Version 8.0.1

Version 8.0.1

HTJ2K in Kakadu

This software includes a heavily optimized implementation of the new High Throughput JPEG 2000 (HTJ2K) standard (ITU-T Rec T.84 | eIS 15444-15), which was first published in August 2019. The HTJ2K standard is highly functional, since it embraces all existing JPEG 2000 technologies and adds a new EXTREMELY fast HT block coding algorithm, as an option that can be used for some, none or all code-blocks. HT standard for "High Throughput". The HT block coding algorithm itself is easily 10x faster than the original JPEG 2000 block coding algorithm, when implemented in a carefully optimized way, but can be much faster again (more than 30x for lossless compression). This is implementation dependent, and the fastest implementation you will get from Kakadu at the moment is one that uses AVX2 and BMI2 instruction sets, so you will need a moderately recent Intel platform (Haswell+, preferably Skylake+) to get the most out of it. Note that the AVX2/BMI2 accelerated implementations of the HT block coding algorithm contained in this release make use of the PDEP/PEXT BMI2 instructions, whose implementation on current AMD processors is unusable -- enormous latencies and tiny throughputs.  As a result, AMD processors will fall back to using the SSE4.2-based accelerations most of the time, which are a lot slower but still fast.  Regrettably, the recent release of the Matisse architecture as Zen 2 did nothing to help this situation. The Matisse architecture also has very high latencies for the AVX2 permutation and variable shift instructions which are important to the coding operations both in HTJ2K and in other algorithms -- we hope this will change with Zen 3. To get going with HTJ2K consult the special sections in the "Usage_Examples.txt" document that deal with HTJ2K -- just search for that term in the usage examples -- and also read the short "HTJ2K_in_Kakadu.txt" document. Both documents are in the documentation" directory at the top-level of your Kakadu software distribution. Additionally, you may like to read the HTJ2K white paper at "www.htj2k.com" or "www.kakadusoftware.com/documentation-support". Currently, the Kakadu implementation contains an extremely fast implementation of the HT block encoding and decoding procedures for processors capable of AVX2 and BMI2, along with a quite fast implementation for processors only capable of SSE4.2 x86 instruction sets, and then only on 64-bit platforms. End-to-end decoding can be 7-8 times faster than regular JPEG 2000 at moderate bit-rates like 2 bits/pixel, reaching more than 3 Giga-samples/second even on a 4-core Skylake CPU. Lossless decoding of 12-bit/channel content can be more than 30 times faster than regular JPEG 2000, consuming well over 1 GByte/s of compressed data on a 4-core CPU, with low latency. HTJ2K encoding in Kakadu supports essentially all the same modalities as the original JPEG 2000 block coder, including length constrained encoding (as in kdu_compress with "-rate") and distortion-length slope constrained encoding (as in kdu_compress with "-slope"). However, when used in this way without any care, HTJ2K encoding is not as fast (typically only 3 or 4 times faster than the original JPEG 2000 algorithm).  Used with care, however, end-to-end encoding throughputs are extremely high, yielding similar speedups over regular JPEG 2000 as those experienced for decoding. Current end-to-end lossless encoding can be well over 30 times faster than regular JPEG 2000. In fact it is significantly faster than lossless decoding -- both can easily be real-time for 4K 4:4:4 60fps video on a 4-core Skylake CPU, for example. Encoding without rate control -- i.e., quantization-based quality control, using say the Qstep attribute, or a quality factor -- is extremely fast, since only one HT coding pass need be performed. Encoding to a target bit-rate or compressed size, can also be extremely fast, reaching 7 to 11 times the speed of regular JPEG 2000 at common bit-rates, if you are careful to use the new "Cplex-EST" complexity control methodology and also careful to avoid any I/O related bottlenecks (disk accesses of pretty much any form) that can impact the throughput of this extremely fast codec. The demo-apps, usage examples and other documentation provide good guidance on how to use the "Cplex-EST" methodology, which activated in the first instance simply by setting the `Cplex' parameter attribute to use the "EST" method -- typically by passing `Cplex={6,EST,0.25,0}" on the command-line to the relevant encoding demo-app. In the future, Kakadu will include a heavily optimized implementation of the HT block coder for ARM-NEON, to benefit mobile applications in particular. The speed-pack version of Kakadu is also likely to include an ultra-high performance AVX512-based implementation of the HT block encoding and decoding procedures. A major use case for HTJ2K is transcoding existing JPEG 2000 content to HTJ2K so that it can be decoded at lightning speed, without losing any information whatsoever. Consult the "kdu_transcode" usage examples to see how this can be done. Kakadu can transcode not only between any original JPEG 2000 codestream and an HTJ2K codestream, but also from an HTJ2K codestream back to an equivalent original JPEG 2000 codestream. It can also perform bit-rate adaptive transcoding on all codestreams that have multiple quality layers to inform such a process, including multi-layer HTJ2K codestreams, even though the HT block coder itself is not strictly embedded or quality scalable.  We note that the current implementation of the "kdu_transcode" demo-app is only single-threaded, which is an area for improvement in the next release, since we expect transcoding of original JPEG 2000 content to HTJ2K to be an important task that will be applied to huge volumes of data.

Other Features NEW in KDU-8.0.1 (beyond HTJ2K)

  1. Speed-pack and regular versions of Kakadu have been harmonized to a greater extent, so that you can write any application for regular Kakadu and immediately take advantage of the accelerations in the speed-pack version of Kakadu without modifying your code, even in transcoding applications.
  2. Kakadu's detection and checking of profiles and profile compliance has been completely overhauled, from an ad-hoc collection of tests to a rationale framework that is designed to support future standardized profiles as well as external custom profiles that might be defined within an application domain that we are not aware of.
  3. The encoding pipeline in Kakadu has been modified in a significant way to improve cache utilization and reduce the likelihood of inappropriate speculative cache reads by the processor that can eat into the processor's bandwidth -- primarily important in conjuntion with the HT block coder, which is so fast that bandwidth constraints inside the CPU can become an issue.
  4. Kakadu now provides a fragmented wavelet transformation mode that allows the wavelet analysis processing to be split into more than just one thread per image plane (or tile-component). This may be advantageous on large multi-core platforms, especially in conjunction with the ultra-fast HT block coding algorithm, since in some cases the wavelet transform could become the processing bottleneck for heavily multi-threaded deployments.
  5. The new "Cplex" parameter attribute provides various means to constrain encoding complexity that are most interesting with the HT block coder, but can also be useful with the original JPEG 2000 block coding algorithm. This is accessible to any existing application with minimal or no changes -- the Kakadu demo-apps, in particular, did not have to be changed at all, since they support the setting of any parameter attribute from the command line.
  6. Added a new "kdu_makeppm" demo-app that can move JPEG 2000 packet headers to PPM marker segments (main header packed packet headers) or PPT marker segments (tile-part header packed packet headers).
Go Top