Direct 2:3 (8:12) DCT upscaling with fast 12x12 IDCT
A fast 16x16 IDCT has been developed to provide a 'djpeg -scale 2/1' option for
1:2 (8:16) image upscaling.
This feature can also be used for direct internal color upsampling while normal
(1:1) decoding of 2hx2v color subsampled JPEG images (also 2hx1v and 1hx2v with
enhanced library).
A fast 12x12 IDCT has been developed to provide a 'djpeg -scale 3/2' option for
2:3 (8:12) image upscaling.
This is interesting because it is the first non-integer upsampling feature in the
IJG context.
Computational efficiency of algorithms
The computational efficiency of the developed algorithms is estimated here by the
number of multiplications per output pixel. We do not take into account the
dequantization mults here, which would improve the upscaling results even further
because only input values (less than output in case of upscaling) need to be
dequantized.
- Normal decoding (1:1, 8:8, djpeg -scale 1/1):
The used LL&M 8-point IDCT kernel requires 12 multiplications (3 even + 9 odd).
This gives 12x16/64 (16 = 8 column loops + 8 row loops) = 3 mults per
output pixel.
- Downscaling to a half (2:1, 8:4, djpeg -scale 1/2):
The developed 4-point IDCT kernel requires 3 multiplications (0 even + 3 odd).
This gives 3x8/16 (8 = 4 column loops + 4 row loops) = 1.5 mults per
output pixel.
This is twice the speed than normal 1:1 decoding!
- Upscaling x2 (1:2, 8:16, djpeg -scale 2/1):
The developed 16-point IDCT kernel requires 28 multiplications (8 even + 20 odd).
This gives 28x24/256 (24 = 8 column loops + 16 row loops) = 2.625 mults per
output pixel.
This is faster than normal 1:1 decoding regarding the output size!
- Upscaling x1.5 (2:3, 8:12, djpeg -scale 3/2):
The developed 12-point IDCT kernel requires 15 multiplications (2 even + 13 odd).
This gives 15x20/144 (20 = 8 column loops + 12 row loops) = 2.08 mults per
output pixel.
This makes the 2:3 upscaling faster than normal 1:1 decoding regarding the output
size!
Derivation of the fast 12x12 IDCT algorithm
Using the NxN DCT formula we can derive the 12x12 DCT as follows.
Note that we only need to calculate the upper 8 of 12 full matrix rows,
since we have to apply them only to 8x8 input values (see later in transposed
IDCT scheme).
The dotted vertical line in the middle marks the central (anti)symmetry axis, so
we can further simplify computation by only calculating the left half and then
mirror it alternately as is or with sign alternation to the right half.
Furthermore we omit the scalar scaling factor because it turns out that the final
scaling is always by a factor of 1/8 for any output (I)DCT size.
col 0 1 2 3 4 5 6 7 8 9 10 11
index
:
/ C6 C6 C6 C6 C6 C6 : C6 C6 C6 C6 C6 C6 \
| : |
| C1 C3 C5 C7 C9 C11 :-C11 -C9 -C7 -C5 -C3 -C1 |
| : |
| C2 C6 C10 -C10 -C6 -C2 :-C2 -C6 -C10 C10 C6 C2 |
| : |
| C3 C9 -C9 -C3 -C3 -C9 : C9 -C3 C3 C9 -C9 -C3 |
| : |
| C4 0 -C4 -C4 0 C4 : C4 0 -C4 -C4 0 C4 |
| : |
| C5 -C9 -C1 -C11 C3 C7 :-C7 -C3 C11 C1 C9 -C5 |
| : |
| C6 -C6 -C6 C6 C6 -C6 :-C6 C6 C6 -C6 -C6 C6 |
| : |
| C7 -C3 -C11 C1 -C9 -C5 : C5 C9 -C1 C11 C3 -C7 |
|....................................:...................................|
| : |
where Ck = cos(k*pi/24)
Now the IDCT is the transpose of the DCT, hence
col 0 1 2 3 4 5 6 7
index
/ C6 C1 C2 C3 C4 C5 C6 C7
|
| C6 C3 C6 C9 0 -C9 -C6 -C3
|
| C6 C5 C10 -C9 -C4 -C1 -C6 -C11
|
| C6 C7 -C10 -C3 -C4 -C11 C6 C1
|
| C6 C9 -C6 -C3 0 C3 C6 -C9
|
| C6 C11 -C2 -C9 C4 C7 -C6 -C5
|----------------------------------------------
| C6 -C11 -C2 C9 C4 -C7 -C6 C5
|
| C6 -C9 -C6 -C3 0 -C3 C6 C9
|
| C6 -C7 -C10 C3 -C4 C11 C6 -C1
|
| C6 -C5 C10 C9 -C4 C1 -C6 C11
|
| C6 -C3 C6 -C9 0 C9 -C6 C3
|
\ C6 -C1 C2 -C3 C4 -C5 C6 -C7
With ck = sqrt(2) * Ck and C6 = 1/sqrt(2) we get
col 0 1 2 3 4 5 6 7
index
/ 1 c1 c2 c3 c4 c5 1 c7
|
| 1 c3 1 c9 0 -c9 -1 -c3
|
| 1 c5 c10 -c9 -c4 -c1 -1 -c11
|
| 1 c7 -c10 -c3 -c4 -c11 1 c1
|
| 1 c9 -1 -c3 0 c3 1 -c9
|
| 1 c11 -c2 -c9 c4 c7 -1 -c5
|----------------------------------------------
| 1 -c11 -c2 c9 c4 -c7 -1 c5
|
| 1 -c9 -1 -c3 0 -c3 1 c9
|
| 1 -c7 -c10 c3 -c4 c11 1 -c1
|
| 1 -c5 c10 c9 -c4 c1 -1 c11
|
| 1 -c3 1 -c9 0 c9 -1 c3
|
\ 1 -c1 c2 -c3 C4 -c5 1 -c7
where
c1 = 1.402114769
c2 = 1.366025404
c3 = 1.306562965
c4 = 1.224744871
c5 = 1.121971054
[c6 = 1 not needed]
c7 = 0.860918669
[c8 not needed]
c9 = 0.541196100
c10 = 0.366025404
c11 = 0.184591911
Even part optimization (columns 0, 2, 4, 6):
You see that we have only 3 multiplicators in this part: c4, c2, and c10.
Furthermore we see from the numbers below that c10 = c2 - 1. Thus multiplication
with c10 can be replaced by subtraction. This leaves us with just 2 mults in
the even part calculation (by factors c4 and c2).
Odd part optimization (columns 1, 3, 5, 7):
Rows 1 and 4 form a 'rotation' expression (see fast 4x4 IDCT derivation) which can
be spanned over the full odd columns block (1, 3, 5, 7) and thereby 'normalized' by
substituting x1 - x7 and x3 - x5 with factors c3 and c9.
Column 3 has just 2 multiplicators (c3 and c9).
The remaining elements can be reduced to 8 multiplications.
This gives us 3 (rotation) + 2 (column 3) + 8 = 13 mults in the odd part
calculation.
Note that the rotation with c3 and c9 is again the same as in the even part of
the 8x8 point LL&M IDCT algorithm.