Direct 2:3 (8:12) DCT upscaling with fast 12x12 IDCT

A fast 16x16 IDCT has been developed to provide a 'djpeg -scale 2/1' option for 1:2 (8:16) image upscaling.
This feature can also be used for direct internal color upsampling while normal (1:1) decoding of 2hx2v color subsampled JPEG images (also 2hx1v and 1hx2v with enhanced library).

A fast 12x12 IDCT has been developed to provide a 'djpeg -scale 3/2' option for 2:3 (8:12) image upscaling.
This is interesting because it is the first non-integer upsampling feature in the IJG context.

Computational efficiency of algorithms

The computational efficiency of the developed algorithms is estimated here by the number of multiplications per output pixel. We do not take into account the dequantization mults here, which would improve the upscaling results even further because only input values (less than output in case of upscaling) need to be dequantized.

Normal decoding (1:1, 8:8, djpeg -scale 1/1):
The used LL&M 8-point IDCT kernel requires 12 multiplications (3 even + 9 odd).
This gives 12x16/64 (16 = 8 column loops + 8 row loops) = 3 mults per output pixel.
Downscaling to a half (2:1, 8:4, djpeg -scale 1/2):
The developed 4-point IDCT kernel requires 3 multiplications (0 even + 3 odd).
This gives 3x8/16 (8 = 4 column loops + 4 row loops) = 1.5 mults per output pixel.
This is twice the speed than normal 1:1 decoding!
Upscaling x2 (1:2, 8:16, djpeg -scale 2/1):
The developed 16-point IDCT kernel requires 28 multiplications (8 even + 20 odd).
This gives 28x24/256 (24 = 8 column loops + 16 row loops) = 2.625 mults per output pixel.
This is faster than normal 1:1 decoding regarding the output size!
Upscaling x1.5 (2:3, 8:12, djpeg -scale 3/2):
The developed 12-point IDCT kernel requires 15 multiplications (2 even + 13 odd).
This gives 15x20/144 (20 = 8 column loops + 12 row loops) = 2.08 mults per output pixel.
This makes the 2:3 upscaling faster than normal 1:1 decoding regarding the output size!

Derivation of the fast 12x12 IDCT algorithm

Using the NxN DCT formula we can derive the 12x12 DCT as follows.
Note that we only need to calculate the upper 8 of 12 full matrix rows, since we have to apply them only to 8x8 input values (see later in transposed IDCT scheme).
The dotted vertical line in the middle marks the central (anti)symmetry axis, so we can further simplify computation by only calculating the left half and then mirror it alternately as is or with sign alternation to the right half.
Furthermore we omit the scalar scaling factor because it turns out that the final scaling is always by a factor of 1/8 for any output (I)DCT size.

col   0     1     2     3     4     5     6     7     8     9    10    11
index
                                       :
  /  C6    C6    C6    C6    C6    C6  : C6    C6    C6    C6    C6    C6  \
  |                                    :                                   |
  |  C1    C3    C5    C7    C9    C11 :-C11  -C9   -C7   -C5   -C3   -C1  |
  |                                    :                                   |
  |  C2    C6    C10  -C10  -C6   -C2  :-C2   -C6   -C10   C10   C6    C2  |
  |                                    :                                   |
  |  C3    C9   -C9   -C3   -C3   -C9  : C9   -C3    C3    C9   -C9   -C3  |
  |                                    :                                   |
  |  C4    0    -C4   -C4    0     C4  : C4    0    -C4   -C4    0     C4  |
  |                                    :                                   |
  |  C5   -C9   -C1   -C11   C3    C7  :-C7   -C3    C11   C1    C9   -C5  |
  |                                    :                                   |
  |  C6   -C6   -C6    C6    C6   -C6  :-C6    C6    C6   -C6   -C6    C6  |
  |                                    :                                   |
  |  C7   -C3   -C11   C1   -C9   -C5  : C5    C9    -C1   C11   C3   -C7  |
  |....................................:...................................|
  |                                    :                                   |

  where  Ck = cos(k*pi/24)

Now the IDCT is the transpose of the DCT, hence

col   0     1     2     3     4     5     6     7
index

  /  C6    C1    C2    C3    C4    C5    C6    C7
  |
  |  C6    C3    C6    C9    0    -C9   -C6   -C3
  |
  |  C6    C5    C10  -C9   -C4   -C1   -C6   -C11
  |
  |  C6    C7   -C10  -C3   -C4   -C11   C6    C1
  |
  |  C6    C9   -C6   -C3    0     C3    C6   -C9
  |
  |  C6    C11  -C2   -C9    C4    C7   -C6   -C5
  |----------------------------------------------
  |  C6   -C11  -C2    C9    C4   -C7   -C6    C5
  |
  |  C6   -C9   -C6   -C3    0    -C3    C6    C9
  |
  |  C6   -C7   -C10   C3   -C4    C11   C6   -C1
  |
  |  C6   -C5    C10   C9   -C4    C1   -C6    C11
  |
  |  C6   -C3    C6   -C9    0     C9   -C6    C3
  |
  \  C6   -C1    C2   -C3    C4   -C5    C6   -C7

With ck = sqrt(2) * Ck and C6 = 1/sqrt(2) we get

col   0     1     2     3     4     5     6     7
index

  /  1     c1    c2    c3    c4    c5    1     c7
  |
  |  1     c3    1     c9    0    -c9   -1    -c3
  |
  |  1     c5    c10  -c9   -c4   -c1   -1    -c11
  |
  |  1     c7   -c10  -c3   -c4   -c11   1     c1
  |
  |  1     c9   -1    -c3    0     c3    1    -c9
  |
  |  1     c11  -c2   -c9    c4    c7   -1    -c5
  |----------------------------------------------
  |  1    -c11  -c2    c9    c4   -c7   -1     c5
  |
  |  1    -c9   -1    -c3    0    -c3    1     c9
  |
  |  1    -c7   -c10   c3   -c4    c11   1    -c1
  |
  |  1    -c5    c10   c9   -c4    c1   -1     c11
  |
  |  1    -c3    1    -c9    0     c9   -1     c3
  |
  \  1    -c1    c2   -c3    C4   -c5    1    -c7

where

  c1  = 1.402114769
  c2  = 1.366025404
  c3  = 1.306562965
  c4  = 1.224744871
  c5  = 1.121971054
 [c6  = 1 not needed]
  c7  = 0.860918669
 [c8  not needed]
  c9  = 0.541196100
  c10 = 0.366025404
  c11 = 0.184591911

Even part optimization (columns 0, 2, 4, 6):
You see that we have only 3 multiplicators in this part: c4, c2, and c10. Furthermore we see from the numbers below that c10 = c2 - 1. Thus multiplication with c10 can be replaced by subtraction. This leaves us with just 2 mults in the even part calculation (by factors c4 and c2).

Odd part optimization (columns 1, 3, 5, 7):
Rows 1 and 4 form a 'rotation' expression (see fast 4x4 IDCT derivation) which can be spanned over the full odd columns block (1, 3, 5, 7) and thereby 'normalized' by substituting x1 - x7 and x3 - x5 with factors c3 and c9.
Column 3 has just 2 multiplicators (c3 and c9). The remaining elements can be reduced to 8 multiplications.
This gives us 3 (rotation) + 2 (column 3) + 8 = 13 mults in the odd part calculation.

Note that the rotation with c3 and c9 is again the same as in the even part of the 8x8 point LL&M IDCT algorithm.