MIME-Version: 1.0 Server: CERN/3.0 Date: Sunday, 01-Dec-96 19:27:41 GMT Content-Type: text/html Content-Length: 14748 Last-Modified: Thursday, 18-Jan-96 18:26:34 GMT
JPEG default quantization tables(QT) are based on psychovisual thresholding and are derived empirically. But because only one QT is available for every image, the default QT is image independent and can not be used to achieve the optimized compression result for each specific image.
Some perceptual models have been developed to calculate the image dependent QT. But because every block in the image contributes different properties to the total image, one QT that is best for the whole image is not always best for every block. If we can quantize every block using different QT specifically suitable for that block, we can get the most optimized compression result for every block.
Because JPEG allows only one QT for each image, pre-quantization is proposed by Johnston and Safranek(J&S)[1]. For every block, one specific masking threshold is calculated and used to zero out the perceptual irrelevant coefficients while the other coefficients passed unchanged. Then one base QT will be finally used to quantize for all the remaining coefficients in each block.
Despite the benefits of simple implementation, the J&S model has the disadvantage of computing a single masking elevation for each input block. This means that there is no information about the distribution of energy within the block.
This problem can be overcome by applying the contrast masking model for each DCT coefficient(model-1).
The computation for model-1 is somewhat complex. This led us to design a second one(model-2) based on the luminance ratio of block to total image to replace the J&S model. This second model has the same single masking elevation problem, but the calculation is more simple.
Our test result shows, compared to the JPEG default setting, both of our models reduced the bit rate 10% with little or no perceived loss in quality.
The rest of this paper is organized as follows. Section 2 describes algorithms to "prequantize" a JPEG image. Section 3 describes two perceptual models designed by us. Section 4 describes the detailed evaluation of our models, and section 5 reviews related work and future extensions.
2.2 Perceptual Model
Many studies have attempted to derive a computational model of this visual masking
level. For each block in the input image, the model attempts to determine to what
degree the features present in that block inhibit the visual system from the
distortion introduced by the compression/decompression process. From these points,
it is possible to determine a masking threshold for each DCT coefficient.
2.3 J&S's model
Johnston and Safranek have developed a framework for computing a locally adaptive
masking model based on an engineering framework. They assume that the total masking
level for any block of the input can be represented as a base masking level, and
other multiplicative elevation factors that represent the contribution of input
dependent properties of the visual system to the total mask. This model may be
expressed as:
M(u,v) = Global(u, v) x Local(u, v) --- (1)
where M(u, v) is the masking level for frequency (u, v) of the input block, Global(u,v) is the base masking level which depends only on global properties, and Local(u, v) handles the image specific local variation in the masking threshold. The adaptation is derived as a function of the block standard deviation using the following formula:
---(2)
This applies to all of the AC coefficients. The masking elevation for the DC coefficient is always set to unity. This model has the advantage of simple implementation, and works well in practice. It has the disadvantage of computing a single masking elevation for each input block.
Figure 2 illustrates the structure of such an encoder. The forward transformation is identical to the one in baseline JPEG. At this point, the DCT coefficients are
input to the perceptual model which generates the data dependent quantization table for that block. This table and the raw DCT coefficients are now input to a
"pre-quantizer". The purpose of this module is to zero out the coefficients that
have a magnitude less than the corresponding entry in the quantization table for
that block, and pass the other coefficients through unchanged. Finally these
prequantized coefficients are quantized and entropy coded as in standard
JPEG.
In the dequantization step, only the base QT is used.
M(u,v)= MAX[t(u,v), |c(u,v)|^w(u,v) *t(u,v)^(1-w(u,v))] ---(3)
w is a constant that lies between 0 and 1. When w=0, no masking occurs, and when w=1, we have "Weber's Law" behavior. For our experiment, an empirical value of w =0.7 was used.
In our implementation to calculate the masking threshold, we did not calculate the luminance threshold by using Peterson's model[3] as suggested by Watson[4]. In addition, as indicated in JPEG standard, JPEG default quantization tables are based on psychovisual thresholding and are derived empirically. If it is divided by 2, the almost indistinguishable image can be reconstructed. That means the default QT can be treated as a general luminance threshold. So, we replaced the t(u, v) in equation(3) by "JPEG default QT value" / 2.
As a Global masking level, we used default JPEG QT.
Basically, model-2 follows the J&S masking model (equation 1). But for the calculation of multiplicative elevation factors (Local(u,v)), we use the luminance ratio of each block to the total image.
Well known "Weber's Law" is expressed as:
df / f = constant (0.02) ---(4)
f is luminance and df is the just noticeable difference.
This equation means our perception is sensitive to luminance contrast rather than the absolute luminance values themselves. At a given luminance f, if the block's luminance is only a little bit differ from luminance f, then the block is less visible and we can drop a lot of perceptually unimportant information.
The adaptation is derived as a function of the ratio of the block's average luminance to the luminance average of the total image using the following formula:
---(5)
Maximum threshold elevation "max_e" and Minimum Threshold (Minimum
luminance ratio) "min_t" are the parameters we need to experiment for.
As a Global masking level, we used default JPEG QT.
For the following experiment, we choose the preferable value max_e = 2 and min_t = 0.01 for each parameter of model-2. Five images were used in this experiment: a photo of human face("Lena"), a flower scene("flowers"), a photo of animal face("baboon"), two photo of airplane(F-18 and Pitts).
Bitrate is calculated after compressing the image file using the pack (huffman coding) program. For the evaluation of the image quality, we used two evaluation method. The first is SNR(signal to noise ratio). In order to provide further insight into the subjective quality of the models, we used DCON metric[5]. This algorithm takes as input two images, a reference and a test, compare the difference of luminance pixel by pixel. The formula is as follows:
DCON = 1/N * sum((y1-y2)/(y1+y2) + 23) --- (6)
This method is simple but very competitive with other complicated human eye model metric.
Table 2 summarize the results of these experiments.
Both our model-1 and model-2 compress 10% better than the Baseline JPEG with almost no perceptual loss in quality. Figure 4 shows one example ("flowers") of output image of our models in comparison with original image and that of Baseline JPEG.
(1)
(2)
(3)
(4)
(1) original image (2) Baseline JPEG (3) model 1 (4) model 2
Klein of U.C.Berkery optometry school has reported techniques for improving the quality of JPEG with high compression rate in viewpoint of vision community[8]. He suggests that with improved human vision models the quantization step could be made more effective by considering effects such as mean luminance, color, bandpass filters in spatio-temporal frequency and orientation, contrast masking and human contrast sensitivity.
The primary contribution of our work is that it details encoding-specific prequantization algorithms that compress images at high compression rate with minimal artifact. Our future work will include extending this work to include other human eye related factors such as spatio-temporal frequency and orientation.
[2]G.E.Legge and J.M.Foley, "Contrast masking in human vision", Journal of the Optical Society of America. 70(12), 1458-1471(1980).
[3]A.J.Ahumada and H.A.Peterson, "Luminance-Model-Based DCT quantization for color image compression", SPIE:Human Vision, Visual Processing, and Digital Display III, Vol.1666,1992, PP365 - 374.
[4]Watson,A.B., "DCT quantization matrices visually optimized for individual images", SPIE:Human Vision, Visual Processing, and Digital Display IV, Vol 1913 Feb.1993, pp202 - 216.
[5]Daniel R. Fuhrmann, John A. Baro, and Jerome R. Cox Experimental evaluation of psychophysical distortion metrics for JPEG-encoded images, SPIE:Human Vision, Visual Processing, and Digital Display, Vol.1913, PP179 - 190.
[6]S.J.Daly, "The visible difference predictor: an algorithm for the assessment of image fidelity",SPIE:Human Vision, Visual Processing, and Digital Display III, Vol.1666,1992, PP2 - 15.
[7]J.L.Mannos and D.J.Sakrison, "The effects of a visual fidelity criterion on the encoding of images", IEEE Transaction on Information Theory IT-20, pp525 - 536
[8]Stanley A. Klein, Amnon D. Silverstein and Thom Carney, "Relevance of human vision to JPEG-DCT compression", SPIE:Human Vision, Visual Processing, and Digital Display III, Vol.1666,1992, PP200 - 215.