I. INTRODUCTION
H.264/AVC [Reference Wiegand, Sullivan, Bjntegaard and Luthra1] is widely adopted in many applications these days due to its advanced coding tools. Under the same quality constraint, the bit-rate saving of H.264/AVC is significant when compared with such predecessors as MPEG-2 and MPEG-4. It should be noted that video frame quality is considerably affected by the quantization parameter (QP) assigned to each frame. Owing to varying contents in video frames, the quality may fluctuate a lot and careless assignment of QP may result in serious distortion in certain frames. This negative effect may not be acceptable in such applications of video surveillance and/or video archiving, since we require that the quality of each frame should be equally preserved well under these scenarios, in which the recorded video frames may be critically viewed afterwards. The objective of this research is to develop a distortion–quantization (D–Q) model so that a suitable QP can be assigned to each frame efficiently according to the frame content to help achieve constant quality video coding.
The measurement of quality has long been a research focus of video processing. The most commonly used metric is peak signal to noise ratio (PSNR), which is defined as
where MSE(x, y) is the mean squared error between two contents x and y, e.g., the original/reference frame and the coded/processed frame, respectively. Simplicity is the major advantage of PSNR and comparing different algorithms based on PSNR is easy. Although PSNR is sometimes questioned for its lack of representing subjective or perceptual quality, when the original video is available, PSNR still serves as a pretty good indicator of quality degradation from the process of lossy compression. To further reflect the subjective quality in measurement, many researchers [Reference Wang, Bovik, Sheikh and Simoncelli2–Reference Moorthy and Bovik5] tried to take human visual systems (HVS) into account. Structural SIMilarity index, SSIM [Reference Wang, Bovik, Sheikh and Simoncelli2], is one of the well-known metrics. SSIM of two contents x and y is defined as
where μx (μy) and σx (σy) are the local mean and standard deviation of x (y), respectively. σxy is the local correlation coefficient of x and y. C 1 and C 2 are small constants to avoid instability when the denominator is close to zero. Considered being more related to HVS, SSIM is also suggested these days to evaluate the quality of processed frame in video coding. The encoding algorithms explicitly employing SSIM have also been proposed [Reference Huang, Ou, Su and Chen6,Reference Ou, Huang and Chen7]. Since PSNR and SSIM are commonly used in video codec designs, we adopt them as examples to demonstrate the idea of constant quality coding. Other quality metrics that have a higher correlation with human perceptual quality can be better choices but their computational complexities may be too high to be used in a real-time encoding system.
To achieve constant quality video coding, one may think that using a fixed QP value to encode the entire video may work. Figure 1 shows an example of encoding the video “Foreman” with a fixed QP equal to 30. We can see that the PSNR values vary and that smaller QP values should have been assigned in the latter part of this video. A similar problem exists if SSIM is used as the measurement. Therefore, constant quality video coding is not a trivial issue and extra attention should be paid to the encoder. Up to now, most of the existing work related to constant quality video coding adopted PSNR as the measurement. Huang et al. [Reference Huang and Hang8] proposed one of the early researches by encoding the video several times and employing the Viterbi algorithm to pick a suitable QP for each frame. To be more specific, a trellis structure is formed with each node representing a QP value. After encoding the video with different QP's, a few nodes resulting in similar PSNR values are clustered. By connecting nodes (of adjacent frames) in clusters, we can assign a QP value for each frame so that the resulting PSNR values are within a pre-defined range. However, since every frame has to be encoded several times, this scheme is quite time-consuming and only acceptable in off-line applications. To attain more efficient quality control, D–Q and/or rate-distortion (R–D) models are developed to facilitate QP assignment. Ma et al. determined the relationship between PSNR and QP to develop a rate–quantization (R–Q) model for effectively allocating bit budgets [Reference Ma, Gao and Lu9]. Kamaci et al. made use of Cauchy-density function to depict the distribution of AC coefficients after block discrete cosine transform for developing an effective D–Q model [Reference Kamaci, Altunbasak and Mersereau10].
In [Reference Wu and Su11], sum of absolute transform differences is used to determine the related parameters of a D–Q model, which can accurately predict the PSNR in intra coded frames. De Vito et al. assigned or adjusted the QP values according to the difference between the average PSNR of previously encoded frames and the target PSNR [Reference Vito and Martin12]. If the difference is small, the QP of previous frame is used. Han et al. encoded the video twice and used the information of first-run encoding as the reference to attain constant quality coding [Reference Han and Zhou13]. In our opinion, the major drawback of the existing methods is the requirement of encoding the video several times. In addition, a practical D–Q model has not been successfully developed. In this research, we aim at proposing a framework, which can adopt more flexible quality measurements, to achieve constant quality video coding. Before encoding a frame, we will approximately predict its D–Q relationship from frame content to help determine a suitable QP such that the resultant quality is close to the target value. The model parameters should be content adaptive since frames with different characteristics should have varying D–Q relationships. Different from the existing approaches, we do not encode every single frame several times to collect the data points for forming the D–Q curve. A trained content adaptive D–Q model is built for assigning a QP value efficiently and most of the frames will thus be encoded just once. A few frames will be encoded at most twice to pursue the objective of constant quality encoding and to avoid significant increase of encoding time as well.
The rest of the paper is organized as follows. Model training of our proposed scheme is described in Section II and the complete QP assignment procedure is presented in Section III. Section IV demonstrates the experimental results, followed by the conclusion in Section V.
II. D–Q MODEL
As we aim at building a model that links the distortion and QP, the measurement of distortion has to be defined first. The measurement of distortion based on PSNR, D PSNR, is related to MSE and we can simply use the sum of squared errors (SSE) as the measurement. For SSIM, it will be close to one if the contents to be compared are similar, so we define the distortion D SSIM as 1 − SSIM. It is observed that a power function can reasonably depict the relation between D PSNR and quantization in both intra- and inter-coding. We employ the following function to describe the relationship, i.e.,
where α and β are the two model parameters. It should be noted that most of the existing algorithms used Qstep in the fitting function while we choose QP instead. The reason for doing so is to develop a single-parameter model, which will be explained later. To verify the power function, we encode some test CIF videos, including Foreman, Coastguard, Container, Football, Mobile, Paris, and Stefan, each with 100 frames, by using intra coding with QP's ranging from 20 to 40 and record the corresponding D PSNR. The curves from the collected data are matched with the above power function by regression. The R 2 values are all very close to one, which means that the chosen function can fit the data very well. In fact, by replacing D PSNR by D SSIM, we also observe a similar relationship. Again, the R 2 values are almost equal to one. However, we list the parameters, α and β in Table 1 and we can see that the two values vary in each video. Existing work usually chose to train some data in the same video or employ the data in the previously decoded frames to acquire these parameters for subsequent encoding. The major disadvantage is that quality fluctuation may be observed in the first few encoded frames if inappropriate parameters are set. More encoding processes may thus be required. In addition, when the scene changes happen, the parameters have to be determined again or the performance will be affected seriously.
The objective of this research is to appropriately estimate these parameters by using a content-adaptive model. The first step is to collect various data samples for training. To begin with, the frame will be divided into basic units. There are several choices for deciding the size of basic units, e.g., an entire frame, a group of macroblocks (MB's) or a single MB. Designing a frame model, i.e., determining a QP value according to the feature representing the entire frame, sounds a reasonable and straightforward approach. A feature representing the frame is computed to determine α and β in equation (3) for the whole frame. However, we found that a slight model inaccuracy will result in poor determination of QP. Using MB's directly for model training should be more flexible. Nevertheless, according to our experience, when the unit size is too small, it will be difficult to determine a well-defined relationship between the content and the parameters. An obvious example is that we may easily obtain small blocks with uniform colors and encoding such blocks with different QP values may generate unexpected results. It is worth noting that such blocks occupy a large portion in common frames. In other words, there will be a large number of outliers in our training data. Training the model with so many “unusual” blocks will be challenging and the model parameters may not be acquired accurately. Therefore, we choose to use a group of MB's as the basic unit in our framework. For a CIF video frame, we divide it into basic units as shown in Fig. 2. A unit contains 33 MB's so a frame contains 12 basic units. Such division may look a bit awkward but we have a reason for this choice. By dividing the frame across the center as shown in Fig. 2, we can obtain blocks or basic units that contain meaningful content more easily since there are usually important objects at the center of a frame. Besides, the units should be reasonably large too. In other words, we expect that a unit can consist of areas with different characteristics so that the number of outliers can be reduced to facilitate the training process. Furthermore, a larger number of “meaningful” units certainly helps QP determination.
We first deal with the intra-coded frames. Since many frames in a video have similar content, we do not use video sequences for training but select still images. We use 200 images from Berkeley image database [Reference Martin, Fowlkes, Tal and Malik14]. Each image is scaled and cropped properly to the CIF frame size. These images are concatenated into a video, which is encoded with various QP's. The quality distortion of each basic unit and the corresponding QP values are collected. The relationship between the distortion and QP shown in equation (3) still holds. A very important finding is that there exists a linear relationship between ln (α) and β for both PSNR and SSIM as shown in Fig. 3. The R 2 values of using this linear relationship are both as high as 0.99. The fact indicates that equation (3) can be reduced to only one variable. For I-frames, the D–Q model can thus be expressed as
for PSNR, and
for SSIM. These relationships are derived by regression. In fact, according to our tests, a similar relation can also be found in P-frames and the data can be fitted well by
and
The R 2 values in P-frames can also reach 0.99 in both distortion measurements. In our opinion, since PSNR and SSIM perform quite differently, such a relationship may exist in many different quality metrics. If PSNR is adopted as the quality metric, we can use Equations (4) and (6) to determine mapping between QP and the distortion for a given frame. For SSIM, Equations (5) and (7) will be employed.
The next step is to seek an efficient way to choose suitable β for a basic unit. It is worth noting that β is content-related. According to our observations, if the content can be affected by lossy coding more easily, the value of β will be larger. On the other hand, for the unit with relatively more uniform content, β will be quite small. Therefore, we would like to predict the effects of compression on content so that a reasonably good β can be selected. One way to achieve this is to encode the frame with different QP's to observe the curve but it may be computationally prohibitive. In other words, this “pre-processing” has to be efficient to avoid considerable increase in the load of video coding. Besides, we aim at developing a more general framework for constant quality H.264 video coding, in which the distortion measurement may be different in targeted applications. We thus adopt the following strategy. The pre-processing or, in fact, a process of distortion is applied on the input frame and then the selected quality measurement will be used to evaluate the degradation of these distorted versions. That is, we make use of these degradation measurements to help us select a suitable β.
Again, we collect training data for coding with different QP's to determine β and, at the same time, preprocess these training data to obtain the distortions. By examining β and the degradations, we would like to know whether such a solid relationship exists. After various trials, the preprocessing we consider right now includes two parts: resizing and singular value decomposition (SVD). The resizing process quickly removes high-frequency textures. We simply calculate the 16×16 block means to obtain a down-sampled version of an input frame. Then, this small frame is filtered by a 3×3 Gaussian low-pass filter. Finally, we linearly interpolate it to form the frame with the original frame size. Figure 4(a) shows a seriously blurred version of Foreman. The reason for removing high-frequency textures is to predict the effects of lossy compression as these parts are affected more. The other process is applying 16×16 block SVD after the block mean is removed. We then use the block mean and the important eigenvectors/eigenvalues to reconstruct the block. Such blocks will contain significant content and can serve as reliable references to see what may be left after coding. The first and second eigenvector pairs are used to reconstruct the block as shown in Fig. 4(b). Although the blocky artifacts are seen, the content can still be preserved quite well. In addition, we found that this SVD process performs better in blocks with more textures. Given these two pre-processed or distorted frames, we calculate their quality degradation (D PSNR or D SSIM for now) compared with the raw input frame. Then, the two distortion measurements are combined to form a so-called “content feature” for evaluating the single parameter β in our model. Since it can be shown from Fig. 4 that the degrees of distortions in these two steps are quite different as resizing results in more serious quality degradation, the two evaluations are weighted and summed to form the feature. In our training data evaluated in SSIM, the average distortion for resized frames, D SSIMresize, is around K = 4 times that of SVD processed frames, D SSIMsvd. We thus calculate the “spatial feature”, F SSIMspatial, by
which will be used to determine β. In the case of using PSNR, K is around 5.5 and the two values are weighted accordingly to obtain F PSNRspatial, i.e.,
Figure 5 shows the relationship between the extracted feature and β in the training data of I-frames. We also found that the data are clustered and can be fitted well by using regression. For PSNR, the data can be depicted reasonably well by
The fitting function for SSIM is
Since only intra-coding is applied in I-frames, the feature for I-frames, F PSNR/SSIMI, is simply F PSNR/SSIMspatial.
In P-frames, temporal information is required. As in regular video coding, we apply motion estimation with 16 × 16 blocks and with the searching range set as ±8 to form a motion compensated frame. Only the integer positions are searched. Similar to what we have done for I-frames, the distortion of this compensated frame is computed to determine the temporal feature, F PSNR/SSIMtemporal. However, since intra-coding may still be employed on P-frames, we also calculate the spatial feature, F PSNR/SSIMspatial, and use the average of the two features to determine most of the P-frame features by
The method of calculating the average value to form the feature does look a bit heuristic and one may even think of estimating the percentages of intra and inter coding in a frame to decide a more suitable weighting function. However, whether a block will be intra or inter coded may depend on the QP value. A block may become intra-coded when a smaller QP is used. Applying the block type prediction or classification before the QP assignment is thus less reasonable. In addition, separating the intra and inter coding in the model training process of P-frames is rather complicated. Therefore, we choose to take both spatial and temporal characteristics into account to form a P-frame feature and resort to the simplified model training process on a large number of collected data to achieve good performances. Figure 6 shows the relationship between the feature and β in P-frames in the case of PSNR and SSIM. Although some outliers exist, the fitting can still be good enough to help us choose suitable QP values of P-frames. The fitting curve for PSNR is
and that for SSIM is
As mentioned before, we will calculate the feature to determine the single parameter β for each basic unit. Then, the frame QP, QP F, is determined such that the overall distortion will be as close to target distortion as possible. That is,
where D (i)(QP) is the distortion of the ith basic unit estimated by Equations (4), (6) or Equations (5), (7), and Dtarget is target distortion. The use of 12 units helps to reduce the negative effects from possible model inaccuracy of a single unit.
III. THE ENCODING PROCEDURE
Our objective is to strictly maintain the quality of each frame. That is, after a target distortion is set, e.g., PSNR equal to 40 dB or SSIM equal to 0.92, the distortion measurement of each decoded frame should reach the target value as close as possible so that constant quality coding can be successfully achieved. With the proposed D–Q model, the selection of QP can be done in a straightforward manner. Given an input frame, the feature is computed to determine the D–Q relations of basic units and the frame QP will be chosen according to Equation (15). There are a few issues that will affect the designs of our proposed encoding procedure. First, adjacent frames in a video usually have similar content, which will result in similar features. Then, calculating features in each frame does not seem that necessary. The spatial feature F spatial is relatively efficient but the temporal feature F temporal is more time-consuming because of motion estimation. Therefore, if we can reuse the feature of a previous frame with similar content for computing the model parameter β, the whole encoding procedure will be more efficient. In other words, the spatial and temporal features of a frame will only be re-calculated if such a feature with the same frame type or similar content is not available. Second, the quality of the reference frame will affect that of the currently encoded frame. Especially when a scene change frame appears and its QP is not appropriately assigned. The quality of the subsequent frames may be poor and larger quality variations may also be observed, especially when a scene change frame appears and its QP is not appropriately assigned. Our strategy is to apply the scene change detection to determine the so-called key frames to build the D–Q model. We will then encode these frames carefully, probably with two runs, so that the quality of subsequent frames can also be maintained. Third, as mentioned before, the content of adjacent frames will be similar. If the frame coding types are also the same, the coding results of the previous frame can serve as a good indication of model accuracy. Therefore, the coding performance of the previous frame of the same type will be examined for model adjustment so that single-run coding may work as well as multiple-run coding. The flowchart of the encoding process is demonstrated in Fig. 7 and explained as follows.
A simple scene-change detection process by examining the luminance histograms of adjacent frames is adopted. The Bhattacharyya distance of two histograms is calculated and compared with a threshold. If the difference is larger than the threshold, a scene change is detected and we call this scene change frame as the key frame. It should be noted that, although the key frame may need to be encoded as a P-frame, we only use the spatial feature F spatial to calculate β, instead of using the P-frame feature F P shown in Equation (12), because a large number of intra-coded blocks will appear in this frame. After using F spatial to determine the D–Q relation and the frame QP for encoding this frame, we usually encode this frame once again if the resulting quality of this decoded frame is not close to the target value. This two-pass encoding is to ensure that these important scene-change frames have the targeted quality. The model will be slightly adjusted according to the first-run encoding results. We call this process the model update, which actually has an additional adjusting factor θ defined by
where a and b are the trained variables listed in Equations (4)–(7). That is, the denominator is the predicted distortion by our model and D p(QP F) is the resulting distortion by using QP F to encode the frame in the first run. In the second-run encoding of this frame, the model becomes
where D PSNR/SSIM is defined in Equations (4)–(7), and a better QP F can then be chosen accordingly. In other words, we simply adjust the parameter α in Equation (3) and this strategy is quite effective. Figure 8 shows the comparison of coding results on Foreman by using the original model and those by using the updated model with θ. In Figs 8(a) and 8(c), we encode all the frames by using intra coding only. In Figs 8(b) and 8(d), only the first frame is an I-frame and the other frames are coded as P-frames. The qualities of P-frames are then averaged. We select Foreman in this test since it contains large content variations and our original model does not perform that well. By using the simple scaling factor θ, the predicted quality, measured in either PSNR or SSIM, will be close to the actual quality after model adjustment in second-run encoding.
For other frames, we will use the coding result of the previous frame with the same frame type as the reference to adjust our D–Q model. That is, θ will be computed by dividing the resulting distortion of the previous frame (i.e., D p(QP F) in Equation (16)) by the predicted distortion so that most of the frames will be encoded only once. As mentioned before, only the spatial feature F spatial will be used to find β in the scene-change frames. It should be noted that there will be a couple of special cases for other frames. (1) For the first P-frame after the scene-change frame, since its temporal feature F temporal is not available, we will calculate its own feature F P. In addition, since the previous P-frames do not have similar content, this P-frame may be encoded twice without referring to the coding results of previous frames. (2) For the first I-frame after the scene-change frame, we will calculate its own F spatial to calculate β and may also encode this frame twice to use its own first-run coding results for model adjustment. To sum up, the features will be computed and the coding may be applied twice in the following three cases: (1) The scene-change or key frame, (2) the first I-frame after the key frame, and (3) the first P-frame after the key frame. For most of the other I/P-frames, we basically employ the existing features and use the coding results of the frames with similar content and with same frame type for model adjustment. Then, the calculation of the features will not be applied repeatedly. Finally, to achieve extremely consistent video quality, the coding result of each frame will be checked. If the result deviates from the target too far, we may encode that frame once more and the model is also adjusted by Equation (17). In our scheme, if the absolute difference of target PSNR and the resulting PSNR is larger than 0.25 dB, the frame will be encoded again. In SSIM, the threshold of absolute difference is set as 0.015. A frame will not be encoded more than twice to maintain the efficiency of the proposed method.
IV. EXPERIMENTAL RESULTS
We implemented our scheme in JM 15.1 reference software of H.264/AVC [15] to evaluate the performances of our proposed D–Q model and encoding procedure. The settings are as follows:
(1) Rate distortion optimization is enabled.
(2) Motion search range for coding is ±16.
(3) Fast full search algorithm is used.
(4) CAVLC is used.
(5) De-blocking filter is enabled.
We set the target PSNR as 30, 35, 40, and 45 dB and target SSIM as 0.91, 0.95, and 0.99 to test the feasibility of our scheme on different quality measures. SSIM is calculated in 8 × 8 blocks without overlapping. Eight CIF videos including Coastguard, Monitor, Table, Foreman, Mobile, Stefan, News, and Paris, each with 300 frames, are used in our experiments. Figures 9 and 10 show the performance of constant quality video coding measured in PSNR and SSIM, respectively.We can see that the resulting quality can achieve target quality in all of the cases. When the target quality is set lower, the variations of both PSNR and SSIM become larger because of wider range of QP. The variations are more obvious in the latter part of Foreman because of fast camera motions. Two-pass encoding is not applied very often and the most frequent case happens when the target SSIM is set as 0.91 in Foreman, in which only 16 out of 300 frames are encoded twice. In other videos such as Monitors and Mobile, except for the first two frames, which are the first I- and P-frames, respectively, and do not have any previous coding results, other frames are encoded just once.
Table 2 compares the performance of our quality control algorithm with the one proposed by De Vito et al. [Reference Vito and Martin12], in which the PSNR and QP values in previous frames are used to maintain constant quality in one pass. Five sequences at three different target PSNR values are tested. The average absolute deviations of PSNR are 0.42 and 1.02 dB in our method and [Reference Vito and Martin12] respectively. The average PSNR variances are 0.06 and 0.25 dB in our method and [Reference Vito and Martin12], respectively. Therefore, our scheme can achieve better performances of constant quality coding. Table 3 shows the other comparison of our scheme with [Reference Han and Zhou13] and [Reference Zhang, Ngan and Chen16], both of which are two-pass schemes. That is, they will apply first pass encoding to estimate the R–D curve and second pass encoding to achieve constant quality coding. We use the resulting PSNR values of [Reference Han and Zhou13,Reference Zhang, Ngan and Chen16] as targets to compress the videos by our scheme. We can see that the average PSNR values are close to the targeted ones and the PSNR variances of our scheme are lower than those of the other two methods. It should be noted that our scheme is more efficient since most of the frames are encoded only once.
Furthermore, we demonstrate the performances of proposed quality control in videos with a larger resolution. Four 4CIF (704 × 576) videos are tested and the performances of maintaining SSIM are shown in Fig. 11. The size of basic unit is set as 11 × 3 MB's, the same with what we have done on CIF videos, and the same model parameters are also employed. The reasonably good performances in Fig. 11 indicate that, as long as most of the basic units used in the training process contain meaningful contents, the built model can work well in videos with different resolutions.
Finally, we would like to discuss the strategy of video encoding involving B-frames. In our framework, we choose not to train the models of B-frames for the following reasons. First, the number of B-frames (between P-frames) may vary according to the settings of encoders. Training models with different parameter settings is not a flexible approach. Second, several prediction modes can be used in a B-frame, including list 0, list 1, bi-predictive, and direct predictions. As mentioned earlier, we will not perform block classification before the exact encoding process hence it will be difficult to determine reasonable features and the corresponding weighting factors. Therefore, instead of training the models for B-frames, we propose a simple QP determination method by assigning the QP value according to the related P-frames. More specifically, the QP value for a B-frame is set as $\lfloor \displaystyle{{QP_{list0} + QP_{list1}}\over{2}}\rfloor$ when both list 0 and list 1 are available and both of them are encoded as P-frames. If one of list 0 and/or list 1 is unavailable or not a P-frame, the QP value of the only reference P-frame will be used to encode the current B-frame. Figure 12 illustrates the performances of the proposed B-frame QP determination. We can see that the target PSNR can be achieved in all of the sequences. Although the quality variations are a bit larger than those shown in Fig. 9, the performances are still satisfactory.
V. CONCLUSION
In this research, a frame quality control mechanism for H.264/AVC is proposed. A suitable QP can be assigned in each frame so that target frame quality can be achieved. A single-parameter D–Q model is derived and the model parameter can be determined from the frame content. The results by using such quality measurements as PSNR and SSIM verify the feasibility of our proposed method. We will extend them to test more quality metrics to further prove the generality of this framework.
ACKNOWLEDGEMENT
This research is supported by “The Aim for the Top University Project” of the National Central University and the Ministry of Education, Taiwan, R.O.C.