Conclusions and Prospects

This thesis has presented various techniques for multi-view video acquisition, compression and rendering. A complete multi-view video coding system architecture, yielding high-quality image rendering while retaining high compression efficiency, has been presented. In this chapter, the achievements are summarized and possible extensions of the techniques are proposed. Finally, we discuss several perspectives of multi-view video.

Results and discussion on the individual chapters

Chapter 3: multi-view depth estimation

For estimating accurate depth images, we have proposed a technique for depth-estimation that allows the use of several views simultaneously and circumvents the need of image-pair rectification. The algorithm integrates two smoothness constraints, ensuring smooth depth transitions, both across the lines and across the views. We have shown that both constraints can be efficiently integrated into a one-dimensional optimization dynamic programming algorithm. Note that this optimization for depth images is performed independently for each frame.

To extend the above results and obtain a more accurate depth estimation, the proposed frame-based scheme can be easily modified for the video domain. Specifically, it can be assumed that the depth of objects varies smoothly along the temporal dimension because of the inertia of natural objects. Similar to the two previously mentioned smoothness constraints, a temporal smoothness constraint can be integrated into the dynamic programming optimization algorithm. The integration of the temporal smoothness constraint can be carried out by modeling the temporal variation of depth into a cost function, penalizing fast transitions of depth across consecutive depth images. This penalty cost is then integrated into the dynamic programming graph by modeling the transition cost as an edge of the graph. This technique would ensure the temporal stability of the depth image and thus lead to temporally stable rendered images.

Chapter 4: multi-view depth image based rendering

To enable high-quality rendering and avoid occluded regions, the proposed 3D video system is based on an N-depth/N-texture video representation format, where each camera view covers different regions of the video scene. First, to render synthetic images we have proposed a variant of the relief texture mapping. The described rendering technique efficiently handles holes and occluded pixels in the rendered images and can also be executed favorably by a Graphics Processor Unit. Second, we have proposed an image rendering technique based on an inverse mapping method that allows a simple and accurate re-sampling of synthetic pixels. Additionally, the presented method provides a simple means for handling occlusions by an elegant, unambiguous construction of a single synthetic image from the two compositing neighboring views. Experimental comparisons with 3D image warping show an improvement of rendering quality of up to 3.0 dB for the inverse mapping rendering technique.

In future research, the rendering quality of images can be increased by combining a higher number of source reference images. Specifically, it can be envisioned that compositing from multiple source images would enable the synthesis of super-resolution synthetic images. This technique would not only yield high-quality synthetic images, but also high-quality prediction views and thus increase the coding efficiency.

Chapter 5: multi-view depth and texture multi-view coding

We have presented an algorithm for the predictive coding of multiple depth and texture camera views that employs two different view-prediction algorithms in parallel: (1) a block-based motion prediction and (2) a view-synthesis prediction. The algorithm selects the most appropriate predictor on a block basis, using a rate-distortion criterion for an optimal prediction-mode selection. The advantages of the complete algorithm are that the compression is robust against inaccurately estimated depth images and that the chosen prediction structure features random access to different views. Furthermore, we have integrated this prediction scheme into an H.264/MPEG-4 AVC encoder, such that the disparity-compensation prediction derived from the conventional motion compensation, is combined with the view-synthesis prediction. Experimental results have shown a modest gain for texture coding and a quality improvement of up to 3.2 dB for the depth signal, when compared to solely performing H.264/MPEG-4 AVC disparity-compensated prediction. A major advantage of the proposed multi-view depth encoder is that the depth prediction scheme does not require the transmission of any side information (such as motion vectors for motion-compensated prediction) because the depth image enables the prediction to be based only on the depth signal itself.

The current algorithm requires the selection of reference views from which neighboring camera views are predicted. This provides a simple method for view prediction, but it requires a proper selection of the reference views. The automatic selection of reference views remains as an interesting future research topic. This automatic selection can be applied not only to the reference views, but also to the prediction structure. Specifically, depending on the properties of the video scene, the prediction structure that yields the minimum bit rate for the lowest distortion may be automatically generated for further gain in compression.

Chapter 6: depth coding using piecewise linear functions

In this chapter, we have proposed a novel depth image coding algorithm that exploits the properties of depth images, i.e., smooth regions delineated by sharp edges. The proposed coding algorithm is modeling smooth regions by piecewise-linear functions and sharp edges by a straight line. The quadtree decomposition and the selection of the type of modeling function and the quantizer setting for the model coefficients is optimized such that a global rate-distortion trade-off is realized. A beneficial comparison in coding and rendering with a JPEG-2000 encoder was made.

Possible future research involves the extension of the proposed compression algorithm to the video domain. As performed by standardized video encoders, a conventional approach is to reduce the temporal redundancy between consecutive depth images. To this end, a simple approach consists of encoding the motion-compensated differences between consecutive depth frames. Assuming that depth images can be modeled by piecewise-linear functions, it can be deduced that the same modeling can be applied to the residual difference of two depth images. Apart from the already realized gain, it is evident that an additional coding gain will be obtained using motion compensation.

Chapter 7: joint depth-texture bit allocation

Finally, we have proposed a novel algorithm that concentrates on the joint compression of depth and texture images. Instead of the conventional independent compression of texture and depth, we have presented an algorithm that jointly optimizes both the rate and the distortion to obtain optimal rendering quality. The presented algorithm optimizes the depth and texture quantization parameters, using an iterative hierarchical search similar to the well-known Three Step Search in motion estimation.

In the current scheme, the quantization parameters are entirely optimized at the first video frame and remain fixed for the sequel of the same video sequence. Instead of using fixed quantization parameters for the video sequence, a possible way to reduce the complexity (per frame) is to employ parameters of the previous video frame as an initialization for searching the optimal quantization parameters of the succeeding video frame. This approach would lead to an adaptive method for optimizing the quantization parameters.

Key issues and open questions

Though the research in multi-view acquisition and compression is expanding continuously, a number of key issues still needs to be addressed.

First, in order to gain acceptance of the 3D multi-view video technology in the motion picture or consumer electronics industry, simple acquisition systems should be easily accessible and ready for use. Specifically, comparing the simplicity of acquiring monocular video against the difficulty of acquiring multi-view video, it can be readily understood that there is a large complexity difference between monocular and multi-view video technology. For example, acquiring multi-view video data requires an acquisition setup composed of multiple synchronized cameras. Also, to obtain a high-quality image, the color of the multiple cameras should be calibrated, so that the multiple images show consistent colors across the views. Additionally, to enable the usage of specific features, e.g., free navigation within the scene (as discussed in Section 4.12), the multi-view acquisition setup should be calibrated such that the positions and orientations of all cameras are known. Finally, a high-performance system enabling the simultaneous recording of multiple input video streams, is necessary. From this discussion, it can be noticed that setting up a multi-view acquisition system requires scientific expertise and hands-on experience. Presently, this forms a high entry barrier for potential users, and thus a reduced acceptance of the multi-view technology. More generally, the problem of working on multi-view video is that it affects the complete communication chain, from acquisition through coding and to rendering. This is clearly more complicated than introducing a new compression standard such as MPEG-1 or MPEG-2.

A second issue is related to the estimation of depth data that is necessary for rendering high-quality synthetic images. As discussed in Chapter 3, depth image estimation is an ill-posed problem in many situations. Currently, the problem of depth estimation remains an open research topic for which no generic and very robust solution has emerged yet.

A final issue is the uncertainty related to the value of the potential applications of multi-view video technologies. Currently, it is not known how much value such a technology would bring to entertainment or advertisement applications. It is fairly evident that it will enhance the video production technology and improve the quality of professional video presentations, such as in the industrial design and medical system areas.

Perspectives

As elaborated at the beginning of this thesis, there are various applications that can be enabled by the multi-view video technology. The considered applications, i.e., 3D-TV and free-viewpoint video applications are currently emerging in the movie and medical industry. However, instead of restricting the multi-view video technology to this limited set of applications, multi-view video should be considered as a tool for paving the way to other unforeseen applications. As a consequence, the investigated N-depth/N-texture multi-view video coding (potential) standard should provide sufficient flexibility. For example, in post-production studios, very high rendering quality will be preferred to fast real-time image rendering. Alternatively, a specific application may require a real-time rendering of the 3D scene. To handle both requirements, the multi-view acquisition compression and rendering system should be able to handle those different cases. It is the author’s opinion that sufficient flexibility of the multi-view video coding standard is a key element to the potential success of N-depth/N-texture multi-view video technology.

Moreover, in the near future, a convergence between natural image processing and computer graphics techniques is anticipated. Since computer graphics and multi-view video technology both provide 3D video content, this merge can be facilitated and accelerated by standardizing a multi-view video compression algorithm. Moreover, the continuous research in both fields will stimulate further advances to their mutual benefit. At present, already early applications that combine computer graphics and multi-view video content are now introduced in the motion picture industry and by content-producing companies.