Acquisition, Compression and Rendering of Depth and Texture for Multi-View Video
Three-dimensional (3D) video and imaging technologies is an emerging
trend in the development of digital video systems, as we presently
witness the appearance of 3D displays, coding systems, and 3D camera
setups. Three-dimensional multi-view video is typically obtained from
a set of synchronized cameras, which are capturing the same scene from
different viewpoints. This technique especially enables applications
such as free-viewpoint video or 3D-TV. Free-viewpoint video
applications provide the feature to interactively select and render a
virtual viewpoint of the scene. A 3D experience such as for example in
3D-TV is obtained if the data representation and display enable to
distinguish the relief of the scene, i.e., the depth within the
scene. With 3D-TV, the depth of the scene can be perceived using a
multi-view display that renders simultaneously several views of the
same scene. To render these multiple views on a remote display, an
efficient transmission, and thus compression of the multi-view video
is necessary. However, a major problem when dealing with multi-view
video is the intrinsically large amount of data to be compressed,
decompressed and rendered. We aim at an efficient and flexible
multi-view video system, and explore three different aspects. First,
we develop an algorithm for acquiring a depth signal from a multi-view
setup. Second, we present efficient 3D rendering algorithms for a
multi-view signal. Third, we propose coding techniques for 3D
multi-view signals, based on the use of an explicit depth signal. This
motivates that the thesis is divided in three parts. The first part
(Chapter 3) addresses the problem of 3D multi-view video
acquisition. Multi-view video acquisition refers to the task of
estimating and recording a 3D geometric description of the scene. A 3D
description of the scene can be represented by a so-called depth
image, which can be estimated by triangulation of the corresponding
pixels in the multiple views. Initially, we focus on the problem of
depth estimation using two views, and present the basic geometric
model that enables the triangulation of corresponding pixels across
the views. Next, we review two calculation/optimization strategies for
determining corresponding pixels: a local and a one-dimensional
optimization strategy. Second, to generalize from the two-view case,
we introduce a simple geometric model for estimating the depth using
multiple views simultaneously. Based on this geometric model, we
propose a new multi-view depth-estimation technique, employing a
one-dimensional optimization strategy that (1) reduces the noise level
in the estimated depth images and (2) enforces consistent depth images
across the views.
The second part (Chapter 4) details the problem of multi-view image
rendering. Multi-view image rendering refers to the process of
generating synthetic images using multiple views. Two different
rendering techniques are initially explored: a 3D image warping and a
mesh-based rendering technique. Each of these methods has its
limitations and suffers from either high computational complexity or
low image rendering quality. As a consequence, we present two
image-based rendering algorithms that improves the balance on the
aforementioned issues. First, we derive an alternative formulation of
the relief texture algorithm which was extented to the geometry of
multiple views. The proposed technique features two advantages: it
avoids rendering artifacts (“holes”) in the synthetic image and it is
suitable for execution on a standard Graphics Processor Unit
(GPU). Second, we propose an inverse mapping rendering technique that
allows a simple and accurate re-sampling of synthetic
pixels. Experimental comparisons with 3D image warping show an
improvement of rendering quality of 3.8 dB for the relief texture
mapping and 3.0 dB for the inverse mapping rendering technique.
The third part concentrates on the compression problem of multi-view
texture and depth video (Chapters 5–7). In Chapter 5, we extend the
standard H.264/MPEG-4 AVC video compression algorithm for handling the
compression of multi-view video. As opposed to the Multi-view Video
Coding (MVC) standard that encodes only the multi-view texture data,
the proposed encoder peforms the compression of both the texture and
the depth multi-view sequences. The proposed extension is based on
exploiting the correlation between the multiple camera views. To this
end, two different approaches for predictive coding of views have been
investigated: a block-based disparity-compensated prediction technique
and a View Synthesis Prediction (VSP) scheme. Whereas VSP relies on an
accurate depth image, the block-based disparity-compensated prediction
scheme can be performed without any geometry information. Our encoder
adaptively selects the most appropriate prediction scheme using a
rate-distortion criterion for an optimal prediction-mode selection. We
present experimental results for several texture and depth multi-view
sequences, yielding a quality improvement of up to 0.6 dB for the
texture and 3.2 dB for the depth, when compared to solely performing
H.264/MPEG-4 AVC disparity-compensated prediction. Additionally, we
discuss the trade-off between the random-access to a user-selected
view and the coding efficiency. Experimental results illustrating and
quantifying this trade-off are provided. In Chapter 6, we focus on the
compression of a depth signal. We present a novel depth image coding
algorithm which concentrates on the special characteristics of depth
images: smooth regions delineated by sharp edges. The algorithm models
these smooth regions using parameterized piecewise-linear functions
and sharp edges by a straight line, so that it is more efficient than
a conventional transform-based encoder. To optimize the quality of the
coding system for a given bit rate, a special global rate-distortion
optimization balances the rate against the accuracy of the signal
representation. For typical bit rates, i.e., between 0.01 and 0.25
bit/pixel, experiments have revealed that the coder outperforms a
standard JPEG-2000 encoder by 0.6-3.0 dB. Preliminary results were
published in the Proceedings of 26th Symposium on Information Theory
in the Benelux. In Chapter 7, we propose a novel joint depth-texture
bit-allocation algorithm for the joint compression of texture and
depth images. The described algorithm combines the depth and texture
Rate-Distortion (R-D) curves, to obtain a single R-D surface that
allows the optimization of the joint bit-allocation in relation to the
obtained rendering quality. Experimental results show an estimated
gain of 1 dB compared to a compression performed without joint
bit-allocation optimization. Besides this, our joint R-D model can be
readily integrated into an multi-view H.264/MPEG-4 AVC coder because
it yields the optimal compression setting with a limited computation
effort.