-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdesign.tex
328 lines (264 loc) · 13.6 KB
/
design.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
%!TEX root = main.tex
\chapter{Depth Estimation}
\label{chap:depthestimation_theory}
In this chapter we look at some of the core principals of depth
estimation theory and some of the previous work done in the field.
\section{Introduction}
Extracting depth from images has been a heavily researched area of computer
vision for decades. As a result, there are many vastly different techniques.
But common for all the techniques are some core principals. In short, depth
estimation is about finding points in a 3D scene that can be identified in two
images taken from different points of view. For each point, the difference in
a points pixel coordinates in one view relative to the other view, gives us
the points disparity. By finding the disparity of each point that is visible
in both images, we get a disparity map, and with these maps, we can roughly
calculate the physical distance to these points from the cameras. A simple
experiment to demonstrate how two views viewing the same scene changes; hold a
finger vertically in front of your eyes. Close one of your eyes alternatively,
and notice how the fingers position changes relative to the background. This
change of horizontal position between the views is the disparity.
\begin{figure}
\centering
\label{fig:depth-theory}
\includegraphics[width=0.8\textwidth]{images/depth-estimation-theory.pdf}
\caption{Example of how the depth can be calculated}
\end{figure}
Figure \ref{fig:depth-theory} shows how the depth $z$ can be
triangulated. Given two corresponding points $x_l$ in the left image
and $x_r$ in the right, the disparity $d = x_l - x_r$, baseline
(distance between the cameras) $b$ and the focal length $f$, $z$ can
be calculated with the following formula:
\[
\mathlarger{z = \frac{fb}{d}}
\]
Finding the corresponding points $x_{l}$ and $x_{r}$ is the main
challenge of all depth estimation approaches.
\section{Related Work}
Scharstein and Szeliski presents a very thorough taxonomy and evaluation of
many of the most known stereo correspondence algorithms in \cite{taxonomy}.
Most can be grouped into one of three types of algorithms: \textit{Feature-
based}, \textit{graph cuts} and \textit{Correlation-based}. Another way to
categorize them is by how they select what pixels are compared. \textit{Local}
algorithms only consider pixels in close proximity to the current pixel of
interest, while \textit{global} algorithms may consider pixels from anywhere
in the input. A third, lesser used category is \textit{semi-global}
algorithms, which consider every pixel up to the maximum search distance along
the scan line of the current pixel of interest.
Feature-based algorithms\cite{hannah74, marr-poggio79, nishimoto88} extract
features like edges, corners, circles and regions from the input images and
then find the disparity of the corresponding matches. These algorithms can be
fast, but are usually only able to produce sparse disparity maps, containing
values only near the features it could match. These algorithms have been in
use in mobile robotics for some time\cite{feature-based-robotics}, but is
unusable for applications like image-based rendering and view synthesis, which
need dense disparity maps, with values for all the pixels.
Graph cut algorithms\cite{markov98,boykov99,boykov04} are able to produce very
high quality disparity maps in terms of erroneous matches, but have a very
high computational cost. Even the fastest graph cut algorithms need seconds to
complete estimation on sub-VGA resolution inputs.
Lastly, there are correlation based algorithms. The fastest algorithms
mentioned in the taxonomy are all correlation based (the taxonomy only covers
algorithms producing dense disparity maps). Many implementations of
correlation-based algorithms exist and are used commercially. `Small Vision
Systems'\cite{svs98} and 'Point Grey'\cite{pg} sell devices capable of real
time depth estimation on lower resolutions, and Jet Propulsion
Laboratory\cite{jpl1, jpl2} has used the approach in mobile robotics. Most
correlation-based algorithms perform the correlation calculations using Sum of
Absolute or Squared Differences (SAD and SSD)\cite{faugeras93} or Normalized
Cross-correlation (NCC)\cite{ncc93} with a rectangular support region centered
around the pixel of interest.
\begin{figure}
\centering
\begin{subfigure}[b]{0.4\textwidth}
\includegraphics[width=\textwidth]{images/bumblebee2.jpg}
\caption{Point Greys Bumblebee2 stereo camera}
\end{subfigure}
~
\begin{subfigure}[b]{0.4\textwidth}
\includegraphics[width=\textwidth]{images/bumblebee_output.png}
\caption{Bumblebee2 disparity map output}
\end{subfigure}
\end{figure}
Regardless of approach, most algorithms perform the following four
steps in some way\cite{taxonomy}:
\begin{enumerate}
\item Matching Cost computation
\item Cost (support) aggregation
\item Disparity computation/optimization
\item Disparity refinement
\end{enumerate}
Not all approaches follow these steps rigorously. Feature-based approaches
usually have some pre-processing step that searches for features and then
perform the four steps on the areas around the extracted features. Graph cut
approaches usually skip the cost aggregation step, and instead search for a
value that minimizes a global cost function.
The correlation-based approach is what seems most likely to be able achieve
real time performance on high definition input, and will be the scope of this
thesis and the implementation.
\section{Block matching algorithm}
Of the correlation-based algorithms, \textit{block matching} algorithms seem
to achieve the best performance overall both in run time and disparity map
quality\cite{taxonomy}. Block matching algorithms get the name from
correlating windows (blocks of pixels), and not just single pixel or scanline
correlation. This window varies between algorithms, but is most often a square
window from $7\times7$ to $21\times21$ pixels. They may use adaptive windows
that changes shape on certain conditions, or other non-square shapes.
\subsection{Matching Cost calculation}
\label{sec:matchingcost}
The matching cost is a measurement of the similarity of pixel locations in the
input images. This calculation is typically done for all pixels, at all the
disparity hypotheses.
There are many ways to calculate this. Two of the simplest methods are
\textit{Absolute intensity differences} (AD):
\begin{equation}
| I_L(x_L,y_L) - I_R(x_R,y_R) |
\end{equation}
and \textit{Squared intensity differences} (SD):
\begin{equation}
(I(x_L) - I((x_r))^2
\end{equation}
These methods are very computationally quick to calculate, AD even
more so than SD.
Another well known matching technique is \textit{Birchfield and Tomasi}'s
image sampling insensitive method (BT)\cite{bt}. Instead of comparing pixel
values shifted by integral amounts, BT compares each pixel in the reference
image against a linearly interpolated function of the other image. It has a
higher calculation cost, but has the benefit of alleviating input sampling
problems. As explained in the paper: $\hat{I}_R$ is first defined as the
linearly interpolated function between the sample points of the right
scanline, and measures how well the intensity at $x_L$ fits into the linearly
interpolated region surrounding $x_R$. The following quantity is defined:
\[ \mathlarger{\bar{d}(x_L, x_R, I_L, I_R) = \min_{x_R - \frac{1}{2}\ \le\ x\ \le \ x_R + \frac{1}{2}} | I_L(x_L) - \hat{I}_R(x) |} \]
Defining $\hat{I}_L$ similarly:
\[\mathlarger{
\bar{d}(x_R, x_L, I_R, I_L) = \min_{x_L - \frac{1}{2}\ \le\ x\
\le \ x_L + \frac{1}{2}} | \hat{I}_L(x) - I_R(x_R) |
}\]
a symmetric quantity is obtained. The dissimilarity $d$ between the pixels is
defined symmetrically as the minimum of the two quantities:
\begin{equation}
\label{eq:btd}
\mathlarger{
d(x_L,x_R) = \min \{\bar{d}(x_L,x_R,I_L,I_R),\ \bar{d}(x_R,x_L,I_R,I_L) \}
}
\end{equation}
Since the extreme points of a piecewise linear function must be its
breakpoints, the computation of $d$ is straightforward. First, compute the
linearly interpolated intensity halfway between $x_R$ and its neighboring
pixel to the left:
\[
I^{-}_{R} \equiv{} \hat{I}_R ( x_R - \frac{1}{2}) =
\frac{1}{2}(I_R(x_R) + I_R(x_R - 1))
\]
and the analogous quantity:
\[
I^{+}_{R} \equiv{} \hat{I}_R ( x_R + \frac{1}{2}) =
\frac{1}{2}(I_R(x_R) + I_R(x_R + 1))
\]
Then, $I_{min} = \min\{I^{-}_{R},I^{+}_{R},I_{R}(x_R)\}$, and $I_{max}
= \max\{I^{-}_{R},I^{+}_{R},I_{R}(x_R)\}$
With these quantities defined, $\bar{d}$ can be calculated with:
\[
\bar{d}(x_L,x_R,I_,I_R) = \max\{0,I_L(x_L) - I_{max},I_{min} - I_L(x_L)\}
\]
along with its symmetric counterpart $\bar{d}(x_R,x_L,I_R,I_L)$.
For all approaches, a lower cost is a better match, where a cost of 0
means the left and right pixels are identical.
\subsection{Aggregation of Costs}
\label{sec:aggregatecost}
Single pixel matching is in most cases too ambiguous. Some kind of additional
information is needed. \textit{Local} algorithms sum up or average a small
neighborhood of costs into an aggregated cost, yielding a much better value
for comparison.
The simplest method is to sum the costs in a square window $w$ centered around
the pixel of interest. This method is known as \textit{Sum of Absolute
Differences} (SAD):
\begin{equation}
\label{eq:sad}
\mathlarger{\mathlarger{\sum}}_{i,j \in w} |I_L(i,j) - I_R(x + i, y + j)|
\end{equation}
or \textit{Sum of Squared Differences},
\begin{equation}
\label{eq:ssd}
\mathlarger{\mathlarger{\sum}}_{i,j \in w} (I_L(i,j) - I_R(x + i, y + j))^2
\end{equation}
depending on which cost matching was used.
Aggregation with Birchfield \& Tomasi's image sampling insensitive method does
not have an equally catchy and widespread name, but the calculation is done in
the same manner using the equation \ref{eq:btd}:
\begin{equation}
\label{eq:bt}
\mathlarger{\mathlarger{\sum}}_{i,j \in w} \min\{
d((x,y)_L,(x,y)_R), \ d((x,y)_L, (x+i,y+j)_R) \}
\end{equation}
Another slightly more complex technique is to use \textit{Adaptive
Windows}\cite{Okutomi and Kanade, 1992; Kanade and Okutomi, 1994; Veksler,
2001; Kang et al., 2001)}. Instead of the fixed square windows mentioned
above, this approach computes costs in 5 smaller sub-windows arranged as shown
in figure t\ref{fig:adaptivewindow}. The final matching cost is determined by
the sum of costs from the center window, and the two surrounding windows with
the lowest sum of costs. This method handles object borders better than the
fixed windows at low resolutions, but on higher resolutions the smaller sizes
of the adaptive windows becomes insufficient\cite{gpugems}.
\begin{figure}
\label{fig:adaptivewindow}
\setcounter{subfigure}{0}
\centering
\begin{subfigure}[b]{0.20\textwidth}
\includegraphics[width=\textwidth]{images/adaptive-windows-9x9.pdf}
\caption{}
\end{subfigure}
~
\begin{subfigure}[b]{0.20\textwidth}
\includegraphics[width=\textwidth]{images/adaptive-windows.pdf}
\caption{}
\end{subfigure}
~
\begin{subfigure}[t]{0.40\textwidth}
\includegraphics[width=\textwidth]{images/adaptive-windows-shapes.pdf}
\caption{}
\end{subfigure}
\caption{(a) Fixed size window of 9x9 pixels; (b) adaptive windows,
5 3x3 sub-windows; (c) possible combinations of center and 2
adjacent windows}
\end{figure}
A lower aggregated cost is a better match, where a cost of 0 means the
left and right pixels and its neighborhood are identical.
\subsection{Disparity Selection}
\label{sec:disparity_selection}
The very simplest disparity selection strategy is the \textit{Winner-Takes-
All} (WTA), which is just a simple search for the lowest value among the
aggregated cost for each pixel. Not many other selection strategies for local
correlation-based algorithms have been proposed in research.
\subsection{Refinement}
\label{sec:refinement}
There are multiple disparity map refinements that can be applied. Of the most
common ones
Most depth estimation algorithms produce a disparity map of integer
disparities. Many applications such as robotics navigation and tracking do
fine with that level of quantization. However, for image-based rendering such
as \textit{Visual Hulls}\cite{torivar} and 3D-reconstruction, this level is
detail is detrimental to the quality of their output. This can be alleviated
by applying a sub-pixel refinement stage. Linear interpolation may be too
simple, but others have researched methods like including an iterative
gradient descent and fitting a curve to the matching costs at discrete
disparity levels \cite{Ryan et al., 1980; Lucas and Kanade, 1981; Tian and
Huhns, 1986; Matthies et al., 1989; Kanade and okutomi, 1994, taxonomy}.
Applying a median filter to the disparity map removes high-intensity noise
caused by erroneous and spurious matches. Along an objects surface the
disparity is most likely going to be continuous, and blurring may even out
slightly-off matches.
Cross-checking eliminates unreliable matches. All depth discontinuities will
have occlusions, where a real world point is only visible on one of the
images, and will always produce an unreliable match. Cross-checking removes
these regions and marks them as invalid lets an application deal with it as it
sees fit.
The occlusions cross-checking has invalidated can be optionally be filled. For
the left disparity map, the background region immediately to the left of an
occlusion most likely stays constant up until the foreground object border.
\section{Summary}
Depth estimation is a vast field of research, and this chapter has barely even
glanced at the related work done. It has given a brief presentation of the
most common approaches, and discarded those that do not fit the goals of this
thesis. Block matching algorithms seem the most promising for a parallel
architecture, so it was explored and explained in further detail.