source: trunk/documents/theses/dstn/kdtree.tex @ 12331

Revision 12331, 45.1 KB checked in by dstn, 14 months ago (diff)

update kdtree figs

  • Property svn:executable set to *
Line 
1\comment{
2
3HOGG comments
4
5Section 2+3, general:  I like the fact that the tricks and tips and
6parts are all individually numbered, but it is also distracting to
7have so many full-break headings.  What about formatting the
8subsections the way LaTeX usually formats \paragraph{} so that "2.1
9Data structures:" appears at the beginning of the line that currently
10starts "Figure 2 shows pseudocode...".
11
12Subsection 2.2: You don't specify when or how you change the splitting
13dimension d.  Even though it *does* say in the pseudocode, I think you
14should be explicit here in the text, and maybe also give some
15alternatives that are also sensible.  That is, give the default method
16but explain where a sensible person might make a different choice and
17why.
18
19Figure 1:  You don't explain the meaning of the dashed lines in the
20caption.  Also are you really going to have red color in your thesis
21here?  Also, the dashed lines are not very clear given the choppy
22data; I would suggest grey or red solid lines.
23
24Figure 3 (and subsection 2.2):  I think you should note in the
25pseudocode comments and the text that traditionally you stop when the
26node size goes below some threshold, but that in Section 3 you are
27going to suggest an alternative.
28
29Figure 8:  I prefer solid grey to dotted lines.
30
31Subsection 2.6:  Either cite or explain the approximate kernel density
32estimation algorithm, or both.
33
34Section 3:   First line:  "most" or "many"?  I think you should spend
35a bit more time here noting that your advice is only for the case in
36which the kd-tree is edited or constructed FAR LESS FREQUENTLY than it
37is searched.
38
39Subsection 3.2:  Could you give the reader some idea of *when* you
40might choose *not* to split at the median?
41
42Subsection 3.6: I would be even a bit more explicit here, that the
43data dimensions in the tree are usually only some small part of your
44data, and that you have all this other information for each data point
45elsewhere.  You call these "labels" in the text, but it could be all
46the other columns in an enormous data base.
47
48Section 4, para starting "We chose a very simple...": Isn't it a HUGE
49DEAL that you can make bigger trees?  If so, this should be trumpeted
50a bit more, here and in the abstract.  So you beat all the standard
51implementations on data-set applicability alone!
52
53Section 4, near the end of the section:  Once again, mention the
54capability of making much bigger trees.
55
56Section 5:  Give the location for the libkd download.  Perhaps it
57should go to sourceforge or code.google.com, and not just
58astrometry.net?  Or is it already *in* some such location?
59
60one more thing:  I think it would be useful to indicate, among the
61speed-up and optimization hints, which depend most strongly on the
62data being *static* and which do not; in general, if it isn't clear,
63it is worth saying what tips and tricks depend on what assumptions.
64}
65
66
67\input{kdfigs}
68
69\section{Introduction}
70
71  The \kdtree is a simple yet versatile data structure for speeding up
72  searches in $k$-dimensional data sets.  It applies to
73  nearest-neighbour queries, which arise in many computer vision
74  applications such as vector quantization and template matching.  It
75  also applies to a family of related problems such as range search
76  (finding all points within a given radius of a query), $K$-nearest
77  neighbours, and approximate kernel density estimation.  Over its
78  long history the \kdtree has accumulated many extensions and
79  variations making the literature difficult to comprehend.
80  This \doctype presents a modern view of the \kdtree.  It also points
81  out a number of optimizations that can increase the performance by
82  an order of magnitude over existing publicly available
83  implementations while also reducing the memory requirements---or
84  increasing the size of the largest \kdtree that be constructed in a given
85  amount of memory.
86
87\section{The standard \kdtree}
88
89
90\Kdtrees have accelerated spatial queries for over thirty years.
91\Kdtrees speed up N-body simulations, search, object recognition,
92ray tracing, template matching, kernel density estimation, and more.
93%But what \emph{is} a \kdtree?
94
95
96
97Imagine you have a classification problem.  You have a million
98labelled 2-dimensional data points (``feature vectors'') in the unit
99square.  You also have a set of a million unlabelled data points, and
100a deadline.  You decide to run nearest-neighbour classification.
101After a moment of thought you realize that if you compare every
102labelled feature to every unlabelled feature you will have to do
103$10^{12}$ comparisons.  Your deadline looms.  Upon further reflection,
104you decide to split the labelled points in half: the ones with $x \le
105\half$ and the ones with $x > \half$.  When you want to
106find the nearest neighbour of an unlabelled point that has $x \le
107\half$, you look in the first set of labelled points.  Most of
108the time, you find that the distance between the unlabelled point and its
109nearest labelled neighbour is less than the distance to the $x =
110\half$ boundary.  You don't even need to look at the second set
111of labelled points!
112
113\begin{figure}
114\begin{center}
115\begin{tabular}{@{}c@{\hspace{5em}}c@{}}
116 % as big as they can be before overflowing the page...
117 % created by kdtree-tests/tree-figs.py
118 \includegraphics[height=0.82\textheight]{kdtree-bbox}&
119 \includegraphics[height=0.82\textheight]{kdtree-split}%
120\end{tabular}
121\end{center}
122\caption{Two varieties of \kdtree.  \captionpart{Left:} A \kdtree with bounding boxes.
123Parent-child relationships are shown by gray lines.
124\captionpart{Right:} A \kdtree with splitting planes.  The splitting plane at each
125node is shown by a gray line.  The top box is the root node.
126Parent-child relationships are not shown in order to avoid cluttering
127the figure.}
128\label{fig:tree}
129\end{figure}
130
131
132
133
134While waiting for your algorithm to finish, you \mbox{remember}
135recursion.  You decide to split the labelled points in half again.
136{\small And again.}  {\scriptsize And again.}  {\tiny And again.}  You
137just invented \kdtrees.  If it was 1975 you would be the first, and
138win a prize \cite{bentley1975}.
139
140
141You used a \kdtree to speed up nearest neighbour search (find the
142closest data point to the query point).  They are also useful for
143\emph{range search} (find all data points within some radius of a
144query point) and more complex queries such as \emph{approximate kernel
145density estimation}.
146
147
148
149A \kdtree is a type of \emph{Binary Space Partition} tree.  Each node
150in the tree represents a portion of space.  In \kdtrees the portion of
151space is an axis-aligned bounding box.%
152\footnote{In the example above, the first node you created had the
153bounding box $x \in [0, \half]$ and $y \in [0, 1]$.}  Each node has
154two children which split the node's space in two.  This split is
155usually chosen so that half the data points go into each child.  A
156node's axis-aligned bounding box is either explicitly stored, or
157implicitly defined by its ancestors' splitting planes.
158
159
160% binary, balanced and complete.
161
162
163%Related stuff.
164%\cite{friedman1977, sproull1991}
165%Competition: \cite{arya1998}, \cite{uhlmann1991}, \cite{moore2000},
166%\cite{datar2004}; \cite{hjaltason2003} (survey).
167
168\Kdtrees work well in low-dimensional spaces (opinions vary, but say up to 15).
169In high dimensions, other data structures may work better \cite{bohm2001}.
170
171\def\ignore#1{}
172\ignore{}{Below, we first describe the core structure of \kdtrees, but note that the foundational implementation is not the most efficient. We then move to the comprehensive series of implementation ``tricks'' that must be considered for optimal performance. Finally, we compare our implementation, {\tt libkd} (incorporating the efficiencies described), with two previously released \kdtree implementations: {\tt ANN} \cite{arya1998} and {\tt Simple \kdtrees} \cite{simkd} (abbreviated to {\tt simkd}) to show the considerable level of speed-ups that can be attained.}
173
174\section{\Kdtree implementation}
175
176\subsection{Data structures}
177\label{sec:datastruct}
178
179\Figref{code:datastruct} shows pseudocode for \kdtree data
180structures.  There are two types of \kdtree nodes listed, {\tt
181SplittingNode} and {\tt BoundingBoxNode}.  In this \doctype, a \kdtree
182contains nodes of only one type, but of course it is possible to store
183both the splitting plane and bounding box of each node if required by
184the application.  {\tt BoundingBoxNode}s contain two points which
185define the maximum and minimum corners of the bounding box.  {\tt
186SplittingNode}s do not contain an explicit representation of their
187spatial extent.  Instead, each node contains a split dimension and
188position.  A {\tt SplittingNode}'s bounding box is implicitly
189described by the splits of its ancestors.
190
191
192The two representations have different performance characteristics on
193different data distributions.  For example, the benchmark in
194\secref{sec:speed} favours splitting planes.  Structured data sets
195where the dimensions are correlated and the data point distribution is
196clumpy work well with bounding box nodes.
197
198
199Please note that these data structures are only for explaining the
200\kdtree; for practical implementation details see
201\secref{sec:impl}.
202
203
204
205\subsection{Construction}
206
207
208\Kdtree construction starts with a set of points.  Construction begins by
209creating a root node which contains all of the data points.  The root
210node is split into two halves by choosing a splitting dimension
211(usually the dimension with the largest spread) and a splitting
212position (usually the median value of the points along that
213dimension).  Then, the set of points whose position along the
214splitting dimension is less than the splitting position are put in the
215left child.  The remaining points are put in the right child.
216Construction proceeds recursively until the nodes at the lowest level
217of the tree contain some small number of points (perhaps 10).  See
218\figref{code:build} for pseudocode.
219
220\clearpage
221
222\begin{figure}
223  \begin{pcode}
224      \class KDTree: \\
225      \> Node root \\
226      \\
227      \class Node: \\
228      \> \codecomment{empty superclass for leaf and internal nodes} \\
229      \\
230      \class InternalNode extends Node: \\
231      \> \codecomment{superclass of bounding box} \\
232      \> \codecomment{and splitting plane nodes} \\
233      \> Node left\_child \\
234      \> Node right\_child \\
235      \\
236      \class BoundingBoxNode extends InternalNode: \\
237      \> \codecomment{lower and upper corners of the bounding box} \\
238      \> point lower \\
239      \> point upper \\
240      \\
241      \class SplittingNode extends InternalNode: \\
242      \> \codecomment{splitting plane dimension and value} \\
243      \> int  split\_dimension \\
244      \> real split\_position \\
245      \\
246      \class LeafNode extends Node: \\
247      \> point[] points\_owned
248  \end{pcode}
249  \caption{Data structures for \kdtree nodes.
250    Each {\tt LeafNode} contains a set of $D$-dimensional {\tt point}s.
251    All nodes in a \kdtree are either of type {\tt BoundingBoxNode} or
252    {\tt SplittingNode}.
253    In practice, more efficient structures should be used; see \secref{sec:impl}.}
254  \label{code:datastruct}
255\end{figure}
256
257\begin{figure}
258\begin{pcode}
259  \func build\_kdtree(point[] p, int nleaf, bool b\_boxes): \\
260  \> t = new KDTree() \\
261  \> t.root = build\_node(p, nleaf, b\_boxes) \\
262  \> return t \\
263  \\
264  \codecomment{Build a \kdtree node from a set of points.} \\
265  \func build\_node(point[] p, int nleaf, bool boxes): \\
266  \> \codecomment{if the set of points is small enough...} \\
267  \> if p.size() <= nleaf: \\
268  \>\> \codecomment{create a leaf node.} \\
269  \>\> return new LeafNode(p) \\
270  \> \codecomment{else, choose a splitting dimension...} \\
271  \> split\_dim = choose\_split\_dimension(p) \\
272  \> \codecomment{and split the points along that dimension.} \\
273  \> (p\_left, p\_right, split\_pos) = partition(p, split\_dim) \\
274  \> if boxes: \\
275  \>\> n = new BoundingBoxNode() \\
276  \>\> n.lower = min(p) \\
277  \>\> n.upper = max(p) \\
278  \> else: \\
279  \>\> n = new SplittingNode() \\
280  \>\> n.split\_dimension = split\_dim \\
281  \>\>  n.split\_position = split\_pos \\
282  \> \codecomment{recurse on the two point sets} \\
283  \> \codecomment{to create the child nodes.} \\
284  \> n.left\_child \spc = build\_node(points\_left, \spc nleaf, boxes) \\
285  \> n.right\_child = build\_node(points\_right, nleaf, boxes) \\
286  \> return n \\
287  \\
288  \codecomment{Typical: split the dimension with largest range.} \\
289  \func choose\_split\_dimension(points): \\
290  \> return argmax(max(points) - min(points)) \\
291  \\
292  \codecomment{Typical: split at the median value.} \\ % to yield a balanced tree.} \\
293  \func partition(points, dim): \\
294  \> \codecomment{find the median value in the splitting dimension} \\
295  \> split\_val = median(points, dim) \\
296  \> \codecomment{grab points on each side of the partition.} \\
297  \> p\_left  \spc = [p in points
298              where p[dim] <= split\_val] \\
299  \> p\_right = [p in points
300              where p[dim] > \spc split\_val] \\
301  \> return (points\_left, points\_right, split\_val)
302\end{pcode}
303\caption{Pseudocode for \kdtree construction.}
304\label{code:build}
305\end{figure}
306
307
308
309\subsection{Distance bounds}
310
311Fundamentally, \kdtree algorithms are about pruning nodes that don't need
312to be examined.  The primary tool to achieve this is the \mindist
313function, which provides a lower bound on the distance between a point and
314any point that lives in the region owned by a node.
315Intuitively, this is the distance from a point to the closest corner or closest
316edge of the node.
317
318The form of the \mindist function depends on the representation of a \kdtree
319node.  For \kdtrees that have bounding boxes, one computes the
320\mindist to an individual node.  For \kdtrees that only store the splitting plane,
321it makes more sense to compute the \mindist to each of the node's two children.
322An illustration is given in \figref{fig:mindist} and pseudocode in
323\figref{code:mindist}.
324
325
326\begin{figure}
327\begin{center}%
328\begin{tabular}{@{}c@{}c@{}}
329\includegraphics[width=0.49\columnwidth]{mindist-bb}&
330\includegraphics[width=0.49\columnwidth]{mindist-split}%
331\end{tabular}
332\end{center}
333\caption{\captionpart{Left:} \mindist for a \kdtree node with bounding boxes.
334\captionpart{Right:} \mindist for a \kdtree node with only the splitting dimension and value,
335the \mindist is zero for the child on the same side of the splitting
336plane as the query point; for the other child the \mindist is simply
337the distance in the splitting dimension from the query to the
338splitting plane.}
339\label{fig:mindist}
340\end{figure}
341
342\begin{figure}
343\begin{pcode}
344\func mindist\_to\_box(point p, BoundingBoxNode n): \\
345\>   mindist = 0 \\
346\>   for each dimension d: \\
347\>\>      if p[d] > n.upper[d]: \\
348\>\>\> \codecomment{p is above the bounding} \\
349\>\>\> \codecomment{box in this dimension} \\
350%\>\>\> \codecomment{p is above the bounding box in this dimension} \\
351\>\>\>         diff = n.upper[d] - p[d] \\
352\>\>      else if p[d] < n.lower[d]: \\
353\>\>\> \codecomment{p is below the bounding} \\
354\>\>\> \codecomment{box in this dimension} \\
355\>\>\> diff = p[d] - n.lower[d] \\
356\>\>      else: \\
357\>\>\> \codecomment{p is within the bounding} \\
358\>\>\> \codecomment{box in this dimension} \\
359%\>\>\> \codecomment{p is within the bounding box in this dimension} \\
360\>\>\>      diff = 0 \\
361\> mindist += diff * diff \\
362\> return sqrt(mindist) \\
363\\
364\func mindist\_to\_left\_child(point p, SplittingNode n): \\
365\>   d = n.split\_dimension \\
366\>   if p[d] > n.split\_position: \\
367\>\>   \codecomment{point is on the right side} \\
368\>\>   \codecomment{of node's splitting plane} \\
369%\>\>   \codecomment{point is on the right side of node's splitting plane} \\
370\>\>      return p[d] - n.split\_position \\
371\>   return 0 \\
372\\
373\func mindist\_to\_right\_child(point p, SplittingNode n): \\
374\>   d = n.split\_dimension \\
375\>   if p[d] < n.split\_position: \\
376\>\>   \codecomment{point is on the left side} \\
377\>\>   \codecomment{of node's splitting plane.} \\
378%\>\>   \codecomment{point is on the left side of node's splitting plane.} \\
379\>\>      return n.split\_position - p[d] \\
380\>   return 0
381\end{pcode}
382\caption{Pseudocode for \mindist.  \captionpart{Top:} for \kdtrees with bounding boxes.
383\captionpart{Bottom:} for \kdtrees with splitting
384planes.}
385\label{code:mindist}
386\end{figure}
387
388\subsection{Nearest neighbour}
389
390Given a set of data points and a query point, solving the nearest neighbour
391problem consists of finding the data point that is closest to the query.
392Specifically, given query point $\mathbf{q}$, find
393\begin{equation}
394\mathbf{x}_{\mathit NN} = \argmin_{i} \| \mathbf{x}_i - \mathbf{q} \|_{2} \quad .
395\end{equation}
396The algorithm is of the \emph{branch and bound} variety
397\cite{land1960}: the tree is traversed, maintaining a \emph{pruning
398distance}---the distance to the nearest data point found so far.  In
399addition, the set of nodes that might contain the solution are stored
400in a stack or priority queue ordered by \mindist.  If a node is
401encountered whose \mindist to the query point is larger than the
402pruning distance, it is discarded because it cannot contain the
403solution.  When a leaf node is encountered, its data points are
404examined individually.  If any data point is closer than the pruning
405distance, that point becomes the nearest neighbour found so far, and
406the pruning distance is set to the point's distance to the query
407point.  The algorithm begins by setting the pruning distance to
408$+\infty$ and placing the root node on the stack.  See the pseudocode
409in \figref{code:nn}.
410
411
412\subsection{Rangesearch}
413
414In the rangesearch problem, the goal is to find all points that are
415within a certain radius of the query point.  Specifically, given query
416point $\mathbf{q}$, radius $r$, and a set of points $\mathbb{X}$, find
417the set of points
418\begin{equation}
419\left\{
420 \ \mathbf{x}
421\ \Big| \ \| \mathbf{x} - \mathbf{q} \|_{2} \le r,\ \mathbf{x}\in\mathbb{X} \
422\right\}
423\quad .
424\end{equation}
425Range search is simpler than nearest neighbour search because the
426pruning distance is fixed.  This implies that the list of nodes that
427must be inspected does not depend on the order of tree traversal.  At
428each node the \mindist to the query point is computed.  If it is
429larger than the search radius, it is pruned because it cannot possibly
430contain any data points within the radius.  See the pseudocode
431in \figref{code:rangesearch}.
432
433
434\begin{figure}
435\begin{pcode}
436\codecomment{Input: root of a \kdtree, and a query point q} \\
437\codecomment{Output: (x, distance(x, q)) such that x is the} \\
438\codecomment{  nearest neighbour of q in the \kdtree} \\
439%\codecomment{  x = argmin dist(x, q) for all x in the \kdtree} \\
440\func nearest\_neighbour(KDTree t, point q): \\
441%1234567890123456789012\=\kill
442%\> real pruning\_distance=+inf): \\
443%\tabstops
444\>   nodestack.push(t.root) \\
445\>   closest\_so\_far = +inf \\
446\>   nearest\_neighbour = null \\
447\>   while nodestack is not empty: \\
448\>\>      n = nodestack.pop() \\
449\>\>      if mindist(n, q) >= closest\_so\_far: \\
450\>\>\>         \codecomment{this node can be pruned.} \\
451\>\>\>         continue \\
452\>\>      if n.is\_leaf(): \\
453\>\>\>         \codecomment{leaf node: examine individual points} \\
454\>\>\>         for p in n.points\_owned: \\
455\>\>\>\>            d = distance(p, q) \\
456\>\>\>\>            if d < closest\_so\_far: \\
457\>\>\>\>\>               \codecomment{nearest neighbour found so far} \\
458\>\>\>\>\>               closest\_so\_far = d \\
459\>\>\>\>\>               nearest\_neighbour = p \\
460\>\>      else: \\
461\>\>\>         \codecomment{find the mindists to the child nodes} \\
462\>\>\>         mindist\_L = mindist\_to\_left\_child(n, q) \\
463\>\>\>         mindist\_R = mindist\_to\_right\_child(n, q) \\
464\>\>\>         \codecomment{visit the closest child first} \\
465\>\>\>         \codecomment{(by stacking the farther child first)} \\
466\>\>\>         if mindist\_L > mindist\_R: \\
467\>\>\>\>            nodestack.push(n.left\_child) \\
468\>\>\>\>            nodestack.push(n.right\_child) \\
469\>\>\>         else: \\
470\>\>\>\>            nodestack.push(n.right\_child) \\
471\>\>\>\>            nodestack.push(n.left\_child) \\
472\>   return (nearest\_neighbour, closest\_so\_far)
473\end{pcode}
474\caption{Pseudocode for nearest neighbour search.  The algorithm maintains
475a stack of nodes to explore and a pruning distance ({\tt
476closest\_so\_far}), which is the distance to the nearest data point
477found so far.  Any node whose
478\mindist is greater than the pruning distance is pruned.}
479\label{code:nn}
480\end{figure}
481
482
483%\begin{tikzpicture}[remember picture, overlay]
484%  \node [xshift=0.3em, yshift=-1in] (A) at (current page.north) {};
485%  \node [xshift=-0.3125in] (tl) at (A.center) {};
486%  \node [yshift=-8.875in] (bl) at (tl.center) {};
487%  \node [xshift=-3.28125in] (br) at (bl.center) {};
488%  \node [xshift=-3.28125in] (tr) at (tl.center) {};
489%  \draw [blue] (tl) -- (tr) -- (br) -- (bl) -- (tl);
490%\end{tikzpicture}
491
492
493\begin{figure}
494\begin{pcode}
495\codecomment{Input: A \kdtree, a query point q, and a range} \\
496\codecomment{Output: The set [(x, distance(x, q)), ...] } \\
497\codecomment{  such that distance(x, q) < range} \\
498\codecomment{  for all x in the \kdtree} \\
499\func range\_search(KDTree t, point q, real range): \\
500\>   results = [] \codecomment{empty list} \\
501\>   range\_search\_helper(t.root, q, range, results) \\
502\>   return results \\
503\\
504\func range\_search\_helper(Node n, point q, \\
505  12345678901234567890123\=\kill
506  \> real range, list res): \\
507  \tabstops
508\>   if is\_leaf(n): \\
509 \>\>     for each p in n.points\_owned: \\
510\>\>\>         d = distance(p,q) \\
511\>\>\>         if d < range: \\
512\>\>\>\>            res.append( (p, d) ) \\
513\>\>      return \\
514\>   if mindist\_to\_left\_child(n, q) < range: \\
515\>\>      range\_search\_helper(n.left\_child, \spc q, range, res) \\
516\>   if mindist\_to\_right\_child(n, q) < range: \\
517\>\>      range\_search\_helper(n.right\_child, q, range, res)
518\end{pcode}
519\caption{Pseudocode for range search.}
520\label{code:rangesearch}
521\end{figure}
522
523%For \kdtrees that only store the splitting plane, the algorithm is slightly
524%simpler.  Each time we examine a non-leaf node, we determine whether the
525%query point lies to the left or right of the splitting plane.  We place the
526%opposite child on the stack, followed by the child that owns the space
527%containing the query point.  A lower bound on the distance from the query
528%point to the opposite child is simply the distance in the splitting dimension
529%of the query to the splitting plane.
530
531% Pseudocode...
532
533
534% Good seed: 1197089319
535
536\begin{figure}
537  \begin{center}
538    \newlength{\nnfigw}
539    \setlength{\nnfigw}{0.25\textwidth}
540  \begin{tabular}{@{}c@{}c@{}c@{}}
541      \includegraphics[width=\nnfigw]{nnbb01} &
542  \includegraphics[width=\nnfigw]{nnbb02} &
543  \includegraphics[width=\nnfigw]{nnbb03} \\
544      \includegraphics[width=\nnfigw]{nnbb04} &
545  \includegraphics[width=\nnfigw]{nnbb05} &
546  \includegraphics[width=\nnfigw]{nnbb06}
547  \end{tabular}
548
549  \vspace{20pt}
550
551  \begin{tabular}{@{}c@{}c@{}c@{}}
552      \includegraphics[width=\nnfigw]{nnsplit01} &
553  \includegraphics[width=\nnfigw]{nnsplit02} &
554  \includegraphics[width=\nnfigw]{nnsplit03} \\
555      \includegraphics[width=\nnfigw]{nnsplit04} &
556  \includegraphics[width=\nnfigw]{nnsplit05} &
557  \includegraphics[width=\nnfigw]{nnsplit06}
558  \end{tabular}
559  \end{center}
560  \caption{\captionpart{Top:} Nearest-neighbour search with bounding boxes.
561  \captionpart{Bottom:} Nearest-neighbour search with splitting planes.
562    The node currently being examined is
563    shown in bold.  In this example,
564  the first leaf node explored contains the nearest neighbour and all
565  remaining nodes can be pruned.}
566  \label{fig:nn}
567\end{figure}
568
569%\clearpage
570
571\subsection{Approximations}
572
573\Kdtrees are applicable to approximate as well as exact algorithms.
574A prominent example is the \emph{Best Bin First} (BBF) algorithm used by many
575SIFT-based vision systems \cite{beis1997}.  BBF is essentially \kdtree
576nearest-neighbour search where the tree traversal is ordered by a priority queue.
577By terminating the search after exploring some fixed number of leaf nodes,
578an approximate nearest neighbour is produced quickly.
579
580There is an elegant solution to approximate kernel density estimation
581which looks similar to the nearest-neighbour algorithm.  It refines
582upper and lower bounds on the kernel density with a series of
583\mindist and
584\maxdist calculations.  The bounds start loose and tighten as the tree is
585traversed.  The algorithm terminates when the interval is small
586enough.
587
588
589%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
590\section{Efficient Implementation Tricks}
591\label{sec:impl}
592
593
594In most cases, \kdtrees are built from a static set of data points
595and then queried millions of times.\footnote{There are some exceptions
596where the tree is updated online, but they are not discussed
597here.}  A number of opportunities arise when points are not added or
598removed after the \kdtree is built.
599
600
601This section presents a series of optimization ``tricks'' taken from
602the battle-hardened \kdtree implementation at the heart of \an.
603These tricks yield an order of magnitude speed increase and vastly
604reduce the memory required compared to the alternatives available.
605
606
607In this section, the dimension of the data points is $D$, the number
608of data points is $N$, and the number of nodes in the \kdtree is $M$.
609All trees are binary and complete. $N$ is large relative to $D$; for
610example in the authors' application $D$ is 3 or 4, and $N$ is 10 to
611100 million.  For us, the maximum tree size that fits in a 32-bit
612address space is an important design parameter.
613
614Not all of these implementation ``tricks'' apply in all cases, but each one
615helps.  Although most of the tricks are concerned with reducing the memory
616requirements of the \kdtree, this often has the effect of increasing the
617speed as well.
618
619
620\trick{Store the data points as a flat array.}
621\label{trick:flatarray}
622Use a one-dimensional array of size $ND$, where coordinate $j$ of data
623point $i$ is stored
624%\linebreak[4]
625at location $iD + j$.  For example, in three dimensions
626the array will contain
627\[ (x_0, y_0, z_0, x_1, y_1, z_1, \ldots) .\]
628
629It may be tempting to store the data points as an $N \times D$
630two-dimensional array.  This incurs a memory overhead of $N$ pointers,
631and at runtime requires a pointer dereference.  To store one million
632points in 3 dimensions ($N=10^{6}, D=3$), our {\tt libkd} requires an
633overhead of 4 bytes, while the existing implementations {\tt ANN} and
634{\tt simkd} use over $4$ megabytes ($8$ megabytes on 64-bit) and over
635$20$ megabytes (40 on 64-bit) respectively.
636
637\trick{Create a complete tree.}
638\label{trick:complete}
639Instead of splitting nodes until each leaf node contains some maximum
640number of points, fix the number of tree levels and create a complete
641tree.  The huge advantage of this trick is that the number of nodes is
642known in advance, which allows the next trick to be used without
643wasting any memory.
644
645There are disadvantages to this trick.  First, the number of data
646points in the leaf nodes is only adjustable by factors of two.
647Second, if nodes are not split at the median value, then some leaf
648nodes will contain more data points than others.
649
650\trick{Don't use pointers to connect nodes.}
651\label{trick:nopointers}
652Instead, put the nodes in a single array.  Use the heap indexing
653strategy shown below instead of explicit pointers.
654The root is node $0$.  The left child of node $i$ is node $2i + 1$,
655and the right child is node $2i + 2$.
656%This works because \kdtrees are binary and complete.
657
658\nonumberparagraphs
659\begin{center}
660\includegraphics[width=\pointerlessfigwidth]{pointerless-fig}
661\end{center}
662\numberparagraphs
663
664For a tree with $M$ nodes, this saves about $M$ pointers, and places
665sibling nodes---which are likely to be accessed at the same
666time---next to each other in memory.  This locality of reference
667improves cache performance.
668
669
670A convenient side-effect of using this trick is that the \kdtree is
671position-independent: the array of nodes can be moved in memory without
672changing any of the contents.  Indeed, it can be written directly to
673disk in `live' form and restored at a later date.  By using the
674{\tt mmap()} system call, the live on-disk representation is
675immediately available for queries; the time required to read it
676from disk will be amortized over the subsequent queries.  Regions of
677space that are never queried may never be read from disk.
678
679
680\trick{Pivot the data points while building the tree.}
681\label{trick:pivot}
682When building the tree, pivot the data along the splitting dimension
683so that the data points owned by the child nodes are contiguous in
684memory.  Represent the set of data points owned by a node as leftmost
685and rightmost offsets into the data array ($L$ and $R$).  Keep a
686permutation array to map from the final array indices back to the
687original indices.
688
689\nonumberparagraphs
690\begin{center}
691  \includegraphics[width=\permutefigwidth]{permute-fig}
692\end{center}
693\numberparagraphs
694
695Once this has been done, the data points owned by leaf nodes are
696contiguous in memory.  In addition, leaf nodes that are nearby in
697space are likely to have their data points stored nearby in memory.
698This increases locality of reference and therefore performance.
699
700
701
702\trick{Don't use {\tt C++} virtual functions.}
703\label{trick:nocpp}
704It may be tempting to define a class hierarchy with an abstract base
705class {\tt Node} and classes {\tt InternalNode} and {\tt LeafNode}
706which inherit from it (as in \secref{sec:datastruct}).  Compilers
707implement this by adding a {\tt vtable} pointer to each {\tt
708InternalNode} and {\tt LeafNode} object.  This enlarges each node by
709the size of a pointer.  This wastes 32 (or 64) bits of memory per
710node, since the node type is completely determined by its position in
711the tree.  Using the node layout shown above, node $i$ is a leaf if $i
712\ge M/2 - 1$ (where $M$ is the total number of nodes).  Using
713inheritance also incurs an indirect function call for each virtual
714function, which itself costs time and prevents the compiler from
715performing inlining optimizations.
716
717
718
719
720\trick{Consider discarding the permutation array.}
721\label{trick:noperm}
722
723Apply the permutation array to auxiliary data to make the \kdtree's
724data order the canonical order.
725For example, you might have a set of data points where each point has
726a class label such as ``cat'' or ``dog''.  After creating a \kdtree
727from the data points, the points are permuted so that the $i$th data
728point no longer corresponds to label $i$.  However, if you apply the
729inverse of the \kdtree's permutation array to the labels, they will
730again correspond.  You can then discard the \kdtree's permutation
731array.  This eliminates a level of indirection and a significant
732amount of memory ($N$ integers).
733
734\nonumberparagraphs
735\begin{center}
736  \includegraphics[width=\applypermfigwidth]{apply-perm-fig}
737\end{center}
738\numberparagraphs
739
740
741% This is frequently the case when the index of the data serves as a
742% label.  Consider a set of data points where each point has a class label
743% such as ``cat'' or ``dog'' associated with it.  After creating a \kdtree
744%from the data points, they will be permuted so that index $i$ in the array
745%of data points no longer corresponds to index $i$ in the array of labels.
746%The \kdtree's permutation array can be used to map the \kdtree's ordering
747%back to the original ordering.  Alternatively, you can apply the inverse
748%permutation to the array of labels.  This eliminates a level of indirection
749%and a significant amount of memory from the tree.
750
751
752\trick{Store only the rightmost offset of points owned by a node.}
753\label{trick:ronly}
754
755The $L$ and $R$ offsets describe the range of data points owned by a
756node.  Within a level of the \kdtree, the $L$ offset of a node is
757simply one greater than the $R$ offset of the node just to the left,
758or zero if there is no node to the left.  Since the $L$ value can be
759trivially computed from the $R$ value, there is no need to store both.
760This saves $M$ integers.
761
762
763\nonumberparagraphs
764\begin{center}
765  \includegraphics[width=\ronlyfigwidth]{r-only-fig}
766\end{center}
767\numberparagraphs
768
769\trick{Don't store the $R$ offsets of internal nodes.}
770Only leaf nodes need to store the $R$ offsets.
771The leftmost data point $L$ of a non-leaf node is just the $L$ value of its
772leftmost leaf.  Similarly, the rightmost data point $R$ of a non-leaf node is
773$R$ of its rightmost leaf.  Try saying that ten times fast!
774This saves $M/2$ integers.
775
776%The savings are approximately $1/2\cdot${\tt sizeof(offset)}.
777
778
779\trick{With median splits, don't store the $R$ offsets.}
780\label{trick:noR}
781Compute them instead.  If the \kdtree is built by splitting at the
782median value (the common case), then the number of data points owned
783by each node in the tree is a function of the total number of data
784points, independent of the values of the data points.  Computing the
785offsets costs $\mathcal{O}(\log M)$\footnote{There {\em may} be an
786$\mathcal{O}(1)$ algorithm, though the authors haven't found it.} if
787exact median splits are used and a fixed rounding direction is chosen
788(\ie, the right subtree gets the extra data point when the number of
789data points is odd).  However, by choosing the rounding direction
790carefully---by moving the splitting position by one place---the data
791points can be distributed so that the computation is $\mathcal{O}(1)$.
792
793
794
795\trick{Consider transposing the data structures.}
796\label{trick:transpose}
797Instead of creating an array of {\tt node} structures, pull the
798node contents directly into the {\tt kdtree} structure, creating a
799separate array for each element.
800
801\nonumberparagraphs
802\begin{center}
803    \includegraphics[width=\transposefigwidth]{transpose-fig}
804\end{center}
805\numberparagraphs
806
807When a compiler encounters a structure such as {\tt node} above,
808it pads the structure so that its total size is a multiple of the size of
809the largest element.  For {\tt node}, the structure will be padded to 16
810bytes, even though only 9 bytes are used.  It is possible to force the
811compiler to pack the structure tightly, but this results in unaligned memory
812accesses (many of the {\tt double} values will not start on a multiple of eight
813bytes), which incurs significant overhead.
814
815
816The advantage of transposing the data structures is that the memory required
817can be minimized without destroying the structure alignment.  The disadvantage
818is a loss of locality of reference: typically the members of a structure are
819accessed at the same time, and transposing them results in the members being
820dispersed in memory.
821
822%\trick{Consider converting the data points to a smaller type.}
823\trick{Consider using a smaller data type.}
824\label{trick:smalltypes}
825For most applications, using {\tt double} values to represent points
826in $\left[0,1\right]$ is not an effective use of bits.  Consider
827transforming the data to a smaller integer representation.  If the
828data points live in a small bounded space, then converting {\tt
829double} values to 32-bit integers saves a factor of two of space with
830very little loss of precision.  Converting to 16- or 8-bit integers
831saves more memory but the effect of introducing this approximation
832should be considered in the context of the problem at hand.
833
834
835The boundary between the \kdtree software library and the application
836is an ideal place to do this transformation: the application need not
837know that the \kdtree library represents the data points in a smaller
838format.  The \kdtree library converts query values into the smaller
839format on the way into the library, and converts results back to the
840original format on the way out.  The only user-visible change should
841be that the data points occupy much less memory, queries are faster,
842and there may be small approximation errors.
843
844
845\trick{Consider bit-packing the splitting value and dimension.}
846\label{trick:packbits}
847Since \kdtrees are only suitable for low-dimensional data, the
848splitting dimension only requires a few bits.  Instead of storing it
849in a separate array, store it in the bottom few bits of the splitting
850value.  For example, with four-dimensional data points, the splitting
851dimension can be stored in the bottom two bits of a 16-bit integer,
852leaving 14 bits to store the splitting position.
853
854\nonumberparagraphs
855\begin{center}
856  \includegraphics[width=\bitpackfigwidth]{bitpack-fig}
857\end{center}
858\numberparagraphs
859
860In general, this trick conflicts with trick of not storing the $R$
861offsets (\secref{trick:noR}), since that trick requires precise
862control over the splitting plane location, while this trick involves
863sacrificing some precision to save space.  For some data point sets,
864this conflict may not arise, and both tricks may be used
865simultaneously.
866
867
868\trick{Consider throwing away the data.}
869If your data points are associated with some other information (such
870as labels), and your application can tolerate some false positive
871results, consider the tradeoffs of discarding the data and keeping the
872\kdtree.  By using the other tricks in this section, the memory
873requirements of a \kdtree can be reduced to a small fraction of the
874size of the data points themselves.  In some applications, the benefit
875of being able to search more data points might outweigh the cost of
876not being able to say exactly where the data points were.
877
878Given $N$ data points, we can build a \kdtree with $N-1$ total
879splitting plane nodes.  Each node contains a splitting value and
880possibly the splitting dimension: it is at most slightly larger than a
881single data value.  The entire tree requires about $1/D$ as much
882memory as the data points themselves.
883
884The \kdtree search algorithms then produce bounded approximations.
885For nearest-neighbour search, the algorithm can produce the index of the
886data point that lives in the same box as the query point, an upper bound
887on the distance to that point, and a lower bound to the distance to the
888true nearest neighbour.  It is also possible to produce a small list of
889points that is guaranteed to contain the nearest neighbour.
890
891
892
893\section{Speed Comparison}
894\label{sec:speed}
895
896This section gives empirical evidence that the tricks presented above
897can yield significant improvements.  We compare our implementation,
898{\tt libkd}, with two previously released \kdtree implementations:
899{\tt ANN} \cite{arya1998} and {\tt simkd} ({\tt Simple \kdtrees})
900\cite{simkd}.  We used these packages `out of the box', using the default compiler settings
901and choices about how to build the \kdtree.
902
903We chose a very simple benchmark test: the data points are 5 million
904samples drawn uniformly from the unit cube ($N=5 \times 10^6, D=3$).
905The number of points owned by a leaf node was set to $14$ for {\tt
906ANN} and {\tt simkd} in order to make the number of nodes in the three
907implementations approximately equal.  The query points are one million
908samples from the same distribution.  We tested the one-nearest
909neighbour algorithm.  We would have preferred to use larger $N$, since
910this is the regime in which our \an project operates, but the other
911libraries were not able to build such large trees.  We present results
912for a 32-bit machine%
913\footnote{Intel Xeon (SL72G) at 3.06 GHz with 4 GB of RAM.}
914and a 64-bit machine%
915\footnote{AMD Opteron 8220 SE at 2.8 GHz with 32 GB of RAM.}
916to show how the memory requirements differ.
917
918%\nonumberparagraphs
919
920\begin{table}
921\begin{center}
922\newcommand{\ftt}[1]{{\tt #1}}
923\newcommand{\minitab}[2][l]{\begin{tabular}{#1}#2\end{tabular}}
924%\newcommand{\ftt}[1]{{\scriptsize\tt #1}}
925%\begin{footnotesize}
926\begin{tabular}{|l|c@{\hspace{5pt}}c|c@{\hspace{5pt}}c|}
927\hline
928%\multirow{3}{*}{{\bf Implementation}} &
929\multirow{3}{*}{\minitab[c]{{\bf Implementation}}} &
930%
931\multicolumn{2}{|c|}{{\bf Speed}} &
932\multicolumn{2}{|c|}{{\bf Memory}} \\
933%& \multicolumn{2}{|c|}{(1000 queries/sec)} &
934& \multicolumn{2}{|c|}{(k q/sec)} &
935\multicolumn{2}{|c|}{data + tree = total (Mbytes)} \\
936& 32-bit & 64-bit & 32-bit & 64-bit \\
937\hline
938%{\tt simkd} &          47 &  39 & 370 & 486 \\
939%{\tt ANN}   &          71 &  90 & 187 & 221 \\
940%\hline
941%{\tt libkd-d-box}   & 127 & 144 & 172 & 172 \\
942%{\tt libkd-d-split} & 231 & 284 & 126 & 126 \\
943%{\tt libkd-d-noR} &   239 & 293 & 125 & 125 \\
944%{\tt libkd-i-split} & 242 & 311 &  64 &  64 \\
945%{\tt libkd-i-noR} &   240 & 326 &  63 &  63 \\
946%{\tt libkd-s-split} & 328 & 386 &  33 &  33 \\
947%{\tt libkd-s-noR} &   307 & 396 &  32 &  32 \\
948\ftt{simkd} &          47 &  39 & 120 + 250=370 & 120 + 366=486 \\
949\ftt{ANN}   &          71 &  90 & 120 + 67=187 & 120 + 101=221 \\
950\hline
951\ftt{libkd-d-box}   & 127 & 144 & \multicolumn{2}{|c|}{120 + 52 = 172}  \\
952\ftt{libkd-d-split} & 231 & 284 & \multicolumn{2}{|c|}{120 + 6 = 126} \\
953\ftt{libkd-d-noR} &   239 & 293 & \multicolumn{2}{|c|}{120 + 5 = 125} \\
954\ftt{libkd-i-split} & 242 & 311 & \multicolumn{2}{|c|}{ 60 + 4 = 64} \\
955\ftt{libkd-i-noR} &   240 & 326 & \multicolumn{2}{|c|}{ 60 + 3 = 63} \\
956\ftt{libkd-s-split} & 328 & 386 & \multicolumn{2}{|c|}{ 30 + 3 = 33} \\
957\ftt{libkd-s-noR} &   307 & 396 & \multicolumn{2}{|c|}{ 30 + 2 = 32} \\
958\hline
959\end{tabular}
960\end{center}
961%\end{footnotesize}
962%\vspace{5pt}
963\caption{Benchmark results.
964Speed is measured in thousands of queries per second (k q/sec), memory
965in megabytes.  The memory values include the memory required to store
966the data points themselves, which is $120$ MB when {\tt double}s are
967used.  For the {\tt libkd} entries, {\tt d} indicates that the data
968points are stored as {\tt double}, {\tt i} indicates 32-bit integers,
969and {\tt s} indicates 16-bit integers (using
970trick \ref{trick:smalltypes}).  {\tt box} means that bounding-boxes
971are created; otherwise splitting-planes are used.  The {\tt noR}
972entries indicate that the we avoid storing the offset arrays
973(trick \ref{trick:noR}).  We have also assumed that trick
974\ref{trick:noperm} is applicable, so the permutation arrays are not stored.}
975\label{table:results}
976\end{table}
977
978%\numberparagraphs
979
980Table \ref{table:results} shows our results.  Our implementation, {\tt
981libkd}, always takes less memory and produces faster results.  The
982memory requirements of {\tt ANN} and {\tt simkd} increase when using a
98364-bit machine, while {\tt libkd} stays fixed because we use no
984extraneous pointers.  Note that the data points themselves require
985$120$ MB of space, so the memory overhead of a {\tt libkd} tree is
986only a few percent, compared to $50$ to $300$ percent for {\tt ANN}
987and {\tt simkd}.
988
989In this test, splitting planes are much faster than bounding boxes.
990Using trick \ref{trick:noR} to avoid storing the offset arrays both
991reduces the memory required and slightly increases the speed.  Using
992trick \ref{trick:smalltypes} to use smaller data types (32- and 16-bit
993integers) reduces the memory footprint by a factor of two or four,
994with a corresponding increase in speed.  It also incurs some
995approximation error.  In this test, we found that when using 32-bit
996integers, we still always produced the correct nearest neighbour, and
997the absolute error in the distance estimates was never more than
998$10^{-9}$.  When using 16-bit integers, we produced the correct
999nearest neighbour 99.5\% of the time, and the absolute distance error
1000was less than $10^{-4}$.
1001
1002% actual approximation errors; 4x10^{-10} and 8x10^{-5}.
1003
1004
1005In summary, these results show that, on this benchmark, by using the
1006tricks presented {\tt libkd} is three times faster and incurs
1007one-fifteenth the memory overhead compared to the closest competitor.
1008While benchmarks are not listed among the three canonical types of
1009lies (``lies, damned lies, and statistics''), we urge practitioners to
1010test these implementations on their own data.  Our testing and
1011benchmarking software is available for download.
1012
1013
1014% cluster59:
1015% (?) http://processorfinder.intel.com/Details.aspx?sSpec=SLABS
1016
1017% cluster23:
1018%    2 * Intel(R) Xeon(TM) CPU 3.06GHz
1019%    S-spec SL72G
1020%    http://processorfinder.intel.com/details.aspx?sSpec=SL72G
1021%    4 GB RAM
1022%    32-bit
1023%    libkd: -O3
1024%           -DNDEBUG
1025%           -march=pentium4
1026%           -fomit-frame-pointer
1027%           -fno-math-errno
1028%           -fno-trapping-math
1029%           -fno-signaling-nans
1030%           -ffinite-math-only
1031%    ann: -O3
1032%    simkd: -O2
1033
1034% testB.in
1035% 5M points in 3D
1036% 120 MB data
1037% ann, simkd: bucket size 14
1038% libkd: bucket size 10
1039
1040% simkd:
1041%  516,748 leaves
1042%  370 MB including data
1043
1044% ann:
1045%  516,748 leaves
1046%  187 MB including data
1047
1048% libkd:
1049%  524,288 leaves
1050%  172 MB ddd/bbox
1051%  126 MB ddd/split
1052%  125 MB ddd/nolr
1053%   64 MB duu/split
1054%   63 MB duu/splitdim/linearlr
1055%   33 MB dss/split
1056%   32 MB dss/splitdim/linearlr
1057
1058% speed: q/sec:
1059%   47  simkd
1060%   71  ann
1061%
1062%  127  ddd/bbox
1063%  231  ddd/split
1064% (203) ddd/nolr
1065%  239  ddd/linearlr
1066%
1067%  242  duu/split
1068% (233) duu/splitdim
1069% (205) duu/splitdim/nolr
1070%  240  duu/splitdim/linearlr
1071%
1072%  328  dss/split
1073% (303) dss/splitdim
1074% (255) dss/splitdim/nolr
1075%  307  dss/splitdim/linearlr
1076
1077
1078
1079% cluster27:
1080%    4 * AMD Opteron(tm) Processor 8220 SE
1081%    32 GB RAM
1082%    64-bit
1083%    libkd: -O3
1084%           -DNDEBUG
1085%           -march=k8 -m64
1086%           -fomit-frame-pointer
1087%           -fno-signaling-nans
1088%           -ffinite-math-only
1089%           -fno-math-errno
1090%           -fno-trapping-math
1091%    ann: -O3
1092%    simkd: -O2
1093
1094% testB.in
1095% 5M points in 3D
1096% 120 MB data
1097% ann, simkd: bucket size 14
1098% libkd: bucket size 10
1099
1100% simkd:
1101%  516,865 leaves
1102%  486 MB including data
1103
1104% ann:
1105%  516,785 leaves
1106%  221 MB including data
1107
1108% libkd:
1109%  524,288 leaves
1110%  172 MB ddd/bbox
1111%  126 MB ddd/split
1112%  125 MB ddd/nolr
1113%   64 MB duu/split
1114%   63 MB duu/splitdim/linearlr
1115%   33 MB dss/split
1116%   32 MB dss/splitdim/linearlr
1117
1118% speed: q/sec:
1119%   39  simkd
1120%   90  ann
1121%
1122%  144  ddd/bbox
1123%  284  ddd/split
1124% (260) ddd/nolr
1125%  293  ddd/linearlr
1126%
1127%  311  duu/split
1128% (314) duu/splitdim
1129% (287) duu/splitdim/nolr
1130%  326  duu/splitdim/linearlr
1131%
1132%  386  dss/split
1133% (378) dss/splitdim
1134% (341) dss/splitdim/nolr
1135%  396  dss/splitdim/linearlr
1136
1137
1138
1139
1140
1141\section{Conclusion}
1142
1143\Kdtrees are a fun and easy data structure for searching in $k$-dimensional
1144space. A careful implementation can be orders of magnitude faster than
1145a na\"ive one. Readers wishing to reap the benefits of such a careful
1146implementation, without building it themselves, are encouraged to
1147download the GPL'd {\tt libkd}. Patches are welcome!
1148
Note: See TracBrowser for help on using the repository browser.