| 1 | \comment{ |
|---|
| 2 | |
|---|
| 3 | HOGG comments |
|---|
| 4 | |
|---|
| 5 | Section 2+3, general: I like the fact that the tricks and tips and |
|---|
| 6 | parts are all individually numbered, but it is also distracting to |
|---|
| 7 | have so many full-break headings. What about formatting the |
|---|
| 8 | subsections the way LaTeX usually formats \paragraph{} so that "2.1 |
|---|
| 9 | Data structures:" appears at the beginning of the line that currently |
|---|
| 10 | starts "Figure 2 shows pseudocode...". |
|---|
| 11 | |
|---|
| 12 | Subsection 2.2: You don't specify when or how you change the splitting |
|---|
| 13 | dimension d. Even though it *does* say in the pseudocode, I think you |
|---|
| 14 | should be explicit here in the text, and maybe also give some |
|---|
| 15 | alternatives that are also sensible. That is, give the default method |
|---|
| 16 | but explain where a sensible person might make a different choice and |
|---|
| 17 | why. |
|---|
| 18 | |
|---|
| 19 | Figure 1: You don't explain the meaning of the dashed lines in the |
|---|
| 20 | caption. Also are you really going to have red color in your thesis |
|---|
| 21 | here? Also, the dashed lines are not very clear given the choppy |
|---|
| 22 | data; I would suggest grey or red solid lines. |
|---|
| 23 | |
|---|
| 24 | Figure 3 (and subsection 2.2): I think you should note in the |
|---|
| 25 | pseudocode comments and the text that traditionally you stop when the |
|---|
| 26 | node size goes below some threshold, but that in Section 3 you are |
|---|
| 27 | going to suggest an alternative. |
|---|
| 28 | |
|---|
| 29 | Figure 8: I prefer solid grey to dotted lines. |
|---|
| 30 | |
|---|
| 31 | Subsection 2.6: Either cite or explain the approximate kernel density |
|---|
| 32 | estimation algorithm, or both. |
|---|
| 33 | |
|---|
| 34 | Section 3: First line: "most" or "many"? I think you should spend |
|---|
| 35 | a bit more time here noting that your advice is only for the case in |
|---|
| 36 | which the kd-tree is edited or constructed FAR LESS FREQUENTLY than it |
|---|
| 37 | is searched. |
|---|
| 38 | |
|---|
| 39 | Subsection 3.2: Could you give the reader some idea of *when* you |
|---|
| 40 | might choose *not* to split at the median? |
|---|
| 41 | |
|---|
| 42 | Subsection 3.6: I would be even a bit more explicit here, that the |
|---|
| 43 | data dimensions in the tree are usually only some small part of your |
|---|
| 44 | data, and that you have all this other information for each data point |
|---|
| 45 | elsewhere. You call these "labels" in the text, but it could be all |
|---|
| 46 | the other columns in an enormous data base. |
|---|
| 47 | |
|---|
| 48 | Section 4, para starting "We chose a very simple...": Isn't it a HUGE |
|---|
| 49 | DEAL that you can make bigger trees? If so, this should be trumpeted |
|---|
| 50 | a bit more, here and in the abstract. So you beat all the standard |
|---|
| 51 | implementations on data-set applicability alone! |
|---|
| 52 | |
|---|
| 53 | Section 4, near the end of the section: Once again, mention the |
|---|
| 54 | capability of making much bigger trees. |
|---|
| 55 | |
|---|
| 56 | Section 5: Give the location for the libkd download. Perhaps it |
|---|
| 57 | should go to sourceforge or code.google.com, and not just |
|---|
| 58 | astrometry.net? Or is it already *in* some such location? |
|---|
| 59 | |
|---|
| 60 | one more thing: I think it would be useful to indicate, among the |
|---|
| 61 | speed-up and optimization hints, which depend most strongly on the |
|---|
| 62 | data being *static* and which do not; in general, if it isn't clear, |
|---|
| 63 | it is worth saying what tips and tricks depend on what assumptions. |
|---|
| 64 | } |
|---|
| 65 | |
|---|
| 66 | |
|---|
| 67 | \input{kdfigs} |
|---|
| 68 | |
|---|
| 69 | \section{Introduction} |
|---|
| 70 | |
|---|
| 71 | The \kdtree is a simple yet versatile data structure for speeding up |
|---|
| 72 | searches in $k$-dimensional data sets. It applies to |
|---|
| 73 | nearest-neighbour queries, which arise in many computer vision |
|---|
| 74 | applications such as vector quantization and template matching. It |
|---|
| 75 | also applies to a family of related problems such as range search |
|---|
| 76 | (finding all points within a given radius of a query), $K$-nearest |
|---|
| 77 | neighbours, and approximate kernel density estimation. Over its |
|---|
| 78 | long history the \kdtree has accumulated many extensions and |
|---|
| 79 | variations making the literature difficult to comprehend. |
|---|
| 80 | This \doctype presents a modern view of the \kdtree. It also points |
|---|
| 81 | out a number of optimizations that can increase the performance by |
|---|
| 82 | an order of magnitude over existing publicly available |
|---|
| 83 | implementations while also reducing the memory requirements---or |
|---|
| 84 | increasing the size of the largest \kdtree that be constructed in a given |
|---|
| 85 | amount of memory. |
|---|
| 86 | |
|---|
| 87 | \section{The standard \kdtree} |
|---|
| 88 | |
|---|
| 89 | |
|---|
| 90 | \Kdtrees have accelerated spatial queries for over thirty years. |
|---|
| 91 | \Kdtrees speed up N-body simulations, search, object recognition, |
|---|
| 92 | ray tracing, template matching, kernel density estimation, and more. |
|---|
| 93 | %But what \emph{is} a \kdtree? |
|---|
| 94 | |
|---|
| 95 | |
|---|
| 96 | |
|---|
| 97 | Imagine you have a classification problem. You have a million |
|---|
| 98 | labelled 2-dimensional data points (``feature vectors'') in the unit |
|---|
| 99 | square. You also have a set of a million unlabelled data points, and |
|---|
| 100 | a deadline. You decide to run nearest-neighbour classification. |
|---|
| 101 | After a moment of thought you realize that if you compare every |
|---|
| 102 | labelled feature to every unlabelled feature you will have to do |
|---|
| 103 | $10^{12}$ comparisons. Your deadline looms. Upon further reflection, |
|---|
| 104 | you decide to split the labelled points in half: the ones with $x \le |
|---|
| 105 | \half$ and the ones with $x > \half$. When you want to |
|---|
| 106 | find the nearest neighbour of an unlabelled point that has $x \le |
|---|
| 107 | \half$, you look in the first set of labelled points. Most of |
|---|
| 108 | the time, you find that the distance between the unlabelled point and its |
|---|
| 109 | nearest labelled neighbour is less than the distance to the $x = |
|---|
| 110 | \half$ boundary. You don't even need to look at the second set |
|---|
| 111 | of labelled points! |
|---|
| 112 | |
|---|
| 113 | \begin{figure} |
|---|
| 114 | \begin{center} |
|---|
| 115 | \begin{tabular}{@{}c@{\hspace{5em}}c@{}} |
|---|
| 116 | % as big as they can be before overflowing the page... |
|---|
| 117 | % created by kdtree-tests/tree-figs.py |
|---|
| 118 | \includegraphics[height=0.82\textheight]{kdtree-bbox}& |
|---|
| 119 | \includegraphics[height=0.82\textheight]{kdtree-split}% |
|---|
| 120 | \end{tabular} |
|---|
| 121 | \end{center} |
|---|
| 122 | \caption{Two varieties of \kdtree. \captionpart{Left:} A \kdtree with bounding boxes. |
|---|
| 123 | Parent-child relationships are shown by gray lines. |
|---|
| 124 | \captionpart{Right:} A \kdtree with splitting planes. The splitting plane at each |
|---|
| 125 | node is shown by a gray line. The top box is the root node. |
|---|
| 126 | Parent-child relationships are not shown in order to avoid cluttering |
|---|
| 127 | the figure.} |
|---|
| 128 | \label{fig:tree} |
|---|
| 129 | \end{figure} |
|---|
| 130 | |
|---|
| 131 | |
|---|
| 132 | |
|---|
| 133 | |
|---|
| 134 | While waiting for your algorithm to finish, you \mbox{remember} |
|---|
| 135 | recursion. You decide to split the labelled points in half again. |
|---|
| 136 | {\small And again.} {\scriptsize And again.} {\tiny And again.} You |
|---|
| 137 | just invented \kdtrees. If it was 1975 you would be the first, and |
|---|
| 138 | win a prize \cite{bentley1975}. |
|---|
| 139 | |
|---|
| 140 | |
|---|
| 141 | You used a \kdtree to speed up nearest neighbour search (find the |
|---|
| 142 | closest data point to the query point). They are also useful for |
|---|
| 143 | \emph{range search} (find all data points within some radius of a |
|---|
| 144 | query point) and more complex queries such as \emph{approximate kernel |
|---|
| 145 | density estimation}. |
|---|
| 146 | |
|---|
| 147 | |
|---|
| 148 | |
|---|
| 149 | A \kdtree is a type of \emph{Binary Space Partition} tree. Each node |
|---|
| 150 | in the tree represents a portion of space. In \kdtrees the portion of |
|---|
| 151 | space is an axis-aligned bounding box.% |
|---|
| 152 | \footnote{In the example above, the first node you created had the |
|---|
| 153 | bounding box $x \in [0, \half]$ and $y \in [0, 1]$.} Each node has |
|---|
| 154 | two children which split the node's space in two. This split is |
|---|
| 155 | usually chosen so that half the data points go into each child. A |
|---|
| 156 | node's axis-aligned bounding box is either explicitly stored, or |
|---|
| 157 | implicitly defined by its ancestors' splitting planes. |
|---|
| 158 | |
|---|
| 159 | |
|---|
| 160 | % binary, balanced and complete. |
|---|
| 161 | |
|---|
| 162 | |
|---|
| 163 | %Related stuff. |
|---|
| 164 | %\cite{friedman1977, sproull1991} |
|---|
| 165 | %Competition: \cite{arya1998}, \cite{uhlmann1991}, \cite{moore2000}, |
|---|
| 166 | %\cite{datar2004}; \cite{hjaltason2003} (survey). |
|---|
| 167 | |
|---|
| 168 | \Kdtrees work well in low-dimensional spaces (opinions vary, but say up to 15). |
|---|
| 169 | In high dimensions, other data structures may work better \cite{bohm2001}. |
|---|
| 170 | |
|---|
| 171 | \def\ignore#1{} |
|---|
| 172 | \ignore{}{Below, we first describe the core structure of \kdtrees, but note that the foundational implementation is not the most efficient. We then move to the comprehensive series of implementation ``tricks'' that must be considered for optimal performance. Finally, we compare our implementation, {\tt libkd} (incorporating the efficiencies described), with two previously released \kdtree implementations: {\tt ANN} \cite{arya1998} and {\tt Simple \kdtrees} \cite{simkd} (abbreviated to {\tt simkd}) to show the considerable level of speed-ups that can be attained.} |
|---|
| 173 | |
|---|
| 174 | \section{\Kdtree implementation} |
|---|
| 175 | |
|---|
| 176 | \subsection{Data structures} |
|---|
| 177 | \label{sec:datastruct} |
|---|
| 178 | |
|---|
| 179 | \Figref{code:datastruct} shows pseudocode for \kdtree data |
|---|
| 180 | structures. There are two types of \kdtree nodes listed, {\tt |
|---|
| 181 | SplittingNode} and {\tt BoundingBoxNode}. In this \doctype, a \kdtree |
|---|
| 182 | contains nodes of only one type, but of course it is possible to store |
|---|
| 183 | both the splitting plane and bounding box of each node if required by |
|---|
| 184 | the application. {\tt BoundingBoxNode}s contain two points which |
|---|
| 185 | define the maximum and minimum corners of the bounding box. {\tt |
|---|
| 186 | SplittingNode}s do not contain an explicit representation of their |
|---|
| 187 | spatial extent. Instead, each node contains a split dimension and |
|---|
| 188 | position. A {\tt SplittingNode}'s bounding box is implicitly |
|---|
| 189 | described by the splits of its ancestors. |
|---|
| 190 | |
|---|
| 191 | |
|---|
| 192 | The two representations have different performance characteristics on |
|---|
| 193 | different data distributions. For example, the benchmark in |
|---|
| 194 | \secref{sec:speed} favours splitting planes. Structured data sets |
|---|
| 195 | where the dimensions are correlated and the data point distribution is |
|---|
| 196 | clumpy work well with bounding box nodes. |
|---|
| 197 | |
|---|
| 198 | |
|---|
| 199 | Please note that these data structures are only for explaining the |
|---|
| 200 | \kdtree; for practical implementation details see |
|---|
| 201 | \secref{sec:impl}. |
|---|
| 202 | |
|---|
| 203 | |
|---|
| 204 | |
|---|
| 205 | \subsection{Construction} |
|---|
| 206 | |
|---|
| 207 | |
|---|
| 208 | \Kdtree construction starts with a set of points. Construction begins by |
|---|
| 209 | creating a root node which contains all of the data points. The root |
|---|
| 210 | node is split into two halves by choosing a splitting dimension |
|---|
| 211 | (usually the dimension with the largest spread) and a splitting |
|---|
| 212 | position (usually the median value of the points along that |
|---|
| 213 | dimension). Then, the set of points whose position along the |
|---|
| 214 | splitting dimension is less than the splitting position are put in the |
|---|
| 215 | left child. The remaining points are put in the right child. |
|---|
| 216 | Construction proceeds recursively until the nodes at the lowest level |
|---|
| 217 | of the tree contain some small number of points (perhaps 10). See |
|---|
| 218 | \figref{code:build} for pseudocode. |
|---|
| 219 | |
|---|
| 220 | \clearpage |
|---|
| 221 | |
|---|
| 222 | \begin{figure} |
|---|
| 223 | \begin{pcode} |
|---|
| 224 | \class KDTree: \\ |
|---|
| 225 | \> Node root \\ |
|---|
| 226 | \\ |
|---|
| 227 | \class Node: \\ |
|---|
| 228 | \> \codecomment{empty superclass for leaf and internal nodes} \\ |
|---|
| 229 | \\ |
|---|
| 230 | \class InternalNode extends Node: \\ |
|---|
| 231 | \> \codecomment{superclass of bounding box} \\ |
|---|
| 232 | \> \codecomment{and splitting plane nodes} \\ |
|---|
| 233 | \> Node left\_child \\ |
|---|
| 234 | \> Node right\_child \\ |
|---|
| 235 | \\ |
|---|
| 236 | \class BoundingBoxNode extends InternalNode: \\ |
|---|
| 237 | \> \codecomment{lower and upper corners of the bounding box} \\ |
|---|
| 238 | \> point lower \\ |
|---|
| 239 | \> point upper \\ |
|---|
| 240 | \\ |
|---|
| 241 | \class SplittingNode extends InternalNode: \\ |
|---|
| 242 | \> \codecomment{splitting plane dimension and value} \\ |
|---|
| 243 | \> int split\_dimension \\ |
|---|
| 244 | \> real split\_position \\ |
|---|
| 245 | \\ |
|---|
| 246 | \class LeafNode extends Node: \\ |
|---|
| 247 | \> point[] points\_owned |
|---|
| 248 | \end{pcode} |
|---|
| 249 | \caption{Data structures for \kdtree nodes. |
|---|
| 250 | Each {\tt LeafNode} contains a set of $D$-dimensional {\tt point}s. |
|---|
| 251 | All nodes in a \kdtree are either of type {\tt BoundingBoxNode} or |
|---|
| 252 | {\tt SplittingNode}. |
|---|
| 253 | In practice, more efficient structures should be used; see \secref{sec:impl}.} |
|---|
| 254 | \label{code:datastruct} |
|---|
| 255 | \end{figure} |
|---|
| 256 | |
|---|
| 257 | \begin{figure} |
|---|
| 258 | \begin{pcode} |
|---|
| 259 | \func build\_kdtree(point[] p, int nleaf, bool b\_boxes): \\ |
|---|
| 260 | \> t = new KDTree() \\ |
|---|
| 261 | \> t.root = build\_node(p, nleaf, b\_boxes) \\ |
|---|
| 262 | \> return t \\ |
|---|
| 263 | \\ |
|---|
| 264 | \codecomment{Build a \kdtree node from a set of points.} \\ |
|---|
| 265 | \func build\_node(point[] p, int nleaf, bool boxes): \\ |
|---|
| 266 | \> \codecomment{if the set of points is small enough...} \\ |
|---|
| 267 | \> if p.size() <= nleaf: \\ |
|---|
| 268 | \>\> \codecomment{create a leaf node.} \\ |
|---|
| 269 | \>\> return new LeafNode(p) \\ |
|---|
| 270 | \> \codecomment{else, choose a splitting dimension...} \\ |
|---|
| 271 | \> split\_dim = choose\_split\_dimension(p) \\ |
|---|
| 272 | \> \codecomment{and split the points along that dimension.} \\ |
|---|
| 273 | \> (p\_left, p\_right, split\_pos) = partition(p, split\_dim) \\ |
|---|
| 274 | \> if boxes: \\ |
|---|
| 275 | \>\> n = new BoundingBoxNode() \\ |
|---|
| 276 | \>\> n.lower = min(p) \\ |
|---|
| 277 | \>\> n.upper = max(p) \\ |
|---|
| 278 | \> else: \\ |
|---|
| 279 | \>\> n = new SplittingNode() \\ |
|---|
| 280 | \>\> n.split\_dimension = split\_dim \\ |
|---|
| 281 | \>\> n.split\_position = split\_pos \\ |
|---|
| 282 | \> \codecomment{recurse on the two point sets} \\ |
|---|
| 283 | \> \codecomment{to create the child nodes.} \\ |
|---|
| 284 | \> n.left\_child \spc = build\_node(points\_left, \spc nleaf, boxes) \\ |
|---|
| 285 | \> n.right\_child = build\_node(points\_right, nleaf, boxes) \\ |
|---|
| 286 | \> return n \\ |
|---|
| 287 | \\ |
|---|
| 288 | \codecomment{Typical: split the dimension with largest range.} \\ |
|---|
| 289 | \func choose\_split\_dimension(points): \\ |
|---|
| 290 | \> return argmax(max(points) - min(points)) \\ |
|---|
| 291 | \\ |
|---|
| 292 | \codecomment{Typical: split at the median value.} \\ % to yield a balanced tree.} \\ |
|---|
| 293 | \func partition(points, dim): \\ |
|---|
| 294 | \> \codecomment{find the median value in the splitting dimension} \\ |
|---|
| 295 | \> split\_val = median(points, dim) \\ |
|---|
| 296 | \> \codecomment{grab points on each side of the partition.} \\ |
|---|
| 297 | \> p\_left \spc = [p in points |
|---|
| 298 | where p[dim] <= split\_val] \\ |
|---|
| 299 | \> p\_right = [p in points |
|---|
| 300 | where p[dim] > \spc split\_val] \\ |
|---|
| 301 | \> return (points\_left, points\_right, split\_val) |
|---|
| 302 | \end{pcode} |
|---|
| 303 | \caption{Pseudocode for \kdtree construction.} |
|---|
| 304 | \label{code:build} |
|---|
| 305 | \end{figure} |
|---|
| 306 | |
|---|
| 307 | |
|---|
| 308 | |
|---|
| 309 | \subsection{Distance bounds} |
|---|
| 310 | |
|---|
| 311 | Fundamentally, \kdtree algorithms are about pruning nodes that don't need |
|---|
| 312 | to be examined. The primary tool to achieve this is the \mindist |
|---|
| 313 | function, which provides a lower bound on the distance between a point and |
|---|
| 314 | any point that lives in the region owned by a node. |
|---|
| 315 | Intuitively, this is the distance from a point to the closest corner or closest |
|---|
| 316 | edge of the node. |
|---|
| 317 | |
|---|
| 318 | The form of the \mindist function depends on the representation of a \kdtree |
|---|
| 319 | node. For \kdtrees that have bounding boxes, one computes the |
|---|
| 320 | \mindist to an individual node. For \kdtrees that only store the splitting plane, |
|---|
| 321 | it makes more sense to compute the \mindist to each of the node's two children. |
|---|
| 322 | An illustration is given in \figref{fig:mindist} and pseudocode in |
|---|
| 323 | \figref{code:mindist}. |
|---|
| 324 | |
|---|
| 325 | |
|---|
| 326 | \begin{figure} |
|---|
| 327 | \begin{center}% |
|---|
| 328 | \begin{tabular}{@{}c@{}c@{}} |
|---|
| 329 | \includegraphics[width=0.49\columnwidth]{mindist-bb}& |
|---|
| 330 | \includegraphics[width=0.49\columnwidth]{mindist-split}% |
|---|
| 331 | \end{tabular} |
|---|
| 332 | \end{center} |
|---|
| 333 | \caption{\captionpart{Left:} \mindist for a \kdtree node with bounding boxes. |
|---|
| 334 | \captionpart{Right:} \mindist for a \kdtree node with only the splitting dimension and value, |
|---|
| 335 | the \mindist is zero for the child on the same side of the splitting |
|---|
| 336 | plane as the query point; for the other child the \mindist is simply |
|---|
| 337 | the distance in the splitting dimension from the query to the |
|---|
| 338 | splitting plane.} |
|---|
| 339 | \label{fig:mindist} |
|---|
| 340 | \end{figure} |
|---|
| 341 | |
|---|
| 342 | \begin{figure} |
|---|
| 343 | \begin{pcode} |
|---|
| 344 | \func mindist\_to\_box(point p, BoundingBoxNode n): \\ |
|---|
| 345 | \> mindist = 0 \\ |
|---|
| 346 | \> for each dimension d: \\ |
|---|
| 347 | \>\> if p[d] > n.upper[d]: \\ |
|---|
| 348 | \>\>\> \codecomment{p is above the bounding} \\ |
|---|
| 349 | \>\>\> \codecomment{box in this dimension} \\ |
|---|
| 350 | %\>\>\> \codecomment{p is above the bounding box in this dimension} \\ |
|---|
| 351 | \>\>\> diff = n.upper[d] - p[d] \\ |
|---|
| 352 | \>\> else if p[d] < n.lower[d]: \\ |
|---|
| 353 | \>\>\> \codecomment{p is below the bounding} \\ |
|---|
| 354 | \>\>\> \codecomment{box in this dimension} \\ |
|---|
| 355 | \>\>\> diff = p[d] - n.lower[d] \\ |
|---|
| 356 | \>\> else: \\ |
|---|
| 357 | \>\>\> \codecomment{p is within the bounding} \\ |
|---|
| 358 | \>\>\> \codecomment{box in this dimension} \\ |
|---|
| 359 | %\>\>\> \codecomment{p is within the bounding box in this dimension} \\ |
|---|
| 360 | \>\>\> diff = 0 \\ |
|---|
| 361 | \> mindist += diff * diff \\ |
|---|
| 362 | \> return sqrt(mindist) \\ |
|---|
| 363 | \\ |
|---|
| 364 | \func mindist\_to\_left\_child(point p, SplittingNode n): \\ |
|---|
| 365 | \> d = n.split\_dimension \\ |
|---|
| 366 | \> if p[d] > n.split\_position: \\ |
|---|
| 367 | \>\> \codecomment{point is on the right side} \\ |
|---|
| 368 | \>\> \codecomment{of node's splitting plane} \\ |
|---|
| 369 | %\>\> \codecomment{point is on the right side of node's splitting plane} \\ |
|---|
| 370 | \>\> return p[d] - n.split\_position \\ |
|---|
| 371 | \> return 0 \\ |
|---|
| 372 | \\ |
|---|
| 373 | \func mindist\_to\_right\_child(point p, SplittingNode n): \\ |
|---|
| 374 | \> d = n.split\_dimension \\ |
|---|
| 375 | \> if p[d] < n.split\_position: \\ |
|---|
| 376 | \>\> \codecomment{point is on the left side} \\ |
|---|
| 377 | \>\> \codecomment{of node's splitting plane.} \\ |
|---|
| 378 | %\>\> \codecomment{point is on the left side of node's splitting plane.} \\ |
|---|
| 379 | \>\> return n.split\_position - p[d] \\ |
|---|
| 380 | \> return 0 |
|---|
| 381 | \end{pcode} |
|---|
| 382 | \caption{Pseudocode for \mindist. \captionpart{Top:} for \kdtrees with bounding boxes. |
|---|
| 383 | \captionpart{Bottom:} for \kdtrees with splitting |
|---|
| 384 | planes.} |
|---|
| 385 | \label{code:mindist} |
|---|
| 386 | \end{figure} |
|---|
| 387 | |
|---|
| 388 | \subsection{Nearest neighbour} |
|---|
| 389 | |
|---|
| 390 | Given a set of data points and a query point, solving the nearest neighbour |
|---|
| 391 | problem consists of finding the data point that is closest to the query. |
|---|
| 392 | Specifically, given query point $\mathbf{q}$, find |
|---|
| 393 | \begin{equation} |
|---|
| 394 | \mathbf{x}_{\mathit NN} = \argmin_{i} \| \mathbf{x}_i - \mathbf{q} \|_{2} \quad . |
|---|
| 395 | \end{equation} |
|---|
| 396 | The algorithm is of the \emph{branch and bound} variety |
|---|
| 397 | \cite{land1960}: the tree is traversed, maintaining a \emph{pruning |
|---|
| 398 | distance}---the distance to the nearest data point found so far. In |
|---|
| 399 | addition, the set of nodes that might contain the solution are stored |
|---|
| 400 | in a stack or priority queue ordered by \mindist. If a node is |
|---|
| 401 | encountered whose \mindist to the query point is larger than the |
|---|
| 402 | pruning distance, it is discarded because it cannot contain the |
|---|
| 403 | solution. When a leaf node is encountered, its data points are |
|---|
| 404 | examined individually. If any data point is closer than the pruning |
|---|
| 405 | distance, that point becomes the nearest neighbour found so far, and |
|---|
| 406 | the pruning distance is set to the point's distance to the query |
|---|
| 407 | point. The algorithm begins by setting the pruning distance to |
|---|
| 408 | $+\infty$ and placing the root node on the stack. See the pseudocode |
|---|
| 409 | in \figref{code:nn}. |
|---|
| 410 | |
|---|
| 411 | |
|---|
| 412 | \subsection{Rangesearch} |
|---|
| 413 | |
|---|
| 414 | In the rangesearch problem, the goal is to find all points that are |
|---|
| 415 | within a certain radius of the query point. Specifically, given query |
|---|
| 416 | point $\mathbf{q}$, radius $r$, and a set of points $\mathbb{X}$, find |
|---|
| 417 | the set of points |
|---|
| 418 | \begin{equation} |
|---|
| 419 | \left\{ |
|---|
| 420 | \ \mathbf{x} |
|---|
| 421 | \ \Big| \ \| \mathbf{x} - \mathbf{q} \|_{2} \le r,\ \mathbf{x}\in\mathbb{X} \ |
|---|
| 422 | \right\} |
|---|
| 423 | \quad . |
|---|
| 424 | \end{equation} |
|---|
| 425 | Range search is simpler than nearest neighbour search because the |
|---|
| 426 | pruning distance is fixed. This implies that the list of nodes that |
|---|
| 427 | must be inspected does not depend on the order of tree traversal. At |
|---|
| 428 | each node the \mindist to the query point is computed. If it is |
|---|
| 429 | larger than the search radius, it is pruned because it cannot possibly |
|---|
| 430 | contain any data points within the radius. See the pseudocode |
|---|
| 431 | in \figref{code:rangesearch}. |
|---|
| 432 | |
|---|
| 433 | |
|---|
| 434 | \begin{figure} |
|---|
| 435 | \begin{pcode} |
|---|
| 436 | \codecomment{Input: root of a \kdtree, and a query point q} \\ |
|---|
| 437 | \codecomment{Output: (x, distance(x, q)) such that x is the} \\ |
|---|
| 438 | \codecomment{ nearest neighbour of q in the \kdtree} \\ |
|---|
| 439 | %\codecomment{ x = argmin dist(x, q) for all x in the \kdtree} \\ |
|---|
| 440 | \func nearest\_neighbour(KDTree t, point q): \\ |
|---|
| 441 | %1234567890123456789012\=\kill |
|---|
| 442 | %\> real pruning\_distance=+inf): \\ |
|---|
| 443 | %\tabstops |
|---|
| 444 | \> nodestack.push(t.root) \\ |
|---|
| 445 | \> closest\_so\_far = +inf \\ |
|---|
| 446 | \> nearest\_neighbour = null \\ |
|---|
| 447 | \> while nodestack is not empty: \\ |
|---|
| 448 | \>\> n = nodestack.pop() \\ |
|---|
| 449 | \>\> if mindist(n, q) >= closest\_so\_far: \\ |
|---|
| 450 | \>\>\> \codecomment{this node can be pruned.} \\ |
|---|
| 451 | \>\>\> continue \\ |
|---|
| 452 | \>\> if n.is\_leaf(): \\ |
|---|
| 453 | \>\>\> \codecomment{leaf node: examine individual points} \\ |
|---|
| 454 | \>\>\> for p in n.points\_owned: \\ |
|---|
| 455 | \>\>\>\> d = distance(p, q) \\ |
|---|
| 456 | \>\>\>\> if d < closest\_so\_far: \\ |
|---|
| 457 | \>\>\>\>\> \codecomment{nearest neighbour found so far} \\ |
|---|
| 458 | \>\>\>\>\> closest\_so\_far = d \\ |
|---|
| 459 | \>\>\>\>\> nearest\_neighbour = p \\ |
|---|
| 460 | \>\> else: \\ |
|---|
| 461 | \>\>\> \codecomment{find the mindists to the child nodes} \\ |
|---|
| 462 | \>\>\> mindist\_L = mindist\_to\_left\_child(n, q) \\ |
|---|
| 463 | \>\>\> mindist\_R = mindist\_to\_right\_child(n, q) \\ |
|---|
| 464 | \>\>\> \codecomment{visit the closest child first} \\ |
|---|
| 465 | \>\>\> \codecomment{(by stacking the farther child first)} \\ |
|---|
| 466 | \>\>\> if mindist\_L > mindist\_R: \\ |
|---|
| 467 | \>\>\>\> nodestack.push(n.left\_child) \\ |
|---|
| 468 | \>\>\>\> nodestack.push(n.right\_child) \\ |
|---|
| 469 | \>\>\> else: \\ |
|---|
| 470 | \>\>\>\> nodestack.push(n.right\_child) \\ |
|---|
| 471 | \>\>\>\> nodestack.push(n.left\_child) \\ |
|---|
| 472 | \> return (nearest\_neighbour, closest\_so\_far) |
|---|
| 473 | \end{pcode} |
|---|
| 474 | \caption{Pseudocode for nearest neighbour search. The algorithm maintains |
|---|
| 475 | a stack of nodes to explore and a pruning distance ({\tt |
|---|
| 476 | closest\_so\_far}), which is the distance to the nearest data point |
|---|
| 477 | found so far. Any node whose |
|---|
| 478 | \mindist is greater than the pruning distance is pruned.} |
|---|
| 479 | \label{code:nn} |
|---|
| 480 | \end{figure} |
|---|
| 481 | |
|---|
| 482 | |
|---|
| 483 | %\begin{tikzpicture}[remember picture, overlay] |
|---|
| 484 | % \node [xshift=0.3em, yshift=-1in] (A) at (current page.north) {}; |
|---|
| 485 | % \node [xshift=-0.3125in] (tl) at (A.center) {}; |
|---|
| 486 | % \node [yshift=-8.875in] (bl) at (tl.center) {}; |
|---|
| 487 | % \node [xshift=-3.28125in] (br) at (bl.center) {}; |
|---|
| 488 | % \node [xshift=-3.28125in] (tr) at (tl.center) {}; |
|---|
| 489 | % \draw [blue] (tl) -- (tr) -- (br) -- (bl) -- (tl); |
|---|
| 490 | %\end{tikzpicture} |
|---|
| 491 | |
|---|
| 492 | |
|---|
| 493 | \begin{figure} |
|---|
| 494 | \begin{pcode} |
|---|
| 495 | \codecomment{Input: A \kdtree, a query point q, and a range} \\ |
|---|
| 496 | \codecomment{Output: The set [(x, distance(x, q)), ...] } \\ |
|---|
| 497 | \codecomment{ such that distance(x, q) < range} \\ |
|---|
| 498 | \codecomment{ for all x in the \kdtree} \\ |
|---|
| 499 | \func range\_search(KDTree t, point q, real range): \\ |
|---|
| 500 | \> results = [] \codecomment{empty list} \\ |
|---|
| 501 | \> range\_search\_helper(t.root, q, range, results) \\ |
|---|
| 502 | \> return results \\ |
|---|
| 503 | \\ |
|---|
| 504 | \func range\_search\_helper(Node n, point q, \\ |
|---|
| 505 | 12345678901234567890123\=\kill |
|---|
| 506 | \> real range, list res): \\ |
|---|
| 507 | \tabstops |
|---|
| 508 | \> if is\_leaf(n): \\ |
|---|
| 509 | \>\> for each p in n.points\_owned: \\ |
|---|
| 510 | \>\>\> d = distance(p,q) \\ |
|---|
| 511 | \>\>\> if d < range: \\ |
|---|
| 512 | \>\>\>\> res.append( (p, d) ) \\ |
|---|
| 513 | \>\> return \\ |
|---|
| 514 | \> if mindist\_to\_left\_child(n, q) < range: \\ |
|---|
| 515 | \>\> range\_search\_helper(n.left\_child, \spc q, range, res) \\ |
|---|
| 516 | \> if mindist\_to\_right\_child(n, q) < range: \\ |
|---|
| 517 | \>\> range\_search\_helper(n.right\_child, q, range, res) |
|---|
| 518 | \end{pcode} |
|---|
| 519 | \caption{Pseudocode for range search.} |
|---|
| 520 | \label{code:rangesearch} |
|---|
| 521 | \end{figure} |
|---|
| 522 | |
|---|
| 523 | %For \kdtrees that only store the splitting plane, the algorithm is slightly |
|---|
| 524 | %simpler. Each time we examine a non-leaf node, we determine whether the |
|---|
| 525 | %query point lies to the left or right of the splitting plane. We place the |
|---|
| 526 | %opposite child on the stack, followed by the child that owns the space |
|---|
| 527 | %containing the query point. A lower bound on the distance from the query |
|---|
| 528 | %point to the opposite child is simply the distance in the splitting dimension |
|---|
| 529 | %of the query to the splitting plane. |
|---|
| 530 | |
|---|
| 531 | % Pseudocode... |
|---|
| 532 | |
|---|
| 533 | |
|---|
| 534 | % Good seed: 1197089319 |
|---|
| 535 | |
|---|
| 536 | \begin{figure} |
|---|
| 537 | \begin{center} |
|---|
| 538 | \newlength{\nnfigw} |
|---|
| 539 | \setlength{\nnfigw}{0.25\textwidth} |
|---|
| 540 | \begin{tabular}{@{}c@{}c@{}c@{}} |
|---|
| 541 | \includegraphics[width=\nnfigw]{nnbb01} & |
|---|
| 542 | \includegraphics[width=\nnfigw]{nnbb02} & |
|---|
| 543 | \includegraphics[width=\nnfigw]{nnbb03} \\ |
|---|
| 544 | \includegraphics[width=\nnfigw]{nnbb04} & |
|---|
| 545 | \includegraphics[width=\nnfigw]{nnbb05} & |
|---|
| 546 | \includegraphics[width=\nnfigw]{nnbb06} |
|---|
| 547 | \end{tabular} |
|---|
| 548 | |
|---|
| 549 | \vspace{20pt} |
|---|
| 550 | |
|---|
| 551 | \begin{tabular}{@{}c@{}c@{}c@{}} |
|---|
| 552 | \includegraphics[width=\nnfigw]{nnsplit01} & |
|---|
| 553 | \includegraphics[width=\nnfigw]{nnsplit02} & |
|---|
| 554 | \includegraphics[width=\nnfigw]{nnsplit03} \\ |
|---|
| 555 | \includegraphics[width=\nnfigw]{nnsplit04} & |
|---|
| 556 | \includegraphics[width=\nnfigw]{nnsplit05} & |
|---|
| 557 | \includegraphics[width=\nnfigw]{nnsplit06} |
|---|
| 558 | \end{tabular} |
|---|
| 559 | \end{center} |
|---|
| 560 | \caption{\captionpart{Top:} Nearest-neighbour search with bounding boxes. |
|---|
| 561 | \captionpart{Bottom:} Nearest-neighbour search with splitting planes. |
|---|
| 562 | The node currently being examined is |
|---|
| 563 | shown in bold. In this example, |
|---|
| 564 | the first leaf node explored contains the nearest neighbour and all |
|---|
| 565 | remaining nodes can be pruned.} |
|---|
| 566 | \label{fig:nn} |
|---|
| 567 | \end{figure} |
|---|
| 568 | |
|---|
| 569 | %\clearpage |
|---|
| 570 | |
|---|
| 571 | \subsection{Approximations} |
|---|
| 572 | |
|---|
| 573 | \Kdtrees are applicable to approximate as well as exact algorithms. |
|---|
| 574 | A prominent example is the \emph{Best Bin First} (BBF) algorithm used by many |
|---|
| 575 | SIFT-based vision systems \cite{beis1997}. BBF is essentially \kdtree |
|---|
| 576 | nearest-neighbour search where the tree traversal is ordered by a priority queue. |
|---|
| 577 | By terminating the search after exploring some fixed number of leaf nodes, |
|---|
| 578 | an approximate nearest neighbour is produced quickly. |
|---|
| 579 | |
|---|
| 580 | There is an elegant solution to approximate kernel density estimation |
|---|
| 581 | which looks similar to the nearest-neighbour algorithm. It refines |
|---|
| 582 | upper and lower bounds on the kernel density with a series of |
|---|
| 583 | \mindist and |
|---|
| 584 | \maxdist calculations. The bounds start loose and tighten as the tree is |
|---|
| 585 | traversed. The algorithm terminates when the interval is small |
|---|
| 586 | enough. |
|---|
| 587 | |
|---|
| 588 | |
|---|
| 589 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
|---|
| 590 | \section{Efficient Implementation Tricks} |
|---|
| 591 | \label{sec:impl} |
|---|
| 592 | |
|---|
| 593 | |
|---|
| 594 | In most cases, \kdtrees are built from a static set of data points |
|---|
| 595 | and then queried millions of times.\footnote{There are some exceptions |
|---|
| 596 | where the tree is updated online, but they are not discussed |
|---|
| 597 | here.} A number of opportunities arise when points are not added or |
|---|
| 598 | removed after the \kdtree is built. |
|---|
| 599 | |
|---|
| 600 | |
|---|
| 601 | This section presents a series of optimization ``tricks'' taken from |
|---|
| 602 | the battle-hardened \kdtree implementation at the heart of \an. |
|---|
| 603 | These tricks yield an order of magnitude speed increase and vastly |
|---|
| 604 | reduce the memory required compared to the alternatives available. |
|---|
| 605 | |
|---|
| 606 | |
|---|
| 607 | In this section, the dimension of the data points is $D$, the number |
|---|
| 608 | of data points is $N$, and the number of nodes in the \kdtree is $M$. |
|---|
| 609 | All trees are binary and complete. $N$ is large relative to $D$; for |
|---|
| 610 | example in the authors' application $D$ is 3 or 4, and $N$ is 10 to |
|---|
| 611 | 100 million. For us, the maximum tree size that fits in a 32-bit |
|---|
| 612 | address space is an important design parameter. |
|---|
| 613 | |
|---|
| 614 | Not all of these implementation ``tricks'' apply in all cases, but each one |
|---|
| 615 | helps. Although most of the tricks are concerned with reducing the memory |
|---|
| 616 | requirements of the \kdtree, this often has the effect of increasing the |
|---|
| 617 | speed as well. |
|---|
| 618 | |
|---|
| 619 | |
|---|
| 620 | \trick{Store the data points as a flat array.} |
|---|
| 621 | \label{trick:flatarray} |
|---|
| 622 | Use a one-dimensional array of size $ND$, where coordinate $j$ of data |
|---|
| 623 | point $i$ is stored |
|---|
| 624 | %\linebreak[4] |
|---|
| 625 | at location $iD + j$. For example, in three dimensions |
|---|
| 626 | the array will contain |
|---|
| 627 | \[ (x_0, y_0, z_0, x_1, y_1, z_1, \ldots) .\] |
|---|
| 628 | |
|---|
| 629 | It may be tempting to store the data points as an $N \times D$ |
|---|
| 630 | two-dimensional array. This incurs a memory overhead of $N$ pointers, |
|---|
| 631 | and at runtime requires a pointer dereference. To store one million |
|---|
| 632 | points in 3 dimensions ($N=10^{6}, D=3$), our {\tt libkd} requires an |
|---|
| 633 | overhead of 4 bytes, while the existing implementations {\tt ANN} and |
|---|
| 634 | {\tt simkd} use over $4$ megabytes ($8$ megabytes on 64-bit) and over |
|---|
| 635 | $20$ megabytes (40 on 64-bit) respectively. |
|---|
| 636 | |
|---|
| 637 | \trick{Create a complete tree.} |
|---|
| 638 | \label{trick:complete} |
|---|
| 639 | Instead of splitting nodes until each leaf node contains some maximum |
|---|
| 640 | number of points, fix the number of tree levels and create a complete |
|---|
| 641 | tree. The huge advantage of this trick is that the number of nodes is |
|---|
| 642 | known in advance, which allows the next trick to be used without |
|---|
| 643 | wasting any memory. |
|---|
| 644 | |
|---|
| 645 | There are disadvantages to this trick. First, the number of data |
|---|
| 646 | points in the leaf nodes is only adjustable by factors of two. |
|---|
| 647 | Second, if nodes are not split at the median value, then some leaf |
|---|
| 648 | nodes will contain more data points than others. |
|---|
| 649 | |
|---|
| 650 | \trick{Don't use pointers to connect nodes.} |
|---|
| 651 | \label{trick:nopointers} |
|---|
| 652 | Instead, put the nodes in a single array. Use the heap indexing |
|---|
| 653 | strategy shown below instead of explicit pointers. |
|---|
| 654 | The root is node $0$. The left child of node $i$ is node $2i + 1$, |
|---|
| 655 | and the right child is node $2i + 2$. |
|---|
| 656 | %This works because \kdtrees are binary and complete. |
|---|
| 657 | |
|---|
| 658 | \nonumberparagraphs |
|---|
| 659 | \begin{center} |
|---|
| 660 | \includegraphics[width=\pointerlessfigwidth]{pointerless-fig} |
|---|
| 661 | \end{center} |
|---|
| 662 | \numberparagraphs |
|---|
| 663 | |
|---|
| 664 | For a tree with $M$ nodes, this saves about $M$ pointers, and places |
|---|
| 665 | sibling nodes---which are likely to be accessed at the same |
|---|
| 666 | time---next to each other in memory. This locality of reference |
|---|
| 667 | improves cache performance. |
|---|
| 668 | |
|---|
| 669 | |
|---|
| 670 | A convenient side-effect of using this trick is that the \kdtree is |
|---|
| 671 | position-independent: the array of nodes can be moved in memory without |
|---|
| 672 | changing any of the contents. Indeed, it can be written directly to |
|---|
| 673 | disk in `live' form and restored at a later date. By using the |
|---|
| 674 | {\tt mmap()} system call, the live on-disk representation is |
|---|
| 675 | immediately available for queries; the time required to read it |
|---|
| 676 | from disk will be amortized over the subsequent queries. Regions of |
|---|
| 677 | space that are never queried may never be read from disk. |
|---|
| 678 | |
|---|
| 679 | |
|---|
| 680 | \trick{Pivot the data points while building the tree.} |
|---|
| 681 | \label{trick:pivot} |
|---|
| 682 | When building the tree, pivot the data along the splitting dimension |
|---|
| 683 | so that the data points owned by the child nodes are contiguous in |
|---|
| 684 | memory. Represent the set of data points owned by a node as leftmost |
|---|
| 685 | and rightmost offsets into the data array ($L$ and $R$). Keep a |
|---|
| 686 | permutation array to map from the final array indices back to the |
|---|
| 687 | original indices. |
|---|
| 688 | |
|---|
| 689 | \nonumberparagraphs |
|---|
| 690 | \begin{center} |
|---|
| 691 | \includegraphics[width=\permutefigwidth]{permute-fig} |
|---|
| 692 | \end{center} |
|---|
| 693 | \numberparagraphs |
|---|
| 694 | |
|---|
| 695 | Once this has been done, the data points owned by leaf nodes are |
|---|
| 696 | contiguous in memory. In addition, leaf nodes that are nearby in |
|---|
| 697 | space are likely to have their data points stored nearby in memory. |
|---|
| 698 | This increases locality of reference and therefore performance. |
|---|
| 699 | |
|---|
| 700 | |
|---|
| 701 | |
|---|
| 702 | \trick{Don't use {\tt C++} virtual functions.} |
|---|
| 703 | \label{trick:nocpp} |
|---|
| 704 | It may be tempting to define a class hierarchy with an abstract base |
|---|
| 705 | class {\tt Node} and classes {\tt InternalNode} and {\tt LeafNode} |
|---|
| 706 | which inherit from it (as in \secref{sec:datastruct}). Compilers |
|---|
| 707 | implement this by adding a {\tt vtable} pointer to each {\tt |
|---|
| 708 | InternalNode} and {\tt LeafNode} object. This enlarges each node by |
|---|
| 709 | the size of a pointer. This wastes 32 (or 64) bits of memory per |
|---|
| 710 | node, since the node type is completely determined by its position in |
|---|
| 711 | the tree. Using the node layout shown above, node $i$ is a leaf if $i |
|---|
| 712 | \ge M/2 - 1$ (where $M$ is the total number of nodes). Using |
|---|
| 713 | inheritance also incurs an indirect function call for each virtual |
|---|
| 714 | function, which itself costs time and prevents the compiler from |
|---|
| 715 | performing inlining optimizations. |
|---|
| 716 | |
|---|
| 717 | |
|---|
| 718 | |
|---|
| 719 | |
|---|
| 720 | \trick{Consider discarding the permutation array.} |
|---|
| 721 | \label{trick:noperm} |
|---|
| 722 | |
|---|
| 723 | Apply the permutation array to auxiliary data to make the \kdtree's |
|---|
| 724 | data order the canonical order. |
|---|
| 725 | For example, you might have a set of data points where each point has |
|---|
| 726 | a class label such as ``cat'' or ``dog''. After creating a \kdtree |
|---|
| 727 | from the data points, the points are permuted so that the $i$th data |
|---|
| 728 | point no longer corresponds to label $i$. However, if you apply the |
|---|
| 729 | inverse of the \kdtree's permutation array to the labels, they will |
|---|
| 730 | again correspond. You can then discard the \kdtree's permutation |
|---|
| 731 | array. This eliminates a level of indirection and a significant |
|---|
| 732 | amount of memory ($N$ integers). |
|---|
| 733 | |
|---|
| 734 | \nonumberparagraphs |
|---|
| 735 | \begin{center} |
|---|
| 736 | \includegraphics[width=\applypermfigwidth]{apply-perm-fig} |
|---|
| 737 | \end{center} |
|---|
| 738 | \numberparagraphs |
|---|
| 739 | |
|---|
| 740 | |
|---|
| 741 | % This is frequently the case when the index of the data serves as a |
|---|
| 742 | % label. Consider a set of data points where each point has a class label |
|---|
| 743 | % such as ``cat'' or ``dog'' associated with it. After creating a \kdtree |
|---|
| 744 | %from the data points, they will be permuted so that index $i$ in the array |
|---|
| 745 | %of data points no longer corresponds to index $i$ in the array of labels. |
|---|
| 746 | %The \kdtree's permutation array can be used to map the \kdtree's ordering |
|---|
| 747 | %back to the original ordering. Alternatively, you can apply the inverse |
|---|
| 748 | %permutation to the array of labels. This eliminates a level of indirection |
|---|
| 749 | %and a significant amount of memory from the tree. |
|---|
| 750 | |
|---|
| 751 | |
|---|
| 752 | \trick{Store only the rightmost offset of points owned by a node.} |
|---|
| 753 | \label{trick:ronly} |
|---|
| 754 | |
|---|
| 755 | The $L$ and $R$ offsets describe the range of data points owned by a |
|---|
| 756 | node. Within a level of the \kdtree, the $L$ offset of a node is |
|---|
| 757 | simply one greater than the $R$ offset of the node just to the left, |
|---|
| 758 | or zero if there is no node to the left. Since the $L$ value can be |
|---|
| 759 | trivially computed from the $R$ value, there is no need to store both. |
|---|
| 760 | This saves $M$ integers. |
|---|
| 761 | |
|---|
| 762 | |
|---|
| 763 | \nonumberparagraphs |
|---|
| 764 | \begin{center} |
|---|
| 765 | \includegraphics[width=\ronlyfigwidth]{r-only-fig} |
|---|
| 766 | \end{center} |
|---|
| 767 | \numberparagraphs |
|---|
| 768 | |
|---|
| 769 | \trick{Don't store the $R$ offsets of internal nodes.} |
|---|
| 770 | Only leaf nodes need to store the $R$ offsets. |
|---|
| 771 | The leftmost data point $L$ of a non-leaf node is just the $L$ value of its |
|---|
| 772 | leftmost leaf. Similarly, the rightmost data point $R$ of a non-leaf node is |
|---|
| 773 | $R$ of its rightmost leaf. Try saying that ten times fast! |
|---|
| 774 | This saves $M/2$ integers. |
|---|
| 775 | |
|---|
| 776 | %The savings are approximately $1/2\cdot${\tt sizeof(offset)}. |
|---|
| 777 | |
|---|
| 778 | |
|---|
| 779 | \trick{With median splits, don't store the $R$ offsets.} |
|---|
| 780 | \label{trick:noR} |
|---|
| 781 | Compute them instead. If the \kdtree is built by splitting at the |
|---|
| 782 | median value (the common case), then the number of data points owned |
|---|
| 783 | by each node in the tree is a function of the total number of data |
|---|
| 784 | points, independent of the values of the data points. Computing the |
|---|
| 785 | offsets costs $\mathcal{O}(\log M)$\footnote{There {\em may} be an |
|---|
| 786 | $\mathcal{O}(1)$ algorithm, though the authors haven't found it.} if |
|---|
| 787 | exact median splits are used and a fixed rounding direction is chosen |
|---|
| 788 | (\ie, the right subtree gets the extra data point when the number of |
|---|
| 789 | data points is odd). However, by choosing the rounding direction |
|---|
| 790 | carefully---by moving the splitting position by one place---the data |
|---|
| 791 | points can be distributed so that the computation is $\mathcal{O}(1)$. |
|---|
| 792 | |
|---|
| 793 | |
|---|
| 794 | |
|---|
| 795 | \trick{Consider transposing the data structures.} |
|---|
| 796 | \label{trick:transpose} |
|---|
| 797 | Instead of creating an array of {\tt node} structures, pull the |
|---|
| 798 | node contents directly into the {\tt kdtree} structure, creating a |
|---|
| 799 | separate array for each element. |
|---|
| 800 | |
|---|
| 801 | \nonumberparagraphs |
|---|
| 802 | \begin{center} |
|---|
| 803 | \includegraphics[width=\transposefigwidth]{transpose-fig} |
|---|
| 804 | \end{center} |
|---|
| 805 | \numberparagraphs |
|---|
| 806 | |
|---|
| 807 | When a compiler encounters a structure such as {\tt node} above, |
|---|
| 808 | it pads the structure so that its total size is a multiple of the size of |
|---|
| 809 | the largest element. For {\tt node}, the structure will be padded to 16 |
|---|
| 810 | bytes, even though only 9 bytes are used. It is possible to force the |
|---|
| 811 | compiler to pack the structure tightly, but this results in unaligned memory |
|---|
| 812 | accesses (many of the {\tt double} values will not start on a multiple of eight |
|---|
| 813 | bytes), which incurs significant overhead. |
|---|
| 814 | |
|---|
| 815 | |
|---|
| 816 | The advantage of transposing the data structures is that the memory required |
|---|
| 817 | can be minimized without destroying the structure alignment. The disadvantage |
|---|
| 818 | is a loss of locality of reference: typically the members of a structure are |
|---|
| 819 | accessed at the same time, and transposing them results in the members being |
|---|
| 820 | dispersed in memory. |
|---|
| 821 | |
|---|
| 822 | %\trick{Consider converting the data points to a smaller type.} |
|---|
| 823 | \trick{Consider using a smaller data type.} |
|---|
| 824 | \label{trick:smalltypes} |
|---|
| 825 | For most applications, using {\tt double} values to represent points |
|---|
| 826 | in $\left[0,1\right]$ is not an effective use of bits. Consider |
|---|
| 827 | transforming the data to a smaller integer representation. If the |
|---|
| 828 | data points live in a small bounded space, then converting {\tt |
|---|
| 829 | double} values to 32-bit integers saves a factor of two of space with |
|---|
| 830 | very little loss of precision. Converting to 16- or 8-bit integers |
|---|
| 831 | saves more memory but the effect of introducing this approximation |
|---|
| 832 | should be considered in the context of the problem at hand. |
|---|
| 833 | |
|---|
| 834 | |
|---|
| 835 | The boundary between the \kdtree software library and the application |
|---|
| 836 | is an ideal place to do this transformation: the application need not |
|---|
| 837 | know that the \kdtree library represents the data points in a smaller |
|---|
| 838 | format. The \kdtree library converts query values into the smaller |
|---|
| 839 | format on the way into the library, and converts results back to the |
|---|
| 840 | original format on the way out. The only user-visible change should |
|---|
| 841 | be that the data points occupy much less memory, queries are faster, |
|---|
| 842 | and there may be small approximation errors. |
|---|
| 843 | |
|---|
| 844 | |
|---|
| 845 | \trick{Consider bit-packing the splitting value and dimension.} |
|---|
| 846 | \label{trick:packbits} |
|---|
| 847 | Since \kdtrees are only suitable for low-dimensional data, the |
|---|
| 848 | splitting dimension only requires a few bits. Instead of storing it |
|---|
| 849 | in a separate array, store it in the bottom few bits of the splitting |
|---|
| 850 | value. For example, with four-dimensional data points, the splitting |
|---|
| 851 | dimension can be stored in the bottom two bits of a 16-bit integer, |
|---|
| 852 | leaving 14 bits to store the splitting position. |
|---|
| 853 | |
|---|
| 854 | \nonumberparagraphs |
|---|
| 855 | \begin{center} |
|---|
| 856 | \includegraphics[width=\bitpackfigwidth]{bitpack-fig} |
|---|
| 857 | \end{center} |
|---|
| 858 | \numberparagraphs |
|---|
| 859 | |
|---|
| 860 | In general, this trick conflicts with trick of not storing the $R$ |
|---|
| 861 | offsets (\secref{trick:noR}), since that trick requires precise |
|---|
| 862 | control over the splitting plane location, while this trick involves |
|---|
| 863 | sacrificing some precision to save space. For some data point sets, |
|---|
| 864 | this conflict may not arise, and both tricks may be used |
|---|
| 865 | simultaneously. |
|---|
| 866 | |
|---|
| 867 | |
|---|
| 868 | \trick{Consider throwing away the data.} |
|---|
| 869 | If your data points are associated with some other information (such |
|---|
| 870 | as labels), and your application can tolerate some false positive |
|---|
| 871 | results, consider the tradeoffs of discarding the data and keeping the |
|---|
| 872 | \kdtree. By using the other tricks in this section, the memory |
|---|
| 873 | requirements of a \kdtree can be reduced to a small fraction of the |
|---|
| 874 | size of the data points themselves. In some applications, the benefit |
|---|
| 875 | of being able to search more data points might outweigh the cost of |
|---|
| 876 | not being able to say exactly where the data points were. |
|---|
| 877 | |
|---|
| 878 | Given $N$ data points, we can build a \kdtree with $N-1$ total |
|---|
| 879 | splitting plane nodes. Each node contains a splitting value and |
|---|
| 880 | possibly the splitting dimension: it is at most slightly larger than a |
|---|
| 881 | single data value. The entire tree requires about $1/D$ as much |
|---|
| 882 | memory as the data points themselves. |
|---|
| 883 | |
|---|
| 884 | The \kdtree search algorithms then produce bounded approximations. |
|---|
| 885 | For nearest-neighbour search, the algorithm can produce the index of the |
|---|
| 886 | data point that lives in the same box as the query point, an upper bound |
|---|
| 887 | on the distance to that point, and a lower bound to the distance to the |
|---|
| 888 | true nearest neighbour. It is also possible to produce a small list of |
|---|
| 889 | points that is guaranteed to contain the nearest neighbour. |
|---|
| 890 | |
|---|
| 891 | |
|---|
| 892 | |
|---|
| 893 | \section{Speed Comparison} |
|---|
| 894 | \label{sec:speed} |
|---|
| 895 | |
|---|
| 896 | This section gives empirical evidence that the tricks presented above |
|---|
| 897 | can yield significant improvements. We compare our implementation, |
|---|
| 898 | {\tt libkd}, with two previously released \kdtree implementations: |
|---|
| 899 | {\tt ANN} \cite{arya1998} and {\tt simkd} ({\tt Simple \kdtrees}) |
|---|
| 900 | \cite{simkd}. We used these packages `out of the box', using the default compiler settings |
|---|
| 901 | and choices about how to build the \kdtree. |
|---|
| 902 | |
|---|
| 903 | We chose a very simple benchmark test: the data points are 5 million |
|---|
| 904 | samples drawn uniformly from the unit cube ($N=5 \times 10^6, D=3$). |
|---|
| 905 | The number of points owned by a leaf node was set to $14$ for {\tt |
|---|
| 906 | ANN} and {\tt simkd} in order to make the number of nodes in the three |
|---|
| 907 | implementations approximately equal. The query points are one million |
|---|
| 908 | samples from the same distribution. We tested the one-nearest |
|---|
| 909 | neighbour algorithm. We would have preferred to use larger $N$, since |
|---|
| 910 | this is the regime in which our \an project operates, but the other |
|---|
| 911 | libraries were not able to build such large trees. We present results |
|---|
| 912 | for a 32-bit machine% |
|---|
| 913 | \footnote{Intel Xeon (SL72G) at 3.06 GHz with 4 GB of RAM.} |
|---|
| 914 | and a 64-bit machine% |
|---|
| 915 | \footnote{AMD Opteron 8220 SE at 2.8 GHz with 32 GB of RAM.} |
|---|
| 916 | to show how the memory requirements differ. |
|---|
| 917 | |
|---|
| 918 | %\nonumberparagraphs |
|---|
| 919 | |
|---|
| 920 | \begin{table} |
|---|
| 921 | \begin{center} |
|---|
| 922 | \newcommand{\ftt}[1]{{\tt #1}} |
|---|
| 923 | \newcommand{\minitab}[2][l]{\begin{tabular}{#1}#2\end{tabular}} |
|---|
| 924 | %\newcommand{\ftt}[1]{{\scriptsize\tt #1}} |
|---|
| 925 | %\begin{footnotesize} |
|---|
| 926 | \begin{tabular}{|l|c@{\hspace{5pt}}c|c@{\hspace{5pt}}c|} |
|---|
| 927 | \hline |
|---|
| 928 | %\multirow{3}{*}{{\bf Implementation}} & |
|---|
| 929 | \multirow{3}{*}{\minitab[c]{{\bf Implementation}}} & |
|---|
| 930 | % |
|---|
| 931 | \multicolumn{2}{|c|}{{\bf Speed}} & |
|---|
| 932 | \multicolumn{2}{|c|}{{\bf Memory}} \\ |
|---|
| 933 | %& \multicolumn{2}{|c|}{(1000 queries/sec)} & |
|---|
| 934 | & \multicolumn{2}{|c|}{(k q/sec)} & |
|---|
| 935 | \multicolumn{2}{|c|}{data + tree = total (Mbytes)} \\ |
|---|
| 936 | & 32-bit & 64-bit & 32-bit & 64-bit \\ |
|---|
| 937 | \hline |
|---|
| 938 | %{\tt simkd} & 47 & 39 & 370 & 486 \\ |
|---|
| 939 | %{\tt ANN} & 71 & 90 & 187 & 221 \\ |
|---|
| 940 | %\hline |
|---|
| 941 | %{\tt libkd-d-box} & 127 & 144 & 172 & 172 \\ |
|---|
| 942 | %{\tt libkd-d-split} & 231 & 284 & 126 & 126 \\ |
|---|
| 943 | %{\tt libkd-d-noR} & 239 & 293 & 125 & 125 \\ |
|---|
| 944 | %{\tt libkd-i-split} & 242 & 311 & 64 & 64 \\ |
|---|
| 945 | %{\tt libkd-i-noR} & 240 & 326 & 63 & 63 \\ |
|---|
| 946 | %{\tt libkd-s-split} & 328 & 386 & 33 & 33 \\ |
|---|
| 947 | %{\tt libkd-s-noR} & 307 & 396 & 32 & 32 \\ |
|---|
| 948 | \ftt{simkd} & 47 & 39 & 120 + 250=370 & 120 + 366=486 \\ |
|---|
| 949 | \ftt{ANN} & 71 & 90 & 120 + 67=187 & 120 + 101=221 \\ |
|---|
| 950 | \hline |
|---|
| 951 | \ftt{libkd-d-box} & 127 & 144 & \multicolumn{2}{|c|}{120 + 52 = 172} \\ |
|---|
| 952 | \ftt{libkd-d-split} & 231 & 284 & \multicolumn{2}{|c|}{120 + 6 = 126} \\ |
|---|
| 953 | \ftt{libkd-d-noR} & 239 & 293 & \multicolumn{2}{|c|}{120 + 5 = 125} \\ |
|---|
| 954 | \ftt{libkd-i-split} & 242 & 311 & \multicolumn{2}{|c|}{ 60 + 4 = 64} \\ |
|---|
| 955 | \ftt{libkd-i-noR} & 240 & 326 & \multicolumn{2}{|c|}{ 60 + 3 = 63} \\ |
|---|
| 956 | \ftt{libkd-s-split} & 328 & 386 & \multicolumn{2}{|c|}{ 30 + 3 = 33} \\ |
|---|
| 957 | \ftt{libkd-s-noR} & 307 & 396 & \multicolumn{2}{|c|}{ 30 + 2 = 32} \\ |
|---|
| 958 | \hline |
|---|
| 959 | \end{tabular} |
|---|
| 960 | \end{center} |
|---|
| 961 | %\end{footnotesize} |
|---|
| 962 | %\vspace{5pt} |
|---|
| 963 | \caption{Benchmark results. |
|---|
| 964 | Speed is measured in thousands of queries per second (k q/sec), memory |
|---|
| 965 | in megabytes. The memory values include the memory required to store |
|---|
| 966 | the data points themselves, which is $120$ MB when {\tt double}s are |
|---|
| 967 | used. For the {\tt libkd} entries, {\tt d} indicates that the data |
|---|
| 968 | points are stored as {\tt double}, {\tt i} indicates 32-bit integers, |
|---|
| 969 | and {\tt s} indicates 16-bit integers (using |
|---|
| 970 | trick \ref{trick:smalltypes}). {\tt box} means that bounding-boxes |
|---|
| 971 | are created; otherwise splitting-planes are used. The {\tt noR} |
|---|
| 972 | entries indicate that the we avoid storing the offset arrays |
|---|
| 973 | (trick \ref{trick:noR}). We have also assumed that trick |
|---|
| 974 | \ref{trick:noperm} is applicable, so the permutation arrays are not stored.} |
|---|
| 975 | \label{table:results} |
|---|
| 976 | \end{table} |
|---|
| 977 | |
|---|
| 978 | %\numberparagraphs |
|---|
| 979 | |
|---|
| 980 | Table \ref{table:results} shows our results. Our implementation, {\tt |
|---|
| 981 | libkd}, always takes less memory and produces faster results. The |
|---|
| 982 | memory requirements of {\tt ANN} and {\tt simkd} increase when using a |
|---|
| 983 | 64-bit machine, while {\tt libkd} stays fixed because we use no |
|---|
| 984 | extraneous pointers. Note that the data points themselves require |
|---|
| 985 | $120$ MB of space, so the memory overhead of a {\tt libkd} tree is |
|---|
| 986 | only a few percent, compared to $50$ to $300$ percent for {\tt ANN} |
|---|
| 987 | and {\tt simkd}. |
|---|
| 988 | |
|---|
| 989 | In this test, splitting planes are much faster than bounding boxes. |
|---|
| 990 | Using trick \ref{trick:noR} to avoid storing the offset arrays both |
|---|
| 991 | reduces the memory required and slightly increases the speed. Using |
|---|
| 992 | trick \ref{trick:smalltypes} to use smaller data types (32- and 16-bit |
|---|
| 993 | integers) reduces the memory footprint by a factor of two or four, |
|---|
| 994 | with a corresponding increase in speed. It also incurs some |
|---|
| 995 | approximation error. In this test, we found that when using 32-bit |
|---|
| 996 | integers, we still always produced the correct nearest neighbour, and |
|---|
| 997 | the absolute error in the distance estimates was never more than |
|---|
| 998 | $10^{-9}$. When using 16-bit integers, we produced the correct |
|---|
| 999 | nearest neighbour 99.5\% of the time, and the absolute distance error |
|---|
| 1000 | was less than $10^{-4}$. |
|---|
| 1001 | |
|---|
| 1002 | % actual approximation errors; 4x10^{-10} and 8x10^{-5}. |
|---|
| 1003 | |
|---|
| 1004 | |
|---|
| 1005 | In summary, these results show that, on this benchmark, by using the |
|---|
| 1006 | tricks presented {\tt libkd} is three times faster and incurs |
|---|
| 1007 | one-fifteenth the memory overhead compared to the closest competitor. |
|---|
| 1008 | While benchmarks are not listed among the three canonical types of |
|---|
| 1009 | lies (``lies, damned lies, and statistics''), we urge practitioners to |
|---|
| 1010 | test these implementations on their own data. Our testing and |
|---|
| 1011 | benchmarking software is available for download. |
|---|
| 1012 | |
|---|
| 1013 | |
|---|
| 1014 | % cluster59: |
|---|
| 1015 | % (?) http://processorfinder.intel.com/Details.aspx?sSpec=SLABS |
|---|
| 1016 | |
|---|
| 1017 | % cluster23: |
|---|
| 1018 | % 2 * Intel(R) Xeon(TM) CPU 3.06GHz |
|---|
| 1019 | % S-spec SL72G |
|---|
| 1020 | % http://processorfinder.intel.com/details.aspx?sSpec=SL72G |
|---|
| 1021 | % 4 GB RAM |
|---|
| 1022 | % 32-bit |
|---|
| 1023 | % libkd: -O3 |
|---|
| 1024 | % -DNDEBUG |
|---|
| 1025 | % -march=pentium4 |
|---|
| 1026 | % -fomit-frame-pointer |
|---|
| 1027 | % -fno-math-errno |
|---|
| 1028 | % -fno-trapping-math |
|---|
| 1029 | % -fno-signaling-nans |
|---|
| 1030 | % -ffinite-math-only |
|---|
| 1031 | % ann: -O3 |
|---|
| 1032 | % simkd: -O2 |
|---|
| 1033 | |
|---|
| 1034 | % testB.in |
|---|
| 1035 | % 5M points in 3D |
|---|
| 1036 | % 120 MB data |
|---|
| 1037 | % ann, simkd: bucket size 14 |
|---|
| 1038 | % libkd: bucket size 10 |
|---|
| 1039 | |
|---|
| 1040 | % simkd: |
|---|
| 1041 | % 516,748 leaves |
|---|
| 1042 | % 370 MB including data |
|---|
| 1043 | |
|---|
| 1044 | % ann: |
|---|
| 1045 | % 516,748 leaves |
|---|
| 1046 | % 187 MB including data |
|---|
| 1047 | |
|---|
| 1048 | % libkd: |
|---|
| 1049 | % 524,288 leaves |
|---|
| 1050 | % 172 MB ddd/bbox |
|---|
| 1051 | % 126 MB ddd/split |
|---|
| 1052 | % 125 MB ddd/nolr |
|---|
| 1053 | % 64 MB duu/split |
|---|
| 1054 | % 63 MB duu/splitdim/linearlr |
|---|
| 1055 | % 33 MB dss/split |
|---|
| 1056 | % 32 MB dss/splitdim/linearlr |
|---|
| 1057 | |
|---|
| 1058 | % speed: q/sec: |
|---|
| 1059 | % 47 simkd |
|---|
| 1060 | % 71 ann |
|---|
| 1061 | % |
|---|
| 1062 | % 127 ddd/bbox |
|---|
| 1063 | % 231 ddd/split |
|---|
| 1064 | % (203) ddd/nolr |
|---|
| 1065 | % 239 ddd/linearlr |
|---|
| 1066 | % |
|---|
| 1067 | % 242 duu/split |
|---|
| 1068 | % (233) duu/splitdim |
|---|
| 1069 | % (205) duu/splitdim/nolr |
|---|
| 1070 | % 240 duu/splitdim/linearlr |
|---|
| 1071 | % |
|---|
| 1072 | % 328 dss/split |
|---|
| 1073 | % (303) dss/splitdim |
|---|
| 1074 | % (255) dss/splitdim/nolr |
|---|
| 1075 | % 307 dss/splitdim/linearlr |
|---|
| 1076 | |
|---|
| 1077 | |
|---|
| 1078 | |
|---|
| 1079 | % cluster27: |
|---|
| 1080 | % 4 * AMD Opteron(tm) Processor 8220 SE |
|---|
| 1081 | % 32 GB RAM |
|---|
| 1082 | % 64-bit |
|---|
| 1083 | % libkd: -O3 |
|---|
| 1084 | % -DNDEBUG |
|---|
| 1085 | % -march=k8 -m64 |
|---|
| 1086 | % -fomit-frame-pointer |
|---|
| 1087 | % -fno-signaling-nans |
|---|
| 1088 | % -ffinite-math-only |
|---|
| 1089 | % -fno-math-errno |
|---|
| 1090 | % -fno-trapping-math |
|---|
| 1091 | % ann: -O3 |
|---|
| 1092 | % simkd: -O2 |
|---|
| 1093 | |
|---|
| 1094 | % testB.in |
|---|
| 1095 | % 5M points in 3D |
|---|
| 1096 | % 120 MB data |
|---|
| 1097 | % ann, simkd: bucket size 14 |
|---|
| 1098 | % libkd: bucket size 10 |
|---|
| 1099 | |
|---|
| 1100 | % simkd: |
|---|
| 1101 | % 516,865 leaves |
|---|
| 1102 | % 486 MB including data |
|---|
| 1103 | |
|---|
| 1104 | % ann: |
|---|
| 1105 | % 516,785 leaves |
|---|
| 1106 | % 221 MB including data |
|---|
| 1107 | |
|---|
| 1108 | % libkd: |
|---|
| 1109 | % 524,288 leaves |
|---|
| 1110 | % 172 MB ddd/bbox |
|---|
| 1111 | % 126 MB ddd/split |
|---|
| 1112 | % 125 MB ddd/nolr |
|---|
| 1113 | % 64 MB duu/split |
|---|
| 1114 | % 63 MB duu/splitdim/linearlr |
|---|
| 1115 | % 33 MB dss/split |
|---|
| 1116 | % 32 MB dss/splitdim/linearlr |
|---|
| 1117 | |
|---|
| 1118 | % speed: q/sec: |
|---|
| 1119 | % 39 simkd |
|---|
| 1120 | % 90 ann |
|---|
| 1121 | % |
|---|
| 1122 | % 144 ddd/bbox |
|---|
| 1123 | % 284 ddd/split |
|---|
| 1124 | % (260) ddd/nolr |
|---|
| 1125 | % 293 ddd/linearlr |
|---|
| 1126 | % |
|---|
| 1127 | % 311 duu/split |
|---|
| 1128 | % (314) duu/splitdim |
|---|
| 1129 | % (287) duu/splitdim/nolr |
|---|
| 1130 | % 326 duu/splitdim/linearlr |
|---|
| 1131 | % |
|---|
| 1132 | % 386 dss/split |
|---|
| 1133 | % (378) dss/splitdim |
|---|
| 1134 | % (341) dss/splitdim/nolr |
|---|
| 1135 | % 396 dss/splitdim/linearlr |
|---|
| 1136 | |
|---|
| 1137 | |
|---|
| 1138 | |
|---|
| 1139 | |
|---|
| 1140 | |
|---|
| 1141 | \section{Conclusion} |
|---|
| 1142 | |
|---|
| 1143 | \Kdtrees are a fun and easy data structure for searching in $k$-dimensional |
|---|
| 1144 | space. A careful implementation can be orders of magnitude faster than |
|---|
| 1145 | a na\"ive one. Readers wishing to reap the benefits of such a careful |
|---|
| 1146 | implementation, without building it themselves, are encouraged to |
|---|
| 1147 | download the GPL'd {\tt libkd}. Patches are welcome! |
|---|
| 1148 | |
|---|