Distributed-Memory Parallel Algorithms for Fixed-Radius Near Neighbor Graph Construction
Abstract
Computing fixed-radius near-neighbor graphs is an important first step for many data
analysis algorithms. Near-neighbor graphs connect points that are close under some
metric, endowing point clouds with a combinatorial structure. As computing power and
data acquisition methods advance, diverse sources of large scientific datasets would
greatly benefit from scalable solutions to this common subroutine for downstream
analysis. Prior work on parallel nearest neighbors has made great progress in problems
like k-nearest and approximate nearest neighbor search problems, with particular
attention on Euclidean spaces. Yet many applications need exact solutions and non-
Euclidean metrics. This paper presents a scalable sparsity-aware distributed memory
algorithm using cover trees to compute near-neighbor graphs in general metric spaces. We
provide a shared-memory algorithm for cover tree construction and demonstrate its
competitiveness with state-of-the-art fixed-radius search data structures. We then
introduce two distributed-memory algorithms for the near-neighbor graph problem, a
simple point-partitioning strategy and a spatial-partitioning strategy, which leverage
the cover tree algorithm on each node. Our algorithms exhibit parallel scaling across a
variety of real and synthetic datasets for both traditional and non-traditional metrics.
On real world high dimensional datasets with one million points, we achieve speedups up
to 678.34x over the state-of-the-art using 1024 cores for graphs with 70 neighbors per
vertex (on average), and up to 1590.99x using 4096 cores for graphs with 500 neighbors
per vertex (on average).