|
| 1 | +.. Licensed to the Apache Software Foundation (ASF) under one |
| 2 | +.. or more contributor license agreements. See the NOTICE file |
| 3 | +.. distributed with this work for additional information |
| 4 | +.. regarding copyright ownership. The ASF licenses this file |
| 5 | +.. to you under the Apache License, Version 2.0 (the |
| 6 | +.. "License"); you may not use this file except in compliance |
| 7 | +.. with the License. You may obtain a copy of the License at |
| 8 | +
|
| 9 | +.. http://www.apache.org/licenses/LICENSE-2.0 |
| 10 | +
|
| 11 | +.. Unless required by applicable law or agreed to in writing, |
| 12 | +.. software distributed under the License is distributed on an |
| 13 | +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| 14 | +.. KIND, either express or implied. See the License for the |
| 15 | +.. specific language governing permissions and limitations |
| 16 | +.. under the License. |
| 17 | +
|
| 18 | +Java Algorithms |
| 19 | +=============== |
| 20 | + |
| 21 | +Arrow's Java library provides algorithms for some commonly-used |
| 22 | +functionalities. The algorithms are provided in the ``org.apache.arrow.algorithm`` |
| 23 | +package of the ``algorithm`` module. |
| 24 | + |
| 25 | +Comparing Vector Elements |
| 26 | +------------------------- |
| 27 | + |
| 28 | +Comparing vector elements is the basic for many algorithms. Vector |
| 29 | +elements can be compared in one of the two ways: |
| 30 | + |
| 31 | +1. **Equality comparison**: there are two possible results for this type of comparisons: ``equal`` and ``unequal``. |
| 32 | +Currently, this type of comparison is supported through the ``org.apache.arrow.vector.compare.VectorValueEqualizer`` |
| 33 | +interface. |
| 34 | + |
| 35 | +2. **Ordering comparison**: there are three possible results for this type of comparisons: ``less than``, ``equal to`` |
| 36 | +and ``greater than``. This comparison is supported by the abstract class ``org.apache.arrow.algorithm.sort.VectorValueComparator``. |
| 37 | + |
| 38 | +We provide default implementations to compare vector elements. However, users can also define ways |
| 39 | +for customized comparisons. |
| 40 | + |
| 41 | +Vector Element Search |
| 42 | +--------------------- |
| 43 | + |
| 44 | +A search algorithm tries to find a particular value in a vector. When successful, a vector index is |
| 45 | +returned; otherwise, a ``-1`` is returned. The following search algorithms are provided: |
| 46 | + |
| 47 | +1. **Linear search**: this algorithm simply traverses the vector from the beginning, until a match is |
| 48 | +found, or the end of the vector is reached. So it takes ``O(n)`` time, where ``n`` is the number of elements |
| 49 | +in the vector. This algorithm is implemented in ``org.apache.arrow.algorithm.search.VectorSearcher#linearSearch``. |
| 50 | + |
| 51 | +2. **Binary search**: this represents a more efficient search algorithm, as it runs in ``O(log(n))`` time. |
| 52 | +However, it is only applicable to sorted vectors. To get a sorted vector, |
| 53 | +one can use one of our sorting algorithms, which will be discussed in the next section. This algorithm |
| 54 | +is implemented in ``org.apache.arrow.algorithm.search.VectorSearcher#binarySearch``. |
| 55 | + |
| 56 | +3. **Parallel search**: when the vector is large, it takes a long time to traverse the elements to search |
| 57 | +for a value. To make this process faster, one can split the vector into multiple partitions, and perform the |
| 58 | +search for each partition in parallel. This is supported by ``org.apache.arrow.algorithm.search.ParallelSearcher``. |
| 59 | + |
| 60 | +4. **Range search**: for many scenarios, there can be multiple matching values in the vector. |
| 61 | +If the vector is sorted, the matching values reside in a contiguous region in the vector. The |
| 62 | +range search algorithm tries to find the upper/lower bound of the region in ``O(log(n))`` time. |
| 63 | +An implementation is provided in ``org.apache.arrow.algorithm.search.VectorRangeSearcher``. |
| 64 | + |
| 65 | +Vector Sorting |
| 66 | +-------------- |
| 67 | + |
| 68 | +Given a vector, a sorting algorithm turns it into a sorted one. The sorting criteria must |
| 69 | +be specified by some ordering comparison operation. The sorting algorithms can be |
| 70 | +classified into the following categories: |
| 71 | + |
| 72 | +1. **In-place sorter**: an in-place sorter performs the sorting by manipulating the original |
| 73 | +vector, without creating any new vector. So it just returns the original vector after the sorting operations. |
| 74 | +Currently, we have ``org.apache.arrow.algorithm.sort.FixedWidthInPlaceVectorSorter`` for in-place |
| 75 | +sorting in ``O(nlog(n))`` time. As the name suggests, it only supports fixed width vectors. |
| 76 | + |
| 77 | +2. **Out-of-place sorter**: an out-of-place sorter does not mutate the original vector. Instead, |
| 78 | +it copies vector elements to a new vector in sorted order, and returns the new vector. |
| 79 | +We have ``org.apache.arrow.algorithm.sort.FixedWidthInPlaceVectorSorter.FixedWidthOutOfPlaceVectorSorter`` |
| 80 | +and ``org.apache.arrow.algorithm.sort.FixedWidthInPlaceVectorSorter.VariableWidthOutOfPlaceVectorSorter`` |
| 81 | +for fixed width and variable width vectors, respectively. Both algorithms run in ``O(nlog(n))`` time. |
| 82 | + |
| 83 | +3. **Index sorter**: this sorter does not actually sort the vector. Instead, it returns an integer |
| 84 | +vector, which correspond to indices of vector elements in sorted order. With the index vector, one can |
| 85 | +easily construct a sorted vector. In addition, some other tasks can be easily achieved, like finding the ``k`` th |
| 86 | +smallest value in the vector. Index sorting is supported by ``org.apache.arrow.algorithm.sort.IndexSorter``, |
| 87 | +which runs in ``O(nlog(n))`` time. It is applicable to vectors of any type. |
| 88 | + |
| 89 | +Other Algorithms |
| 90 | +---------------- |
| 91 | + |
| 92 | +Other algorithms include vector deduplication, dictionary encoding, etc., in the ``algorithm`` module. |
0 commit comments