Skip to content
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Commit 45dbe6e

Browse files
committedNov 27, 2024
Docs: Harley-Seal plans for binary kernels
#138
1 parent a39419c commit 45dbe6e

File tree

1 file changed

+29
-13
lines changed

1 file changed

+29
-13
lines changed
 

‎include/simsimd/binary.h

+29-13
Original file line numberDiff line numberDiff line change
@@ -12,8 +12,37 @@
1212
* - Arm: NEON, SVE
1313
* - x86: Haswell, Ice Lake
1414
*
15+
* The hardest part of optimizing binary similarity measures is the population count operation.
16+
* It's natively supported by almost every insrtuction set, but the throughput and latency can
17+
* be suboptimal. There are several ways to optimize this operation:
18+
*
19+
* - Lookup tables, mostly using nibbles (4-bit lookups)
20+
* - Harley-Seal population counts: https://arxiv.org/pdf/1611.07612
21+
*
22+
* On binary vectors, when computing Jaccard similarity we can clearly see how the CPU struggles
23+
* to compute that many population counts. There are several instructions we should keep in mind
24+
* for future optimizations:
25+
*
26+
* - `_mm512_popcnt_epi64` maps to `VPOPCNTQ (ZMM, K, ZMM)`:
27+
* - On Ice Lake: 3 cycles latency, ports: 1*p5
28+
* - On Genoa: 2 cycles latency, ports: 1*FP01
29+
* - `_mm512_shuffle_epi8` maps to `VPSHUFB (ZMM, ZMM, ZMM)`:
30+
* - On Ice Lake: 1 cycles latency, ports: 1*p5
31+
* - On Genoa: 2 cycles latency, ports: 1*FP12
32+
* - `_mm512_sad_epu8` maps to `VPSADBW (ZMM, ZMM, ZMM)`:
33+
* - On Ice Lake: 3 cycles latency, ports: 1*p5
34+
* - On Zen4: 3 cycles latency, ports: 1*FP01
35+
* - `_mm512_tertiarylogic_epi64` maps to `VPTERNLOGQ (ZMM, ZMM, ZMM, I8)`:
36+
* - On Ice Lake: 1 cycles latency, ports: 1*p05
37+
* - On Zen4: 1 cycles latency, ports: 1*FP0123
38+
* - `_mm512_gf2p8mul_epi8` maps to `VPGF2P8AFFINEQB (ZMM, ZMM, ZMM)`:
39+
* - On Ice Lake: 5 cycles latency, ports: 1*p0
40+
* - On Zen4: 3 cycles latency, ports: 1*FP01
41+
*
1542
* x86 intrinsics: https://www.intel.com/content/www/us/en/docs/intrinsics-guide/
1643
* Arm intrinsics: https://developer.arm.com/architectures/instruction-sets/intrinsics/
44+
* SSE POPCOUNT experiments by Wojciech Muła: https://github.com/WojciechMula/sse-popcount
45+
* R&D progress tracker: https://github.com/ashvardanian/SimSIMD/pull/138
1746
*/
1847
#ifndef SIMSIMD_BINARY_H
1948
#define SIMSIMD_BINARY_H
@@ -321,19 +350,6 @@ SIMSIMD_PUBLIC void simsimd_jaccard_b8_ice(simsimd_b8_t const *a, simsimd_b8_t c
321350
simsimd_distance_t *result) {
322351

323352
simsimd_size_t intersection = 0, union_ = 0;
324-
//? On such vectors we can clearly see that the CPU struggles to perform this many parallel
325-
//? population counts, because the throughput of Jaccard and Hamming in this case starts to differ.
326-
//? One optimization, aside from Harley-Seal transforms can be using "shuffles" for nibble-popcount
327-
//? lookups, to utilize other ports on the CPU.
328-
//? https://github.com/ashvardanian/SimSIMD/pull/138
329-
//
330-
// - `_mm512_popcnt_epi64` maps to `VPOPCNTQ (ZMM, K, ZMM)`:
331-
// - On Ice Lake: 3 cycles latency, ports: 1*p5
332-
// - On Genoa: 2 cycles latency, ports: 1*FP01
333-
// - `_mm512_shuffle_epi8` maps to `VPSHUFB (ZMM, ZMM, ZMM)`:
334-
// - On Ice Lake: 1 cycles latency, ports: 1*p5
335-
// - On Genoa: 2 cycles latency, ports: 1*FP12
336-
//
337353
// It's harder to squeeze out performance from tiny representations, so we unroll the loops for binary metrics.
338354
if (n_words <= 64) { // Up to 512 bits.
339355
__mmask64 mask = (__mmask64)_bzhi_u64(0xFFFFFFFFFFFFFFFF, n_words);

0 commit comments

Comments
 (0)
Please sign in to comment.