|
12 | 12 | * - Arm: NEON, SVE
|
13 | 13 | * - x86: Haswell, Ice Lake
|
14 | 14 | *
|
| 15 | + * The hardest part of optimizing binary similarity measures is the population count operation. |
| 16 | + * It's natively supported by almost every insrtuction set, but the throughput and latency can |
| 17 | + * be suboptimal. There are several ways to optimize this operation: |
| 18 | + * |
| 19 | + * - Lookup tables, mostly using nibbles (4-bit lookups) |
| 20 | + * - Harley-Seal population counts: https://arxiv.org/pdf/1611.07612 |
| 21 | + * |
| 22 | + * On binary vectors, when computing Jaccard similarity we can clearly see how the CPU struggles |
| 23 | + * to compute that many population counts. There are several instructions we should keep in mind |
| 24 | + * for future optimizations: |
| 25 | + * |
| 26 | + * - `_mm512_popcnt_epi64` maps to `VPOPCNTQ (ZMM, K, ZMM)`: |
| 27 | + * - On Ice Lake: 3 cycles latency, ports: 1*p5 |
| 28 | + * - On Genoa: 2 cycles latency, ports: 1*FP01 |
| 29 | + * - `_mm512_shuffle_epi8` maps to `VPSHUFB (ZMM, ZMM, ZMM)`: |
| 30 | + * - On Ice Lake: 1 cycles latency, ports: 1*p5 |
| 31 | + * - On Genoa: 2 cycles latency, ports: 1*FP12 |
| 32 | + * - `_mm512_sad_epu8` maps to `VPSADBW (ZMM, ZMM, ZMM)`: |
| 33 | + * - On Ice Lake: 3 cycles latency, ports: 1*p5 |
| 34 | + * - On Zen4: 3 cycles latency, ports: 1*FP01 |
| 35 | + * - `_mm512_tertiarylogic_epi64` maps to `VPTERNLOGQ (ZMM, ZMM, ZMM, I8)`: |
| 36 | + * - On Ice Lake: 1 cycles latency, ports: 1*p05 |
| 37 | + * - On Zen4: 1 cycles latency, ports: 1*FP0123 |
| 38 | + * - `_mm512_gf2p8mul_epi8` maps to `VPGF2P8AFFINEQB (ZMM, ZMM, ZMM)`: |
| 39 | + * - On Ice Lake: 5 cycles latency, ports: 1*p0 |
| 40 | + * - On Zen4: 3 cycles latency, ports: 1*FP01 |
| 41 | + * |
15 | 42 | * x86 intrinsics: https://www.intel.com/content/www/us/en/docs/intrinsics-guide/
|
16 | 43 | * Arm intrinsics: https://developer.arm.com/architectures/instruction-sets/intrinsics/
|
| 44 | + * SSE POPCOUNT experiments by Wojciech Muła: https://github.com/WojciechMula/sse-popcount |
| 45 | + * R&D progress tracker: https://github.com/ashvardanian/SimSIMD/pull/138 |
17 | 46 | */
|
18 | 47 | #ifndef SIMSIMD_BINARY_H
|
19 | 48 | #define SIMSIMD_BINARY_H
|
@@ -321,19 +350,6 @@ SIMSIMD_PUBLIC void simsimd_jaccard_b8_ice(simsimd_b8_t const *a, simsimd_b8_t c
|
321 | 350 | simsimd_distance_t *result) {
|
322 | 351 |
|
323 | 352 | simsimd_size_t intersection = 0, union_ = 0;
|
324 |
| - //? On such vectors we can clearly see that the CPU struggles to perform this many parallel |
325 |
| - //? population counts, because the throughput of Jaccard and Hamming in this case starts to differ. |
326 |
| - //? One optimization, aside from Harley-Seal transforms can be using "shuffles" for nibble-popcount |
327 |
| - //? lookups, to utilize other ports on the CPU. |
328 |
| - //? https://github.com/ashvardanian/SimSIMD/pull/138 |
329 |
| - // |
330 |
| - // - `_mm512_popcnt_epi64` maps to `VPOPCNTQ (ZMM, K, ZMM)`: |
331 |
| - // - On Ice Lake: 3 cycles latency, ports: 1*p5 |
332 |
| - // - On Genoa: 2 cycles latency, ports: 1*FP01 |
333 |
| - // - `_mm512_shuffle_epi8` maps to `VPSHUFB (ZMM, ZMM, ZMM)`: |
334 |
| - // - On Ice Lake: 1 cycles latency, ports: 1*p5 |
335 |
| - // - On Genoa: 2 cycles latency, ports: 1*FP12 |
336 |
| - // |
337 | 353 | // It's harder to squeeze out performance from tiny representations, so we unroll the loops for binary metrics.
|
338 | 354 | if (n_words <= 64) { // Up to 512 bits.
|
339 | 355 | __mmask64 mask = (__mmask64)_bzhi_u64(0xFFFFFFFFFFFFFFFF, n_words);
|
|
0 commit comments