Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache GroupByHash raw values where appropriate #25294

Merged
merged 2 commits into from
Mar 19, 2025

Conversation

pettyjamesm
Copy link
Member

@pettyjamesm pettyjamesm commented Mar 12, 2025

Description

Adds logic to dynamically decide whether to cache hash values into the hash table itself based on whether the GroupByHash instance is spillable, container types are present in the grouping keys, there are 2 or more variable width types present, or there are 3 or more types in total.

Caching hash values in GroupByHash is a classic compute / memory trade-off. Caching the value costs more memory (8 extra bytes per record) but avoids relatively more expensive hash recalculation when re-hashing the table or sorting the table contents by raw hash value for spilling.

Caching the hash value can also make inserting new entries cheaper by avoiding the need for relatively more expensive
valueIdentical checks when the hash value can prove that the values are not identical, although this effect is already somewhat mitigated by the FlatHash#control vector.

Additional context and related issues

Builds on top of:

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

## General
* Improve performance of GROUP BY and DISTINCT aggregations when spilling to disk is enabled or grouping by ROW, ARRAY, or MAP columns ({issue}`25294`)

@cla-bot cla-bot bot added the cla-signed label Mar 12, 2025
@pettyjamesm pettyjamesm force-pushed the cache-hash-values branch 3 times, most recently from 7c12b77 to 90f9484 Compare March 13, 2025 14:10
@pettyjamesm pettyjamesm marked this pull request as ready for review March 13, 2025 15:17
@pettyjamesm pettyjamesm requested a review from dain March 13, 2025 15:17
Copy link
Member

@dain dain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

BTW the PR message about saving on collisions isn't as big of a deal as you would think. The control byte array contains 8bits of the hash, so it reduces collision identity checks by a factor of 256.

Adds logical to dynamically decide whether to cache hash values into
the hash table itself based on whether the GroupByHash instance is
spillable, whether container types are present in the grouping keys, the
number of variable width types, and the number of types overall.

Caching hash values in GroupByHash is a classic compute / memory
trade-off. Caching the value costs more memory (8 extra bytes per
record) but avoids relatively more expensive hash recalculation when
re-hashing the table or sorting the table contents by raw hash value for
spilling. Caching the hash value can also make inserting new entries
cheaper by avoiding the need for relatively more expensive
valueIdentical checks when the hash value can prove that the values are
not identical, although this effect is already somewhat mitigated by the
FlatHash control vector.
@pettyjamesm
Copy link
Member Author

BTW the PR message about saving on collisions isn't as big of a deal as you would think. The control byte array contains 8bits of the hash, so it reduces collision identity checks by a factor of 256.

Fair point, I've rephrased the commit message and PR description- although minor clarification, the control vector only stores 7 bits of the hash (since the sign bit is used to detect set vs unset). Thanks for the review!

@pettyjamesm pettyjamesm merged commit ace0da6 into trinodb:master Mar 19, 2025
94 checks passed
@pettyjamesm pettyjamesm deleted the cache-hash-values branch March 19, 2025 16:46
@github-actions github-actions bot added this to the 473 milestone Mar 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

2 participants