-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Fast Decimal Type
Floating point literals should be represented as decimals in order to be compliant with other databases and SQL standard. To make that happen, decimal arithmetic has to be fast enough, so that there is no significant performance gap compared to BIGING
or DOUBLE
types. Take for instance those SQL expressions:
a_bigint >= 2.5 => returns boolean
a_bigint >= 2.5 * b_bigint => returns boolean
sqrt((a_bigint * 2.5)^3) => returns double
a_decimal_5_2 * 2.5 <= b_bigint => returns boolean
There are number of issues which have to be addressed given that floating point literals are DECIMALS
(and not DOUBLES
):
- Doing fast comparison where one side is
BIGINT
and the otherDECIMAL
. The problem here is thatBIGINT
should be coerced toDECIMAL(19, 0)
which cannot be represented bylong
Java primitive for all values; - Performing fast calculations when both
BIGINT
orDOUBLE
andDECIMAL
is involved (previouslyBIGINT
would be cast toDOUBLE
). The performance gap should be as small as possible; - Performance of short decimals (
precision <= 19
) should be comparable to the performance of Javalong
ordouble
primitive; - Performance of long decimals (
precision >= 19
) should also be as good as possible (the same order of magnitude as for short decimals).
Given those requirements, those are proposed steps for implementing fast DECIMAL
:
- Add support in Presto for output
Slice
argument for scalar operators. This allows to implementSlice
based version ofDECIMAL
without instantiatingSlice
for each operator result. This also makes calculations faster because same accumulatorSlice
can be reused between calculations (the accumulator will get cached in CPU). - Implementing long decimal (
precision >= 19
) arithmetic that would be specifically designed for 128-bit decimals and would work directly onSlices
without instantiating temporary objects (e.g.BigInteger
). Such arithmetic POC is available here: https://github.com/Teradata/presto/commit/6aa867720dfb6e5b30ccda143442d76e9d84b621 and is based on Hive's Decimal128 (https://github.com/prongs/apache-hive/blob/master/common/src/java/org/apache/hadoop/hive/common/type/Decimal128.java); - Have a dual
Slice
based representations of long decimals (precision >= 19
), e.g store either Java long primitive or four integers in aSlice
depending on value magnitude. This will allow for fast computations (long
based) for smaller decimals withprecision >= 19
. - (Optional depending on performance) Have specific operators for
BIGINT
andDECIMAL
logical and/or arithmetic operations - (Optional) Rewrite expressions, e.g.
a_bigint >= 2.5
could be rewritten toa_bigint >= 2
Concepts behind those proposals were investigated and benchmarked. Benchmark code is available: https://github.com/Teradata/presto/commit/e1dac35111cdc3cab807c763b7e43c0c2ed36f4e. The assumptions are following:
-
N
rows contain1, 10, 100
columns. Column is stored in multiple formats (all wrapped inSlice
):long
primitive,long
primitive with additional byte flag,double
primitive, fast decimal POC (four integers), long decimal (based on JavaBigInteger
). - Column data is stored in
Slice
to simulate Presto workflow (e.g. extracting value fromBlock
or reading it from hard drive) - Each benchmark computes sum of all columns for each row. The result is stored in
Slice
to simulate Presto workflow (e.g. storing value usingBlockBuilder
or writing it to disk) - Column data is generated from
extendedPrice
column oflineitem
TPCH table.
There are two ways of computing columns sum:
-
sum(a_column, sum(b_column, sum(c_column, ...) ->
in this approach result of one sum is input to other sum. This works well with Java primitives, but causes overhead when the sum result isSlice
. This is because every sum operation has to return newSlice
instance. Such benchmarks have suffix:NoAccumulator
-
sum(a_column, 0, result); sum(b_column, result, result); ... ->
in this approach sum operator gets a reference to resultSlice
variable where the sum result should be stored. This approach is best forDECIMALS
that work onSlices
because unnecessary objects are not created
Here are benchmarks results:
Benchmark (numberOfColumns) (numberOfRows) Mode Cnt Score Error Units
DecimalBenchmark.benchmarkAddDouble 100 10000 avgt 10 11009680,098 ± 136702,006 ns/op
DecimalBenchmark.benchmarkAddExactBigInt 100 10000 avgt 10 10102069,534 ± 192094,824 ns/op
DecimalBenchmark.benchmarkAddExactSliceShortDecimal 100 10000 avgt 10 10000773,936 ± 314484,661 ns/op
DecimalBenchmark.benchmarkAddExactSliceShortDecimalWithCondition 100 10000 avgt 10 12233633,786 ± 281106,921 ns/op
DecimalBenchmark.benchmarkAddExactSliceShortDecimalWithFlag 100 10000 avgt 10 12851717,002 ± 649326,304 ns/op
DecimalBenchmark.benchmarkAddExactSliceShortDecimalNoAccumulator 100 10000 avgt 10 27868271,539 ± 641941,350 ns/op
DecimalBenchmark.benchmarkAddFastSliceDecimal 100 10000 avgt 10 23306906,663 ± 906315,500 ns/op
DecimalBenchmark.benchmarkAddFastSliceDecimalNoAccumulator 100 10000 avgt 10 45606653,390 ± 1319108,845 ns/op
DecimalBenchmark.benchmarkAddDouble 10 10000 avgt 10 2691757,513 ± 42252,904 ns/op
DecimalBenchmark.benchmarkAddExactBigInt 10 10000 avgt 10 2294961,172 ± 45196,712 ns/op
DecimalBenchmark.benchmarkAddExactSliceShortDecimal 10 10000 avgt 10 2105299,290 ± 154218,157 ns/op
DecimalBenchmark.benchmarkAddExactSliceShortDecimalWithCondition 10 10000 avgt 10 2468484,502 ± 40638,119 ns/op
DecimalBenchmark.benchmarkAddExactSliceShortDecimalWithFlag 10 10000 avgt 10 2882945,564 ± 233256,605 ns/op
DecimalBenchmark.benchmarkAddExactSliceShortDecimalNoAccumulator 10 10000 avgt 10 4018341,231 ± 78284,635 ns/op
DecimalBenchmark.benchmarkAddFastSliceDecimal 10 10000 avgt 10 4435123,345 ± 118476,732 ns/op
DecimalBenchmark.benchmarkAddFastSliceDecimalNoAccumulator 10 10000 avgt 10 6486706,982 ± 125816,358 ns/op
DecimalBenchmark.benchmarkAddLongDecimal 10 10000 avgt 10 25570694,178 ± 718172,389 ns/op
DecimalBenchmark.benchmarkAddDouble 2 10000 avgt 10 531261,601 ± 7925,572 ns/op
DecimalBenchmark.benchmarkAddExactBigInt 2 10000 avgt 10 582688,160 ± 13036,141 ns/op
DecimalBenchmark.benchmarkAddExactSliceShortDecimal 2 10000 avgt 10 576614,844 ± 61817,158 ns/op
DecimalBenchmark.benchmarkAddExactSliceShortDecimalWithCondition 2 10000 avgt 10 558155,381 ± 60957,820 ns/op
DecimalBenchmark.benchmarkAddExactSliceShortDecimalWithFlag 2 10000 avgt 10 834472,680 ± 111274,151 ns/op
DecimalBenchmark.benchmarkAddExactSliceShortDecimalNoAccumulator 2 10000 avgt 10 1368111,040 ± 40764,386 ns/op
DecimalBenchmark.benchmarkAddFastSliceDecimal 2 10000 avgt 10 1402975,001 ± 54223,765 ns/op
DecimalBenchmark.benchmarkAddFastSliceDecimalNoAccumulator 2 10000 avgt 10 1658072,818 ± 20921,517 ns/op
DecimalBenchmark.benchmarkAddLongDecimal 2 10000 avgt 10 5707279,463 ± 152062,044 ns/op
DecimalBenchmark.benchmarkAddDouble 1 10000 avgt 10 153647,332 ± 8232,550 ns/op
DecimalBenchmark.benchmarkAddExactBigInt 1 10000 avgt 10 199451,630 ± 4178,584 ns/op
DecimalBenchmark.benchmarkAddExactSliceShortDecimal 1 10000 avgt 10 167010,986 ± 3828,009 ns/op
DecimalBenchmark.benchmarkAddExactSliceShortDecimalWithCondition 1 10000 avgt 10 176952,276 ± 5064,431 ns/op
DecimalBenchmark.benchmarkAddExactSliceShortDecimalWithFlag 1 10000 avgt 10 223903,387 ± 8515,021 ns/op
DecimalBenchmark.benchmarkAddExactSliceShortDecimalNoAccumulator 1 10000 avgt 10 330760,579 ± 8886,068 ns/op
DecimalBenchmark.benchmarkAddFastSliceDecimal 1 10000 avgt 10 390820,057 ± 14403,589 ns/op
DecimalBenchmark.benchmarkAddFastSliceDecimalNoAccumulator 1 10000 avgt 10 594768,346 ± 8167,884 ns/op
DecimalBenchmark.benchmarkAddLongDecimal 1 10000 avgt 10 2527092,601 ± 85102,283 ns/op
Benchmarks explanation:
-
benchmarkAddExactBigInt
are benchmarks that operate onlong
Java primitives as inputs and results. -
benchmarkAddExactSliceShortDecimal
are benchmarks that operate onlong
Java primitives wrapped inSlice
and passed as arguments. The result (accumulator) parameter is provided to sum operator. -
benchmarkAddExactSliceShortDecimalWithCondition/Flag
are benchmarks that do additional checks on input parameters. This is to simulate dual representation of short and long decimals inSlice
. Condition checking means checking if the first bit of extractedlong
primitive is set. Flag checking means checking additional byte fromSlice
(this requires one additional call toSlice.getByte()
method). -
benchmarkAddExactSliceShortDecimalNoAccumulator
simulates case when a newSlice
is returned as a result of sum operator instead of using accumulator -
benchmarkAddFastSliceDecimal
benchmarks POC implementation of fast long decimal -
benchmarkAddLongDecimal
benchmarksBigInteger
based long decimal implementation
Conclusions:
-
benchmarkAddExactBigInt
andbenchmarkAddDouble
have comparable performance tobenchmarkAddExactSliceShortDecimalWithCondition
andbenchmarkAddExactSliceShortDecimal
. This confirms that it is possible to have dual representation for short and long decimals inSlice
. -
benchmarkAddExactSliceShortDecimal
is about twice as fast asbenchmarkAddExactSliceShortDecimalNoAccumulator
. This confirms that having outputSlice
argument is beneficial. -
benchmarkAddFastSliceDecimal
is 2-3 times slower thanbenchmarkAddExactBigInt
but about order of magnitude faster thanbenchmarkAddLongDecimal
depending on number of columns.