Skip to content
Karol Sobczak edited this page Feb 22, 2016 · 10 revisions

Floating point literals should be represented as decimals in order to be compliant with other databases and SQL standard. To make that happen, decimal arithmetic has to be fast enough, so that there is no significant performance gap compared to BIGING or DOUBLE types. Take for instance those SQL expressions:

  • a_bigint >= 2.5 => returns boolean
  • a_bigint >= 2.5 * b_bigint => returns boolean
  • sqrt((a_bigint * 2.5)^3) => returns double
  • a_decimal_5_2 * 2.5 <= b_bigint => returns boolean

There are number of issues which have to be addressed given that floating point literals are DECIMALS (and not DOUBLES):

  • Doing fast comparison where one side is BIGINT and the other DECIMAL. The problem here is that BIGINT should be coerced to DECIMAL(19, 0) which cannot be represented by long Java primitive for all values;
  • Performing fast calculations when both BIGINT or DOUBLE and DECIMAL is involved (previously BIGINT would be cast to DOUBLE). The performance gap should be as small as possible;
  • Performance of short decimals (precision <= 19) should be comparable to the performance of Java long or double primitive;
  • Performance of long decimals (precision >= 19) should also be as good as possible (the same order of magnitude as for short decimals).

Given those requirements, those are proposed steps for implementing fast DECIMAL:

  • Add support in Presto for output Slice argument for scalar operators. This allows to implement Slice based version of DECIMAL without instantiating Slice for each operator result. This also makes calculations faster because same accumulator Slice can be reused between calculations (the accumulator will get cached in CPU).
  • Implementing long decimal (precision >= 19) arithmetic that would be specifically designed for 128-bit decimals and would work directly on Slices without instantiating temporary objects (e.g. BigInteger). Such arithmetic POC is available here: https://github.com/Teradata/presto/commit/6aa867720dfb6e5b30ccda143442d76e9d84b621 and is based on Hive's Decimal128 (https://github.com/prongs/apache-hive/blob/master/common/src/java/org/apache/hadoop/hive/common/type/Decimal128.java);
  • Have a dual Slice based representations of long decimals (precision >= 19), e.g store either Java long primitive or four integers in a Slice depending on value magnitude. This will allow for fast computations (long based) for smaller decimals with precision >= 19.
  • (Optional depending on performance) Have specific operators for BIGINT and DECIMAL logical and/or arithmetic operations
  • (Optional) Rewrite expressions, e.g. a_bigint >= 2.5 could be rewritten to a_bigint >= 2

Concepts behind those proposals were investigated and benchmarked. Benchmark code is available: https://github.com/Teradata/presto/commit/e1dac35111cdc3cab807c763b7e43c0c2ed36f4e. The assumptions are following:

  • N rows contain 1, 10, 100 columns. Column is stored in multiple formats (all wrapped in Slice): long primitive, long primitive with additional byte flag, double primitive, fast decimal POC (four integers), long decimal (based on Java BigInteger).
  • Column data is stored in Slice to simulate Presto workflow (e.g. extracting value from Block or reading it from hard drive)
  • Each benchmark computes sum of all columns for each row. The result is stored in Slice to simulate Presto workflow (e.g. storing value using BlockBuilder or writing it to disk)
  • Column data is generated from extendedPrice column of lineitem TPCH table.

There are two ways of computing columns sum:

  • sum(a_column, sum(b_column, sum(c_column, ...) -> in this approach result of one sum is input to other sum. This works well with Java primitives, but causes overhead when the sum result is Slice. This is because every sum operation has to return new Slice instance. Such benchmarks have suffix: NoAccumulator
  • sum(a_column, 0, result); sum(b_column, result, result); ... -> in this approach sum operator gets a reference to result Slice variable where the sum result should be stored. This approach is best for DECIMALS that work on Slices because unnecessary objects are not created

Here are benchmarks results:

Benchmark                                                 (numberOfColumns)  (numberOfRows)  Mode  Cnt         Score         Error  Units
DecimalBenchmark.benchmarkAddDouble                                     100           10000  avgt   10  11009680,098 ± 136702,006  ns/op
DecimalBenchmark.benchmarkAddExactBigInt                                100           10000  avgt   10  10102069,534 ±  192094,824  ns/op
DecimalBenchmark.benchmarkAddExactSliceShortDecimal                     100           10000  avgt   10  10000773,936 ±  314484,661  ns/op
DecimalBenchmark.benchmarkAddExactSliceShortDecimalWithCondition        100           10000  avgt   10  12233633,786 ±  281106,921  ns/op
DecimalBenchmark.benchmarkAddExactSliceShortDecimalWithFlag             100           10000  avgt   10  12851717,002 ±  649326,304  ns/op
DecimalBenchmark.benchmarkAddExactSliceShortDecimalNoAccumulator        100           10000  avgt   10  27868271,539 ±  641941,350  ns/op
DecimalBenchmark.benchmarkAddFastSliceDecimal                           100           10000  avgt   10  23306906,663 ±  906315,500  ns/op
DecimalBenchmark.benchmarkAddFastSliceDecimalNoAccumulator              100           10000  avgt   10  45606653,390 ± 1319108,845  ns/op
 
DecimalBenchmark.benchmarkAddDouble                                      10           10000  avgt   10   2691757,513 ±  42252,904  ns/op
DecimalBenchmark.benchmarkAddExactBigInt                                 10           10000  avgt   10   2294961,172 ±   45196,712  ns/op
DecimalBenchmark.benchmarkAddExactSliceShortDecimal                      10           10000  avgt   10   2105299,290 ±  154218,157  ns/op
DecimalBenchmark.benchmarkAddExactSliceShortDecimalWithCondition         10           10000  avgt   10   2468484,502 ±   40638,119  ns/op
DecimalBenchmark.benchmarkAddExactSliceShortDecimalWithFlag              10           10000  avgt   10   2882945,564 ±  233256,605  ns/op
DecimalBenchmark.benchmarkAddExactSliceShortDecimalNoAccumulator         10           10000  avgt   10   4018341,231 ±   78284,635  ns/op
DecimalBenchmark.benchmarkAddFastSliceDecimal                            10           10000  avgt   10   4435123,345 ±  118476,732  ns/op
DecimalBenchmark.benchmarkAddFastSliceDecimalNoAccumulator               10           10000  avgt   10   6486706,982 ±  125816,358  ns/op
DecimalBenchmark.benchmarkAddLongDecimal                                 10           10000  avgt   10  25570694,178 ±  718172,389  ns/op
 
DecimalBenchmark.benchmarkAddDouble                                       2           10000  avgt   10    531261,601 ±   7925,572  ns/op
DecimalBenchmark.benchmarkAddExactBigInt                                  2           10000  avgt   10    582688,160 ±   13036,141  ns/op
DecimalBenchmark.benchmarkAddExactSliceShortDecimal                       2           10000  avgt   10    576614,844 ±   61817,158  ns/op
DecimalBenchmark.benchmarkAddExactSliceShortDecimalWithCondition          2           10000  avgt   10    558155,381 ±   60957,820  ns/op
DecimalBenchmark.benchmarkAddExactSliceShortDecimalWithFlag               2           10000  avgt   10    834472,680 ±  111274,151  ns/op
DecimalBenchmark.benchmarkAddExactSliceShortDecimalNoAccumulator          2           10000  avgt   10   1368111,040 ±   40764,386  ns/op
DecimalBenchmark.benchmarkAddFastSliceDecimal                             2           10000  avgt   10   1402975,001 ±   54223,765  ns/op
DecimalBenchmark.benchmarkAddFastSliceDecimalNoAccumulator                2           10000  avgt   10   1658072,818 ±   20921,517  ns/op
DecimalBenchmark.benchmarkAddLongDecimal                                  2           10000  avgt   10   5707279,463 ±  152062,044  ns/op
 
DecimalBenchmark.benchmarkAddDouble                                       1           10000  avgt   10    153647,332 ±   8232,550  ns/op
DecimalBenchmark.benchmarkAddExactBigInt                                  1           10000  avgt   10    199451,630 ±    4178,584  ns/op
DecimalBenchmark.benchmarkAddExactSliceShortDecimal                       1           10000  avgt   10    167010,986 ±    3828,009  ns/op
DecimalBenchmark.benchmarkAddExactSliceShortDecimalWithCondition          1           10000  avgt   10    176952,276 ±    5064,431  ns/op
DecimalBenchmark.benchmarkAddExactSliceShortDecimalWithFlag               1           10000  avgt   10    223903,387 ±    8515,021  ns/op
DecimalBenchmark.benchmarkAddExactSliceShortDecimalNoAccumulator          1           10000  avgt   10    330760,579 ±    8886,068  ns/op
DecimalBenchmark.benchmarkAddFastSliceDecimal                             1           10000  avgt   10    390820,057 ±   14403,589  ns/op
DecimalBenchmark.benchmarkAddFastSliceDecimalNoAccumulator                1           10000  avgt   10    594768,346 ±    8167,884  ns/op
DecimalBenchmark.benchmarkAddLongDecimal                                  1           10000  avgt   10   2527092,601 ±   85102,283  ns/op

Benchmarks explanation:

  • benchmarkAddExactBigInt are benchmarks that operate on long Java primitives as inputs and results.
  • benchmarkAddExactSliceShortDecimal are benchmarks that operate on long Java primitives wrapped in Slice and passed as arguments. The result (accumulator) parameter is provided to sum operator.
  • benchmarkAddExactSliceShortDecimalWithCondition/Flag are benchmarks that do additional checks on input parameters. This is to simulate dual representation of short and long decimals in Slice. Condition checking means checking if the first bit of extracted long primitive is set. Flag checking means checking additional byte from Slice (this requires one additional call to Slice.getByte() method).
  • benchmarkAddExactSliceShortDecimalNoAccumulator simulates case when a new Slice is returned as a result of sum operator instead of using accumulator
  • benchmarkAddFastSliceDecimal benchmarks POC implementation of fast long decimal
  • benchmarkAddLongDecimal benchmarks BigInteger based long decimal implementation

Conclusions:

  • benchmarkAddExactBigInt and benchmarkAddDouble have comparable performance to benchmarkAddExactSliceShortDecimalWithCondition and benchmarkAddExactSliceShortDecimal. This confirms that it is possible to have dual representation for short and long decimals in Slice.
  • benchmarkAddExactSliceShortDecimal is about twice as fast as benchmarkAddExactSliceShortDecimalNoAccumulator. This confirms that having output Slice argument is beneficial.
  • benchmarkAddFastSliceDecimal is 2-3 times slower than benchmarkAddExactBigInt but about order of magnitude faster than benchmarkAddLongDecimal depending on number of columns.