Arnab Sarkar's implementation of Insight 2018 challenge
The aim of the project is to process a stream data of campaign contributions, identify repeat donors & report total dollars received, the total number of contributions received & donation amount in a given percentile.
Big data processing requires close attention to runtime & how data is stored & processed. Although a naive implementation of percentile is fairly easy to implement; a special algorithm for percentile (using two Heap Queues) was developed for calculating fast percentiles & is the main highlight of my implementation.
The program can process a 1.2G itcont.txt
file in about 2.5 min on an
Intel(R) Core(TM)2 Duo CPU E8135 (year:2009) processor. It uses about 450 MB of
main memory.
Download the application from github. Follow these steps to run the application:
$ git clone https://github.com/myonlinecode1988/ELECTION_ANALYTICS.git
$ cd ELECTION_ANALYTICS
$ chmod +x run.sh
$ ./run.sh
If you want to run your own input data set; you can do so by editing run.sh
as
show below:
#!/bin/bash
#
python ./src/donation-analytics.py ./<YOUR-PATH>/itcont.txt ./<YOUR-PATH>/percentile.txt ./output/repeat_donors.txt
We have included a 4 new tests highlighting accuracy & speed of our code & algorithm. To run the tests follow these steps:
$ cd insight_testsuite/
insight_testsuite~$ chmod +x run_tests.sh
insight_testsuite~$ ./run_tests.sh
The output should look similar to this:
[PASS]: test_1 repeat_donors.txt
[PASS]: test_2 repeat_donors.txt
[PASS]: test_3 repeat_donors.txt
[PASS]: test_4 repeat_donors.txt
[Mon Feb 12 11:04:19 EST 2018] 4 of 4 tests passed
[UNIT TEST#1: Comparing Speed & Accuracy: Naive-Nearest-Rank-Percentile-Method vs. Dual-Heap-Nearest-Rank-Percentile-Method Implementations for 10 Random Seeds]
SUCESS:Percentile Match. Dual-Heap-Percentile implementation is 3.47 times faster than naive implementation
SUCESS:Percentile Match. Dual-Heap-Percentile implementation is 3.45 times faster than naive implementation
SUCESS:Percentile Match. Dual-Heap-Percentile implementation is 3.2 times faster than naive implementation
SUCESS:Percentile Match. Dual-Heap-Percentile implementation is 3.13 times faster than naive implementation
SUCESS:Percentile Match. Dual-Heap-Percentile implementation is 3.08 times faster than naive implementation
SUCESS:Percentile Match. Dual-Heap-Percentile implementation is 3.21 times faster than naive implementation
SUCESS:Percentile Match. Dual-Heap-Percentile implementation is 3.1 times faster than naive implementation
SUCESS:Percentile Match. Dual-Heap-Percentile implementation is 3.05 times faster than naive implementation
SUCESS:Percentile Match. Dual-Heap-Percentile implementation is 3.12 times faster than naive implementation
SUCESS:Percentile Match. Dual-Heap-Percentile implementation is 3.19 times faster than naive implementation
- test_1: This is the default test provided to us.
- test_2: This test has been provided to evaluate correct results for CMTE_ID=C00640623 & ZIP_CODE=35043.
- test_3: This test has been provided to take care of out-of-order streams as explained in the FAQ section.
- test_4: This test has been provided to highlight how the code checks & skips malformed data.
- UNIT TEST#1: This a test that highlights the accuracy/performance of
Dual Heap Percentile Implementation
vs.Naive Percentile Implementation
.
Read rows, check input data sanity
|
Use (NAME, ZIP_CODE) as a key in Dictionary#1
and store donor info (YEAR, COUNT)
|
Reset repeat donor info to current donor entry
if out-of-order calendar year appears
|
Is (NAME, ZIP_CODE) key present in Dictionary#1 ?
|
|----------------------------------------------------------------|
(if no) (if yes)
Create a new entry in Dictionary#1 Update count in Dictionary#1
(NAME,ZIP_CODE):[YEAR,count=1] (NAME,ZIP_CODE):[YEAR,count+=1]
This is a repeat donor
|
Is (CMTE_ID,ZIP_CODE,YEAR) key present in RepeatDonorDictionary ?
|----------------------------------------------------------------|
(if no) (if yes)
Create entry in RepeatDonorDictionary with Update entry in RepeatDonorDictionary with
(CMTE_ID,ZIP_CODE,YEAR):[update Percentile object, (CMTE_ID,ZIP_CODE,YEAR):[update Percentile object,
find new percentile and update running donation total and counts] find new percentile and update running donation total and counts]
We use two dictionaries & a Percentile object which comprises of two heap data strucures.
The first dictionary (referred to as Dictionary#1
above &
NAME_ZIP_to_YEAR_COUNT
in code) uses (NAME,ZIP_CODE)
as key & stores
[YEAR,count]
as its values.
The second dictionary (referred to as RepeatDonorDictionary
above &
REPEAT_DONOR_DICT
in code ) uses (CMTE_ID,ZIP_CODE,YEAR)
as key & stores
[CMTE_ID,ZIP_CODE,TRANSACTION_DT,Percentile-Object,Percentile,Running Count,Running Total Amount]
as its values. Dictionary is a good data structure
to use because key search is of O(1) complexity.
We create two heaps MinHeap
& MaxHeap
. Let's say we are evaluating for 30
percentile. MaxHeap
would contain elements smaller than or equal to 30
percentile value. MinHeap
would contain all the elements greater than 30
percentile value. Now we insert values according to the size of the heap for
that iteration.
It takes good advantage of ordered data required for percentile calculation and
the fact that heap has O(1) complexity to find-min-value
operation & O(log
n) complexity for insert
operation. Heapq
module in python is an
implementation of MinHeap
. To implement MaxHeap
, I made the value of keys
negative.
- The code was tested in
Python 2.7.10
. Although I haven't tested it in other versions, I expect the code to work withPython 2.7.x
. The code has not been tested forPython 3
. - The code should work with baseline python installation.The code uses the
following modules:
sys
,os.path
,heapq
&math
. - I have used calendar years instead of dates in the entire program. All
output donation values have been rounded to
int
. - The nearest-rank-method is NOT defined for Percentile=0. Please use Percentile between (0,100].
- Although the fast-percentile-algorithm was exhaustively tested with 1000s of
random data and has been shown as part of my tests; I should have used
unittest
module for more formal testing.