Skip to content

Commit 3dde9ef

Browse files
committedAug 7, 2019
first files commit
0 parents  commit 3dde9ef

14 files changed

+741
-0
lines changed
 

‎.gitignore

+160
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,160 @@
1+
# Created by .ignore support plugin (hsz.mobi)
2+
### JetBrains template
3+
# Covers JetBrains IDEs: IntelliJ, RubyMine, PhpStorm, AppCode, PyCharm, CLion, Android Studio and WebStorm
4+
# Reference: https://intellij-support.jetbrains.com/hc/en-us/articles/206544839
5+
6+
# User-specific stuff
7+
.idea/**/workspace.xml
8+
.idea/**/tasks.xml
9+
.idea/**/dictionaries
10+
.idea/**/shelf
11+
12+
# Sensitive or high-churn files
13+
.idea/**/dataSources/
14+
.idea/**/dataSources.ids
15+
.idea/**/dataSources.local.xml
16+
.idea/**/sqlDataSources.xml
17+
.idea/**/dynamic.xml
18+
.idea/**/uiDesigner.xml
19+
.idea/**/dbnavigator.xml
20+
21+
# Gradle
22+
.idea/**/gradle.xml
23+
.idea/**/libraries
24+
25+
# CMake
26+
cmake-build-debug/
27+
cmake-build-release/
28+
29+
# Mongo Explorer plugin
30+
.idea/**/mongoSettings.xml
31+
32+
# File-based project format
33+
*.iws
34+
35+
# IntelliJ
36+
out/
37+
38+
# mpeltonen/sbt-idea plugin
39+
.idea_modules/
40+
41+
# JIRA plugin
42+
atlassian-ide-plugin.xml
43+
44+
# Cursive Clojure plugin
45+
.idea/replstate.xml
46+
47+
# Crashlytics plugin (for Android Studio and IntelliJ)
48+
com_crashlytics_export_strings.xml
49+
crashlytics.properties
50+
crashlytics-build.properties
51+
fabric.properties
52+
53+
# Editor-based Rest Client
54+
.idea/httpRequests
55+
### Python template
56+
# Byte-compiled / optimized / DLL files
57+
__pycache__/
58+
*.py[cod]
59+
*$py.class
60+
61+
# C extensions
62+
*.so
63+
64+
# Distribution / packaging
65+
.Python
66+
build/
67+
develop-eggs/
68+
dist/
69+
downloads/
70+
eggs/
71+
.eggs/
72+
lib/
73+
lib64/
74+
parts/
75+
sdist/
76+
var/
77+
wheels/
78+
*.egg-info/
79+
.installed.cfg
80+
*.egg
81+
MANIFEST
82+
83+
# PyInstaller
84+
# Usually these files are written by a python script from a template
85+
# before PyInstaller builds the exe, so as to inject date/other infos into it.
86+
*.manifest
87+
*.spec
88+
89+
# Installer logs
90+
pip-log.txt
91+
pip-delete-this-directory.txt
92+
93+
# Unit test / coverage reports
94+
htmlcov/
95+
.tox/
96+
.coverage
97+
.coverage.*
98+
.cache
99+
nosetests.xml
100+
coverage.xml
101+
*.cover
102+
.hypothesis/
103+
.pytest_cache/
104+
105+
# Translations
106+
*.mo
107+
*.pot
108+
109+
# Django stuff:
110+
*.log
111+
local_settings.py
112+
db.sqlite3
113+
114+
# Flask stuff:
115+
instance/
116+
.webassets-cache
117+
118+
# Scrapy stuff:
119+
.scrapy
120+
121+
# Sphinx documentation
122+
docs/_build/
123+
124+
# PyBuilder
125+
target/
126+
127+
# Jupyter Notebook
128+
.ipynb_checkpoints
129+
130+
# pyenv
131+
.python-version
132+
133+
# celery beat schedule file
134+
celerybeat-schedule
135+
136+
# SageMath parsed files
137+
*.sage.py
138+
139+
# Environments
140+
.env
141+
.venv
142+
env/
143+
venv/
144+
ENV/
145+
env.bak/
146+
venv.bak/
147+
148+
# Spyder project settings
149+
.spyderproject
150+
.spyproject
151+
152+
# Rope project settings
153+
.ropeproject
154+
155+
# mkdocs documentation
156+
/site
157+
158+
# mypy
159+
.mypy_cache/
160+

‎README.md

+128
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
## Data Engineering Capstone Project for Udacity
2+
3+
### Objective
4+
Under construction
5+
6+
### Installing and starting
7+
8+
#### Installing Python Dependencies
9+
You need to install this python dependencies
10+
In Terminal/CommandPrompt:
11+
12+
without anaconda you can do this:
13+
```
14+
$ python3 -m venv virtual-env-name
15+
$ source virtual-env-name/bin/activate
16+
$ pip install -r requirements.txt
17+
```
18+
with anaconda you can do this (in Windows):
19+
```
20+
$ conda env create -f env.yml
21+
$ source activate <conda-env-name>
22+
```
23+
or (in Others)
24+
```
25+
conda create -y -n <conda-env-name> python==3.6
26+
conda install -f -y -q -n <conda-env-name> -c conda-forge --file requirements.txt
27+
[source activate/ conda activate] <conda-env-name>
28+
```
29+
#### Fixing/Configuring Airflow
30+
```
31+
$ pip install --upgrade Flask
32+
$ pip install zappa
33+
$ mkdir airflow_home
34+
$ export AIRFLOW_HOME=./airflow_home
35+
$ cd airflow_home
36+
$ airflow initdb
37+
$ airflow webserver
38+
$ airflow scheduler
39+
```
40+
41+
#### More Airflow commands
42+
To list existing dags registered with airflow
43+
```
44+
$ airflow list_dags
45+
```
46+
47+
#### Secure/Encrypt your connections and hooks
48+
**Run**
49+
```bash
50+
$ python cryptosetup.py
51+
```
52+
copy this key to *airflow.cfg* to paste after
53+
fernet_key = ************
54+
55+
#### Setting up connections and variables in Airflow UI for AWS
56+
TODO: There is no code to modify in this exercise. We're going to
57+
create a connection and a variable.
58+
59+
**S3**
60+
1. Open your browser to localhost:8080 and open Admin->Variables
61+
2. Click "Create"
62+
3. Set "Key" equal to "s3_bucket" and set "Val" equal to "udacity-dend"
63+
4. Set "Key" equal to "s3_prefix" and set "Val" equal to "data-pipelines"
64+
5. Click save
65+
66+
**AWS**
67+
1. Open Admin->Connections
68+
2. Click "Create"
69+
3. Set "Conn Id" to "aws_credentials", "Conn Type" to "Amazon Web Services"
70+
4. Set "Login" to your aws_access_key_id and "Password" to your aws_secret_key
71+
5. Click save
72+
6. If it doesn't work then in "Extra" field put:
73+
{"region_name": "your_aws_region", "aws_access_key_id":"your_aws_access_key_id", "aws_secret_access_key": "your_aws_secret_access_key", "aws_iam_user": "your_created_iam_user"}
74+
7. These are all you can put:
75+
- aws_account_id: AWS account ID for the connection
76+
- aws_iam_role: AWS IAM role for the connection
77+
- external_id: AWS external ID for the connection
78+
- host: Endpoint URL for the connection
79+
- region_name: AWS region for the connection
80+
- role_arn: AWS role ARN for the connection
81+
82+
**Redshift**
83+
1. Open Admin->Connections
84+
2. Click "Create"
85+
3. Set "Conn Id" to "redshift", "Conn Type" to "postgres"
86+
4. Set "Login" to your master_username for your cluster and "Password"
87+
to your master_password for your cluster
88+
5. Click save
89+
90+
### About the data
91+
#### I94 Immigration Data:
92+
This data comes from the US National Tourism and Trade Office.
93+
[This](https://travel.trade.gov/research/reports/i94/historical/2016.html)
94+
is where the data comes from. There's a sample file so you can take a look
95+
at the data in csv format before reading it all in. The report contains
96+
international visitor arrival statistics by world regions and selected
97+
countries (including top 20), type of visa, mode of transportation,
98+
age groups, states visited (first intended address only), and the top
99+
ports of entry (for select countries)
100+
101+
#### World Temperature Data:
102+
This dataset came from Kaggle. You can read more about it [here](https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data).
103+
104+
#### U.S. City Demographic Data:
105+
This data comes from OpenSoft. You can read more about it [here](https://public.opendatasoft.com/explore/dataset/us-cities-demographics/export/).
106+
107+
#### Airport Code Table:
108+
This is a simple table of airport codes and corresponding cities. It comes from [here](https://datahub.io/core/airport-codes#data).
109+
110+
### Run the project
111+
under construction
112+
113+
#### Links for Airflow
114+
**Context Variables**
115+
https://airflow.apache.org/macros.html
116+
117+
**Hacks for airflow**
118+
https://medium.com/datareply/airflow-lesser-known-tips-tricks-and-best-practises-cf4d4a90f8f
119+
https://medium.com/handy-tech/airflow-tips-tricks-and-pitfalls-9ba53fba14eb
120+
https://www.astronomer.io/guides/dag-best-practices/
121+
122+
```
123+
____________ _____________
124+
____ |__( )_________ __/__ /________ __
125+
____ /| |_ /__ ___/_ /_ __ /_ __ \_ | /| / /
126+
___ ___ | / _ / _ __/ _ / / /_/ /_ |/ |/ /
127+
_/_/ |_/_/ /_/ /_/ /_/ \____/____/|__/
128+
```

‎airflow/plugins/operators/__init__.py

+11
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
from operators.stage_redshift import StageToRedshiftOperator
2+
from operators.load_fact import LoadFactOperator
3+
from operators.load_dimension import LoadDimensionOperator
4+
from operators.data_quality import DataQualityOperator
5+
6+
__all__ = [
7+
'StageToRedshiftOperator',
8+
'LoadFactOperator',
9+
'LoadDimensionOperator',
10+
'DataQualityOperator'
11+
]
+30
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
from airflow.hooks.postgres_hook import PostgresHook
2+
from airflow.models import BaseOperator
3+
from airflow.utils.decorators import apply_defaults
4+
5+
6+
class DataQualityOperator(BaseOperator):
7+
8+
ui_color = '#89DA59'
9+
10+
@apply_defaults
11+
def __init__(self,
12+
redshift_conn_id,
13+
sql_stmt,
14+
tables,
15+
*args, **kwargs):
16+
17+
super(DataQualityOperator, self).__init__(*args, **kwargs)
18+
self.redshift_conn_id = redshift_conn_id
19+
self.sql_stmt = sql_stmt
20+
self.tables = tables
21+
22+
def execute(self, context):
23+
self.log.info(f""" Checking ETL result quality """)
24+
redshift = PostgresHook(self.redshift_conn_id)
25+
for cur_table in self.tables:
26+
try:
27+
if redshift.run(self.sql_stmt.format(cur_table)) == 1:
28+
self.log.info(f""" Quality test passed for {cur_table} """)
29+
except:
30+
raise ValueError(f""" Quality check for {cur_table} """)
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
from airflow.hooks.postgres_hook import PostgresHook
2+
from airflow.models import BaseOperator
3+
from airflow.utils.decorators import apply_defaults
4+
5+
6+
class LoadDimensionOperator(BaseOperator):
7+
ui_color = '#358140'
8+
9+
@apply_defaults
10+
def __init__(self,
11+
redshift_conn_id,
12+
table,
13+
sql_stmt,
14+
mode,
15+
*args, **kwargs):
16+
17+
super(LoadDimensionOperator, self).__init__(*args, **kwargs)
18+
self.redshift_conn_id = redshift_conn_id
19+
self.table = table
20+
self.sql_stmt = sql_stmt
21+
self.mode = mode.lower()
22+
23+
def execute(self, context):
24+
self.log.info(f""" Creating Postgres hook """)
25+
redshift = PostgresHook(self.redshift_conn_id)
26+
self.log.info(f""" Loading Data into table {self.table} """)
27+
if self.mode == 'update':
28+
formatted_sql = f""" INSERT INTO {self.table} ({self.sql_stmt})"""
29+
redshift.run(formatted_sql)
30+
elif self.mode == "insert":
31+
formatted_sql = f"""TRUNCATE TABLE {self.table};
32+
INSERT INTO {self.table} ({self.sql_stmt})"""
33+
redshift.run(formatted_sql)
34+
35+
+26
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
from airflow.hooks.postgres_hook import PostgresHook
2+
from airflow.models import BaseOperator
3+
from airflow.utils.decorators import apply_defaults
4+
5+
6+
class LoadFactOperator(BaseOperator):
7+
ui_color = '#F98866'
8+
9+
@apply_defaults
10+
def __init__(self,
11+
redshift_conn_id,
12+
table,
13+
sql_stmt,
14+
*args, **kwargs):
15+
16+
super(LoadFactOperator, self).__init__(*args, **kwargs)
17+
self.redshift_conn_id = redshift_conn_id
18+
self.table = table
19+
self.sql_stmt = sql_stmt
20+
21+
def execute(self, context):
22+
self.log.info(f""" Creating Postgres hook """)
23+
redshift = PostgresHook(self.redshift_conn_id)
24+
self.log.info(f""" Loading Data into table {self.table} """)
25+
formatted_sql = f""" INSERT INTO {self.table} ({self.sql_stmt})"""
26+
redshift.run(formatted_sql)

0 commit comments

Comments
 (0)