This package provides operators, sensors, and a hook to integrate DataRobot into Apache Airflow. Using these components, you should be able to build the essential DataRobot pipeline - create a project, train models, deploy a model, and score predictions against the model deployment.
The DataRobot provider for Apache Airflow requires an environment with the following dependencies installed:
-
Apache Airflow >= 2.3
-
DataRobot Python API Client >= 3.2.0
To install the DataRobot provider, run the following command:
pip install airflow-provider-datarobot
The next step is to create a connection from Airflow to DataRobot:
-
In the Airflow user interface, click Admin > Connections to add an Airflow connection.
-
On the List Connection page, click + Add a new record.
-
In the Add Connection dialog box, configure the following fields:
Field Description Connection Id datarobot_default
(this name is used by default in all operators)Connection Type DataRobot API Key A DataRobot API key, created in the DataRobot Developer Tools, from the API Keys section. DataRobot endpoint URL https://app.datarobot.com/api/v2
by default -
Click Test to establish a test connection between Airflow and DataRobot.
-
When the connection test is successful, click Save.
Operators and sensors use parameters from the config JSON submitted when triggering the DAG; for example:
{
"training_data": "s3-presigned-url-or-local-path-to-training-data",
"project_name": "Project created from Airflow",
"autopilot_settings": {
"target": "readmitted"
},
"deployment_label": "Deployment created from Airflow",
"score_settings": {
"intake_settings": {
"type": "s3",
"url": "s3://path/to/scoring-data/Diabetes10k.csv",
"credential_id": "<credential_id>"
},
"output_settings": {
"type": "s3",
"url": "s3://path/to/results-dir/Diabetes10k_predictions.csv",
"credential_id": "<credential_id>"
}
}
}
These config values are accessible in the execute()
method of any operator in the DAG
through the context["params"]
variable; for example, to get training data, you could use the following:
def execute(self, context: Context) -> str:
...
training_data = context["params"]["training_data"]
...
It is useful to have a simple airflow testing environment and a local development environment for the operators and DAGs. The following steps will construct the two environments needed for development.
- Clone the
airflow-provider-datarobot
repositorycd ~/workspace git clone [email protected]:datarobot/airflow-provider-datarobot.git cd airflow-provider-datarobot
- Create a virtual environment and install the dependencies
pyenv virtualenv 3.12 airflow-provider-datarobot pyenv local airflow-provider-datarobot make req-dev pre-commit install
- (OPTIONAL) Install astro with the following command or manually from the links above:
make install-astro
- Build an astro development environment with the following command:
make create-astro-dev
- A new
./astro-dev
folder will be constructed for you to use as a development and test environment. - Compile and run airflow on the development package with:
make build-astro-dev
Note: All credentials and logins will be printed in the terminal after running
the build-astro-dev
command.
- Test, compile, and run new or updated operators on the development package with:
make build-astro-dev
- Manually start the airflow dev environment without rebuilding the package with:
make start-astro-dev
- Manually stop the airflow dev environment without rebuilding the package with:
make stop-astro-dev
- If there are problems with the airflow environment you can reset it to a clean state with:
make clean-astro-dev
Please submit issues and pull requests in our official repo: https://github.com/datarobot/airflow-provider-datarobot
We are happy to hear from you. Please email any feedback to the authors at [email protected].
Copyright 2023 DataRobot, Inc. and its affiliates.
All rights reserved.
This is proprietary source code of DataRobot, Inc. and its affiliates.
Released under the terms of DataRobot Tool and Utility Agreement.