Skip to content

datarobot/airflow-provider-datarobot

Repository files navigation

DataRobot Provider for Apache Airflow

Documentation PyPI version Python versions License

This package provides operators, sensors, and a hook to integrate DataRobot into Apache Airflow. Using these components, you should be able to build the essential DataRobot pipeline - create a project, train models, deploy a model, and score predictions against the model deployment.

Install the Airflow provider

The DataRobot provider for Apache Airflow requires an environment with the following dependencies installed:

To install the DataRobot provider, run the following command:

pip install airflow-provider-datarobot

Create a connection from Airflow to DataRobot

The next step is to create a connection from Airflow to DataRobot:

  1. In the Airflow user interface, click Admin > Connections to add an Airflow connection.

  2. On the List Connection page, click + Add a new record.

  3. In the Add Connection dialog box, configure the following fields:

    Field Description
    Connection Id datarobot_default (this name is used by default in all operators)
    Connection Type DataRobot
    API Key A DataRobot API key, created in the DataRobot Developer Tools, from the API Keys section.
    DataRobot endpoint URL https://app.datarobot.com/api/v2 by default
  4. Click Test to establish a test connection between Airflow and DataRobot.

  5. When the connection test is successful, click Save.

JSON configuration for the DAG run

Operators and sensors use parameters from the config JSON submitted when triggering the DAG; for example:

{
    "training_data": "s3-presigned-url-or-local-path-to-training-data",
    "project_name": "Project created from Airflow",
    "autopilot_settings": {
        "target": "readmitted"
    },
    "deployment_label": "Deployment created from Airflow",
    "score_settings": {
        "intake_settings": {
            "type": "s3",
            "url": "s3://path/to/scoring-data/Diabetes10k.csv",
            "credential_id": "<credential_id>"
        },
        "output_settings": {
            "type": "s3",
            "url": "s3://path/to/results-dir/Diabetes10k_predictions.csv",
            "credential_id": "<credential_id>"
        }
    }
}

These config values are accessible in the execute() method of any operator in the DAG through the context["params"] variable; for example, to get training data, you could use the following:

def execute(self, context: Context) -> str:
    ...
    training_data = context["params"]["training_data"]
    ...

Development

Pre-requisites

Environment Setup

It is useful to have a simple airflow testing environment and a local development environment for the operators and DAGs. The following steps will construct the two environments needed for development.

  1. Clone the airflow-provider-datarobot repository
        cd ~/workspace
        git clone [email protected]:datarobot/airflow-provider-datarobot.git
        cd airflow-provider-datarobot
  2. Create a virtual environment and install the dependencies
        pyenv virtualenv 3.12 airflow-provider-datarobot
        pyenv local airflow-provider-datarobot
        make req-dev
        pre-commit install

Astro Setup

  1. (OPTIONAL) Install astro with the following command or manually from the links above:
        make install-astro
  2. Build an astro development environment with the following command:
        make create-astro-dev
  3. A new ./astro-dev folder will be constructed for you to use as a development and test environment.
  4. Compile and run airflow on the development package with:
        make build-astro-dev

Note: All credentials and logins will be printed in the terminal after running the build-astro-dev command.

Updating Operators in the Dev Environment

  • Test, compile, and run new or updated operators on the development package with:
        make build-astro-dev
  • Manually start the airflow dev environment without rebuilding the package with:
        make start-astro-dev
  • Manually stop the airflow dev environment without rebuilding the package with:
        make stop-astro-dev
  • If there are problems with the airflow environment you can reset it to a clean state with:
        make clean-astro-dev

Issues

Please submit issues and pull requests in our official repo: https://github.com/datarobot/airflow-provider-datarobot

We are happy to hear from you. Please email any feedback to the authors at [email protected].

Copyright Notice

Copyright 2023 DataRobot, Inc. and its affiliates.

All rights reserved.

This is proprietary source code of DataRobot, Inc. and its affiliates.

Released under the terms of DataRobot Tool and Utility Agreement.