|
| 1 | +# Prefect Integration with DataHub |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +DataHub supports integration with Prefect, allowing you to ingest: |
| 6 | + |
| 7 | +- Prefect flow and task metadata |
| 8 | +- Flow run and Task run information |
| 9 | +- Lineage information (when available) |
| 10 | + |
| 11 | +This integration enables you to track and monitor your Prefect workflows within DataHub, providing a comprehensive view of your data pipeline activities. |
| 12 | + |
| 13 | +## Prefect DataHub Block |
| 14 | + |
| 15 | +### What is a Prefect DataHub Block? |
| 16 | + |
| 17 | +Blocks in Prefect are primitives that enable the storage of configuration and provide an interface for interacting with external systems. The `prefect-datahub` block uses the [DataHub REST](../../metadata-ingestion/sink_docs/datahub.md#datahub-rest) emitter to send metadata events while running Prefect flows. |
| 18 | + |
| 19 | +### Prerequisites |
| 20 | + |
| 21 | +1. Use either Prefect Cloud (recommended) or a self-hosted Prefect server. |
| 22 | +2. For Prefect Cloud setup, refer to the [Cloud Quickstart](https://docs.prefect.io/latest/getting-started/quickstart/) guide. |
| 23 | +3. For self-hosted Prefect server setup, refer to the [Host Prefect Server](https://docs.prefect.io/latest/guides/host/) guide. |
| 24 | +4. Ensure the Prefect API URL is set correctly. Verify using: |
| 25 | + |
| 26 | + ```shell |
| 27 | + prefect profile inspect |
| 28 | + ``` |
| 29 | + |
| 30 | +5. API URL format: |
| 31 | + - Prefect Cloud: `https://api.prefect.cloud/api/accounts/<account_id>/workspaces/<workspace_id>` |
| 32 | + - Self-hosted: `http://<host>:<port>/api` |
| 33 | + |
| 34 | +## Setup Instructions |
| 35 | + |
| 36 | +### 1. Installation |
| 37 | + |
| 38 | +Install `prefect-datahub` using pip: |
| 39 | + |
| 40 | +```shell |
| 41 | +pip install 'prefect-datahub' |
| 42 | +``` |
| 43 | + |
| 44 | +Note: Requires Python 3.7+ |
| 45 | + |
| 46 | +### 2. Saving Configurations to a Block |
| 47 | + |
| 48 | +Save your configuration to the [Prefect block document store](https://docs.prefect.io/latest/concepts/blocks/#saving-blocks): |
| 49 | + |
| 50 | +```python |
| 51 | +from prefect_datahub.datahub_emitter import DatahubEmitter |
| 52 | + |
| 53 | +DatahubEmitter( |
| 54 | + datahub_rest_url="http://localhost:8080", |
| 55 | + env="PROD", |
| 56 | + platform_instance="local_prefect" |
| 57 | +).save("MY-DATAHUB-BLOCK") |
| 58 | +``` |
| 59 | + |
| 60 | +Configuration options: |
| 61 | + |
| 62 | +| Config | Type | Default | Description | |
| 63 | +|--------|------|---------|-------------| |
| 64 | +| datahub_rest_url | `str` | `http://localhost:8080` | DataHub GMS REST URL | |
| 65 | +| env | `str` | `PROD` | Environment for assets (see [FabricType](https://datahubproject.io/docs/graphql/enums/#fabrictype)) | |
| 66 | +| platform_instance | `str` | `None` | Platform instance for assets (see [Platform Instances](https://datahubproject.io/docs/platform-instances/)) | |
| 67 | + |
| 68 | +### 3. Using the Block in Prefect Workflows |
| 69 | + |
| 70 | +Load and use the saved block in your Prefect workflows: |
| 71 | + |
| 72 | +```python |
| 73 | +from prefect import flow, task |
| 74 | +from prefect_datahub.dataset import Dataset |
| 75 | +from prefect_datahub.datahub_emitter import DatahubEmitter |
| 76 | + |
| 77 | +datahub_emitter = DatahubEmitter.load("MY-DATAHUB-BLOCK") |
| 78 | + |
| 79 | +@task(name="Transform", description="Transform the data") |
| 80 | +def transform(data): |
| 81 | + data = data.split(" ") |
| 82 | + datahub_emitter.add_task( |
| 83 | + inputs=[Dataset("snowflake", "mydb.schema.tableA")], |
| 84 | + outputs=[Dataset("snowflake", "mydb.schema.tableC")], |
| 85 | + ) |
| 86 | + return data |
| 87 | + |
| 88 | +@flow(name="ETL flow", description="Extract transform load flow") |
| 89 | +def etl(): |
| 90 | + data = transform("This is data") |
| 91 | + datahub_emitter.emit_flow() |
| 92 | +``` |
| 93 | + |
| 94 | +**Note**: To emit tasks, you must call `emit_flow()`. Otherwise, no metadata will be emitted. |
| 95 | + |
| 96 | +## Concept Mapping |
| 97 | + |
| 98 | +| Prefect Concept | DataHub Concept | |
| 99 | +|-----------------|-----------------| |
| 100 | +| [Flow](https://docs.prefect.io/latest/concepts/flows/) | [DataFlow](https://datahubproject.io/docs/generated/metamodel/entities/dataflow/) | |
| 101 | +| [Flow Run](https://docs.prefect.io/latest/concepts/flows/#flow-runs) | [DataProcessInstance](https://datahubproject.io/docs/generated/metamodel/entities/dataprocessinstance) | |
| 102 | +| [Task](https://docs.prefect.io/latest/concepts/tasks/) | [DataJob](https://datahubproject.io/docs/generated/metamodel/entities/datajob/) | |
| 103 | +| [Task Run](https://docs.prefect.io/latest/concepts/tasks/#tasks) | [DataProcessInstance](https://datahubproject.io/docs/generated/metamodel/entities/dataprocessinstance) | |
| 104 | +| [Task Tag](https://docs.prefect.io/latest/concepts/tasks/#tags) | [Tag](https://datahubproject.io/docs/generated/metamodel/entities/tag/) | |
| 105 | + |
| 106 | +## Validation and Troubleshooting |
| 107 | + |
| 108 | +### Validating the Setup |
| 109 | + |
| 110 | +1. Check the Prefect UI's Blocks menu for the DataHub emitter. |
| 111 | +2. Run a Prefect workflow and look for DataHub-related log messages: |
| 112 | + |
| 113 | + ```text |
| 114 | + Emitting flow to datahub... |
| 115 | + Emitting tasks to datahub... |
| 116 | + ``` |
| 117 | + |
| 118 | +### Debugging Common Issues |
| 119 | + |
| 120 | +#### Incorrect Prefect API URL |
| 121 | + |
| 122 | +If the Prefect API URL is incorrect, set it manually: |
| 123 | + |
| 124 | +```shell |
| 125 | +prefect config set PREFECT_API_URL='http://127.0.0.1:4200/api' |
| 126 | +``` |
| 127 | + |
| 128 | +#### DataHub Connection Error |
| 129 | + |
| 130 | +If you encounter a `ConnectionError: HTTPConnectionPool(host='localhost', port=8080)`, ensure that your DataHub GMS service is running. |
| 131 | + |
| 132 | +## Additional Resources |
| 133 | + |
| 134 | +- [Prefect Documentation](https://docs.prefect.io/) |
| 135 | +- [DataHub Documentation](https://datahubproject.io/docs/) |
| 136 | + |
| 137 | +For more information or support, please refer to the official Prefect and DataHub documentation or reach out to their respective communities. |
0 commit comments