|
| 1 | +# DataHub Dataset Command |
| 2 | + |
| 3 | +The `dataset` command allows you to interact with Dataset entities in DataHub. This includes creating, updating, retrieving, validating, and synchronizing Dataset metadata. |
| 4 | + |
| 5 | +## Commands |
| 6 | + |
| 7 | +### sync |
| 8 | + |
| 9 | +Synchronize Dataset metadata between YAML files and DataHub. |
| 10 | + |
| 11 | +```shell |
| 12 | +datahub dataset sync -f PATH_TO_YAML_FILE --to-datahub|--from-datahub |
| 13 | +``` |
| 14 | + |
| 15 | +**Options:** |
| 16 | +- `-f, --file` - Path to the YAML file (required) |
| 17 | +- `--to-datahub` - Push metadata from YAML file to DataHub |
| 18 | +- `--from-datahub` - Pull metadata from DataHub to YAML file |
| 19 | + |
| 20 | +**Example:** |
| 21 | +```shell |
| 22 | +# Push to DataHub |
| 23 | +datahub dataset sync -f dataset.yaml --to-datahub |
| 24 | + |
| 25 | +# Pull from DataHub |
| 26 | +datahub dataset sync -f dataset.yaml --from-datahub |
| 27 | +``` |
| 28 | + |
| 29 | +The `sync` command offers bidirectional synchronization, allowing you to keep your local YAML files in sync with the DataHub platform. The `upsert` command actually uses `sync` with the `--to-datahub` flag internally. |
| 30 | + |
| 31 | +For details on the supported YAML format, see the [Dataset YAML Format](#dataset-yaml-format) section. |
| 32 | + |
| 33 | +### file |
| 34 | + |
| 35 | +Operate on a Dataset YAML file for validation or linting. |
| 36 | + |
| 37 | +```shell |
| 38 | +datahub dataset file [--lintCheck] [--lintFix] PATH_TO_YAML_FILE |
| 39 | +``` |
| 40 | + |
| 41 | +**Options:** |
| 42 | +- `--lintCheck` - Check the YAML file for formatting issues (optional) |
| 43 | +- `--lintFix` - Fix formatting issues in the YAML file (optional) |
| 44 | + |
| 45 | +**Example:** |
| 46 | +```shell |
| 47 | +# Check for linting issues |
| 48 | +datahub dataset file --lintCheck dataset.yaml |
| 49 | + |
| 50 | +# Fix linting issues |
| 51 | +datahub dataset file --lintFix dataset.yaml |
| 52 | +``` |
| 53 | + |
| 54 | +This command helps maintain consistent formatting of your Dataset YAML files. For more information on the expected format, refer to the [Dataset YAML Format](#dataset-yaml-format) section. |
| 55 | + |
| 56 | +### upsert |
| 57 | + |
| 58 | +Create or update Dataset metadata in DataHub. |
| 59 | + |
| 60 | +```shell |
| 61 | +datahub dataset upsert -f PATH_TO_YAML_FILE |
| 62 | +``` |
| 63 | + |
| 64 | +**Options:** |
| 65 | +- `-f, --file` - Path to the YAML file containing Dataset metadata (required) |
| 66 | + |
| 67 | +**Example:** |
| 68 | +```shell |
| 69 | +datahub dataset upsert -f dataset.yaml |
| 70 | +``` |
| 71 | + |
| 72 | +This command will parse the YAML file, validate that any entity references exist in DataHub, and then emit the corresponding metadata change proposals to update or create the Dataset. |
| 73 | + |
| 74 | +For details on the required structure of your YAML file, see the [Dataset YAML Format](#dataset-yaml-format) section. |
| 75 | + |
| 76 | +### get |
| 77 | + |
| 78 | +Retrieve Dataset metadata from DataHub and optionally write it to a file. |
| 79 | + |
| 80 | +```shell |
| 81 | +datahub dataset get --urn DATASET_URN [--to-file OUTPUT_FILE] |
| 82 | +``` |
| 83 | + |
| 84 | +**Options:** |
| 85 | +- `--urn` - The Dataset URN to retrieve (required) |
| 86 | +- `--to-file` - Path to write the Dataset metadata as YAML (optional) |
| 87 | + |
| 88 | +**Example:** |
| 89 | +```shell |
| 90 | +datahub dataset get --urn "urn:li:dataset:(urn:li:dataPlatform:hive,example_table,PROD)" --to-file my_dataset.yaml |
| 91 | +``` |
| 92 | + |
| 93 | +If the URN does not start with `urn:li:dataset:`, it will be automatically prefixed. |
| 94 | + |
| 95 | +The output file will be formatted according to the [Dataset YAML Format](#dataset-yaml-format) section. |
| 96 | + |
| 97 | +### add_sibling |
| 98 | + |
| 99 | +Add sibling relationships between Datasets. |
| 100 | + |
| 101 | +```shell |
| 102 | +datahub dataset add_sibling --urn PRIMARY_URN --sibling-urns SECONDARY_URN [--sibling-urns ANOTHER_URN ...] |
| 103 | +``` |
| 104 | + |
| 105 | +**Options:** |
| 106 | +- `--urn` - URN of the primary Dataset (required) |
| 107 | +- `--sibling-urns` - URNs of secondary sibling Datasets (required, multiple allowed) |
| 108 | + |
| 109 | +**Example:** |
| 110 | +```shell |
| 111 | +datahub dataset add_sibling --urn "urn:li:dataset:(urn:li:dataPlatform:hive,example_table,PROD)" --sibling-urns "urn:li:dataset:(urn:li:dataPlatform:snowflake,example_table,PROD)" |
| 112 | +``` |
| 113 | + |
| 114 | +Siblings are semantically equivalent datasets, typically representing the same data across different platforms or environments. |
| 115 | + |
| 116 | +## Dataset YAML Format |
| 117 | + |
| 118 | +The Dataset YAML file follows a structured format with various supported fields: |
| 119 | + |
| 120 | +```yaml |
| 121 | +# Basic identification (required) |
| 122 | +id: "example_table" # Dataset identifier |
| 123 | +platform: "hive" # Platform name |
| 124 | +env: "PROD" # Environment (PROD by default) |
| 125 | + |
| 126 | +# Metadata (optional) |
| 127 | +name: "Example Table" # Display name (defaults to id if not specified) |
| 128 | +description: "This is an example table" |
| 129 | + |
| 130 | +# Schema definition (optional) |
| 131 | +schema: |
| 132 | + fields: |
| 133 | + - id: "field1" # Field identifier |
| 134 | + type: "string" # Data type |
| 135 | + description: "First field" # Field description |
| 136 | + doc: "First field" # Alias for description |
| 137 | + nativeDataType: "VARCHAR" # Native platform type (defaults to type if not specified) |
| 138 | + nullable: false # Whether field can be null (default: false) |
| 139 | + label: "Field One" # Display label (optional business label for the field) |
| 140 | + isPartOfKey: true # Whether field is part of primary key |
| 141 | + isPartitioningKey: false # Whether field is a partitioning key |
| 142 | + jsonProps: {"customProp": "value"} # Custom JSON properties |
| 143 | + |
| 144 | + - id: "field2" |
| 145 | + type: "number" |
| 146 | + description: "Second field" |
| 147 | + nullable: true |
| 148 | + globalTags: ["PII", "Sensitive"] |
| 149 | + glossaryTerms: ["urn:li:glossaryTerm:Revenue"] |
| 150 | + structured_properties: |
| 151 | + property1: "value1" |
| 152 | + property2: 42 |
| 153 | + file: example.schema.avsc # Optional schema file (required if defining tables with nested fields) |
| 154 | + |
| 155 | +# Additional metadata (all optional) |
| 156 | +properties: # Custom properties as key-value pairs |
| 157 | + origin: "external" |
| 158 | + pipeline: "etl_daily" |
| 159 | + |
| 160 | +subtype: "View" # Dataset subtype |
| 161 | +subtypes: ["View", "Materialized"] # Multiple subtypes (if only one, use subtype field instead) |
| 162 | + |
| 163 | +downstreams: # Downstream Dataset URNs |
| 164 | + - "urn:li:dataset:(urn:li:dataPlatform:hive,downstream_table,PROD)" |
| 165 | + |
| 166 | +tags: # Tags |
| 167 | + - "Tier1" |
| 168 | + - "Verified" |
| 169 | + |
| 170 | +glossary_terms: # Associated glossary terms |
| 171 | + - "urn:li:glossaryTerm:Revenue" |
| 172 | + |
| 173 | +owners: # Dataset owners |
| 174 | + - "jdoe" # Simple format (defaults to TECHNICAL_OWNER) |
| 175 | + - id: "alice" # Extended format with ownership type |
| 176 | + type: "BUSINESS_OWNER" |
| 177 | + |
| 178 | +structured_properties: # Structured properties |
| 179 | + priority: "P1" |
| 180 | + cost_center: 123 |
| 181 | + |
| 182 | +external_url: "https://example.com/datasets/example_table" |
| 183 | +``` |
| 184 | +
|
| 185 | +You can also define multiple datasets in a single YAML file by using a list format: |
| 186 | +
|
| 187 | +```yaml |
| 188 | +- id: "dataset1" |
| 189 | + platform: "hive" |
| 190 | + description: "First dataset" |
| 191 | + # other properties... |
| 192 | + |
| 193 | +- id: "dataset2" |
| 194 | + platform: "snowflake" |
| 195 | + description: "Second dataset" |
| 196 | + # other properties... |
| 197 | +``` |
| 198 | + |
| 199 | +### Schema Definition |
| 200 | + |
| 201 | +You can define Dataset schema in two ways: |
| 202 | + |
| 203 | +1. **Direct field definitions** as shown above |
| 204 | + > **Important limitation**: When using inline schema field definitions, only non-nested (flat) fields are currently supported. For nested or complex schemas, you must use the Avro file approach described below. |
| 205 | +
|
| 206 | +2. **Reference to an Avro schema file**: |
| 207 | + ```yaml |
| 208 | + schema: |
| 209 | + file: "path/to/schema.avsc" |
| 210 | + ``` |
| 211 | +
|
| 212 | +Even when using the Avro file approach for the basic schema structure, you can still use the `fields` section to provide additional metadata like structured properties, tags, and glossary terms for your schema fields. |
| 213 | + |
| 214 | +#### Schema Field Properties |
| 215 | + |
| 216 | +The Schema Field object supports the following properties: |
| 217 | + |
| 218 | +| Property | Type | Description | |
| 219 | +|----------|------|-------------| |
| 220 | +| `id` | string | Field identifier/path (required if `urn` not provided) | |
| 221 | +| `urn` | string | URN of the schema field (required if `id` not provided) | |
| 222 | +| `type` | string | Data type (one of the supported [Field Types](#field-types)) | |
| 223 | +| `nativeDataType` | string | Native data type in the source platform (defaults to `type` if not specified) | |
| 224 | +| `description` | string | Field description | |
| 225 | +| `doc` | string | Alias for description | |
| 226 | +| `nullable` | boolean | Whether the field can be null (default: false) | |
| 227 | +| `label` | string | Display label for the field | |
| 228 | +| `recursive` | boolean | Whether the field is recursive (default: false) | |
| 229 | +| `isPartOfKey` | boolean | Whether the field is part of the primary key | |
| 230 | +| `isPartitioningKey` | boolean | Whether the field is a partitioning key | |
| 231 | +| `jsonProps` | object | Custom JSON properties | |
| 232 | +| `globalTags` | array | List of tags associated with the field | |
| 233 | +| `glossaryTerms` | array | List of glossary terms associated with the field | |
| 234 | +| `structured_properties` | object | Structured properties for the field | |
| 235 | + |
| 236 | + |
| 237 | +**Important Note on Schema Field Types**: |
| 238 | +When specifying fields in the YAML file, you must follow an all-or-nothing approach with the `type` field: |
| 239 | +- If you want the command to generate the schema for you, specify the `type` field for ALL fields. |
| 240 | +- If you only want to add field-level metadata (like tags, glossary terms, or structured properties), do NOT specify the `type` field for ANY field. |
| 241 | + |
| 242 | +Example of fields with only metadata (no types): |
| 243 | +```yaml |
| 244 | +schema: |
| 245 | + fields: |
| 246 | + - id: "field1" # Field identifier |
| 247 | + structured_properties: |
| 248 | + prop1: prop_value |
| 249 | + - id: "field2" |
| 250 | + structured_properties: |
| 251 | + prop1: prop_value |
| 252 | +``` |
| 253 | + |
| 254 | +### Ownership Types |
| 255 | + |
| 256 | +When specifying owners, the following ownership types are supported: |
| 257 | +- `TECHNICAL_OWNER` (default) |
| 258 | +- `BUSINESS_OWNER` |
| 259 | +- `DATA_STEWARD` |
| 260 | + |
| 261 | +Custom ownership types can be specified using the URN format. |
| 262 | + |
| 263 | +### Field Types |
| 264 | + |
| 265 | +When defining schema fields, the following primitive types are supported: |
| 266 | +- `string` |
| 267 | +- `number` |
| 268 | +- `int` |
| 269 | +- `long` |
| 270 | +- `float` |
| 271 | +- `double` |
| 272 | +- `boolean` |
| 273 | +- `bytes` |
| 274 | +- `fixed` |
| 275 | + |
| 276 | +## Implementation Notes |
| 277 | + |
| 278 | +- URNs are generated automatically if not provided, based on the platform, id, and env values |
| 279 | +- The command performs validation to ensure referenced entities (like structured properties) exist |
| 280 | +- When updating schema fields, changes are propagated correctly to maintain consistent metadata |
| 281 | +- The Dataset object will check for existence of entity references and will skip datasets with missing references |
| 282 | +- When using the `sync` command with `--from-datahub`, existing YAML files will be updated with metadata from DataHub while preserving comments and structure |
| 283 | +- For structured properties, single values are simplified (not wrapped in lists) when appropriate |
| 284 | +- Field paths are simplified for better readability |
| 285 | +- When specifying field types, all fields must have type information or none of them should |
0 commit comments