Skip to content

Commit a517173

Browse files
committed
Merge remote-tracking branch 'oss-datahub/master' into vertex_src_temp
2 parents 99269aa + cc3782e commit a517173

File tree

29 files changed

+2770
-223
lines changed

29 files changed

+2770
-223
lines changed

docs-website/sidebars.js

+1-1
Original file line numberDiff line numberDiff line change
@@ -742,7 +742,7 @@ module.exports = {
742742
type: "category",
743743
label: "DataHub CLI",
744744
link: { type: "doc", id: "docs/cli" },
745-
items: ["docs/datahub_lite"],
745+
items: ["docs/cli-commands/dataset", "docs/datahub_lite"],
746746
},
747747
{
748748
type: "category",

docs/cli-commands/dataset.md

+285
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,285 @@
1+
# DataHub Dataset Command
2+
3+
The `dataset` command allows you to interact with Dataset entities in DataHub. This includes creating, updating, retrieving, validating, and synchronizing Dataset metadata.
4+
5+
## Commands
6+
7+
### sync
8+
9+
Synchronize Dataset metadata between YAML files and DataHub.
10+
11+
```shell
12+
datahub dataset sync -f PATH_TO_YAML_FILE --to-datahub|--from-datahub
13+
```
14+
15+
**Options:**
16+
- `-f, --file` - Path to the YAML file (required)
17+
- `--to-datahub` - Push metadata from YAML file to DataHub
18+
- `--from-datahub` - Pull metadata from DataHub to YAML file
19+
20+
**Example:**
21+
```shell
22+
# Push to DataHub
23+
datahub dataset sync -f dataset.yaml --to-datahub
24+
25+
# Pull from DataHub
26+
datahub dataset sync -f dataset.yaml --from-datahub
27+
```
28+
29+
The `sync` command offers bidirectional synchronization, allowing you to keep your local YAML files in sync with the DataHub platform. The `upsert` command actually uses `sync` with the `--to-datahub` flag internally.
30+
31+
For details on the supported YAML format, see the [Dataset YAML Format](#dataset-yaml-format) section.
32+
33+
### file
34+
35+
Operate on a Dataset YAML file for validation or linting.
36+
37+
```shell
38+
datahub dataset file [--lintCheck] [--lintFix] PATH_TO_YAML_FILE
39+
```
40+
41+
**Options:**
42+
- `--lintCheck` - Check the YAML file for formatting issues (optional)
43+
- `--lintFix` - Fix formatting issues in the YAML file (optional)
44+
45+
**Example:**
46+
```shell
47+
# Check for linting issues
48+
datahub dataset file --lintCheck dataset.yaml
49+
50+
# Fix linting issues
51+
datahub dataset file --lintFix dataset.yaml
52+
```
53+
54+
This command helps maintain consistent formatting of your Dataset YAML files. For more information on the expected format, refer to the [Dataset YAML Format](#dataset-yaml-format) section.
55+
56+
### upsert
57+
58+
Create or update Dataset metadata in DataHub.
59+
60+
```shell
61+
datahub dataset upsert -f PATH_TO_YAML_FILE
62+
```
63+
64+
**Options:**
65+
- `-f, --file` - Path to the YAML file containing Dataset metadata (required)
66+
67+
**Example:**
68+
```shell
69+
datahub dataset upsert -f dataset.yaml
70+
```
71+
72+
This command will parse the YAML file, validate that any entity references exist in DataHub, and then emit the corresponding metadata change proposals to update or create the Dataset.
73+
74+
For details on the required structure of your YAML file, see the [Dataset YAML Format](#dataset-yaml-format) section.
75+
76+
### get
77+
78+
Retrieve Dataset metadata from DataHub and optionally write it to a file.
79+
80+
```shell
81+
datahub dataset get --urn DATASET_URN [--to-file OUTPUT_FILE]
82+
```
83+
84+
**Options:**
85+
- `--urn` - The Dataset URN to retrieve (required)
86+
- `--to-file` - Path to write the Dataset metadata as YAML (optional)
87+
88+
**Example:**
89+
```shell
90+
datahub dataset get --urn "urn:li:dataset:(urn:li:dataPlatform:hive,example_table,PROD)" --to-file my_dataset.yaml
91+
```
92+
93+
If the URN does not start with `urn:li:dataset:`, it will be automatically prefixed.
94+
95+
The output file will be formatted according to the [Dataset YAML Format](#dataset-yaml-format) section.
96+
97+
### add_sibling
98+
99+
Add sibling relationships between Datasets.
100+
101+
```shell
102+
datahub dataset add_sibling --urn PRIMARY_URN --sibling-urns SECONDARY_URN [--sibling-urns ANOTHER_URN ...]
103+
```
104+
105+
**Options:**
106+
- `--urn` - URN of the primary Dataset (required)
107+
- `--sibling-urns` - URNs of secondary sibling Datasets (required, multiple allowed)
108+
109+
**Example:**
110+
```shell
111+
datahub dataset add_sibling --urn "urn:li:dataset:(urn:li:dataPlatform:hive,example_table,PROD)" --sibling-urns "urn:li:dataset:(urn:li:dataPlatform:snowflake,example_table,PROD)"
112+
```
113+
114+
Siblings are semantically equivalent datasets, typically representing the same data across different platforms or environments.
115+
116+
## Dataset YAML Format
117+
118+
The Dataset YAML file follows a structured format with various supported fields:
119+
120+
```yaml
121+
# Basic identification (required)
122+
id: "example_table" # Dataset identifier
123+
platform: "hive" # Platform name
124+
env: "PROD" # Environment (PROD by default)
125+
126+
# Metadata (optional)
127+
name: "Example Table" # Display name (defaults to id if not specified)
128+
description: "This is an example table"
129+
130+
# Schema definition (optional)
131+
schema:
132+
fields:
133+
- id: "field1" # Field identifier
134+
type: "string" # Data type
135+
description: "First field" # Field description
136+
doc: "First field" # Alias for description
137+
nativeDataType: "VARCHAR" # Native platform type (defaults to type if not specified)
138+
nullable: false # Whether field can be null (default: false)
139+
label: "Field One" # Display label (optional business label for the field)
140+
isPartOfKey: true # Whether field is part of primary key
141+
isPartitioningKey: false # Whether field is a partitioning key
142+
jsonProps: {"customProp": "value"} # Custom JSON properties
143+
144+
- id: "field2"
145+
type: "number"
146+
description: "Second field"
147+
nullable: true
148+
globalTags: ["PII", "Sensitive"]
149+
glossaryTerms: ["urn:li:glossaryTerm:Revenue"]
150+
structured_properties:
151+
property1: "value1"
152+
property2: 42
153+
file: example.schema.avsc # Optional schema file (required if defining tables with nested fields)
154+
155+
# Additional metadata (all optional)
156+
properties: # Custom properties as key-value pairs
157+
origin: "external"
158+
pipeline: "etl_daily"
159+
160+
subtype: "View" # Dataset subtype
161+
subtypes: ["View", "Materialized"] # Multiple subtypes (if only one, use subtype field instead)
162+
163+
downstreams: # Downstream Dataset URNs
164+
- "urn:li:dataset:(urn:li:dataPlatform:hive,downstream_table,PROD)"
165+
166+
tags: # Tags
167+
- "Tier1"
168+
- "Verified"
169+
170+
glossary_terms: # Associated glossary terms
171+
- "urn:li:glossaryTerm:Revenue"
172+
173+
owners: # Dataset owners
174+
- "jdoe" # Simple format (defaults to TECHNICAL_OWNER)
175+
- id: "alice" # Extended format with ownership type
176+
type: "BUSINESS_OWNER"
177+
178+
structured_properties: # Structured properties
179+
priority: "P1"
180+
cost_center: 123
181+
182+
external_url: "https://example.com/datasets/example_table"
183+
```
184+
185+
You can also define multiple datasets in a single YAML file by using a list format:
186+
187+
```yaml
188+
- id: "dataset1"
189+
platform: "hive"
190+
description: "First dataset"
191+
# other properties...
192+
193+
- id: "dataset2"
194+
platform: "snowflake"
195+
description: "Second dataset"
196+
# other properties...
197+
```
198+
199+
### Schema Definition
200+
201+
You can define Dataset schema in two ways:
202+
203+
1. **Direct field definitions** as shown above
204+
> **Important limitation**: When using inline schema field definitions, only non-nested (flat) fields are currently supported. For nested or complex schemas, you must use the Avro file approach described below.
205+
206+
2. **Reference to an Avro schema file**:
207+
```yaml
208+
schema:
209+
file: "path/to/schema.avsc"
210+
```
211+
212+
Even when using the Avro file approach for the basic schema structure, you can still use the `fields` section to provide additional metadata like structured properties, tags, and glossary terms for your schema fields.
213+
214+
#### Schema Field Properties
215+
216+
The Schema Field object supports the following properties:
217+
218+
| Property | Type | Description |
219+
|----------|------|-------------|
220+
| `id` | string | Field identifier/path (required if `urn` not provided) |
221+
| `urn` | string | URN of the schema field (required if `id` not provided) |
222+
| `type` | string | Data type (one of the supported [Field Types](#field-types)) |
223+
| `nativeDataType` | string | Native data type in the source platform (defaults to `type` if not specified) |
224+
| `description` | string | Field description |
225+
| `doc` | string | Alias for description |
226+
| `nullable` | boolean | Whether the field can be null (default: false) |
227+
| `label` | string | Display label for the field |
228+
| `recursive` | boolean | Whether the field is recursive (default: false) |
229+
| `isPartOfKey` | boolean | Whether the field is part of the primary key |
230+
| `isPartitioningKey` | boolean | Whether the field is a partitioning key |
231+
| `jsonProps` | object | Custom JSON properties |
232+
| `globalTags` | array | List of tags associated with the field |
233+
| `glossaryTerms` | array | List of glossary terms associated with the field |
234+
| `structured_properties` | object | Structured properties for the field |
235+
236+
237+
**Important Note on Schema Field Types**:
238+
When specifying fields in the YAML file, you must follow an all-or-nothing approach with the `type` field:
239+
- If you want the command to generate the schema for you, specify the `type` field for ALL fields.
240+
- If you only want to add field-level metadata (like tags, glossary terms, or structured properties), do NOT specify the `type` field for ANY field.
241+
242+
Example of fields with only metadata (no types):
243+
```yaml
244+
schema:
245+
fields:
246+
- id: "field1" # Field identifier
247+
structured_properties:
248+
prop1: prop_value
249+
- id: "field2"
250+
structured_properties:
251+
prop1: prop_value
252+
```
253+
254+
### Ownership Types
255+
256+
When specifying owners, the following ownership types are supported:
257+
- `TECHNICAL_OWNER` (default)
258+
- `BUSINESS_OWNER`
259+
- `DATA_STEWARD`
260+
261+
Custom ownership types can be specified using the URN format.
262+
263+
### Field Types
264+
265+
When defining schema fields, the following primitive types are supported:
266+
- `string`
267+
- `number`
268+
- `int`
269+
- `long`
270+
- `float`
271+
- `double`
272+
- `boolean`
273+
- `bytes`
274+
- `fixed`
275+
276+
## Implementation Notes
277+
278+
- URNs are generated automatically if not provided, based on the platform, id, and env values
279+
- The command performs validation to ensure referenced entities (like structured properties) exist
280+
- When updating schema fields, changes are propagated correctly to maintain consistent metadata
281+
- The Dataset object will check for existence of entity references and will skip datasets with missing references
282+
- When using the `sync` command with `--from-datahub`, existing YAML files will be updated with metadata from DataHub while preserving comments and structure
283+
- For structured properties, single values are simplified (not wrapped in lists) when appropriate
284+
- Field paths are simplified for better readability
285+
- When specifying field types, all fields must have type information or none of them should

docs/cli.md

+5-9
Original file line numberDiff line numberDiff line change
@@ -404,21 +404,17 @@ datahub timeline --urn "urn:li:dataset:(urn:li:dataPlatform:mysql,User.UserAccou
404404

405405
### dataset (Dataset Entity)
406406

407-
The `dataset` command allows you to interact with the dataset entity.
408-
409-
The `get` operation can be used to read in a dataset into a yaml file.
407+
The `dataset` command allows you to interact with Dataset entities in DataHub, including creating, updating, retrieving, and validating Dataset metadata.
410408

411409
```shell
412-
datahub dataset get --urn "$URN" --to-file "$FILE_NAME"
413-
```
410+
# Get a dataset and write to YAML file
411+
datahub dataset get --urn "urn:li:dataset:(urn:li:dataPlatform:hive,example_table,PROD)" --to-file dataset.yaml
414412

415-
The `upsert` operation can be used to create a new user or update an existing one.
416-
417-
```shell
413+
# Create or update dataset from YAML file
418414
datahub dataset upsert -f dataset.yaml
419415
```
420416

421-
An example of `dataset.yaml` would look like as in [dataset.yaml](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/cli_usage/dataset/dataset.yaml).
417+
➡️ [Learn more about the dataset command](./cli-commands/dataset.md)
422418

423419
### user (User Entity)
424420

metadata-ingestion/examples/structured_properties/click_event.avsc

+11-1
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,16 @@
99
{ "name": "referer", "type": ["string", "null"] },
1010
{ "name": "user_agent", "type": ["string", "null"] },
1111
{ "name": "user_id", "type": ["string", "null"] },
12-
{ "name": "session_id", "type": ["string", "null"] }
12+
{ "name": "session_id", "type": ["string", "null"] },
13+
{
14+
"name": "locator", "type": {
15+
"type": "record",
16+
"name": "Locator",
17+
"fields": [
18+
{ "name": "latitude", "type": "float" },
19+
{ "name": "longitude", "type": "float" }
20+
]
21+
}
22+
}
1323
]
1424
}

metadata-ingestion/examples/structured_properties/dataset.yaml

+21-15
Original file line numberDiff line numberDiff line change
@@ -6,17 +6,22 @@
66
schema:
77
file: examples/structured_properties/click_event.avsc
88
fields:
9-
- id: ip
10-
- urn: urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:hive,user.clicks,PROD),ip)
11-
structured_properties: # structured properties for schema fields/columns go here
12-
io.acryl.dataManagement.deprecationDate: "2023-01-01"
13-
io.acryl.dataManagement.certifier: urn:li:corpuser:[email protected]
14-
io.acryl.dataManagement.replicationSLA: 90
9+
- id: ip
10+
structured_properties:
11+
io.acryl.dataManagement.deprecationDate: '2023-01-01'
12+
io.acryl.dataManagement.certifier: urn:li:corpuser:[email protected]
13+
io.acryl.dataManagement.replicationSLA: 90
14+
- id: url
15+
structured_properties:
16+
io.acryl.dataManagement.deprecationDate: '2023-01-01'
17+
- id: locator.latitude
18+
structured_properties:
19+
io.acryl.dataManagement.deprecationDate: '2023-01-01'
1520
structured_properties: # dataset level structured properties go here
1621
io.acryl.privacy.retentionTime: 365
1722
projectNames:
18-
- Tracking
19-
- DataHub
23+
- Tracking
24+
- DataHub
2025
- id: ClickEvent
2126
platform: events
2227
subtype: Topic
@@ -27,19 +32,20 @@
2732
project_name: Tracking
2833
namespace: org.acryl.tracking
2934
version: 1.0.0
30-
retention: 30
35+
retention: '30'
3136
structured_properties:
3237
io.acryl.dataManagement.certifier: urn:li:corpuser:[email protected]
3338
schema:
3439
file: examples/structured_properties/click_event.avsc
3540
downstreams:
36-
- urn:li:dataset:(urn:li:dataPlatform:hive,user.clicks,PROD)
41+
- urn:li:dataset:(urn:li:dataPlatform:hive,user.clicks,PROD)
3742
- id: user.clicks
3843
platform: snowflake
39-
schema:
40-
fields:
41-
- id: user_id
42-
structured_properties:
43-
io.acryl.dataManagement.deprecationDate: "2023-01-01"
4444
structured_properties:
4545
io.acryl.dataManagement.replicationSLA: 90
46+
schema:
47+
fields:
48+
- id: user_id
49+
structured_properties:
50+
io.acryl.dataManagement.deprecationDate: '2023-01-01'
51+
type: string

0 commit comments

Comments
 (0)