Skip to content

Commit a700448

Browse files
authored
feat(ingestion/business-glossary): Automatically generate predictable glossary term and node URNs when incompatible URL characters are specified in term and node names. (#12673)
1 parent 4714f46 commit a700448

13 files changed

+938
-268
lines changed

docs/how/updating-datahub.md

+2
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,8 @@ This file documents any backwards-incompatible changes in DataHub and assists pe
2020

2121
### Breaking Changes
2222

23+
- #12673: Business Glossary ID generation has been modified to handle special characters and URL cleaning. When `enable_auto_id` is false (default), IDs are now generated by cleaning the name (converting spaces to hyphens, removing special characters except periods which are used as path separators) while preserving case. This may result in different IDs being generated for terms with special characters.
24+
2325
- #12580: The OpenAPI source handled nesting incorrectly. 12580 fixes it to create proper nested field paths, however, this will re-write the incorrect schemas of existing OpenAPI runs.
2426

2527
- #12408: The `platform` field in the DataPlatformInstance GraphQL type is removed. Clients need to retrieve the platform via the optional `dataPlatformInstance` field.

metadata-ingestion/docs/sources/business-glossary/datahub-business-glossary.md

+134-120
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,8 @@ nodes: # list of child **Glossa
2424
Example **GlossaryNode**:
2525
2626
```yaml
27-
- name: Shipping # name of the node
27+
- name: "Shipping" # name of the node
28+
id: "Shipping-Logistics" # (optional) custom identifier for the node
2829
description: Provides terms related to the shipping domain # description of the node
2930
owners: # (optional) owners contains 2 nested fields
3031
users: # (optional) a list of user IDs
@@ -43,7 +44,8 @@ Example **GlossaryNode**:
4344
Example **GlossaryTerm**:
4445
4546
```yaml
46-
- name: FullAddress # name of the term
47+
- name: "Full Address" # name of the term
48+
id: "Full-Address-Details" # (optional) custom identifier for the term
4749
description: A collection of information to give the location of a building or plot of land. # description of the term
4850
owners: # (optional) owners contains 2 nested fields
4951
users: # (optional) a list of user IDs
@@ -67,10 +69,86 @@ Example **GlossaryTerm**:
6769
domain: "urn:li:domain:Logistics" # (optional) domain name or domain urn
6870
```
6971
70-
To see how these all work together, check out this comprehensive example business glossary file below:
72+
## ID Management and URL Generation
73+
74+
The business glossary provides two primary ways to manage term and node identifiers:
75+
76+
1. **Custom IDs**: You can explicitly specify an ID for any term or node using the `id` field. This is recommended for terms that need stable, predictable identifiers:
77+
```yaml
78+
terms:
79+
- name: "Response Time"
80+
id: "support-response-time" # Explicit ID
81+
description: "Target time to respond to customer inquiries"
82+
```
83+
84+
2. **Automatic ID Generation**: When no ID is specified, the system will generate one based on the `enable_auto_id` setting:
85+
- With `enable_auto_id: false` (default):
86+
- Node and term names are converted to URL-friendly format
87+
- Spaces within names are replaced with hyphens
88+
- Special characters are removed (except hyphens)
89+
- Case is preserved
90+
- Multiple hyphens are collapsed to single ones
91+
- Path components (node/term hierarchy) are joined with periods
92+
- Example: Node "Customer Support" with term "Response Time" → "Customer-Support.Response-Time"
93+
94+
- With `enable_auto_id: true`:
95+
- Generates GUID-based IDs
96+
- Recommended for guaranteed uniqueness
97+
- Required for terms with non-ASCII characters
98+
99+
Here's how path-based ID generation works:
100+
```yaml
101+
nodes:
102+
- name: "Customer Support" # Node ID: Customer-Support
103+
terms:
104+
- name: "Response Time" # Term ID: Customer-Support.Response-Time
105+
description: "Response SLA"
106+
107+
- name: "First Reply" # Term ID: Customer-Support.First-Reply
108+
description: "Initial response"
109+
110+
- name: "Product Feedback" # Node ID: Product-Feedback
111+
terms:
112+
- name: "Response Time" # Term ID: Product-Feedback.Response-Time
113+
description: "Feedback response"
114+
```
115+
116+
**Important Notes**:
117+
- Periods (.) are used exclusively as path separators between nodes and terms
118+
- Periods in term or node names themselves will be removed
119+
- Each component of the path (node names, term names) is cleaned independently:
120+
- Spaces to hyphens
121+
- Special characters removed
122+
- Case preserved
123+
- The cleaned components are then joined with periods to form the full path
124+
- Non-ASCII characters in any component trigger automatic GUID generation
125+
- Once an ID is created (either manually or automatically), it cannot be easily changed
126+
- All references to a term (in `inherits`, `contains`, etc.) must use its correct ID
127+
- Moving terms in the hierarchy does NOT update their IDs:
128+
- The ID retains its original path components even after moving
129+
- This can lead to IDs that don't match the current location
130+
- Consider using `enable_auto_id: true` if you plan to reorganize your glossary
131+
- For terms that other terms will reference, consider using explicit IDs or enable auto_id
132+
133+
Example of how different names are handled:
134+
```yaml
135+
nodes:
136+
- name: "Data Services" # Node ID: Data-Services
137+
terms:
138+
# Basic term name
139+
- name: "Response Time" # Term ID: Data-Services.Response-Time
140+
description: "SLA metrics"
141+
142+
# Term name with special characters
143+
- name: "API @ Response" # Term ID: Data-Services.API-Response
144+
description: "API metrics"
145+
146+
# Term with non-ASCII (triggers GUID)
147+
- name: "パフォーマンス" # Term ID will be a 32-character GUID
148+
description: "Performance"
149+
```
71150

72-
<details>
73-
<summary>Example business glossary file</summary>
151+
To see how these all work together, check out this comprehensive example business glossary file below:
74152

75153
```yaml
76154
version: "1"
@@ -80,172 +158,108 @@ owners:
80158
- mjames
81159
url: "https://github.com/datahub-project/datahub/"
82160
nodes:
83-
- name: Classification
161+
- name: "Data Classification"
162+
id: "Data-Classification" # Custom ID for stable references
84163
description: A set of terms related to Data Classification
85164
knowledge_links:
86165
- label: Wiki link for classification
87166
url: "https://en.wikipedia.org/wiki/Classification"
88167
terms:
89-
- name: Sensitive
168+
- name: "Sensitive Data" # Will generate: Data-Classification.Sensitive-Data
90169
description: Sensitive Data
91170
custom_properties:
92171
is_confidential: "false"
93-
- name: Confidential
172+
- name: "Confidential Information" # Will generate: Data-Classification.Confidential-Information
94173
description: Confidential Data
95174
custom_properties:
96175
is_confidential: "true"
97-
- name: HighlyConfidential
176+
- name: "Highly Confidential" # Will generate: Data-Classification.Highly-Confidential
98177
description: Highly Confidential Data
99178
custom_properties:
100179
is_confidential: "true"
101180
domain: Marketing
102-
- name: PersonalInformation
181+
182+
- name: "Personal Information"
103183
description: All terms related to personal information
104184
owners:
105185
users:
106186
- mjames
107187
terms:
108-
- name: Email
109-
## An example of using an id to pin a term to a specific guid
110-
## See "how to generate custom IDs for your terms" section below
111-
# id: "urn:li:glossaryTerm:41516e310acbfd9076fffc2c98d2d1a3"
188+
- name: "Email" # Will generate: Personal-Information.Email
112189
description: An individual's email address
113190
inherits:
114-
- Classification.Confidential
191+
- Data-Classification.Confidential # References parent node path
115192
owners:
116193
groups:
117194
- Trust and Safety
118-
- name: Address
195+
- name: "Address" # Will generate: Personal-Information.Address
119196
description: A physical address
120-
- name: Gender
197+
- name: "Gender" # Will generate: Personal-Information.Gender
121198
description: The gender identity of the individual
122199
inherits:
123-
- Classification.Sensitive
124-
- name: Shipping
125-
description: Provides terms related to the shipping domain
126-
owners:
127-
users:
128-
- njones
129-
groups:
130-
- logistics
131-
terms:
132-
- name: FullAddress
133-
description: A collection of information to give the location of a building or plot of land.
134-
owners:
135-
users:
136-
- njones
137-
groups:
138-
- logistics
139-
term_source: "EXTERNAL"
140-
source_ref: FIBO
141-
source_url: "https://www.google.com"
142-
inherits:
143-
- Privacy.PII
144-
contains:
145-
- Shipping.ZipCode
146-
- Shipping.CountryCode
147-
- Shipping.StreetAddress
148-
related_terms:
149-
- Housing.Kitchen.Cutlery
150-
custom_properties:
151-
- is_used_for_compliance_tracking: "true"
152-
knowledge_links:
153-
- url: "https://en.wikipedia.org/wiki/Address"
154-
label: Wiki link
155-
domain: "urn:li:domain:Logistics"
156-
knowledge_links:
157-
- label: Wiki link for shipping
158-
url: "https://en.wikipedia.org/wiki/Freight_transport"
159-
- name: ClientsAndAccounts
200+
- Data-Classification.Sensitive # References parent node path
201+
202+
- name: "Clients And Accounts"
160203
description: Provides basic concepts such as account, account holder, account provider, relationship manager that are commonly used by financial services providers to describe customers and to determine counterparty identities
161204
owners:
162205
groups:
163206
- finance
207+
type: DATAOWNER
164208
terms:
165-
- name: Account
209+
- name: "Account" # Will generate: Clients-And-Accounts.Account
166210
description: Container for records associated with a business arrangement for regular transactions and services
167211
term_source: "EXTERNAL"
168212
source_ref: FIBO
169213
source_url: "https://spec.edmcouncil.org/fibo/ontology/FBC/ProductsAndServices/ClientsAndAccounts/Account"
170214
inherits:
171-
- Classification.HighlyConfidential
215+
- Data-Classification.Highly-Confidential # References parent node path
172216
contains:
173-
- ClientsAndAccounts.Balance
174-
- name: Balance
217+
- Clients-And-Accounts.Balance # References term in same node
218+
- name: "Balance" # Will generate: Clients-And-Accounts.Balance
175219
description: Amount of money available or owed
176220
term_source: "EXTERNAL"
177221
source_ref: FIBO
178222
source_url: "https://spec.edmcouncil.org/fibo/ontology/FBC/ProductsAndServices/ClientsAndAccounts/Balance"
179-
- name: Housing
180-
description: Provides terms related to the housing domain
181-
owners:
182-
users:
183-
- mjames
184-
groups:
185-
- interior
186-
nodes:
187-
- name: Colors
188-
description: "Colors that are used in Housing construction"
189-
terms:
190-
- name: Red
191-
description: "red color"
192-
term_source: "EXTERNAL"
193-
source_ref: FIBO
194-
source_url: "https://spec.edmcouncil.org/fibo/ontology/FBC/ProductsAndServices/ClientsAndAccounts/Account"
195-
196-
- name: Green
197-
description: "green color"
198-
term_source: "EXTERNAL"
199-
source_ref: FIBO
200-
source_url: "https://spec.edmcouncil.org/fibo/ontology/FBC/ProductsAndServices/ClientsAndAccounts/Account"
201-
202-
- name: Pink
203-
description: pink color
204-
term_source: "EXTERNAL"
205-
source_ref: FIBO
206-
source_url: "https://spec.edmcouncil.org/fibo/ontology/FBC/ProductsAndServices/ClientsAndAccounts/Account"
223+
224+
- name: "KPIs"
225+
description: Common Business KPIs
207226
terms:
208-
- name: WindowColor
209-
description: Supported window colors
210-
term_source: "EXTERNAL"
211-
source_ref: FIBO
212-
source_url: "https://spec.edmcouncil.org/fibo/ontology/FBC/ProductsAndServices/ClientsAndAccounts/Account"
213-
values:
214-
- Housing.Colors.Red
215-
- Housing.Colors.Pink
227+
- name: "CSAT %" # Will generate: KPIs.CSAT
228+
description: Customer Satisfaction Score
229+
```
216230

217-
- name: Kitchen
218-
description: a room or area where food is prepared and cooked.
219-
term_source: "EXTERNAL"
220-
source_ref: FIBO
221-
source_url: "https://spec.edmcouncil.org/fibo/ontology/FBC/ProductsAndServices/ClientsAndAccounts/Account"
231+
## Custom ID Specification
222232

223-
- name: Spoon
224-
description: an implement consisting of a small, shallow oval or round bowl on a long handle, used for eating, stirring, and serving food.
225-
term_source: "EXTERNAL"
226-
source_ref: FIBO
227-
source_url: "https://spec.edmcouncil.org/fibo/ontology/FBC/ProductsAndServices/ClientsAndAccounts/Account"
228-
related_terms:
229-
- Housing.Kitchen
230-
knowledge_links:
231-
- url: "https://en.wikipedia.org/wiki/Spoon"
232-
label: Wiki link
233-
```
234-
</details>
233+
Custom IDs can be specified in two ways, both of which are fully supported and acceptable:
235234

236-
Source file linked [here](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/bootstrap_data/business_glossary.yml).
235+
1. Just the ID portion (simpler approach):
236+
```yaml
237+
terms:
238+
- name: "Email"
239+
id: "company-email" # Will become urn:li:glossaryTerm:company-email
240+
description: "Company email address"
241+
```
237242

238-
## Generating custom IDs for your terms
243+
2. Full URN format:
244+
```yaml
245+
terms:
246+
- name: "Email"
247+
id: "urn:li:glossaryTerm:company-email"
248+
description: "Company email address"
249+
```
239250

240-
IDs are normally inferred from the glossary term/node's name, see the `enable_auto_id` config. But, if you need a stable
241-
identifier, you can generate a custom ID for your term. It should be unique across the entire Glossary.
251+
Both methods are valid and will work correctly. The system will automatically handle the URN prefix if you specify just the ID portion.
242252

243-
Here's an example ID:
244-
`id: "urn:li:glossaryTerm:41516e310acbfd9076fffc2c98d2d1a3"`
253+
The same applies for nodes:
254+
```yaml
255+
nodes:
256+
- name: "Communications"
257+
id: "internal-comms" # Will become urn:li:glossaryNode:internal-comms
258+
description: "Internal communication methods"
259+
```
245260

246-
A note of caution: once you select a custom ID, it cannot be easily changed.
261+
Note: Once you select a custom ID, it cannot be easily changed.
247262

248263
## Compatibility
249264

250-
Compatible with version 1 of business glossary format.
251-
The source will be evolved as we publish newer versions of this format.
265+
Compatible with version 1 of business glossary format. The source will be evolved as newer versions of this format are published.

0 commit comments

Comments
 (0)