Skip to content

Commit b0cdaa2

Browse files
committed
Add doc for gc source
1 parent b86bbf7 commit b0cdaa2

File tree

1 file changed

+201
-0
lines changed

1 file changed

+201
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,201 @@
1+
# DataHub Garbage Collection Source Documentation
2+
3+
## Overview
4+
5+
The DataHub Garbage Collection (GC) source is a maintenance component responsible for cleaning up various types of metadata to maintain system performance and data quality. It performs multiple cleanup tasks, each focusing on different aspects of DataHub's metadata.
6+
7+
## Configuration
8+
9+
### Example GC Configuration
10+
```yaml
11+
source:
12+
type: datahub-gc
13+
config:
14+
# Whether to run the recipe in dry-run mode or not
15+
dry_run: false
16+
# Cleanup expired tokens
17+
cleanup_expired_tokens: true
18+
# Whether to truncate elasticsearch indices or not which can be safely truncated
19+
truncate_indices: true
20+
21+
# Cleanup DataProcess Instances
22+
dataprocess_cleanup:
23+
enabled: true
24+
retention_days: 10
25+
# Delete empty Data Jobs (if no DataProcessInstance associated with the DataJob)
26+
delete_empty_data_jobs: true
27+
# Delete empty Data Flow (if no DataJob associated with the DataFlow)
28+
delete_empty_data_flows: true
29+
# Whether to hard delete entities or soft delete them
30+
hard_delete_entities: false
31+
# Keep the last n dataprocess instances
32+
keep_last_n: 5
33+
soft_deleted_entities_cleanup:
34+
enabled: true
35+
# Delete soft deleted entities which were deleted 10 days ago
36+
retention_days: 10
37+
execution_request_cleanup:
38+
# Minimum number of execution requests to keep, per ingestion source
39+
keep_history_min_count: 10
40+
# Maximum number of execution requests to keep, per ingestion source
41+
keep_history_max_count: 1000
42+
# Maximum number of days to keep execution requests for, per ingestion source
43+
keep_history_max_days: 30
44+
# Number of records per read operation
45+
batch_read_size: 100
46+
# Global switch for this cleanup task
47+
enabled: true
48+
```
49+
50+
## Cleanup Tasks
51+
52+
### 1. Index Cleanup
53+
54+
Manages Elasticsearch indices in DataHub, particularly focusing on time-series data.
55+
56+
#### Configuration
57+
```yaml
58+
source:
59+
type: datahub-gc
60+
config:
61+
truncate_indices: true
62+
truncate_index_older_than_days: 30
63+
truncation_watch_until: 10000
64+
truncation_sleep_between_seconds: 30
65+
```
66+
67+
#### Features
68+
- Truncates old Elasticsearch indices for:
69+
- Dataset operations
70+
- Dataset usage statistics
71+
- Chart usage statistics
72+
- Dashboard usage statistics
73+
- Query usage statistics
74+
- Monitors truncation progress
75+
- Implements safe deletion with monitoring thresholds
76+
- Supports gradual truncation with sleep intervals
77+
78+
### 2. Expired Token Cleanup
79+
80+
Manages access tokens in DataHub to maintain security and prevent token accumulation.
81+
82+
#### Configuration
83+
```yaml
84+
source:
85+
type: datahub-gc
86+
config:
87+
cleanup_expired_tokens: true
88+
```
89+
90+
#### Features
91+
- Automatically identifies and revokes expired access tokens
92+
- Processes tokens in batches for efficiency
93+
- Maintains system security by removing outdated credentials
94+
- Reports number of tokens revoked
95+
- Uses GraphQL API for token management
96+
97+
### 3. Data Process Cleanup
98+
99+
Manages the lifecycle of data processes, jobs, and their instances (DPIs) within DataHub.
100+
101+
#### Features
102+
- Cleans up Data Process Instances (DPIs) based on age and count
103+
- Can remove empty DataJobs and DataFlows
104+
- Supports both soft and hard deletion
105+
- Uses parallel processing for efficient cleanup
106+
- Maintains configurable retention policies
107+
108+
#### Configuration
109+
```yaml
110+
source:
111+
type: datahub-gc
112+
config:
113+
dataprocess_cleanup:
114+
enabled: true
115+
retention_days: 10
116+
keep_last_n: 5
117+
delete_empty_data_jobs: false
118+
delete_empty_data_flows: false
119+
hard_delete_entities: false
120+
batch_size: 500
121+
max_workers: 10
122+
delay: 0.25
123+
```
124+
125+
### Limitations
126+
127+
- Maximum 9000 DPIs per job for performance
128+
129+
130+
### 4. Execution Request Cleanup
131+
132+
Manages DataHub execution request records to prevent accumulation of historical execution data.
133+
134+
#### Features
135+
- Maintains execution history per ingestion source
136+
- Preserves minimum number of recent requests
137+
- Removes old requests beyond retention period
138+
- Special handling for running/pending requests
139+
- Automatic cleanup of corrupted records
140+
141+
#### Configuration
142+
```yaml
143+
source:
144+
type: datahub-gc
145+
config:
146+
execution_request_cleanup:
147+
enabled: true
148+
keep_history_min_count: 10
149+
keep_history_max_count: 1000
150+
keep_history_max_days: 30
151+
batch_read_size: 100
152+
runtime_limit_seconds: 3600
153+
max_read_errors: 10
154+
```
155+
156+
### 5. Soft-Deleted Entities Cleanup
157+
158+
Manages the permanent removal of soft-deleted entities after a retention period.
159+
160+
#### Features
161+
- Permanently removes soft-deleted entities after retention period
162+
- Handles entity references cleanup
163+
- Special handling for query entities
164+
- Supports filtering by entity type, platform, or environment
165+
- Concurrent processing with safety limits
166+
167+
#### Configuration
168+
```yaml
169+
source:
170+
type: datahub-gc
171+
config:
172+
soft_deleted_entities_cleanup:
173+
enabled: true
174+
retention_days: 10
175+
batch_size: 500
176+
max_workers: 10
177+
delay: 0.25
178+
entity_types: null # Optional list of entity types to clean
179+
platform: null # Optional platform filter
180+
env: null # Optional environment filter
181+
query: null # Optional custom query filter
182+
limit_entities_delete: 25000
183+
futures_max_at_time: 1000
184+
runtime_limit_seconds: 7200
185+
```
186+
187+
### Performance Considerations
188+
- Concurrent processing using thread pools
189+
- Configurable batch sizes for optimal performance
190+
- Rate limiting through configurable delays
191+
- Maximum limits on concurrent operations
192+
193+
## Reporting
194+
195+
Each cleanup task maintains detailed reports including:
196+
- Number of entities processed
197+
- Number of entities removed
198+
- Errors encountered
199+
- Sample of affected entities
200+
- Runtime statistics
201+
- Task-specific metrics

0 commit comments

Comments
 (0)