|
| 1 | +# DataHub Garbage Collection Source Documentation |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +The DataHub Garbage Collection (GC) source is a maintenance component responsible for cleaning up various types of metadata to maintain system performance and data quality. It performs multiple cleanup tasks, each focusing on different aspects of DataHub's metadata. |
| 6 | + |
| 7 | +## Configuration |
| 8 | + |
| 9 | +### Example GC Configuration |
| 10 | +```yaml |
| 11 | +source: |
| 12 | + type: datahub-gc |
| 13 | + config: |
| 14 | + # Whether to run the recipe in dry-run mode or not |
| 15 | + dry_run: false |
| 16 | + # Cleanup expired tokens |
| 17 | + cleanup_expired_tokens: true |
| 18 | + # Whether to truncate elasticsearch indices or not which can be safely truncated |
| 19 | + truncate_indices: true |
| 20 | + |
| 21 | + # Cleanup DataProcess Instances |
| 22 | + dataprocess_cleanup: |
| 23 | + enabled: true |
| 24 | + retention_days: 10 |
| 25 | + # Delete empty Data Jobs (if no DataProcessInstance associated with the DataJob) |
| 26 | + delete_empty_data_jobs: true |
| 27 | + # Delete empty Data Flow (if no DataJob associated with the DataFlow) |
| 28 | + delete_empty_data_flows: true |
| 29 | + # Whether to hard delete entities or soft delete them |
| 30 | + hard_delete_entities: false |
| 31 | + # Keep the last n dataprocess instances |
| 32 | + keep_last_n: 5 |
| 33 | + soft_deleted_entities_cleanup: |
| 34 | + enabled: true |
| 35 | + # Delete soft deleted entities which were deleted 10 days ago |
| 36 | + retention_days: 10 |
| 37 | + execution_request_cleanup: |
| 38 | + # Minimum number of execution requests to keep, per ingestion source |
| 39 | + keep_history_min_count: 10 |
| 40 | + # Maximum number of execution requests to keep, per ingestion source |
| 41 | + keep_history_max_count: 1000 |
| 42 | + # Maximum number of days to keep execution requests for, per ingestion source |
| 43 | + keep_history_max_days: 30 |
| 44 | + # Number of records per read operation |
| 45 | + batch_read_size: 100 |
| 46 | + # Global switch for this cleanup task |
| 47 | + enabled: true |
| 48 | +``` |
| 49 | +
|
| 50 | +## Cleanup Tasks |
| 51 | +
|
| 52 | +### 1. Index Cleanup |
| 53 | +
|
| 54 | +Manages Elasticsearch indices in DataHub, particularly focusing on time-series data. |
| 55 | +
|
| 56 | +#### Configuration |
| 57 | +```yaml |
| 58 | +source: |
| 59 | + type: datahub-gc |
| 60 | + config: |
| 61 | + truncate_indices: true |
| 62 | + truncate_index_older_than_days: 30 |
| 63 | + truncation_watch_until: 10000 |
| 64 | + truncation_sleep_between_seconds: 30 |
| 65 | +``` |
| 66 | +
|
| 67 | +#### Features |
| 68 | +- Truncates old Elasticsearch indices for: |
| 69 | + - Dataset operations |
| 70 | + - Dataset usage statistics |
| 71 | + - Chart usage statistics |
| 72 | + - Dashboard usage statistics |
| 73 | + - Query usage statistics |
| 74 | +- Monitors truncation progress |
| 75 | +- Implements safe deletion with monitoring thresholds |
| 76 | +- Supports gradual truncation with sleep intervals |
| 77 | +
|
| 78 | +### 2. Expired Token Cleanup |
| 79 | +
|
| 80 | +Manages access tokens in DataHub to maintain security and prevent token accumulation. |
| 81 | +
|
| 82 | +#### Configuration |
| 83 | +```yaml |
| 84 | +source: |
| 85 | + type: datahub-gc |
| 86 | + config: |
| 87 | + cleanup_expired_tokens: true |
| 88 | +``` |
| 89 | +
|
| 90 | +#### Features |
| 91 | +- Automatically identifies and revokes expired access tokens |
| 92 | +- Processes tokens in batches for efficiency |
| 93 | +- Maintains system security by removing outdated credentials |
| 94 | +- Reports number of tokens revoked |
| 95 | +- Uses GraphQL API for token management |
| 96 | +
|
| 97 | +### 3. Data Process Cleanup |
| 98 | +
|
| 99 | +Manages the lifecycle of data processes, jobs, and their instances (DPIs) within DataHub. |
| 100 | +
|
| 101 | +#### Features |
| 102 | +- Cleans up Data Process Instances (DPIs) based on age and count |
| 103 | +- Can remove empty DataJobs and DataFlows |
| 104 | +- Supports both soft and hard deletion |
| 105 | +- Uses parallel processing for efficient cleanup |
| 106 | +- Maintains configurable retention policies |
| 107 | +
|
| 108 | +#### Configuration |
| 109 | +```yaml |
| 110 | +source: |
| 111 | + type: datahub-gc |
| 112 | + config: |
| 113 | + dataprocess_cleanup: |
| 114 | + enabled: true |
| 115 | + retention_days: 10 |
| 116 | + keep_last_n: 5 |
| 117 | + delete_empty_data_jobs: false |
| 118 | + delete_empty_data_flows: false |
| 119 | + hard_delete_entities: false |
| 120 | + batch_size: 500 |
| 121 | + max_workers: 10 |
| 122 | + delay: 0.25 |
| 123 | +``` |
| 124 | +
|
| 125 | +### Limitations |
| 126 | +
|
| 127 | +- Maximum 9000 DPIs per job for performance |
| 128 | +
|
| 129 | +
|
| 130 | +### 4. Execution Request Cleanup |
| 131 | +
|
| 132 | +Manages DataHub execution request records to prevent accumulation of historical execution data. |
| 133 | +
|
| 134 | +#### Features |
| 135 | +- Maintains execution history per ingestion source |
| 136 | +- Preserves minimum number of recent requests |
| 137 | +- Removes old requests beyond retention period |
| 138 | +- Special handling for running/pending requests |
| 139 | +- Automatic cleanup of corrupted records |
| 140 | +
|
| 141 | +#### Configuration |
| 142 | +```yaml |
| 143 | +source: |
| 144 | + type: datahub-gc |
| 145 | + config: |
| 146 | + execution_request_cleanup: |
| 147 | + enabled: true |
| 148 | + keep_history_min_count: 10 |
| 149 | + keep_history_max_count: 1000 |
| 150 | + keep_history_max_days: 30 |
| 151 | + batch_read_size: 100 |
| 152 | + runtime_limit_seconds: 3600 |
| 153 | + max_read_errors: 10 |
| 154 | +``` |
| 155 | +
|
| 156 | +### 5. Soft-Deleted Entities Cleanup |
| 157 | +
|
| 158 | +Manages the permanent removal of soft-deleted entities after a retention period. |
| 159 | +
|
| 160 | +#### Features |
| 161 | +- Permanently removes soft-deleted entities after retention period |
| 162 | +- Handles entity references cleanup |
| 163 | +- Special handling for query entities |
| 164 | +- Supports filtering by entity type, platform, or environment |
| 165 | +- Concurrent processing with safety limits |
| 166 | +
|
| 167 | +#### Configuration |
| 168 | +```yaml |
| 169 | +source: |
| 170 | + type: datahub-gc |
| 171 | + config: |
| 172 | + soft_deleted_entities_cleanup: |
| 173 | + enabled: true |
| 174 | + retention_days: 10 |
| 175 | + batch_size: 500 |
| 176 | + max_workers: 10 |
| 177 | + delay: 0.25 |
| 178 | + entity_types: null # Optional list of entity types to clean |
| 179 | + platform: null # Optional platform filter |
| 180 | + env: null # Optional environment filter |
| 181 | + query: null # Optional custom query filter |
| 182 | + limit_entities_delete: 25000 |
| 183 | + futures_max_at_time: 1000 |
| 184 | + runtime_limit_seconds: 7200 |
| 185 | +``` |
| 186 | +
|
| 187 | +### Performance Considerations |
| 188 | +- Concurrent processing using thread pools |
| 189 | +- Configurable batch sizes for optimal performance |
| 190 | +- Rate limiting through configurable delays |
| 191 | +- Maximum limits on concurrent operations |
| 192 | +
|
| 193 | +## Reporting |
| 194 | +
|
| 195 | +Each cleanup task maintains detailed reports including: |
| 196 | +- Number of entities processed |
| 197 | +- Number of entities removed |
| 198 | +- Errors encountered |
| 199 | +- Sample of affected entities |
| 200 | +- Runtime statistics |
| 201 | +- Task-specific metrics |
0 commit comments