Skip to content

Commit 8335c20

Browse files
david-leifkershirshanka
authored andcommitted
docs(restore-indices): added best practices (datahub-project#12741)
1 parent c615740 commit 8335c20

File tree

1 file changed

+169
-10
lines changed

1 file changed

+169
-10
lines changed

docs/how/restore-indices.md

+169-10
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,22 @@
11
# Restoring Search and Graph Indices from Local Database
22

3-
If search or graph services go down or you have made changes to them that require reindexing, you can restore them from
4-
the aspects stored in the local database.
3+
If search infrastructure (Elasticsearch/Opensearch) or graph services (Elasticsearch/Opensearch/Neo4j) become inconsistent,
4+
you can restore them from the aspects stored in the local database.
55

6-
When a new version of the aspect gets ingested, GMS initiates an MAE event for the aspect which is consumed to update
6+
When a new version of the aspect gets ingested, GMS initiates an MCL event for the aspect which is consumed to update
77
the search and graph indices. As such, we can fetch the latest version of each aspect in the local database and produce
8-
MAE events corresponding to the aspects to restore the search and graph indices.
8+
MCL events corresponding to the aspects to restore the search and graph indices.
99

1010
By default, restoring the indices from the local database will not remove any existing documents in
1111
the search and graph indices that no longer exist in the local database, potentially leading to inconsistencies
1212
between the search and graph indices and the local database.
1313

1414
## Configuration
1515

16-
The upgrade jobs take arguments as command line args to the job itself rather than environment variables for job specific configuration. The RestoreIndices job is specified through the `-u RestoreIndices` upgrade ID parameter and then additional parameters are specified like `-a batchSize=1000`.
16+
The upgrade jobs take arguments as command line args to the job itself rather than environment variables for job specific
17+
configuration. The RestoreIndices job is specified through the `-u RestoreIndices` upgrade ID parameter and then additional
18+
parameters are specified like `-a batchSize=1000`.
19+
1720
The following configurations are available:
1821

1922
### Time-Based Filtering
@@ -43,7 +46,9 @@ The following configurations are available:
4346

4447
These are available in the helm charts as configurations for Kubernetes deployments under the `datahubUpgrade.restoreIndices.args` path which will set them up as args to the pod command.
4548

46-
## Quickstart
49+
## Execution Methods
50+
51+
### Quickstart
4752

4853
If you're using the quickstart images, you can use the `datahub` cli to restore the indices.
4954

@@ -57,7 +62,7 @@ Using the `datahub` CLI to restore the indices when using the quickstart images
5762

5863
See [this section](../quickstart.md#restore-datahub) for more information.
5964

60-
## Docker-compose
65+
### Docker-compose
6166

6267
If you are on a custom docker-compose deployment, run the following command (you need to checkout [the source repository](https://github.com/datahub-project/datahub)) from the root of the repo to send MAE for each aspect in the local database.
6368

@@ -78,7 +83,7 @@ If you need to clear the search and graph indices before restoring, add `-a clea
7883
Refer to this [doc](../../docker/datahub-upgrade/README.md#environment-variables) on how to set environment variables
7984
for your environment.
8085

81-
## Kubernetes
86+
### Kubernetes
8287

8388
Run `kubectl get cronjobs` to see if the restoration job template has been deployed. If you see results like below, you
8489
are good to go.
@@ -120,6 +125,160 @@ datahubUpgrade:
120125
- "clean"
121126
```
122127
123-
## Through API
128+
### Through APIs
129+
130+
See also the [Best Practices](#best-practices) section below, however note that the APIs are able to handle a few thousand
131+
aspects. In this mode one of the GMS instances will perform the required actions, however it is subject to timeout. Use one of the
132+
approaches above for longer running restoreIndices.
133+
134+
#### OpenAPI
135+
136+
There are two primary APIs, one which exposes the common parameters for restoreIndices and another one designed
137+
to accept a list of URNs where all aspects are to be restored.
138+
139+
Full configuration:
140+
141+
<p align="center">
142+
<img width="80%" src="https://github.com/datahub-project/static-assets/blob/main/imgs/how/restore-indices/openapi-restore-indices.png?raw=true"/>
143+
</p>
144+
145+
All Aspects:
146+
147+
<p align="center">
148+
<img width="80%" src="https://github.com/datahub-project/static-assets/blob/main/imgs/how/restore-indices/openapi-restore-indices-urns.png?raw=true"/>
149+
</p>
150+
151+
#### Rest.li
152+
153+
For Rest.li, see [Restore Indices API](../api/restli/restore-indices.md).
154+
155+
## Best Practices
156+
157+
In general, this process is not required to run unless there has been a disruption of storage services or infrastructure,
158+
such as Elasticsearch/Opensearch cluster failures, data corruption events, or significant version upgrade inconsistencies
159+
that have caused the search and graph indices to become out of sync with the local database.
160+
161+
### K8 Job vs. API
162+
163+
#### When to Use Kubernetes Jobs
164+
For operations affecting 2,000 or more aspects, it's strongly recommended to use the Kubernetes job approach. This job is
165+
designed for long-running processes and provide several advantages:
166+
167+
* Won't time out like API calls
168+
* Can be monitored through Kubernetes logging
169+
* Won't consume resources from your primary GMS instances
170+
* Can be scheduled during off-peak hours to minimize system impact
171+
172+
#### When to Use APIs
173+
The RestoreIndices APIs (available through both Rest.li and OpenAPI) is best suited for:
174+
175+
* Targeted restores affecting fewer than 2,000 aspects
176+
* Emergencies where you need to quickly restore critical metadata
177+
* Testing or validating the restore process before running a full-scale job
178+
* Scenarios where you don't have direct access to the Kubernetes cluster
179+
180+
Remember that API-based restoration runs within one of your GMS instances and is subject to timeouts, which could lead to
181+
incomplete restorations for larger installations.
182+
183+
### Targeted Restoration Strategies
184+
Being selective about what you restore is crucial for efficiency. Combining these filtering strategies can dramatically
185+
reduce the restoration scope, saving resources and time.
186+
187+
#### Entity Type Filtering
188+
Entity Type Filtering: Use the `urnLike` parameter to target specific entity types:
189+
190+
* For datasets: `urnLike=urn:li:dataset:%`
191+
* For users: `urnLike=urn:li:corpuser:%`
192+
* For dashboards: `urnLike=urn:li:dashboard:%`
193+
194+
#### Single Entity
195+
Single Entity Restoration: When only one entity is affected, provide the specific URN to minimize processing overhead.
196+
Aspect-Based Filtering: Use aspectNames to target only the specific aspects that need restoration:
197+
198+
* For ownership inconsistencies: `aspectNames=ownership`
199+
* For tag issues: `aspectNames=globalTags`
200+
201+
#### Time-Based
202+
Time-Based Recovery: If you know when the inconsistency began, use time-based filtering:
203+
204+
* gePitEpochMs={timestamp} to process only records created after the incident
205+
* lePitEpochMs={timestamp} to limit processing to records before a certain time
206+
207+
### Parallel Processing Strategies
208+
209+
To optimize restoration speed while managing system load:
210+
211+
#### Multiple Parallel Jobs
212+
Run several restoreIndices processes simultaneously by:
213+
214+
* Work on non-overlapping sets of aspects or entities
215+
* Dividing work by entity type (one job for datasets, another for users, etc.)
216+
* Splitting aspects among different jobs (one for ownership aspects, another for lineage, etc.)
217+
* Partitioning large entity types by prefix or time range
218+
* Have staggered start times to prevent initial resource contention
219+
* Monitor system metrics closely during concurrent restoration to ensure you're not overloading your infrastructure.
220+
221+
:::caution
222+
Avoid Conflicts: Ensure that concurrent jobs:
223+
224+
Never specify the --clean argument in concurrent jobs
225+
:::
226+
227+
### Temporary Workload Reduction
228+
229+
* Pause scheduled ingestion jobs during restoration
230+
* Temporarily disable or reduce frequency of the datahub-gc job to prevent conflicting deletes
231+
* Consider pausing automated workflows or integrations that generate metadata events
232+
233+
### Infrastructure Tuning
234+
Implementing these expanded best practices should help ensure a smoother, more efficient restoration process while
235+
minimizing impact on your DataHub environment.
236+
237+
This operation can be I/O intensive from the read-side from SQL and on the Elasticsearch write side. If you're able to leverage
238+
provisioned I/O. or throughput, you might want to monitor your infrastructure for a possible.
239+
240+
#### Elasticsearch/Opensearch Optimization
241+
To improve write performance during restoration:
242+
243+
##### Refresh Interval Adjustment:
244+
245+
Temporarily increase the refresh_interval setting from the default (typically 1s) to something like 30s or 60s.
246+
Run the system update job with the following environment variable `ELASTICSEARCH_INDEX_BUILDER_REFRESH_INTERVAL_SECONDS=60`
247+
248+
:::caution
249+
Remember to reset this after restoration completes!
250+
:::caution
251+
252+
##### Bulk Processing Improvements:
253+
254+
* Adjust the Elasticsearch batching parameters to optimize bulk request size (try values between 2000-5000)
255+
* Run your GMS or `mae-consumer` with environment variables
256+
* `ES_BULK_REQUESTS_LIMIT=3000`
257+
* `ES_BULK_FLUSH_PERIOD=60`
258+
* Configure `batchDelayMs` on restoreIndices to add breathing room between batches if your cluster is struggling
259+
260+
##### Shard Management:
261+
262+
* Ensure your indices have an appropriate number of shards for your cluster size.
263+
* Consider temporarily adding nodes to your search cluster during massive restorations.
264+
265+
#### SQL/Primary Storage
266+
267+
Consider using a read replica as the source of the job's data.
268+
269+
#### Kafka & Consumers
270+
271+
##### Partition Strategy:
272+
273+
* Verify that the Kafka Metadata Change Log (MCL) topic have enough partitions to allow for parallel processing.
274+
* Recommended: At least 10-20 partitions for the MCL topic in production environments.
275+
276+
##### Consumer Scaling:
277+
278+
* Temporarily increase the number of `mae-consumer` pods to process the higher event volume.
279+
* Scale GMS instances if they're handling consumer duties.
280+
281+
##### Monitoring:
124282

125-
See [Restore Indices API](../api/restli/restore-indices.md).
283+
* Watch consumer lag metrics closely during restoration.
284+
* Be prepared to adjust scaling or batch parameters if consumers fall behind.

0 commit comments

Comments
 (0)