You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: metadata-integration/java/openlineage-converter/src/main/java/io/datahubproject/openlineage/dataset/DatahubJob.java
+11
Original file line number
Diff line number
Diff line change
@@ -9,6 +9,7 @@
9
9
importcom.linkedin.common.EdgeArray;
10
10
importcom.linkedin.common.GlobalTags;
11
11
importcom.linkedin.common.Ownership;
12
+
importcom.linkedin.common.Status;
12
13
importcom.linkedin.common.TagAssociation;
13
14
importcom.linkedin.common.UrnArray;
14
15
importcom.linkedin.common.urn.DataFlowUrn;
@@ -110,6 +111,8 @@ public List<MetadataChangeProposal> toMcps(DatahubOpenlineageConfig config) thro
@@ -104,7 +104,7 @@ and [Init script](https://docs.databricks.com/clusters/configure.html#init-scrip
104
104
information like tokens.
105
105
106
106
- Download `datahub-spark-lineage` jar
107
-
from [the Maven central repository](https://s01.oss.sonatype.org/content/groups/public/io/acryl/acryl-spark-lineage/0.0.1/).
107
+
from [the Maven central repository](https://s01.oss.sonatype.org/content/groups/public/io/acryl/acryl-spark-lineage/).
108
108
- Create `init.sh` with below content
109
109
110
110
```sh
@@ -178,7 +178,10 @@ information like tokens.
178
178
| spark.datahub.flow_name ||| If it is set it will be used as the DataFlow name otherwise it uses spark app name as flow_name |
179
179
| spark.datahub.partition_regexp_pattern ||| Strip partition part from the path if path end matches with the specified regexp. Example `year=.*/month=.*/day=.*`|
180
180
| spark.datahub.tags ||| Comma separated list of tags to attach to the DataFlow |
181
-
| spark.datahub.stage_metadata_coalescing ||| Normally it coalesce and send metadata at the onApplicationEnd event which is never called on Databricsk. You should enable this on Databricsk if you want coalesced run .|
181
+
| spark.datahub.domains ||| Comma separated list of domain urns to attach to the DataFlow |
182
+
| spark.datahub.stage_metadata_coalescing ||| Normally it coalesce and send metadata at the onApplicationEnd event which is never called on Databricsk. You should enable this on Databricks if you want coalesced run .|
183
+
| spark.datahub.patch.enabled ||| Set this to true to send lineage as a patch, which appends rather than overwrites existing Dataset lineage edges. By default it is enabled.
184
+
|
182
185
183
186
## What to Expect: The Metadata Model
184
187
@@ -207,7 +210,7 @@ For Spark on Databricks, pipeline start time is the cluster start time.
207
210
208
211
### Spark versions supported
209
212
210
-
Supports Spark 3.x series and was tested with Spark 3.2.x and 3.3.x.
213
+
Supports Spark 3.x series.
211
214
212
215
### Environments tested with
213
216
@@ -219,12 +222,6 @@ This initial release has been tested with the following environments:
219
222
220
223
Testing with Databricks Standard and High-concurrency Cluster is not done yet.
221
224
222
-
### Spark commands not yet supported
223
-
224
-
- View related commands
225
-
- Cache commands and implications on lineage
226
-
- RDD jobs
227
-
228
225
### Configuring Hdfs based dataset URNs
229
226
230
227
Spark emits lineage between datasets. It has its own logic for generating urns. Python sources emit metadata of
0 commit comments