Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingest/looker): Added delta-lake to platform naming convention involving 2 parts #12061

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

k7ragav
Copy link
Contributor

@k7ragav k7ragav commented Dec 9, 2024

Checklist

Added delta-lake to platforms that follow the naming convention of 'db.table_name`.

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions github-actions bot added ingestion PR or Issue related to the ingestion of metadata community-contribution PR or Issue raised by member(s) of DataHub Community labels Dec 9, 2024
@datahub-cyborg datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Dec 9, 2024
Copy link
Collaborator

@hsheth2 hsheth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the logic here

logger.debug(f"Ingesting table {table_name} from location {path}")
if self.source_config.relative_path is None:
browse_path: str = (
strip_s3_prefix(path) if self.source_config.is_s3 else path.strip("/")
)
else:
browse_path = path.split(self.source_config.base_path)[1].strip("/")
data_platform_urn = make_data_platform_urn(self.source_config.platform)
- it does not seem like delta-lake is always going to have a two-part identifier

@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Dec 17, 2024
Copy link
Contributor Author

@k7ragav k7ragav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought they would also have database.tablename identifier in Looker while using the sql_table_name. And will not require the DEFAULT_DATABASE that is mentioned in the Looker connection
I saw some docs here as well https://docs.delta.io/latest/delta-batch.html

But let me know what would be the best way forward. We have database.tablename in sql_table_name and would like that to be the identifier as well for upstream lineage

@hsheth2
Copy link
Collaborator

hsheth2 commented Dec 17, 2024

@k7ragav looking at our golden files, delta-lake seems to have / in the dataset urns and can have an arbitrary number of parts

What do your delta-lake urns look like? Are you using our delta-lake ingestion source, or something else?

Overall - I don't think this change does what we want. That said, my understanding of how delta-lake works is somewhat weak.

@k7ragav
Copy link
Contributor Author

k7ragav commented Dec 17, 2024

@hsheth2 We do not use the delta-lake ingestion from DataHub. We ingest it using the emitters. And we do not have the file paths as it is shown in the golden dataset. We have it like this urn:li:dataset:(urn:li:dataPlatform:delta-lake,databasename.tablename,DEV
I think with the proposed change, it will only matter based on what is provided in the sql_table_name right? Or did I misunderstand it. Because that is why I added it to _platform_names_have_2_parts cause in sql _table_name , it works when it is database.tablename for Delta Lake queries in Looker
Let me know what you think.

@datahub-cyborg datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Dec 17, 2024
@k7ragav
Copy link
Contributor Author

k7ragav commented Jan 6, 2025

@hsheth2 Do you have some suggestions on how to move forward based on the explanation below?

@hsheth2 We do not use the delta-lake ingestion from DataHub. We ingest it using the emitters. And we do not have the file paths as it is shown in the golden dataset. We have it like this urn:li:dataset:(urn:li:dataPlatform:delta-lake,databasename.tablename,DEV I think with the proposed change, it will only matter based on what is provided in the sql_table_name right? Or did I misunderstand it. Because that is why I added it to _platform_names_have_2_parts cause in sql _table_name , it works when it is database.tablename for Delta Lake queries in Looker Let me know what you think.

@treff7es
Copy link
Contributor

I briefly checked how our different sources work right now.

  1. delta-lake: This source ingests delta lake files from cloud storage right now, and here the platform is delta-lake. In the above case, you have delta-lake tables registered in some metastore, and you use some query execution engine like Databricks/Trino/Presto/Hive to query the delta-lake tables from Looker. Because of this, I think the preferable platform should be hive/presto/trino/databricks and not the underlying file format. These tables could also be iceberg tables or pure parquet files, if I get it right. Am I right?
  2. Then, the urn creation should be aligned with what the source-specific ingestions generate. This is mostly catalog_name.schema.table.

@k7ragav, what do you think about this?

@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Jan 22, 2025
@k7ragav
Copy link
Contributor Author

k7ragav commented Jan 22, 2025

@treff7es

  1. Yes. You're right, we do not store in the cloud and store in a metastore. Does this platform 'delta-lake' is only when the location is in a cloud? (like s3)? If I am understanding correclty, if that is the case, then we cannot use it the way I described it above?
  2. So if we use a catalog (for example, let's say hive), so that the URN will contain hive.database.tablename instead of database.tablename?

@datahub-cyborg datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Jan 22, 2025
@treff7es
Copy link
Contributor

@treff7es

  1. Yes. You're right, we do not store in the cloud and store in a metastore. Does this platform 'delta-lake' is only when the location is in a cloud? (like s3)? If I am understanding correclty, if that is the case, then we cannot use it the way I described it above?
  2. So if we use a catalog (for example, let's say hive), so that the URN will contain hive.database.tablename instead of database.tablename?

Here is an example of how this looks like in the Hive tests ->

"urn": "urn:li:dataset:(urn:li:dataPlatform:hive,db1._test_table_underscore,PROD)",

urn:li:dataset:(urn:li:dataPlatform:hive,db1._test_table_underscore,PROD)
In the above example, platform is: hive db name:: db1 and table name _test_table_underscore
In your case, this would look like:
urn:li:dataset:(urn:li:dataPlatform:hive,database.tablename,PROD)

@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution PR or Issue raised by member(s) of DataHub Community ingestion PR or Issue related to the ingestion of metadata pending-submitter-response Issue/request has been reviewed but requires a response from the submitter
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants