Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingestion) Adding vertexAI ingestion source (v1 - model group and model) #12632

Merged
merged 75 commits into from
Mar 13, 2025

Conversation

ryota-cloud
Copy link
Collaborator

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions github-actions bot added ingestion PR or Issue related to the ingestion of metadata community-contribution PR or Issue raised by member(s) of DataHub Community labels Feb 13, 2025
@datahub-cyborg datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Feb 13, 2025
Copy link

codecov bot commented Feb 13, 2025

Codecov Report

Attention: Patch coverage is 84.81481% with 41 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...ingestion/src/datahub/ingestion/source/vertexai.py 82.70% 41 Missing ⚠️

📢 Thoughts on this report? Let us know!

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@ryota-cloud ryota-cloud changed the title (WIP) feat(ingestion) Adding vertexAI ingestion source (v1 - model group and model) feat(ingestion) Adding vertexAI ingestion source (v1 - model group and model) Feb 24, 2025
Copy link
Collaborator

@hsheth2 hsheth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every entity needs container aspects

@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter needs-review Label for PRs that need review from a maintainer. and removed needs-review Label for PRs that need review from a maintainer. pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Feb 25, 2025
super().__init__(**data)

if self.credential:
self._credentials_path = self.credential.create_credential_temp_file(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we actually need to create a credentials file?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, not needed while GCPcredential need it, deleted

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a bit confused - it looks like we're still writing the credentials to a file

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not a blocker - but we should not be writing credentials to disk if we can avoid it

Copy link
Collaborator Author

@ryota-cloud ryota-cloud Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file is actually used to create credentials object using service_account.Credentials util, which feed into aitplatform.init()

        credentials = (
            service_account.Credentials.from_service_account_file(
                self.config._credentials_path
            )
            if self.config.credential
            else None
        )

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right - but we're getting the config as json or as a true file path in the original GCPCredential

then we take that credential and write it to a new file, storing the file path in self.config._credentials_path. and then we load self.config._credentials_path again.

that flow is pretty strange

Copy link
Collaborator Author

@ryota-cloud ryota-cloud Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not so unusual to see this pattern (cred part only is written to temp file and loaded), but I understand your point of avoiding yet another file write, how about changing it to something like below,

     credentials = (
            service_account.Credentials.from_service_account_info(
                self.config.get_credentials(). --> passing dict
            )

@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Mar 5, 2025
@datahub-cyborg datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Mar 9, 2025
super().__init__(**data)

if self.credential:
self._credentials_path = self.credential.create_credential_temp_file(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a bit confused - it looks like we're still writing the credentials to a file

super().__init__(**data)

if self.credential:
self._credentials_path = self.credential.create_credential_temp_file(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not a blocker - but we should not be writing credentials to disk if we can avoid it

if job.create_time
else datetime_to_ts_millis(datetime.now())
)
created_actor = f"urn:li:platformResource:{self.platform}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we using a platformResource here? or is this just a dummy value?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is placeholder, since actor info is missing in Vertex Training job itself.
MLflow connector is using "urn:li:corpuser:datahub. I can change to it for now.

@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Mar 13, 2025
@datahub-cyborg datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Mar 13, 2025
@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Mar 13, 2025
Copy link
Collaborator

@hsheth2 hsheth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving to unblock you, but noting that we agreed a number of things will need to be done in follow ups

  • Fixing the way the unit tests are set up
  • Moving to use MCPW.construct_many
  • Changing the audit stamp urns to not be platformResources
  • Changing the auth flow to avoid writing credentials to a file
  • (couple others that I might be forgetting - basically all pending "conversation" in the github PR review)

@datahub-cyborg datahub-cyborg bot added merge-pending-ci A PR that has passed review and should be merged once CI is green. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Mar 13, 2025
@ryota-cloud
Copy link
Collaborator Author

ryota-cloud commented Mar 13, 2025

thanks for review! @hsheth2
couple of follows-up

@ryota-cloud ryota-cloud merged commit 0e62e8c into datahub-project:master Mar 13, 2025
139 of 152 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution PR or Issue raised by member(s) of DataHub Community ingestion PR or Issue related to the ingestion of metadata merge-pending-ci A PR that has passed review and should be merged once CI is green.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants