Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ [source-google-sheets] add row_batch_size as an input parameter #35320

Conversation

tautvydas-v
Copy link
Contributor

What

Resolves my raised issue: #35274
Previously, source_google_sheets source connector had hardcoded value of 200 for row_batch_size. This means that one request sent to Google Sheets API processes only 200 rows, even though there's pretty much no limit, as long as it's processed in under 3 minutes. Also, limitation is on requests sent, but not the rows processed. Sometimes, a problem rises when doing a sync with a google sheet with over 100k records, where exponential backoff fails and the sync silently fails. In order to overcome this, making "batch_size" input parameter, with a default value of 200, so there wouldn't be any kind of change when updating the connector.

How

Added "batch_size" to spec with a default value of 200, removed ** increase_row_batch_size** method from Backoff class, but still left backoff functionality (it just doesn't add additional 10 records processed to the previous 200), created a new method get_batch_size which fetches the batch_size from config and returns batch size, when the method is called in source.py, _read method.

Copy link

vercel bot commented Feb 15, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
airbyte-docs ⬜️ Ignored (Inspect) Visit Preview Feb 15, 2024 2:32pm

@CLAassistant
Copy link

CLAassistant commented Feb 15, 2024

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ tautvydas-v
❌ tautvydas.varnas


tautvydas.varnas seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@octavia-squidington-iii octavia-squidington-iii added area/connectors Connector related issues area/documentation Improvements or additions to documentation community connectors/source/google-sheets labels Feb 15, 2024
Copy link
Contributor

Before Merging a Connector Pull Request

Wow! What a great pull request you have here! 🎉

To merge this PR, ensure the following has been done/considered for each connector added or updated:

  • PR name follows PR naming conventions
  • Breaking changes are considered. If a Breaking Change is being introduced, ensure an Airbyte engineer has created a Breaking Change Plan.
  • Connector version has been incremented in the Dockerfile and metadata.yaml according to our Semantic Versioning for Connectors guidelines
  • You've updated the connector's metadata.yaml file any other relevant changes, including a breakingChanges entry for major version bumps. See metadata.yaml docs
  • Secrets in the connector's spec are annotated with airbyte_secret
  • All documentation files are up to date. (README.md, bootstrap.md, docs.md, etc...)
  • Changelog updated in docs/integrations/<source or destination>/<name>.md with an entry for the new version. See changelog example
  • Migration guide updated in docs/integrations/<source or destination>/<name>-migrations.md with an entry for the new version, if the version is a breaking change. See migration guide example
  • If set, you've ensured the icon is present in the platform-internal repo. (Docs)

If the checklist is complete, but the CI check is failing,

  1. Check for hidden checklists in your PR description

  2. Toggle the github label checklist-action-run on/off to re-run the checklist CI.

Sorry, something went wrong.

@marcosmarxm
Copy link
Member

Thanks for submitting a change to the connector @tautvydas-v. I don't think removing the increase row when there is an exceptions is the right solution here. @darynaishchenko can help you in find a better way to implement the batch row functionality again.

@tautvydas-v
Copy link
Contributor Author

Ok, it seems things were overcomplicated from my side - pushed the newer code's version update.

@darynaishchenko
Copy link
Collaborator

@tautvydas-v Hi, what do you think about improving "increase" logic instead of providing batch size as custom param? for example in client.py do + 100 instead of + 10 on backoff. Or another value that will be more suitable for your data. As I understand your rate limits is really strict and we should think about backoff strategy too. Because on some row_batch_size value you also can face rate limits and +10 rows is to small value to fix it.

def increase_row_batch_size(cls, details):
            if details["exception"].status_code == status_codes.TOO_MANY_REQUESTS and cls.row_batch_size < 1000:
                cls.row_batch_size = cls.row_batch_size + 100

@tautvydas-v
Copy link
Contributor Author

Hey @darynaishchenko, I was thinking about increase parameter too. If we use default value of row_batch_size, which is 200, then increasing by 10 seems ok I think. But what if we made it also an input parameter, or instead of hardcoded value, which is 10, we would calculate a percentage of the row_batch_size, for example with each error, increase the value by 10%? In our case, we have some google sheets with 50-100k records and sometimes exponential backoff just fails, but we tested out that having row_batch_size as a much larger value, for example 10000, has no effect on the API, if it manages to process that request in under 3 minutes. So in this case, if we have a much higher value, adding in 10 or 100 doesn't really make a lot of difference, meanwhile a percentage could be more dynamic solution?

@marcosmarxm marcosmarxm changed the title Source Google Sheets - making row_batch_size as an input parameter ✨ [source-google-sheets] add row_batch_size as an input parameter Feb 16, 2024
@tautvydas-v
Copy link
Contributor Author

tautvydas-v commented Feb 18, 2024

Before Merging a Connector Pull Request

Wow! What a great pull request you have here! 🎉

To merge this PR, ensure the following has been done/considered for each connector added or updated:

  • [] PR name follows PR naming conventions
  • Breaking changes are considered. If a Breaking Change is being introduced, ensure an Airbyte engineer has created a Breaking Change Plan.
  • Connector version has been incremented in the Dockerfile and metadata.yaml according to our Semantic Versioning for Connectors guidelines
  • You've updated the connector's metadata.yaml file any other relevant changes, including a breakingChanges entry for major version bumps. See metadata.yaml docs
  • Secrets in the connector's spec are annotated with airbyte_secret
  • All documentation files are up to date. (README.md, bootstrap.md, docs.md, etc...)
  • Changelog updated in docs/integrations/<source or destination>/<name>.md with an entry for the new version. See changelog example
  • Migration guide updated in docs/integrations/<source or destination>/<name>-migrations.md with an entry for the new version, if the version is a breaking change. See migration guide example
  • If set, you've ensured the icon is present in the platform-internal repo. (Docs)

If the checklist is complete, but the CI check is failing,

  1. Check for hidden checklists in your PR description
  2. Toggle the github label checklist-action-run on/off to re-run the checklist CI.

Sorry, something went wrong.

@tautvydas-v tautvydas-v deleted the tautvydas-v/source-google-sheets-add-batch-size-as-input-parameter branch February 18, 2024 09:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues area/documentation Improvements or additions to documentation community connectors/source/google-sheets
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants