-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨ [source-google-sheets] add row_batch_size
as an input parameter
#35320
✨ [source-google-sheets] add row_batch_size
as an input parameter
#35320
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Ignored Deployment
|
tautvydas.varnas seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
Before Merging a Connector Pull RequestWow! What a great pull request you have here! 🎉 To merge this PR, ensure the following has been done/considered for each connector added or updated:
If the checklist is complete, but the CI check is failing,
|
Thanks for submitting a change to the connector @tautvydas-v. I don't think removing the increase row when there is an exceptions is the right solution here. @darynaishchenko can help you in find a better way to implement the batch row functionality again. |
Ok, it seems things were overcomplicated from my side - pushed the newer code's version update. |
@tautvydas-v Hi, what do you think about improving "increase" logic instead of providing batch size as custom param? for example in
|
Hey @darynaishchenko, I was thinking about increase parameter too. If we use default value of row_batch_size, which is 200, then increasing by 10 seems ok I think. But what if we made it also an input parameter, or instead of hardcoded value, which is 10, we would calculate a percentage of the row_batch_size, for example with each error, increase the value by 10%? In our case, we have some google sheets with 50-100k records and sometimes exponential backoff just fails, but we tested out that having row_batch_size as a much larger value, for example 10000, has no effect on the API, if it manages to process that request in under 3 minutes. So in this case, if we have a much higher value, adding in 10 or 100 doesn't really make a lot of difference, meanwhile a percentage could be more dynamic solution? |
row_batch_size
as an input parameter
|
What
Resolves my raised issue: #35274
Previously, source_google_sheets source connector had hardcoded value of 200 for row_batch_size. This means that one request sent to Google Sheets API processes only 200 rows, even though there's pretty much no limit, as long as it's processed in under 3 minutes. Also, limitation is on requests sent, but not the rows processed. Sometimes, a problem rises when doing a sync with a google sheet with over 100k records, where exponential backoff fails and the sync silently fails. In order to overcome this, making "batch_size" input parameter, with a default value of 200, so there wouldn't be any kind of change when updating the connector.
How
Added "batch_size" to spec with a default value of 200, removed ** increase_row_batch_size** method from Backoff class, but still left backoff functionality (it just doesn't add additional 10 records processed to the previous 200), created a new method get_batch_size which fetches the batch_size from config and returns batch size, when the method is called in source.py, _read method.