Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Destination-S3] Uses New Load CDK LoadStrategy Interface (temporarily disabled) #54695

Merged
merged 4 commits into from
Mar 19, 2025

Conversation

johnny-schmidt
Copy link
Contributor

@johnny-schmidt johnny-schmidt commented Feb 26, 2025

What

Migrate S3 (and object storage in general) to the new ObjectLoader interface using the LoadPipeline. This has been cloud tested for performance and reliability, but I'm leaving it disabled so that other connectors can use the code w/o waiting on a full S3 release.

How

Creates a new LoadStrategy called ObjectLoader. (See the documentation on the interface for usage guidelines for the connector dev. This will also be the foundation for any GCS, Azure, or other file-like interfaces. This is also the foundation for BulkLoad, in progress here.

If the ObjectLoader bean is present, also creates

  • ObjectLoaderRecordToPartAccumulator: straight port of RecordToPartAccumulator, except that it uses the per-record interface instead of the old batch iterator interface
  • ObjectLoaderPartToObjectAccumulator: same as above, but a straight port of PartToObjectAccumulator
  • a step for each accumulator (just the minimum necessary to wrap each accumulator in a LoadPipelineStep)
  • an ObjectLoaderPartQueue for passing finished parts from the first step to the second step (ObjectLoaderPartStep and ObjectLoaderObjectStep)
  • an ObjectLoaderPipeline that sequences the two steps into a LoadPipeline for injection (presence of any LoadPipeline is enough to trigger the core using the new interface)

Additionally, there are round-robin partitioners to distribute the incoming records and generated parts to the steps

Also, I made a couple of minor changes to the pipeline interface

  • Pipeline doesn't passed Reserved<...> around, it just has a callback for post-processing the message (it was impossible to make a 2-step pipeline otherwise)
  • I changed the accumulator interface to suspend -- this caused ~20% performance increase for s3 (and had no effect on iceberg)

Copy link

vercel bot commented Feb 26, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
airbyte-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Mar 19, 2025 4:20pm

@octavia-squidington-iii octavia-squidington-iii added area/connectors Connector related issues CDK Connector Development Kit connectors/destination/s3 labels Feb 26, 2025
@johnny-schmidt johnny-schmidt force-pushed the jschmidt/s3v2/s3-uses-new-interface branch from 7c788b3 to e7f785d Compare February 26, 2025 22:18
@johnny-schmidt johnny-schmidt force-pushed the jschmidt/s3v2/s3-uses-new-interface branch 2 times, most recently from 7c75a47 to f899595 Compare February 28, 2025 01:31
@johnny-schmidt johnny-schmidt force-pushed the jschmidt/s3v2/s3-uses-new-interface branch 2 times, most recently from a1e327c to 35509dc Compare February 28, 2025 19:37
@johnny-schmidt johnny-schmidt force-pushed the jschmidt/s3v2/s3-uses-new-interface branch from 35509dc to dbfd7fc Compare March 4, 2025 00:17
@johnny-schmidt johnny-schmidt force-pushed the jschmidt/s3v2/s3-uses-new-interface branch from dbfd7fc to 09e3ac5 Compare March 4, 2025 20:16
@johnny-schmidt johnny-schmidt changed the base branch from master to jschmidt/s3/possible-to-use-legacy-cdk March 4, 2025 20:17
@johnny-schmidt johnny-schmidt force-pushed the jschmidt/s3v2/s3-uses-new-interface branch from 09e3ac5 to b4349ef Compare March 4, 2025 20:29
@johnny-schmidt johnny-schmidt force-pushed the jschmidt/s3/possible-to-use-legacy-cdk branch from 8d8cb6c to 0782ea4 Compare March 4, 2025 22:39
@johnny-schmidt johnny-schmidt force-pushed the jschmidt/s3v2/s3-uses-new-interface branch from b4349ef to a78c844 Compare March 4, 2025 22:42
@johnny-schmidt johnny-schmidt force-pushed the jschmidt/s3/possible-to-use-legacy-cdk branch from 0782ea4 to ecf8fed Compare March 4, 2025 23:58
@johnny-schmidt johnny-schmidt force-pushed the jschmidt/s3v2/s3-uses-new-interface branch from a78c844 to 1f8561d Compare March 5, 2025 00:02
@johnny-schmidt johnny-schmidt force-pushed the jschmidt/s3v2/s3-uses-new-interface branch from 1f8561d to eae8200 Compare March 5, 2025 00:13
Base automatically changed from jschmidt/s3/possible-to-use-legacy-cdk to master March 5, 2025 00:15
@johnny-schmidt johnny-schmidt force-pushed the jschmidt/s3v2/s3-uses-new-interface branch from eae8200 to b5c5d8e Compare March 5, 2025 20:27
@octavia-squidington-iii octavia-squidington-iii added the area/documentation Improvements or additions to documentation label Mar 5, 2025
Copy link
Contributor

@edgao edgao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems reasonable. Had a few minor comments, plus questions about the PartToObjectAccumulator, and about how we're injecting the queues.

Can I get a link to the PR adding bulk loader? I was a bit nervous about the micronaut wiring you described on tuesday (with all the Replaces stuff, queues, etc.) - it felt like we're moving towards the same problems that platform has, with it being difficult to understand which beans are in play at any given moment (probably even worse, since we're doing stuff across classpaths, so intellij doesn't know how to discover all the beans)

but that code isn't in this PR, so 🤷

@johnny-schmidt
Copy link
Contributor Author

johnny-schmidt commented Mar 13, 2025

seems reasonable. Had a few minor comments, plus questions about the PartToObjectAccumulator, and about how we're injecting the queues.

Can I get a link to the PR adding bulk loader? I was a bit nervous about the micronaut wiring you described on tuesday (with all the Replaces stuff, queues, etc.) - it felt like we're moving towards the same problems that platform has, with it being difficult to understand which beans are in play at any given moment (probably even worse, since we're doing stuff across classpaths, so intellij doesn't know how to discover all the beans)

but that code isn't in this PR, so 🤷

Everything that is implicitly included by some XLoader has a @Requires(XLoader::class)

The confusing part will be when something @Replaces something, but in practice you YLoader: XLoader and some of the things that @Require(YLoader) also replace something with a @Requires(XLoader).

@edgao

#55671

@johnny-schmidt johnny-schmidt force-pushed the jschmidt/s3v2/s3-uses-new-interface branch 2 times, most recently from e7eff8d to eb39313 Compare March 14, 2025 00:32
Copy link
Contributor

@edgao edgao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, had a few nits

@Value("\${airbyte.destination.core.record-batch-size-override:null}")
val batchSizeOverride: Long? = null,
) : LoadPipelineStep {
override val numWorkers: Int = objectLoader.numPartWorkers
private val streamCompletionMap = ConcurrentHashMap<DestinationStream.Descriptor, AtomicLong>()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does streamCompletionMap need to be in the LoadPipelineStepTask constructor? seems a bit weird to require every Step implementation to pass in this hashmap

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it needs to be shared across tasks and it needs to be affined to the step (ie, it can't be global).

I think breaking out a factory makes sense. i was thinking something like

LoadPipelineStepTaskFactory:

  createFirstStep(... arguments except input queue/input partitioner/batch queue/num workers/stream completions/output queue)

  createMiddleStep(...same as above but also takes input queue, but injects the partitioner)

  createFinalStep(...same as middle but doesn't require output)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did that and went ahead and updated DirectLoader. (I'm punting on Object loader, since it's about to change a little with BulkLoad and I want to avoid the rebase.)

ie,

@Singleton
@Requires(bean = DirectLoaderFactory::class)
class DirectLoadPipelineStep<S : DirectLoader>(
    val directLoaderFactory: DirectLoaderFactory<S>,
    val accumulator: DirectLoadRecordAccumulator<S, StreamKey>,
    val taskFactory: LoadPipelineStepTaskFactory,
) : LoadPipelineStep {
    private val log = KotlinLogging.logger {}
    override val numWorkers: Int = directLoaderFactory.inputPartitions

    override fun taskForPartition(partition: Int): LoadPipelineStepTask<*, *, *, *, *> {
        log.info { "Creating DirectLoad pipeline step task for partition $partition" }
        return taskFactory.createOnlyStep<S, StreamKey, DirectLoadAccResult>(accumulator, partition, numWorkers)
    }
}

Copy link
Contributor Author

@johnny-schmidt johnny-schmidt Mar 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also reran tests for S3DataLake (temporarily unpinned)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

13001de (I fixed the constructor, too)

as
PartitionedQueue<
PipelineEvent<ObjectKey, ObjectLoaderUploadCompleter.UploadResult>>?,
null,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's have an explicit flushStrategy = null here - wasn't immediately obvious that this is how we implement the enclosing step should be configured not to flush in UploadCompleter.finish

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's hidden by the factory now.

In practice, forcing only makes sense on the first step (it will finish early, and that effect will propagate downstream.)

Copy link
Contributor

@edgao edgao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems reasonable - I'll take your word that the factory createFirstStep / etc. methods make sense, since you're only calling createOnlyStep here

(but I think you meant to replace the constructor calls with factory calls, in the ObjectFooStep classes?)

@@ -28,7 +28,8 @@ class ObjectLoaderUploadCompleterStep(
@Named("batchStateUpdateQueue") val batchQueue: QueueWriter<BatchUpdate>,
) : LoadPipelineStep {
override val numWorkers: Int = objectLoader.numUploadCompleters
private val streamCompletionMap = ConcurrentHashMap<DestinationStream.Descriptor, AtomicLong>()
private val streamCompletionMap =
ConcurrentHashMap<DestinationStream.Descriptor, AtomicInteger>()

override fun taskForPartition(partition: Int): LoadPipelineStepTask<*, *, *, *, *> {
return LoadPipelineStepTask(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you mean to replace this with a factory call?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't make the changes to the object steps yet, but I will go ahead and do it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to avoid a rebase but the changes are actually really light

@Value("\${airbyte.destination.core.record-batch-size-override:null}")
val batchSizeOverride: Long? = null,
) {
private val streamCompletions = ConcurrentHashMap<DestinationStream.Descriptor, AtomicInteger>()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs to be affined to the step (ie, it can't be global)

should this be a function so that we get a new instance on each invocation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's worse than that 🙃

we can't do that, because each individual task would get its own map. we need the counts to be affined to the step and shared across all the workers in the step.

After some local testing, the right pattern is a map of Pair<StepIndex, Descriptor> -> Count. Not too bad since the index is deterministic on the first, final, and only steps, so the factory can set it.

Copy link
Contributor

@edgao edgao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one nonblocking comment, feel free to :shipit:

@@ -221,6 +226,7 @@ class LoadPipelineStepTaskFactory(
flushStrategy: PipelineFlushStrategy?,
part: Int,
numWorkers: Int,
taskIndex: Int,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nonblocking: maybe nicer to use taskName:String? then we don't have magic numbers like -1, 2, etc. floating around, and is maybe also useful for future debugging

@johnny-schmidt johnny-schmidt force-pushed the jschmidt/s3v2/s3-uses-new-interface branch from f48b486 to 5cd5111 Compare March 19, 2025 16:14
@johnny-schmidt johnny-schmidt merged commit bdf9688 into master Mar 19, 2025
27 checks passed
@johnny-schmidt johnny-schmidt deleted the jschmidt/s3v2/s3-uses-new-interface branch March 19, 2025 16:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues area/documentation Improvements or additions to documentation CDK Connector Development Kit connectors/destination/mssql-v2 connectors/destination/s3
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants