[Destination-S3] Uses New Load CDK LoadStrategy Interface (temporarily disabled) #54695

johnny-schmidt · 2025-02-26T22:17:16Z

What

Migrate S3 (and object storage in general) to the new ObjectLoader interface using the LoadPipeline. This has been cloud tested for performance and reliability, but I'm leaving it disabled so that other connectors can use the code w/o waiting on a full S3 release.

How

Creates a new LoadStrategy called ObjectLoader. (See the documentation on the interface for usage guidelines for the connector dev. This will also be the foundation for any GCS, Azure, or other file-like interfaces. This is also the foundation for BulkLoad, in progress here.

If the ObjectLoader bean is present, also creates

ObjectLoaderRecordToPartAccumulator: straight port of RecordToPartAccumulator, except that it uses the per-record interface instead of the old batch iterator interface
ObjectLoaderPartToObjectAccumulator: same as above, but a straight port of PartToObjectAccumulator
a step for each accumulator (just the minimum necessary to wrap each accumulator in a LoadPipelineStep)
an ObjectLoaderPartQueue for passing finished parts from the first step to the second step (ObjectLoaderPartStep and ObjectLoaderObjectStep)
an ObjectLoaderPipeline that sequences the two steps into a LoadPipeline for injection (presence of any LoadPipeline is enough to trigger the core using the new interface)

Additionally, there are round-robin partitioners to distribute the incoming records and generated parts to the steps

Also, I made a couple of minor changes to the pipeline interface

Pipeline doesn't passed Reserved<...> around, it just has a callback for post-processing the message (it was impossible to make a 2-step pipeline otherwise)
I changed the accumulator interface to suspend -- this caused ~20% performance increase for s3 (and had no effect on iceberg)

vercel · 2025-02-26T22:17:22Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
airbyte-docs	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Mar 19, 2025 4:20pm

edgao

seems reasonable. Had a few minor comments, plus questions about the PartToObjectAccumulator, and about how we're injecting the queues.

Can I get a link to the PR adding bulk loader? I was a bit nervous about the micronaut wiring you described on tuesday (with all the Replaces stuff, queues, etc.) - it felt like we're moving towards the same problems that platform has, with it being difficult to understand which beans are in play at any given moment (probably even worse, since we're doing stuff across classpaths, so intellij doesn't know how to discover all the beans)

but that code isn't in this PR, so 🤷

...cdk/bulk/core/load/src/main/kotlin/io/airbyte/cdk/load/task/internal/LoadPipelineStepTask.kt

...ge/src/main/kotlin/io/airbyte/cdk/load/pipline/object_storage/ObjectLoaderPartPartitioner.kt

...ain/kotlin/io/airbyte/cdk/load/pipline/object_storage/ObjectLoaderPartToObjectAccumulator.kt

...t-storage/src/main/kotlin/io/airbyte/cdk/load/pipline/object_storage/ObjectLoaderPartStep.kt

johnny-schmidt · 2025-03-13T18:20:55Z

seems reasonable. Had a few minor comments, plus questions about the PartToObjectAccumulator, and about how we're injecting the queues.

Can I get a link to the PR adding bulk loader? I was a bit nervous about the micronaut wiring you described on tuesday (with all the Replaces stuff, queues, etc.) - it felt like we're moving towards the same problems that platform has, with it being difficult to understand which beans are in play at any given moment (probably even worse, since we're doing stuff across classpaths, so intellij doesn't know how to discover all the beans)

but that code isn't in this PR, so 🤷

Everything that is implicitly included by some XLoader has a @Requires(XLoader::class)

The confusing part will be when something @Replaces something, but in practice you YLoader: XLoader and some of the things that @Require(YLoader) also replace something with a @Requires(XLoader).

@edgao

#55671

edgao

lgtm, had a few nits

...cdk/bulk/core/load/src/main/kotlin/io/airbyte/cdk/load/task/internal/LoadPipelineStepTask.kt

.../main/kotlin/io/airbyte/cdk/load/pipline/object_storage/ObjectLoaderLoadedPartPartitioner.kt

edgao · 2025-03-18T16:11:17Z

.../src/main/kotlin/io/airbyte/cdk/load/pipline/object_storage/ObjectLoaderPartFormatterStep.kt

    @Value("\${airbyte.destination.core.record-batch-size-override:null}")
    val batchSizeOverride: Long? = null,
 ) : LoadPipelineStep {
    override val numWorkers: Int = objectLoader.numPartWorkers
+    private val streamCompletionMap = ConcurrentHashMap<DestinationStream.Descriptor, AtomicLong>()


does streamCompletionMap need to be in the LoadPipelineStepTask constructor? seems a bit weird to require every Step implementation to pass in this hashmap

it needs to be shared across tasks and it needs to be affined to the step (ie, it can't be global).

I think breaking out a factory makes sense. i was thinking something like

LoadPipelineStepTaskFactory: createFirstStep(... arguments except input queue/input partitioner/batch queue/num workers/stream completions/output queue) createMiddleStep(...same as above but also takes input queue, but injects the partitioner) createFinalStep(...same as middle but doesn't require output)

I did that and went ahead and updated DirectLoader. (I'm punting on Object loader, since it's about to change a little with BulkLoad and I want to avoid the rebase.)

ie,

@Singleton @Requires(bean = DirectLoaderFactory::class) class DirectLoadPipelineStep<S : DirectLoader>( val directLoaderFactory: DirectLoaderFactory<S>, val accumulator: DirectLoadRecordAccumulator<S, StreamKey>, val taskFactory: LoadPipelineStepTaskFactory, ) : LoadPipelineStep { private val log = KotlinLogging.logger {} override val numWorkers: Int = directLoaderFactory.inputPartitions override fun taskForPartition(partition: Int): LoadPipelineStepTask<*, *, *, *, *> { log.info { "Creating DirectLoad pipeline step task for partition $partition" } return taskFactory.createOnlyStep<S, StreamKey, DirectLoadAccResult>(accumulator, partition, numWorkers) } }

I also reran tests for S3DataLake (temporarily unpinned)

13001de (I fixed the constructor, too)

edgao · 2025-03-18T16:38:00Z

...rc/main/kotlin/io/airbyte/cdk/load/pipline/object_storage/ObjectLoaderUploadCompleterStep.kt

+                as
+                PartitionedQueue<
+                    PipelineEvent<ObjectKey, ObjectLoaderUploadCompleter.UploadResult>>?,
+            null,


nit: let's have an explicit flushStrategy = null here - wasn't immediately obvious that this is how we implement the enclosing step should be configured not to flush in UploadCompleter.finish

That's hidden by the factory now.

In practice, forcing only makes sense on the first step (it will finish early, and that effect will propagate downstream.)

...ge/src/main/kotlin/io/airbyte/cdk/load/pipline/object_storage/ObjectLoaderPartPartitioner.kt

edgao

seems reasonable - I'll take your word that the factory createFirstStep / etc. methods make sense, since you're only calling createOnlyStep here

(but I think you meant to replace the constructor calls with factory calls, in the ObjectFooStep classes?)

edgao · 2025-03-18T23:24:42Z

...rc/main/kotlin/io/airbyte/cdk/load/pipline/object_storage/ObjectLoaderUploadCompleterStep.kt

@@ -28,7 +28,8 @@ class ObjectLoaderUploadCompleterStep(
    @Named("batchStateUpdateQueue") val batchQueue: QueueWriter<BatchUpdate>,
 ) : LoadPipelineStep {
    override val numWorkers: Int = objectLoader.numUploadCompleters
-    private val streamCompletionMap = ConcurrentHashMap<DestinationStream.Descriptor, AtomicLong>()
+    private val streamCompletionMap =
+        ConcurrentHashMap<DestinationStream.Descriptor, AtomicInteger>()

    override fun taskForPartition(partition: Int): LoadPipelineStepTask<*, *, *, *, *> {
        return LoadPipelineStepTask(


did you mean to replace this with a factory call?

I didn't make the changes to the object steps yet, but I will go ahead and do it

I was trying to avoid a rebase but the changes are actually really light

edgao · 2025-03-18T23:26:20Z

...cdk/bulk/core/load/src/main/kotlin/io/airbyte/cdk/load/task/internal/LoadPipelineStepTask.kt

+    @Value("\${airbyte.destination.core.record-batch-size-override:null}")
+    val batchSizeOverride: Long? = null,
+) {
+    private val streamCompletions = ConcurrentHashMap<DestinationStream.Descriptor, AtomicInteger>()


needs to be affined to the step (ie, it can't be global)

should this be a function so that we get a new instance on each invocation?

it's worse than that 🙃

we can't do that, because each individual task would get its own map. we need the counts to be affined to the step and shared across all the workers in the step.

After some local testing, the right pattern is a map of Pair<StepIndex, Descriptor> -> Count. Not too bad since the index is deterministic on the first, final, and only steps, so the factory can set it.

edgao

one nonblocking comment, feel free to

edgao · 2025-03-19T00:11:29Z

...cdk/bulk/core/load/src/main/kotlin/io/airbyte/cdk/load/task/internal/LoadPipelineStepTask.kt

@@ -221,6 +226,7 @@ class LoadPipelineStepTaskFactory(
        flushStrategy: PipelineFlushStrategy?,
        part: Int,
        numWorkers: Int,
+        taskIndex: Int,


nonblocking: maybe nicer to use taskName:String? then we don't have magic numbers like -1, 2, etc. floating around, and is maybe also useful for future debugging

octavia-squidington-iii added area/connectors Connector related issues CDK Connector Development Kit connectors/destination/s3 labels Feb 26, 2025

johnny-schmidt force-pushed the jschmidt/s3v2/s3-uses-new-interface branch from 7c788b3 to e7f785d Compare February 26, 2025 22:18

vercel bot deployed to Preview February 26, 2025 22:27 View deployment

johnny-schmidt force-pushed the jschmidt/s3v2/s3-uses-new-interface branch 2 times, most recently from 7c75a47 to f899595 Compare February 28, 2025 01:31

vercel bot deployed to Preview February 28, 2025 01:40 View deployment

johnny-schmidt force-pushed the jschmidt/s3v2/s3-uses-new-interface branch 2 times, most recently from a1e327c to 35509dc Compare February 28, 2025 19:37

vercel bot deployed to Preview February 28, 2025 19:45 View deployment

johnny-schmidt force-pushed the jschmidt/s3v2/s3-uses-new-interface branch from 35509dc to dbfd7fc Compare March 4, 2025 00:17

vercel bot deployed to Preview March 4, 2025 00:23 View deployment

johnny-schmidt force-pushed the jschmidt/s3v2/s3-uses-new-interface branch from dbfd7fc to 09e3ac5 Compare March 4, 2025 20:16

johnny-schmidt changed the base branch from master to jschmidt/s3/possible-to-use-legacy-cdk March 4, 2025 20:17

vercel bot deployed to Preview March 4, 2025 20:22 View deployment

johnny-schmidt force-pushed the jschmidt/s3v2/s3-uses-new-interface branch from 09e3ac5 to b4349ef Compare March 4, 2025 20:29

vercel bot deployed to Preview March 4, 2025 20:36 View deployment

johnny-schmidt force-pushed the jschmidt/s3/possible-to-use-legacy-cdk branch from 8d8cb6c to 0782ea4 Compare March 4, 2025 22:39

johnny-schmidt force-pushed the jschmidt/s3v2/s3-uses-new-interface branch from b4349ef to a78c844 Compare March 4, 2025 22:42

vercel bot deployed to Preview March 4, 2025 22:48 View deployment

johnny-schmidt force-pushed the jschmidt/s3/possible-to-use-legacy-cdk branch from 0782ea4 to ecf8fed Compare March 4, 2025 23:58

johnny-schmidt force-pushed the jschmidt/s3v2/s3-uses-new-interface branch from a78c844 to 1f8561d Compare March 5, 2025 00:02

vercel bot deployed to Preview March 5, 2025 00:09 View deployment

johnny-schmidt force-pushed the jschmidt/s3v2/s3-uses-new-interface branch from 1f8561d to eae8200 Compare March 5, 2025 00:13

Base automatically changed from jschmidt/s3/possible-to-use-legacy-cdk to master March 5, 2025 00:15

vercel bot deployed to Preview March 5, 2025 00:19 View deployment

johnny-schmidt force-pushed the jschmidt/s3v2/s3-uses-new-interface branch from eae8200 to b5c5d8e Compare March 5, 2025 20:27

octavia-squidington-iii added the area/documentation Improvements or additions to documentation label Mar 5, 2025

vercel bot deployed to Preview March 12, 2025 18:19 View deployment

edgao reviewed Mar 13, 2025

View reviewed changes

johnny-schmidt force-pushed the jschmidt/s3v2/s3-uses-new-interface branch 2 times, most recently from e7eff8d to eb39313 Compare March 14, 2025 00:32

vercel bot deployed to Preview March 14, 2025 00:44 View deployment

S3 Destination Uses New Load CDK Interface (temporarily disabled)

263b106

johnny-schmidt force-pushed the jschmidt/s3v2/s3-uses-new-interface branch from eb39313 to c0cd6cc Compare March 17, 2025 18:48

vercel bot deployed to Preview March 17, 2025 18:54 View deployment

johnny-schmidt force-pushed the jschmidt/s3v2/s3-uses-new-interface branch from c0cd6cc to 6f31250 Compare March 17, 2025 18:58

vercel bot deployed to Preview March 17, 2025 19:05 View deployment

johnny-schmidt force-pushed the jschmidt/s3v2/s3-uses-new-interface branch from 6f31250 to f1a6b7b Compare March 17, 2025 22:21

vercel bot deployed to Preview March 17, 2025 22:27 View deployment

Fix bad state ack issue

53cbe5a

johnny-schmidt force-pushed the jschmidt/s3v2/s3-uses-new-interface branch from f1a6b7b to 53cbe5a Compare March 17, 2025 22:34

vercel bot deployed to Preview March 17, 2025 22:40 View deployment

edgao approved these changes Mar 18, 2025

View reviewed changes

ed comments

13001de

vercel bot deployed to Preview March 18, 2025 20:12 View deployment

edgao reviewed Mar 18, 2025

View reviewed changes

vercel bot deployed to Preview March 19, 2025 00:10 View deployment

edgao approved these changes Mar 19, 2025

View reviewed changes

johnny-schmidt force-pushed the jschmidt/s3v2/s3-uses-new-interface branch from 19eaebd to f48b486 Compare March 19, 2025 16:01

johnny-schmidt enabled auto-merge (squash) March 19, 2025 16:01

vercel bot deployed to Preview March 19, 2025 16:07 View deployment

everything is a factory (and the complete counts work)

5cd5111

johnny-schmidt force-pushed the jschmidt/s3v2/s3-uses-new-interface branch from f48b486 to 5cd5111 Compare March 19, 2025 16:14

vercel bot deployed to Preview March 19, 2025 16:20 View deployment

johnny-schmidt merged commit bdf9688 into master Mar 19, 2025
27 checks passed

johnny-schmidt deleted the jschmidt/s3v2/s3-uses-new-interface branch March 19, 2025 16:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Destination-S3] Uses New Load CDK LoadStrategy Interface (temporarily disabled) #54695

[Destination-S3] Uses New Load CDK LoadStrategy Interface (temporarily disabled) #54695

johnny-schmidt commented Feb 26, 2025 •

edited

Loading

vercel bot commented Feb 26, 2025 •

edited

Loading

edgao left a comment

johnny-schmidt commented Mar 13, 2025 •

edited

Loading

edgao left a comment

edgao Mar 18, 2025

johnny-schmidt Mar 18, 2025

johnny-schmidt Mar 18, 2025

johnny-schmidt Mar 18, 2025 •

edited

Loading

johnny-schmidt Mar 18, 2025

edgao Mar 18, 2025

johnny-schmidt Mar 18, 2025

edgao left a comment

edgao Mar 18, 2025

johnny-schmidt Mar 18, 2025

johnny-schmidt Mar 18, 2025

edgao Mar 18, 2025

johnny-schmidt Mar 18, 2025

edgao left a comment

edgao Mar 19, 2025

[Destination-S3] Uses New Load CDK LoadStrategy Interface (temporarily disabled) #54695

[Destination-S3] Uses New Load CDK LoadStrategy Interface (temporarily disabled) #54695

Conversation

johnny-schmidt commented Feb 26, 2025 • edited Loading

What

How

vercel bot commented Feb 26, 2025 • edited Loading

edgao left a comment

Choose a reason for hiding this comment

johnny-schmidt commented Mar 13, 2025 • edited Loading

edgao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnny-schmidt Mar 18, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

edgao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

edgao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnny-schmidt commented Feb 26, 2025 •

edited

Loading

vercel bot commented Feb 26, 2025 •

edited

Loading

johnny-schmidt commented Mar 13, 2025 •

edited

Loading

johnny-schmidt Mar 18, 2025 •

edited

Loading