[RFC] feat!: kernel-based log replay #3137

roeap · 2025-01-16T14:18:21Z

Description

This PR aims to provide new implementations for the current Snapshot (now called LazySnapshot) and EagerSnapshot back by the delta-kernel-rs library.

This PR focusses on the implementation of the new snapshots, but avoids updating all usage and removing the old ones. I plan to provide some stacked PRs that actually use these in operations etc., hoping that this way reviews and feedback can be a bit more streamlined.

To reduce churn in the codebase, after the switch has been made, we introduce a trait Snapshot which is implemented by the new snapshots and should also be implemented for DeltaTableState. We can now establish a more uniform API across the Snapshot variants since Kernel's execution model allows us to avoid async in all APIs.

One of the most significant conceptual changes is how eager the EagerSnapshot is. The parquet reading in both delta-rs and delta-kernel-rs has evolved much since the EagerSnapshot was first written and handles pushdown of columns and predicates much more effectively. TO mitigate the cost of repeated reads of commit data, we introduce a simple caching layer in form of an ObjectStore implementation that caches commit reads in memory. This is right now a simple brute force approach to allow for migration, but hopefully will be extended in the future to also avoid json parsing and caching parquet metadata reads.

Any feedback on the direction this is taking is greatly appreciated.

github-actions · 2025-01-16T14:18:48Z

ACTION NEEDED

delta-rs follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

ion-elgreco · 2025-01-16T15:05:34Z

@roeap exciting stuff :D

This will close then this issue: #2776

Will your PR also address this: #3062?

roeap · 2025-01-16T15:33:59Z

Will your PR also address this: #3062?

Not in its current form, but updating Snapshot and with that the log segment needs to definitely go in here...

codecov · 2025-01-19T23:58:44Z

Codecov Report

Attention: Patch coverage is 53.01095% with 515 lines in your changes missing coverage. Please review.

Project coverage is 71.45%. Comparing base (3bff47b) to head (5d2cf48).

Files with missing lines	Patch %	Lines
crates/core/src/kernel/snapshot_next/iterators.rs	37.35%	150 Missing and 11 partials ⚠️
crates/core/src/kernel/snapshot_next/mod.rs	55.60%	66 Missing and 45 partials ⚠️
crates/test/src/acceptance/data.rs	0.00%	87 Missing ⚠️
crates/core/src/kernel/snapshot_next/lazy.rs	73.14%	28 Missing and 30 partials ⚠️
crates/core/src/kernel/snapshot_next/eager.rs	64.00%	35 Missing and 19 partials ⚠️
crates/core/src/kernel/snapshot_next/cache.rs	64.15%	36 Missing and 2 partials ⚠️
crates/test/src/acceptance/meta.rs	81.48%	0 Missing and 5 partials ⚠️
crates/core/src/protocol/checkpoints.rs	66.66%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3137      +/-   ##
==========================================
- Coverage   71.90%   71.45%   -0.45%     
==========================================
  Files         137      144       +7     
  Lines       44263    45351    +1088     
  Branches    44263    45351    +1088     
==========================================
+ Hits        31826    32406     +580     
- Misses      10397    10797     +400     
- Partials     2040     2148     +108

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Robert Pack <robstar.pack@gmail.com>

ion-elgreco · 2025-01-22T15:17:13Z

@roeap I assume with the introduction of the CommitCacheObjectStore, you would want to have the instantiate two object_stores on the log store, one for commits, and one for reading/writing parquet.

With regards to the object store for reading/writing parquet, the folks at "seafowl" built an interesting caching layer for reading parquets https://github.com/splitgraph/seafowl/blob/main/src/object_store/cache.rs, I asked whether they could publish that as a crate, I think it could be really valuable for read operations during some operations that require scans

roeap · 2025-01-22T15:46:02Z

Well .. this very naive caching implementation is mainly meant for now to not double down on some of the "regrets" from our pasts selves when it comes to the Snapshot implementation.

By now the parquet read is very selective in delta-rs and delta-kernel-rs with column selection and row group filtering... as such the assumption is, that we do not need to cache data from checkpoints and focus on caching all these expensive json commit reads.

This simplifies the data we keep in memory significantly - essentially just reconciled add action data. While not incurring too much of a penalty for repeated json (commit) reads.

But this is mostly just a stop-gap for adopting kernel "the right way", or at least not in an obviously wrong way 😆.

As you rightfully mention, there is much more that can be done. IIRC, datafusion also at least has the wiring to inject caching of parquet footers, which should make scanning snapshots for actions other then adds also much more efficient.

Without having spend too much time thinking about it, I think the abstraction you mentioned is much nicer - i.e. we are aware of what type of file we are reading. For us this would in a kernel world mean we would hoist some caching up to a higher level, the json and parquet handler traits in Engine. This way we could also avoid additional parsing.

One could argue that this is more or less what we are doing now, keeping all arrow state in memory, but I would say that we can build something much more efficient - and shareable across snapshots - at the engine layer. Also do things like local file cache etc ..

One thing I discussed with @rtyler is to move the caching object store to a dedicated PR, as we can get that merged much quicker then this one - which may yet take some time :). Also, we can think about if we can (and should) iterate on our configuration system a bit. The tombstones config for instance has no effect for a while now.

ion-elgreco · 2025-01-22T15:49:58Z

@roeap on your last note, I think that could be useful indeed to already provide the benefit of it. I haven't looked to depth in to that code, but I assume you can limit the cache size?

roeap · 2025-01-22T15:58:20Z

Indeed you can - right now its a hard-coded count, bit in a separate PR this should be configurable. The crate also allows in a simple way to use other weights - e.g. limit by size, as well as choose eviction policies. Some of which we should allow users to configure, but hopefully we can just have great defaults based on what we know about delta tables 😆.

ion-elgreco

Looks good overall!

ion-elgreco · 2025-03-03T14:43:05Z

crates/core/src/kernel/snapshot_next/cache.rs

+        Self {
+            inner,
+            check: Arc::new(cache_json),
+            cache: Arc::new(Cache::new(100)),


I think we should add a Weight capacity here as well with a configurable env var to limit the Bytes held in memory

ion-elgreco · 2025-03-03T14:49:56Z

crates/core/src/kernel/snapshot_next/eager.rs

+    files: Option<RecordBatch>,
+}
+
+impl Snapshot for EagerSnapshot {


One thing that is missing is the DeltaTableConfig, I added this some time ago to the old snapshot because we some times need to be aware in the operation how the table got loaded.

/// Get the table config which is loaded with of the snapshot pub fn load_config(&self) -> &DeltaTableConfig { self.snapshot.load_config() }

ion-elgreco · 2025-03-03T14:51:17Z

crates/core/src/kernel/snapshot_next/eager.rs

+        self.snapshot.table_root()
+    }
+
+    fn version(&self) -> Version {


No more version_timestamp as well?

ion-elgreco · 2025-03-03T14:59:07Z

crates/core/src/kernel/snapshot_next/iterators.rs

+    fn next(&mut self) -> Option<Self::Item> {
+        if self.index < self.paths.len() {
+            let path = self.paths.value(self.index).to_string();
+            let add = AddVisitor::visit_add(self.index, path, self.getters.as_slice())


Is this always guaranteed to find the next add action?

ion-elgreco · 2025-03-03T15:18:07Z

crates/core/src/kernel/snapshot_next/iterators.rs

+        ))
+    }
+
+    pub fn stats(&self) -> Option<&str> {


Why does a logicalFileView have stats?

ion-elgreco · 2025-03-03T15:31:07Z

crates/core/src/kernel/snapshot_next/iterators.rs

+
+fn extract_column<'a>(
+    mut parent: &'a dyn ProvidesColumnByName,
+    col: &[impl AsRef<str>],


The name threw me off, I thought it was multiple columns, but it's a single column_path

ion-elgreco · 2025-03-03T15:42:45Z

crates/core/src/kernel/snapshot_next/lazy.rs

+                    res.and_then(|(data, predicate)| {
+                        let batch: RecordBatch =
+                            ArrowEngineData::try_from_engine_data(data)?.into();
+                        Ok(filter_record_batch(&batch, &BooleanArray::from(predicate))?)


Why even filter when the predicate was None?

ion-elgreco · 2025-03-03T15:44:06Z

crates/core/src/kernel/snapshot_next/lazy.rs

+        start_version: Option<Version>,
+        limit: Option<usize>,
+    ) -> DeltaResult<Box<dyn Iterator<Item = (Version, CommitInfo)>>> {
+        // let start_version = start_version.into();


ion-elgreco · 2025-03-03T15:56:19Z

crates/core/src/kernel/snapshot_next/lazy.rs

+        let end_version = start_version.unwrap_or_else(|| self.version());
+        let start_version = limit
+            .and_then(|limit| {
+                if limit == 0 {
+                    Some(end_version)
+                } else {
+                    Some(end_version.saturating_sub(limit as u64 - 1))


This is highly confusing xd, the end version becomes the start versions when passed, and then the start_versions becomes the end version again when there is no limit :S

ion-elgreco · 2025-03-03T15:58:57Z

crates/core/src/kernel/snapshot_next/lazy.rs

+        store: Arc<dyn ObjectStore>,
+        version: impl Into<Option<Version>>,
+    ) -> DeltaResult<Self> {
+        // TODO: how to deal with the dedicated IO runtime? Would this already be covered by the


We currently do that all the way at the beginning in logstore_with

github-actions bot added the binding/rust label Jan 16, 2025

roeap force-pushed the feat/kernel-data branch 4 times, most recently from f8049db to e7c7766 Compare January 19, 2025 00:11

roeap force-pushed the feat/kernel-data branch from d59867e to 4f8ff2d Compare January 21, 2025 19:08

roeap added 14 commits January 21, 2025 20:14

chore: setup dat test scaffolding

8e8378b

Signed-off-by: Robert Pack <robstar.pack@gmail.com>

feat: file action replay

333198c

Signed-off-by: Robert Pack <robstar.pack@gmail.com>

feat: add objectstore with commit file caching

0f2c1c4

Signed-off-by: Robert Pack <robstar.pack@gmail.com>

feat: add owned file view

55565ae

Signed-off-by: Robert Pack <robstar.pack@gmail.com>

feat: basic updates of file state

3d6d263

Signed-off-by: Robert Pack <robstar.pack@gmail.com>

feat: introduce snapshot trait

a5672b5

Signed-off-by: Robert Pack <robstar.pack@gmail.com>

test: run some more dat tests

b0f794f

Signed-off-by: Robert Pack <robstar.pack@gmail.com>

feat: add commit infos apis to new snapshots

7a559ac

Signed-off-by: Robert Pack <robstar.pack@gmail.com>

feat: snapshot updates and improved file data iterators

adb9df8

Signed-off-by: Robert Pack <robstar.pack@gmail.com>

fix: cocnsistent schemas in file replay and object safe snapshot trait

f3b0edb

Signed-off-by: Robert Pack <robstar.pack@gmail.com>

test: more snapshot tests

e83c3ca

Signed-off-by: Robert Pack <robstar.pack@gmail.com>

feat: allow iterating over logical files

5364f4a

Signed-off-by: Robert Pack <robstar.pack@gmail.com>

fix: revert accidentally commited file

51349f4

Signed-off-by: Robert Pack <robstar.pack@gmail.com>

fix: tombstone replay

5d2cf48

Signed-off-by: Robert Pack <robstar.pack@gmail.com>

roeap force-pushed the feat/kernel-data branch from 4f8ff2d to 5d2cf48 Compare January 21, 2025 20:55

ion-elgreco reviewed Mar 3, 2025

View reviewed changes

roeap mentioned this pull request Mar 4, 2025

[tracking] Kernelize! #3298

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] feat!: kernel-based log replay #3137

[RFC] feat!: kernel-based log replay #3137

roeap commented Jan 16, 2025

github-actions bot commented Jan 16, 2025

ion-elgreco commented Jan 16, 2025

roeap commented Jan 16, 2025

codecov bot commented Jan 19, 2025 •

edited

Loading

ion-elgreco commented Jan 22, 2025

roeap commented Jan 22, 2025

ion-elgreco commented Jan 22, 2025

roeap commented Jan 22, 2025

ion-elgreco left a comment

ion-elgreco Mar 3, 2025

ion-elgreco Mar 3, 2025

ion-elgreco Mar 3, 2025

ion-elgreco Mar 3, 2025

ion-elgreco Mar 3, 2025

ion-elgreco Mar 3, 2025

ion-elgreco Mar 3, 2025

ion-elgreco Mar 3, 2025

ion-elgreco Mar 3, 2025

ion-elgreco Mar 3, 2025

[RFC] feat!: kernel-based log replay #3137

Are you sure you want to change the base?

[RFC] feat!: kernel-based log replay #3137

Conversation

roeap commented Jan 16, 2025

Description

github-actions bot commented Jan 16, 2025

ion-elgreco commented Jan 16, 2025

roeap commented Jan 16, 2025

codecov bot commented Jan 19, 2025 • edited Loading

Codecov Report

ion-elgreco commented Jan 22, 2025

roeap commented Jan 22, 2025

ion-elgreco commented Jan 22, 2025

roeap commented Jan 22, 2025

ion-elgreco left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jan 19, 2025 •

edited

Loading