[WIP] Stream GeoArrow by batch from the driver #1855

paleolimbot · 2025-03-12T21:26:29Z

Did you read the Contributor Guide?

Yes, I have read the Contributor Rules and Contributor Development Guide

Is this PR related to a ticket?

(Not yet!)

Yes, and the PR name follows the format [SEDONA-XXX] my subject.
Yes, and the PR name follows the format [GH-XXX] my subject.
No:
- this is a documentation update. The PR name follows the format [DOCS] my subject
- this is a CI update. The PR name follows the format [CI] my subject

What changes were proposed in this PR?

This PR implements a batch-by-batch reader (as opposed to a complete whole-table reader. For large results this might help increase the size of result that can be handled, although I think that no matter how it happens the whole result has to be collected in the JVM.

Still working on the details!

How was this patch tested?

Tests forthcoming!

Did this PR include necessary documentation updates?

Yes, I am adding a new API. I am using the current SNAPSHOT version number in vX.Y.Z format.
Yes, I have updated the documentation.
No, this PR does not affect any public API so no need to change the documentation.

Reprex:

from sedona.spark import SedonaContext

config = (
    SedonaContext.builder()
    .config(
        "spark.jars",
        "spark-shaded/target/sedona-spark-shaded-3.5_2.12-1.7.1-SNAPSHOT.jar",
    )
    .config("spark.executor.memory", "6G")
    .config("spark.driver.memory", "6G")
    .getOrCreate()
)

sedona = SedonaContext.create(config)

from sedona.utils.geoarrow import GeoArrowDataFrameReader, dataframe_to_arrow

df = sedona.read.format("geoparquet").load(
    "/Users/dewey/gh/geoarrow-data/microsoft-buildings/files/microsoft-buildings_point_geo.parquet"
).limit(100_000)

reader = GeoArrowDataFrameReader(df)
print(reader.schema)
#> geometry: extension<geoarrow.wkb<WkbType>>
for batch in reader:
    print(".", end="")
print()
print(batch)
#> pyarrow.RecordBatch
#> geometry: extension<geoarrow.wkb<WkbType>>
#> ----
#> geometry: [01...
print(reader.batch_order)
#> [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

paleolimbot added 2 commits March 12, 2025 15:56

first pass

27af047

theoretically working

a0b045c

github-actions bot added the sedona-python label Mar 12, 2025

paleolimbot added 2 commits March 13, 2025 09:39

lazy imports to build on older spark

3b7a451

fix crs sniffing without geoarrow-types

4617c13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Stream GeoArrow by batch from the driver #1855

[WIP] Stream GeoArrow by batch from the driver #1855

paleolimbot commented Mar 12, 2025 •

edited

Loading

[WIP] Stream GeoArrow by batch from the driver #1855

Are you sure you want to change the base?

[WIP] Stream GeoArrow by batch from the driver #1855

Conversation

paleolimbot commented Mar 12, 2025 • edited Loading

Did you read the Contributor Guide?

Is this PR related to a ticket?

What changes were proposed in this PR?

How was this patch tested?

Did this PR include necessary documentation updates?

paleolimbot commented Mar 12, 2025 •

edited

Loading