Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Stream GeoArrow by batch from the driver #1855

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

paleolimbot
Copy link
Member

@paleolimbot paleolimbot commented Mar 12, 2025

Did you read the Contributor Guide?

Is this PR related to a ticket?

(Not yet!)

  • Yes, and the PR name follows the format [SEDONA-XXX] my subject.

  • Yes, and the PR name follows the format [GH-XXX] my subject.

  • No:

    • this is a documentation update. The PR name follows the format [DOCS] my subject
    • this is a CI update. The PR name follows the format [CI] my subject

What changes were proposed in this PR?

This PR implements a batch-by-batch reader (as opposed to a complete whole-table reader. For large results this might help increase the size of result that can be handled, although I think that no matter how it happens the whole result has to be collected in the JVM.

Still working on the details!

How was this patch tested?

Tests forthcoming!

Did this PR include necessary documentation updates?

  • Yes, I am adding a new API. I am using the current SNAPSHOT version number in vX.Y.Z format.
  • Yes, I have updated the documentation.
  • No, this PR does not affect any public API so no need to change the documentation.

Reprex:

from sedona.spark import SedonaContext

config = (
    SedonaContext.builder()
    .config(
        "spark.jars",
        "spark-shaded/target/sedona-spark-shaded-3.5_2.12-1.7.1-SNAPSHOT.jar",
    )
    .config("spark.executor.memory", "6G")
    .config("spark.driver.memory", "6G")
    .getOrCreate()
)

sedona = SedonaContext.create(config)

from sedona.utils.geoarrow import GeoArrowDataFrameReader, dataframe_to_arrow

df = sedona.read.format("geoparquet").load(
    "/Users/dewey/gh/geoarrow-data/microsoft-buildings/files/microsoft-buildings_point_geo.parquet"
).limit(100_000)

reader = GeoArrowDataFrameReader(df)
print(reader.schema)
#> geometry: extension<geoarrow.wkb<WkbType>>
for batch in reader:
    print(".", end="")
print()
print(batch)
#> pyarrow.RecordBatch
#> geometry: extension<geoarrow.wkb<WkbType>>
#> ----
#> geometry: [01...
print(reader.batch_order)
#> [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant