Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1890076: SnowflakeConnectionV1#downloadStream has the chance of downloading the wrong file if files only differ by suffix. #2030

Closed
wukachn opened this issue Jan 14, 2025 · 8 comments
Assignees
Labels
bug status-fixed_awaiting_release The issue has been fixed, its PR merged, and now awaiting the next release cycle of the connector. status-triage_done Initial triage done, will be further handled by the driver team

Comments

@wukachn
Copy link

wukachn commented Jan 14, 2025

Please answer these questions before submitting your issue.
In order to accurately debug the issue this information is required. Thanks!

  1. What version of JDBC driver are you using?
    3.21.0 but seems to be introduced in 3.19.1 by this PR: SNOW-1708304: Download stream from git repository #1920

  2. What operating system and processor architecture are you using?
    macOS ARM

  3. What version of Java are you using?
    17

  4. What did you do?

I uploaded two files to my stage (not wrapped in quotes):

  • filename goes here
  • filename goes here.dat

Then, when I try to download filename goes here, I get the file content of filename goes here.dat.

Below is the code snippet that assumes sourceFiles should only ever contain a single filename when we call downloadStream. However, sourceFiles is determined by applying some kind of parsing process to the CMD generated earlier (get '@"DPC_STREAMING_TESTING_DB"."CDC_IT_f3333863-c08f-4c9a-9dd2-6cd3c25465d2"."SNOWFLAKE_CLIENT_IT"/filename goes here' file:///tmp/ /*jdbc download stream*/). In this example, sourceFiles gets set to a list of both file names and in this case the list is ordered in a way not in my favor, so the wrong file is downloaded.

// when downloading files as stream there should be only one file in source files
String sourceLocation =
sourceFiles.stream()
.findFirst()
.orElseThrow(
() ->
new SnowflakeSQLException(
queryID,
SqlState.NO_DATA,
ErrorCode.FILE_NOT_FOUND.getMessageCode(),
session,
"File not found: " + fileName));```

File names sharing a name besides a prefix do not encounter this issue but filenames sharing a name besides a suffix do.

  1. What did you expect to see?

    I would expect the correct file to be downloaded.

@wukachn wukachn added the bug label Jan 14, 2025
@wukachn wukachn changed the title SnowflakeConnectionV1#downloadStream have the chance of downloading the wrong file if files only differ by suffix. SnowflakeConnectionV1#downloadStream has the chance of downloading the wrong file if files only differ by suffix. Jan 14, 2025
@sfc-gh-dszmolka sfc-gh-dszmolka self-assigned this Jan 15, 2025
@sfc-gh-dszmolka sfc-gh-dszmolka added question Issue is a usage/other question rather than a bug status-triage_done Initial triage done, will be further handled by the driver team and removed bug labels Jan 15, 2025
@sfc-gh-dszmolka
Copy link
Contributor

hi there and thanks for submitting this issue ! i think you're observing the expected and documented behaviour. Why? the Snowflake GET command which the driver is using under the hood, per its documentation: https://docs.snowflake.com/en/sql-reference/sql/get

GET internalStage file://<local_directory_path>
..
internalStage ::=
    @[<namespace>.]<int_stage_name>[/<path>]
  | @[<namespace>.]%<table_name>[/<path>]
  | @~[/<path>]

<path> is an optional case-sensitive path for files in the cloud storage location (i.e. files have names that begin with a common string) that limits access to a set of files. Paths are alternatively called prefixes or folders by different cloud storage services. If path is specified, but no file is explicitly named in the path, all data files in the path are downloaded.

This is precisely what is happening here.

I think as it is the expected behaviour, considering a different approach

  • (you seem to be already using that) naming the files differently
  • placing the files under different <path>

might help you. Hope this helps.

@wukachn
Copy link
Author

wukachn commented Jan 20, 2025

The explanation makes sense but I don't agree with that this should be the behaviour. If downloadStream is built to only get a single file, it feels like there should be some enforcement on that. It feels odd to me that if you hit the multiple file case, you essentially get the content of a random file in that group.

As a first step, it'd be nice to have some kind of warning if you call downloadStream and multiple files with that path are found.

What do you think about exposing the PATTERN parameter?

@sfc-gh-dszmolka
Copy link
Contributor

had a quick chat with the team about downloadStream intended behaviour and they agreed that the current behaviour, even if it works according to the backing GET functionality, is not expected. This function is expected to 1:1 match the name provided as input.

we'll take a look and i'll keep this thread posted on the progress.

@sfc-gh-dszmolka sfc-gh-dszmolka added bug and removed question Issue is a usage/other question rather than a bug labels Jan 21, 2025
@sfc-gh-dszmolka sfc-gh-dszmolka changed the title SnowflakeConnectionV1#downloadStream has the chance of downloading the wrong file if files only differ by suffix. SNOW-1890076: SnowflakeConnectionV1#downloadStream has the chance of downloading the wrong file if files only differ by suffix. Jan 21, 2025
@sfc-gh-dszmolka sfc-gh-dszmolka added the status-pr_pending_merge A PR is made and is under review label Jan 22, 2025
@sfc-gh-dszmolka
Copy link
Contributor

a PR is prepared with the fix : #2043

@sfc-gh-dszmolka sfc-gh-dszmolka added status-fixed_awaiting_release The issue has been fixed, its PR merged, and now awaiting the next release cycle of the connector. and removed status-pr_pending_merge A PR is made and is under review labels Jan 27, 2025
@sfc-gh-dszmolka
Copy link
Contributor

PR is merged and will be part of the next release

@wukachn
Copy link
Author

wukachn commented Jan 27, 2025

Great news, thanks for the quick action @sfc-gh-dszmolka

@sfc-gh-dszmolka
Copy link
Contributor

All credits go to my colleague @sfc-gh-dprzybysz but glad we could help you quickly :)

@sfc-gh-dszmolka
Copy link
Contributor

change released with JDBC driver version 3.22.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug status-fixed_awaiting_release The issue has been fixed, its PR merged, and now awaiting the next release cycle of the connector. status-triage_done Initial triage done, will be further handled by the driver team
Projects
None yet
Development

No branches or pull requests

3 participants