Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pull,load: use *Image instead of re-resolving via name #2339

Merged
merged 2 commits into from
Mar 21, 2025

Conversation

flouthoc
Copy link
Collaborator

@flouthoc flouthoc commented Mar 1, 2025

Following commit fixes a race condition in libimage because in Pull( after performing copy from remote sources it agains attempts to resolve image via LookupImage, any operation between copy and LookupImage can remove name from the recently pulled image. Causing race in builds.

Removing unnecessary image -> name -> image round-trip from Pull(

This issue was discoverd while working on PR containers/buildah#5971

buildah build -t test --jobs=2 --skip-unused-stages=false .

Containerfile

FROM quay.io/jitesoft/alpine
RUN arch
FROM --platform=linux/arm64 quay.io/jitesoft/alpine AS foreign

Following commit also addresses the commit e2324dd by performing the neccessary refactor.

No functional change in public exposed API, exisiting tests should pass as-is. [NO NEW TESTS NEEDED]

@flouthoc flouthoc force-pushed the refactor-pull branch 2 times, most recently from 4f50746 to 663a29b Compare March 1, 2025 17:35
@flouthoc
Copy link
Collaborator Author

flouthoc commented Mar 3, 2025

I don't know why after image.reload( here https://github.com/containers/common/blob/main/libimage/image_test.go#L53 the digest gets changed although I never changed anything in storage. @mtrmac any hints ?

@mtrmac
Copy link
Contributor

mtrmac commented Mar 3, 2025

I didn’t trace it, for now a quick guess: https://github.com/containers/image/blob/c3293eccad1afb55fb9b45d674c8cd06f605eac9/storage/storage_reference.go#L173-L188 means that when using a digested c/storage reference, the reference refers to the manifest with that digest.

So, for a digested pull, the resolved reference is a name@digest@ID reference, and always uses the manifest with referenced digest. But Image.reload replaces the storage reference with an @ID-only reference, in which case the “default” manifest selection applies, and that might differ.

Let me know if this does not help and I should look into this in more detail.

@flouthoc
Copy link
Collaborator Author

flouthoc commented Mar 3, 2025

I didn’t trace it, for now a quick guess: https://github.com/containers/image/blob/c3293eccad1afb55fb9b45d674c8cd06f605eac9/storage/storage_reference.go#L173-L188 means that when using a digested c/storage reference, the reference refers to the manifest with that digest.

So, for a digested pull, the resolved reference is a name@digest@ID reference, and always uses the manifest with referenced digest. But Image.reload replaces the storage reference with an @ID-only reference, in which case the “default” manifest selection applies, and that might differ.

Let me know if this does not help and I should look into this in more detail.

This is very helpful, I will investigate with above notes and will report back 😄 , thanks for the tip.

@mtrmac
Copy link
Contributor

mtrmac commented Mar 4, 2025

But Image.reload replaces the storage reference with an @ID-only reference,

That’s imprecise I’m afraid; Image.storageReference is not modified — but Image.storageImage is, to an image coming directly from c/storage and not from storageReference.resolveImage. But I think the core of the idea, that resolveImage modifies Image.Digest to match the user input, remains.

@flouthoc flouthoc force-pushed the refactor-pull branch 2 times, most recently from 37b8f64 to d585e3d Compare March 4, 2025 20:38
@flouthoc
Copy link
Collaborator Author

flouthoc commented Mar 4, 2025

@mtrmac Tests passes now, this is ready for review. Although I must created vendor PR in podman and buildah to verify everything else.

@flouthoc flouthoc requested a review from mtrmac March 4, 2025 21:22
Copy link
Contributor

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an extremely brief skim for now, without reading any of the context; mostly just a list of topics to discuss / for me to research.

  • I very much like minimizing repeated image lookups. The overall direction is great.
  • I’m not at all sure about the .reload() part now. Do we know what is exactly happening?
  • I don’t know about the non-registry copies, e.g. loads from archives. I would be fine with not fixing those cases in this PR and deferring that for later… OTOH it might be easier to do all now than to re-establish the context later.

@flouthoc
Copy link
Collaborator Author

flouthoc commented Mar 6, 2025

Okay buildah tests all pass containers/buildah#6039, podman test is having an issue once I fix the podman tests I will address comments on this PR

Copy link

Ephemeral COPR build failed. @containers/packit-build please check.

@flouthoc flouthoc force-pushed the refactor-pull branch 5 times, most recently from da19354 to f1f8f60 Compare March 17, 2025 18:05
@flouthoc flouthoc requested a review from mtrmac March 17, 2025 18:28
@flouthoc
Copy link
Collaborator Author

@mtrmac Could you review the PR again ? I have modfied tests a little bit and also removed reload( from my code.

Comment on lines 68 to 75
digestFound := false
for _, d := range digests {
if d.String() == origDigest.String() {
digestFound = true
break
}
}
require.True(t, digestFound, "original digest string should be present in the digests array of new pulled image")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That results in a bad error message, the proper matcher for this is require.Contains()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to match .String() part of the object. I wonder if require.Contains would allow to me to match only .String() parts of it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

digest is defined as type Digest string so there doesn't not seem to be a reason why you would need to match .String() as this does just a type cast to string

func (d Digest) String() string {
	return string(d)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I’m academically interested in learning a good answer, but, pragmatically, I’m not worried. If the test ever fails, someone will add a few debugging prints.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I’m academically interested in learning a good answer, but, pragmatically, I’m not worried. If the test ever fails, someone will add a few debugging prints.)

The first time this fails you would be greeted with a rather useless "true should equal false" message. Then in order to debug you would have to to extra work adding these prints which seems unnecessary. If the failure doesn't reproduce locally and/or is a flake then you must commit first all the debug info to CI so we can gather more useful logs which again is extra work.

One thing we should always assume is that the person who wrote the test is not the same person who will see this fail so they don't know what the test is doing, or even the same person will not remember right away and then you have to dig in the sources, etc... In the best case the error is good enough for another person to see what they broke without having to fully read and understand the entire test code. Or consider a QE person like Ed looking at flakes. They will not necessarily dig up the sources or understand them enough to then file a meaningful issue. If they just copy paste the test error it would be much more meaningful if it includes the real digests.

Logically for this this test there seem to be a few rather different cases I could see right away with the better error. The digest list is empty so the element cannot be part of it, list is not empty but only has different digests and similar maybe the digest algorithm was changed but the test still expects a sha256 one.

Sure here in c/common with these unit tests this is less of a concern compared to podman integration tests but still we should apply the same good habits everywhere.

@flouthoc flouthoc requested a review from Luap99 March 17, 2025 19:15
@flouthoc flouthoc force-pushed the refactor-pull branch 2 times, most recently from 82c2b5e to 706bcf2 Compare March 18, 2025 14:02
Copy link
Contributor

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic LGTM, just some cleanups.

@flouthoc flouthoc force-pushed the refactor-pull branch 2 times, most recently from 280ee10 to 20e2caa Compare March 20, 2025 19:05
Copy link
Contributor

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking: As discussed elsewhere, consider removing the report… parameter from NewCopier.

Great work!

Copy link
Contributor

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This doesn’t build, but otherwise) LGTM.

Following commit fixes a `race` condition in `libimage` because in `Pull(`
after performing `copy` from remote sources it agains attempts to resolve
image via `LookupImage`, any operation between `copy` and `LookupImage` can remove
`name` from the recently pulled image. Causing race in builds.

This issue was discoverd while working on PR containers/buildah#5971
```
buildah build -t test --jobs=2 --skip-unused-stages=false .
```

Containerfile
```
FROM quay.io/jitesoft/alpine
RUN arch
FROM --platform=linux/arm64 quay.io/jitesoft/alpine AS foreign
```

Following commit also addresses the commit containers@e2324dd
by performing the neccessary refactor.

No functional change in public exposed API, exisiting tests should pass as-is.
[NO NEW TESTS NEEDED]

Signed-off-by: flouthoc <[email protected]>
Fix linter error
```
Error: libimage/copier.go:180:51: `(*Runtime).newCopier` - `reportResolvedReference` always receives `nil` (unparam)
```

Signed-off-by: flouthoc <[email protected]>
@flouthoc
Copy link
Collaborator Author

@containers/podman-maintainers @nalind PTAL

@flouthoc
Copy link
Collaborator Author

Both buildah and podman CI passes with this patch.

Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Copy link
Contributor

openshift-ci bot commented Mar 21, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: flouthoc, Luap99

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit dbeb17e into containers:main Mar 21, 2025
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants