Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛clustercatalog_controller: hacky, but more correct status updating in reconciler #424

Merged
merged 1 commit into from
Oct 4, 2024

Conversation

joelanford
Copy link
Member

@joelanford joelanford commented Oct 3, 2024

For awhile in catalogd, we've had a very loose "needsUnpack" behavior where we would make the assumption that if we didn't need to unpack, then we also didn't need to update the status of the object.

This means that some objects may never have their status updated after an upgrade. That's problematic because upgrades may add new helpful fields or fix bugs in the way that status is reported.

This PR is a somewhat hacky fix for the situation. It solves the problem by keeping an in-memory map of the observedGeneration and resolved source, and then using that as the source of truth to and then use that expected status as part of the determination of whether an unpack is required.

A new criteria of "needsUnpack" is now: "does the expected status differ from the status that is on the object. If so, unpack again.

Fixes #422
Fixes #420

@joelanford joelanford requested a review from a team as a code owner October 3, 2024 20:20
Copy link

codecov bot commented Oct 3, 2024

Codecov Report

Attention: Patch coverage is 85.10638% with 14 lines in your changes missing coverage. Please review.

Project coverage is 38.55%. Comparing base (8137da0) to head (63b0914).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...rnal/controllers/core/clustercatalog_controller.go 87.20% 11 Missing ⚠️
cmd/manager/main.go 0.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #424      +/-   ##
==========================================
+ Coverage   35.28%   38.55%   +3.26%     
==========================================
  Files          14       14              
  Lines         802      843      +41     
==========================================
+ Hits          283      325      +42     
- Misses        472      474       +2     
+ Partials       47       44       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Unpacker source.Unpacker
Storage storage.Instance

finalizers crfinalizer.Finalizers
Copy link
Member Author

@joelanford joelanford Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed to refactor finalizers so that it had access to the reconciler struct. That way it would be possible to delete(r.storedCatalogs, catalog.Name) in the finalizer logic.

case r.needsPoll(storedCatalog.unpackResult.ResolvedSource.Image.LastSuccessfulPollAttempt.Time, catalog):
l.Info("unpack required: poll duration has elapsed")
needsUnpack = true
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intentionally added lots of logging here to help us deduce why we might be unpacking more frequently than expected.

clearUnknownConditions(expectedStatus)
if hasStoredCatalog && r.Storage.ContentExists(catalog.Name) {
updateStatusServing(expectedStatus, storedCatalog.unpackResult, r.Storage.ContentURL(catalog.Name), storedCatalog.observedGeneration)
updateStatusProgressing(expectedStatus, storedCatalog.observedGeneration, nil)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed to change the updateStatusProgressing signature so that I could call it with a deep-copied status, and a different observed generation.

@@ -396,7 +396,7 @@ func TestImageRegistry(t *testing.T) {
// If the digest should already exist check that we actually hit it
if tt.digestAlreadyExists {
assert.Contains(t, buf.String(), "image already unpacked")
assert.Equal(t, rs.UnpackTime, unpackDirStat.ModTime())
assert.Equal(t, rs.UnpackTime, unpackDirStat.ModTime().Truncate(time.Second))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think even asserting on the second could be brittle if we happen to run on the second to next second break. How about just asserting the duration is in a reasonable range?

Suggested change
assert.Equal(t, rs.UnpackTime, unpackDirStat.ModTime().Truncate(time.Second))
assert.WithinDuration(t, rs.UnpackTime, unpackDirStat.ModTime(), time.Second)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one should be exact. rs.UnpackTime is always truncated to the second in the same way. I'll add a comment in the code about why we need that though.

@bentito
Copy link
Contributor

bentito commented Oct 3, 2024

@joelanford when I point current operator-controller main at this with a replace and run make test-upgrade-e2e I am still seeing:

# github.com/operator-framework/operator-controller/internal/catalogmetadata/cache
internal/catalogmetadata/cache/cache.go:86:42: catalog.Status.ResolvedSource.Image.ResolvedRef undefined (type *"github.com/operator-framework/catalogd/api/core/v1alpha1".ResolvedImageSource has no field or method ResolvedRef)
internal/catalogmetadata/cache/cache.go:122:62: catalog.Status.ResolvedSource.Image.ResolvedRef undefined (type *"github.com/operator-framework/catalogd/api/core/v1alpha1".ResolvedImageSource has no field or method ResolvedRef)
internal/catalogmetadata/cache/cache.go:164:52: catalog.Status.ResolvedSource.Image.ResolvedRef undefined (type *"github.com/operator-framework/catalogd/api/core/v1alpha1".ResolvedImageSource has no field or method ResolvedRef)
# github.com/operator-framework/operator-controller/internal/controllers
internal/controllers/clusterextension_controller.go:419:53: oldObject.Status.ResolvedSource.Image.ResolvedRef undefined (type *"github.com/operator-framework/catalogd/api/core/v1alpha1".ResolvedImageSource has no field or method ResolvedRef)
internal/controllers/clusterextension_controller.go:419:106: newObject.Status.ResolvedSource.Image.ResolvedRef undefined (type *"github.com/operator-framework/catalogd/api/core/v1alpha1".ResolvedImageSource has no field or method ResolvedRef)
# github.com/operator-framework/operator-controller/internal/controllers [github.com/operator-framework/operator-controller/internal/controllers.test]
internal/controllers/clusterextension_controller.go:419:53: oldObject.Status.ResolvedSource.Image.ResolvedRef undefined (type *"github.com/operator-framework/catalogd/api/core/v1alpha1".ResolvedImageSource has no field or method ResolvedRef)
internal/controllers/clusterextension_controller.go:419:106: newObject.Status.ResolvedSource.Image.ResolvedRef undefined (type *"github.com/operator-framework/catalogd/api/core/v1alpha1".ResolvedImageSource has no field or method ResolvedRef)
make: *** [vet] Error 1

@joelanford
Copy link
Member Author

@bentito, that's expected. See the accompanying changes in operator-framework/operator-controller#1329.

No matter how you slice it, there is a breaking change from 0.29.0 to 0.30.0 (and from 0.29.0 to this PR):

  1. We removed some status fields that operator-controller depends on, and
  2. We changed the spec.source.type enum from image to Image, with no backcompat support. So the upgrade tests are bound to fail because the new catalogd no longer understands old data (i.e. spec.source.type: "image")

@bentito
Copy link
Contributor

bentito commented Oct 4, 2024

@bentito, that's expected. See the accompanying changes in operator-framework/operator-controller#1329.

No matter how you slice it, there is a breaking change from 0.29.0 to 0.30.0 (and from 0.29.0 to this PR):

1. We removed some status fields that operator-controller depends on, and

2. We changed the `spec.source.type` enum from `image` to `Image`, with no backcompat support. So the upgrade tests are bound to fail because the new catalogd no longer understands old data (i.e. `spec.source.type: "image"`)

ah, okay, understanding that now, was just worried it was supposed to be fixed from catalogd side alone

ankitathomas
ankitathomas previously approved these changes Oct 4, 2024
Copy link
Contributor

@ankitathomas ankitathomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Left a small comment about the status comparison

clearUnknownConditions(expectedStatus)
if hasStoredCatalog && r.Storage.ContentExists(catalog.Name) {
updateStatusServing(expectedStatus, storedCatalog.unpackResult, r.Storage.ContentURL(catalog.Name), storedCatalog.observedGeneration)
updateStatusProgressing(expectedStatus, storedCatalog.observedGeneration, nil)
Copy link
Contributor

@ankitathomas ankitathomas Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this to avoid waiting till the next poll before reconciling in case of unpack errors? Looks like the equivalence check for the Progressing condition will always fail if we had an error on the last unpack; do we backoff between reconciling errored unpacks?

Copy link
Member Author

@joelanford joelanford Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh this is a good question. Here's a scenario:

  1. We are successfully serving content for generation 1
  2. We reconcile generation 2, which has a bad image ref, let's say the tag doesn't exist. So we:
    • Do not update storedCatalog map
    • DO update the status with Progressing=True, Retrying
  3. On the next reconcile (that is requeued basically immediately), we build an "expectedStatus" that says Installed=True and Progressing=False based on this code. We later compare this to the real status from the object and we are told that there are differences, which forces an unpack.

Step 3 is a little odd though. Maybe if we return an error from Reconcile we should delete(r.storedCatalogs, catalog.Name) as a better signal for "we need to attempt to unpack again"?

Comment on lines +215 to 224
updateStatusProgressing(&catalog.Status, catalog.GetGeneration(), storageErr)
return ctrl.Result{}, storageErr
Copy link
Contributor

@ankitathomas ankitathomas Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

considering this isn't a terminal error, should we requeue here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By returning the storageErr, we get requeue based on exponential backoff. We only want the requeue based on poll interval if there are no errors during reconcile.

everettraven
everettraven previously approved these changes Oct 4, 2024
@joelanford joelanford added this pull request to the merge queue Oct 4, 2024
Merged via the queue into operator-framework:main with commit dee3337 Oct 4, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: when needsUnpack returns false no status update happens Avoid using resource status for reconcile logic
5 participants