Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete/Recreate openshift marketplace pods #1482

Merged
merged 1 commit into from
Apr 17, 2024

Conversation

raukadah
Copy link
Contributor

@raukadah raukadah commented Apr 16, 2024

On crc Zuul reproducer job, cert-manager operator installation is failing with following error.

the cert-manager CRDs are not yet installed on the Kubernetes API server

The cert-manager operator get installed from openshift marketplace. After digging deep, we found that pods under openshift-marketplace namespace are hitting CrashLoopBackOff due to following error.

failed to populate resolver cache from source redhat-operators/openshift-marketplace:
failed to list bundles: rpc error: code = Unavailable desc = connection error: desc =

Based on crc-org/crc#4109 (comment), Delete and recreating openshift-marketplace pods fixes the issue.

Since OCP is deployed after pre_infra hook and cert_manager role iscalled before post_infra. There is no way to run this workaround as a hook.

It would be best to include under openshift_setup role.

As a pull request owner and reviewers, I checked that:

  • Appropriate testing is done and actually running

@@ -0,0 +1,18 @@
---
- name: Disable/Enable default CatalogSource
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd do this only if the issue is hitting, that, afik, it's happening only in the latest 4.15, but, I'd not like to mesh the marketplace always as a default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/a3722701902d4a80a1d4f3fc59925e8d

✔️ openstack-k8s-operators-content-provider SUCCESS in 30m 43s
podified-multinode-edpm-deployment-crc RETRY_LIMIT in 8m 45s
✔️ noop SUCCESS in 0s
cifmw-pod-pre-commit FAILURE in 7m 51s (non-voting)

@raukadah raukadah force-pushed the fix_cert_manager branch 2 times, most recently from 6e9b003 to 05e96bf Compare April 16, 2024 13:12
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/f6bf3b2e2b8c40b684c4021b6323cd3e

✔️ openstack-k8s-operators-content-provider SUCCESS in 29m 43s
podified-multinode-edpm-deployment-crc RETRY_LIMIT in 8m 44s
✔️ noop SUCCESS in 0s
✔️ cifmw-pod-pre-commit SUCCESS in 7m 39s (non-voting)

kubernetes.core.k8s_info:
kind: Pod
kubeconfig: "{{ cifmw_openshift_kubeconfig }}"
name: "{{ pod_list.stdout | regex_search('^pod/redhat-operators-.*$', multiline=True) | split('/') | last }}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhmm I don't get this one, this will query a single pods as your are passing the name field. Isn't enough if you remove the previous task and the name? The field_selector here should do the rest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, got it now! Updated it.

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/d3f10df365794f53936424b8e6df5aca

✔️ openstack-k8s-operators-content-provider SUCCESS in 38m 08s
podified-multinode-edpm-deployment-crc FAILURE in 18m 00s
✔️ noop SUCCESS in 0s
✔️ cifmw-pod-pre-commit SUCCESS in 8m 29s (non-voting)

@raukadah
Copy link
Contributor Author

recheck

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/ee2bcb7ff4f74e11a38412cb7bd97ca6

✔️ openstack-k8s-operators-content-provider SUCCESS in 34m 41s
podified-multinode-edpm-deployment-crc FAILURE in 17m 32s
✔️ noop SUCCESS in 0s
✔️ cifmw-pod-pre-commit SUCCESS in 8m 22s (non-voting)

Copy link
Collaborator

@lewisdenny lewisdenny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting issue Chandan, I was going to suggest we add a todo to remove this "workaround" once fixed but seems from the RH solution you linked this is just how it works.

One request left

@@ -4,6 +4,12 @@ cifmw_use_libvirt: false

cifmw_openshift_setup_skip_internal_registry_tls_verify: true

pre_infra:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't use pre_infra here as OCP isn't available at that stage.

post_infra looks to be the correct hook to use but I didn't test.

You can see CI failing due to this[1]:

fatal: [localhost]: FAILED! => {"changed": false, "msg": "Could not find or access '/home/zuul/.crc/machines/crc/kubeconfig' on the Ansible Controller.\nIf you are using a module and expect the file to exist on the remote, see the remote_src option"}

[1] https://logserver.rdoproject.org/82/1482/43267db1d1546643bcf8d8f5f0a5fc8cc69f8f2d/github-check/podified-multinode-edpm-deployment-crc/a1df9e4/controller/ci-framework-data/logs/ci_script_000_run_disable_enable_red_hat.log

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @pablintino @lewisdenny for the review. Since OCP is deployed after pre_infra hook and cert_manager role iscalled before post_infra. There is no way to run this workaround
as a hook so we need to include in the cert_manager role itself. I have updated the same.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, let's see how CI likes it. Thanks for adding the comment too :)

namespace: openshift-marketplace
field_selectors:
- status.phase=CrashLoopBackOff
register: _pod_status
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just deleting the crash pod will also recover it crc-org/crc#4109 (comment)
Issue seen in 4.15.3 atleast, 4.15.8 didn't hit it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@raukadah raukadah force-pushed the fix_cert_manager branch 4 times, most recently from 137c8f7 to 8ea3496 Compare April 17, 2024 06:10
On crc Zuul reproducer job, cert-manager operator installation is
failing with following error.
```
the cert-manager CRDs are not yet installed on the Kubernetes API server
```

The cert-manager operator get installed from openshift marketplace.
After digging deep, we found that pods under openshift-marketplace
namespace are hitting CrashLoopBackOff due to following error.
```
failed to populate resolver cache from source redhat-operators/openshift-marketplace:
failed to list bundles: rpc error: code = Unavailable desc = connection error: desc =
```

Based on crc-org/crc#4109 (comment),
Delete and recreating openshift-marketplace pods fixes the issue.

Since OCP is deployed after pre_infra hook and cert_manager role is
called before post_infra. There is no way to run this workaround
as a hook.

It would be best to include under openshift_setup role.

Signed-off-by: Chandan Kumar <[email protected]>
@raukadah raukadah changed the title [cert_manager]Disable/Enable default catalogsource Delete/Recreate openshift marketplace Apr 17, 2024
@raukadah raukadah changed the title Delete/Recreate openshift marketplace Delete/Recreate openshift marketplace pods Apr 17, 2024
@arxcruz
Copy link
Contributor

arxcruz commented Apr 17, 2024

/lgtm
lgtm

@rebtoor
Copy link
Contributor

rebtoor commented Apr 17, 2024

/approve

Copy link
Contributor

openshift-ci bot commented Apr 17, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rebtoor

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit a113e1a into main Apr 17, 2024
10 checks passed
@openshift-merge-bot openshift-merge-bot bot deleted the fix_cert_manager branch April 17, 2024 09:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants