Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blob Get meta timeout, which blocks other image pulls #357

Open
1 task done
djdongjin opened this issue Dec 3, 2024 · 0 comments
Open
1 task done

Blob Get meta timeout, which blocks other image pulls #357

djdongjin opened this issue Dec 3, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@djdongjin
Copy link
Member

What happened in your environment?

Sometime we saw long image pull latency, even with lazy pull enabled. Looking into containerd/overlaybd logs, we found overlaybd has the Get meta timedout error on an image blob, which blocks later image pull requests during this timeout period.

Below is a rough timeline from containerd/overlaybd logs. Sorry I probably couldn't share all the logs, but let me know if you need any more details.

It seems the snapshotter (or overlaybd) cannot process requests in parallel, even for requests for different images.

# (containerd) - notice the long latency for img2, both are overlaybd images
"2024-12-03T01:24:32.973066470Z" PullImage img1
"2024-12-03T01:24:33.300435436Z" PullImage img1 returns image reference
"2024-12-03T01:24:36.923843733Z" PullImage img2
"2024-12-03T01:25:06.203165092Z" PullImage img2 returns image reference

# (overlaybd-snapshotter)
# snapshotted starts to have img2 related logs only AFTER the previous img1 blob timed out.
"2024-12-03T01:24:33Z" "{repoBlobUrl: img1/blobs/... }"
"2024-12-03T01:25:06Z" "failed to enable target for ... failed:failed to open remote file ... Get meta timedout"
...
"2024-12-03T01:25:06Z" "{repoBlobUrl: img2/blobs/...}"

# (overlaybd)
"2024/12/03 01:24:36" /src/src/switch_file.cpp:50|try_open_zfile:open file as zfile format, path: img1/blobs/...
"2024/12/03 01:25:06" /src/src/image_file.cpp:165|__open_ro_remote:failed to open remote file img1/blobs/... Get meta timedout errno=110(Connection timed out)
"2024/12/03 01:25:08" /src/src/image_file.cpp:145|__open_ro_remote:open file from remotefs mg2/blobs/...

What did you expect to happen?

  1. A Get Meta timeout in one image shouldn't block requests for other images.
  2. Make Get Meta timeout configurable, so we can reduce the wait time. (I believe the payload for Get Meta should be small, so 30s seems too long).

How can we reproduce it?

Not sure, this only happens in one region. We can only capture some live nodes having the issue, but couldn't repro it yet.

What is the version of your Overlaybd?

overlaybd: 0.6.17

overlaybd-snapshotter: 0.6.7

What is your OS environment?

Ubuntu 20.04

Are you willing to submit PRs to fix it?

  • Yes, I am willing to fix it.
@djdongjin djdongjin added the bug Something isn't working label Dec 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant