Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition 500 Internal Server Error when submitting multiple builds to a directory that has never been used #3358

Open
hroncok opened this issue Aug 6, 2024 · 6 comments

Comments

@hroncok
Copy link
Contributor

hroncok commented Aug 6, 2024

This happens to me fairly regularly when I run Copr impact checks to see if an upgrade of some Fedora package does not break anything. I decided to create a smaller reproducer and report it.

Using the copr CLI:

  1. create a new copr project
  2. add packages from Fedora distgit (other sources may also be impacted)
  3. submit several builds to a custom directory that has never been used yet, at the same time

Some of the builds will fail with:

Something went wrong:
Error: Response is not in JSON format, there is probably a bug in the API code.
Try 'copr-cli --debug' for more info.

Adding --debug does not reveal much:

Server response:
----------------


500 Internal Server Error

Internal Server Error
The server encountered an internal error or
misconfiguration and was unable to complete
your request.
Please contact the server administrator at 
 root@localhost to inform them of the time this error occurred,
 and the actions you performed just before this error.
More information about this error may be available
in the server error log.

Reproducer (uses moreutils-parallel):

COPR=reproducer-race
copr create $COPR --chroot fedora-rawhide-x86_64 --delete-after-days 30
copr add-package-distgit $COPR --webhook-rebuild off --commit rawhide --name dummy-test-package-gloster
parallel -j8 copr build-package $COPR:custom:1 --nowait --background --name -- dummy-test-package-gloster dummy-test-package-gloster dummy-test-package-gloster dummy-test-package-gloster dummy-test-package-gloster dummy-test-package-gloster dummy-test-package-gloster dummy-test-package-gloster

Often some of the first builds errors:

Build was added to reproducer-race:
  https://copr.fedorainfracloud.org/coprs/build/...
Created builds: ...

Something went wrong:
Error: Response is not in JSON format, there is probably a bug in the API code.
Try 'copr-cli --debug' for more info.
Build was added to reproducer-race:
  https://copr.fedorainfracloud.org/coprs/build/...
Created builds: ...

If it does not happen to you, repeat with a new directory name ($COPR:custom:2, $COPR:custom:3...) until it does.

Use this to cancel the running/pending builds after you run the above in case you want to preserve resources for others:

parallel copr cancel -- $(copr list-builds --output-format text-row $COPR | cut -f1)

I hypothesize that a first build in the custom directory does something special (wrt creating the directory) and when multiple builds think they are first, they all attempt to do the special thing at the same time and some of them get an unhandled exception because of a race condition.

@FrostyX
Copy link
Member

FrostyX commented Aug 7, 2024

Triage: Two issues to solve ... 1. Why 500? 2. Return something reasonable if 500

@hroncok
Copy link
Contributor Author

hroncok commented Aug 7, 2024

In my experience, 500 happens when there is an unhandled Python exception. If the webserver runs in debug mode, the exception is shown, but if it is in production mode, it is hidden. If you have a development copr server with debug mode enabled, we could try reproducing there.

@hroncok
Copy link
Contributor Author

hroncok commented Aug 7, 2024

I am looking at the code, searching where this could have happened and I found c1fa04b -- if this wasn't deployed yet, perhaps this fixed the issue.

@FrostyX
Copy link
Member

FrostyX commented Aug 7, 2024

Hello @hroncok,
thank you for the report. The step-by-step reproducer is very much appreciated.

We decided to not prioritize this issue for the next 3 months because although annoying, it seems there should be an easy workaround. I suppose only the reproducer is done via parallel to hit the issue more easily but your actual script goes one by one? Then something like sleep 1 between calls should workaround this? If I am wrong and there isn't an easy workaround, please let us know and we will prioritize this more.

@hroncok
Copy link
Contributor Author

hroncok commented Aug 7, 2024

No, I use parallel to submit thousands of builds.

The workaround I use is to resubmit the failed ones later (a bit tricky to figure out which failed, but I can manage).

Another workaround is to submit the first one manually and use parallel to submit the rest after.

@praiskup
Copy link
Member

Probably related to #3372

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In 2 years
Development

No branches or pull requests

3 participants