Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run tests on GitHub builds #211

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

wonglkd
Copy link
Contributor

@wonglkd wonglkd commented Mar 8, 2023

  1. Run most tests after build
  2. Cache submodules (folly, fizz, wangle, fbthrift) to speed up build
  3. Update workflow files to upgrade actions to v3 and remove Node version warning
  4. Shortened job names to avoid cut-off in GitHub UI

Test runner script

  • Runs most tests except 2 long-running benchmarks in $TO_SKIP
  • Adds a Markdown summary to GitHub action job, including log excerpt
  • Adds warning annotations for failing optional tests or timed out tests
  • Adds error annotation for failing tests not in $OPTIONAL
  • $OPTIONAL tests have their failures ignored. List needs to be manually updated.
  • Tests are run first (allocated 20 mins), then benchmarks (allocated a separate 20 mins). Within each stage, they are run in parallel ($PARALLELISM = 10)

Trade-offs considered

  • To keep things simple, ignored list is not segmented by OS; may miss out some new failing tests if already ignored for another OS)
  • A standalone bash script keeps it within GitHub Actions instead of adding an external dependency (most FB OSS repos use GH actions alone, with RocksDB using CircleCI and HHVM using NixCI). Still, it would be nice to have proper test insights (e.g., CircleCI's)
  • Because we fail build if tests do not pass, this could add a lot of noise. Hopefully this should be mitigated with $OPTIONAL. If not, one can comment the last line of run_tests.sh so that builds will not fail, with the disadvantage that one needs to view individual workflows to see the summary. (Or write a script to compile that)

Build time impact (net gain):

  • 30-50 mins less from not building submodules, if they do not change
  • 30 mins more for tests

Test plan:

  • Caching: ran with it enabled on my fork over a few days
  • Ran testing script on my own fork, tried with 1) tests that failed but were not ignored, and 2) tests that timed out

Screenshots of job summary page (accessible via Checks tab on PRs, or in Actions tab)
image
image

wonglkd added 5 commits March 8, 2023 21:36
Going via the GitHub Actions tab brings people to the summary page first, but clicking Details in a PR will bring people to the log directly and skip the summary page. Print a message telling people to look at the summary.
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 8, 2023
@wonglkd
Copy link
Contributor Author

wonglkd commented Mar 8, 2023

  1. I figure there are multiple ways to do tests, please feel free to suggest changes or turn down this PR!
  2. As a side-effect, this appears to show failing tests on the following OSes:

Ignored tests (and failing OSes)

  • allocator-test-AllocationClassTest
    • Ubuntu 18 (segfault)
  • allocator-test-AllocatorResizeTypeTest
    • Rocky 8.6
  • allocator-test-AllocatorTypeTest
    • CentOS 8.5, Rocky 8.6
  • allocator-test-MemoryAllocatorTest
    • Ubuntu 18 (segfault)
  • allocator-test-MM2QTest
    • Ubuntu 18 (segfault)
  • allocator-test-BlockCacheTest
    • Rocky 8.6
  • allocator-test-NavySetupTest
    • CentOS 8.1, CentOS 8.5, Debian, Fedora 36, Rocky 9, Rocky 8.6, Ubuntu 18
  • common-test-UtilTests
    • CentOS 8.1, CentOS 8.5, Debian, Fedora 36, Rocky 9, Rocky 8.6, Ubuntu 18
  • navy-test-BlockCacheTest
    • CentOS 8.1, Rocky 9, Ubuntu 18
  • navy-test-DriverTest
    • CentOS 8.1, CentOS 8.5, Debian, Fedora 36, Rocky 9, Ubuntu 18, Ubuntu 20, Ubuntu 22
  • navy-test-MockJobSchedulerTest
    • CentOS 8.5, Rocky 9, Ubuntu 20
  • shm-test-test_page_size
    • CentOS 8.1, CentOS 8.5, Debian, Fedora 36, Rocky 9, Rocky 8.6, Ubuntu 18, Ubuntu 20, Ubuntu 22
    • Large pages need to be enabled in workflow files or test script
  • datatype-test-MapTest (Ubuntu 20)
  • allocator-test-SlabAllocatorTest (Ubuntu 18)
  • allocator-test-NvmCacheTests (Rocky 8)
  • allocator-test-RefCountTest (Ubuntu 18)

Failing, not yet ignored

wonglkd added 2 commits March 8, 2023 19:54
- datatype-test-MapTest (Ubuntu 20)
- allocator-test-SlabAllocatorTest (Ubuntu 18)
@wonglkd wonglkd marked this pull request as ready for review March 9, 2023 06:09
@therealgymmy
Copy link
Contributor

therealgymmy commented Mar 9, 2023

We really appreciate you working on this! This will save us so much time maintaining OSS builds going forward :-D

shm-test-test_page_size

This test is expected to fail since it relies on adjusting page size. Internally, we also have this disabled on test services (since the test needs sudo permission to manipulate page size).

"Util.SysctlTests" is also expected to fail because it expects to change overcommit settings in the kernel

The other test failures are not expected. I'll go through the logs.

A standalone bash script keeps it within GitHub Actions instead of adding an external dependency (most FB OSS repos use GH actions alone, with RocksDB using CircleCI and HHVM using NixCI). Still, it would be nice to have proper test insights (e.g., CircleCI's)

RocksDB team develops for OSS first and then imports into our internal source control. CacheLib and most other FB OSS projects develops primarily within FB's internal setup, and then a system picks up our changes and push to Github. It's a trade-off between internal dev velocity and OSS dev velocity.

@wonglkd
Copy link
Contributor Author

wonglkd commented Mar 10, 2023

Sounds good! I hope it helps! GitHub's CI runners do have sudo, btw, so one could add a step in the test script to adjust that. Is there a recommended setting? (Realised that for OSes running within Docker containers, one will get a sysctl: permission denied on key "vm.nr_hugepages" even with sudo. For native builds, setting the value works, but the test still fails.)

Also, since GitHub Actions has free unlimited mins on OSS repos anyway, perhaps one could consider enabling CI on every push? Scheduled builds are helpful but one still needs to line up the timestamps and manually bisect the exact commit or code change that causes a break - plus, it's free anyway.

Btw, is there any expectation that tests are run sequentially vs in parallel?

@therealgymmy
Copy link
Contributor

therealgymmy commented Mar 14, 2023

Btw, is there any expectation that tests are run sequentially vs in parallel?

Internally we run them in parallel. That said, sequential is fine as long as tests finish in a reasonable time.

Also, since GitHub Actions has free unlimited mins on OSS repos anyway, perhaps one could consider enabling CI on every push? Scheduled builds are helpful but one still needs to line up the timestamps and manually bisect the exact commit or code change that causes a break - plus, it's free anyway.

Good to know. I wasn't aware github actions are free actually. Previously we had to pay for it (but maybe that was when we were first setting up this repo and it was private then).

Generalize from segfaults to other things that cause core dumps
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants