[SYCL][CUDA][HIP] Remove CUDA and HIP PI unit tests #12459

npmiller · 2024-01-22T19:15:21Z

These tests are not currently running and are covered in other test
suites:

test_primary_context.cpp
- Deprecated feature, covered in test-e2e/Basic/context.cpp
test_commands.cpp
- Covered by UR CTS
test_sampler_properties.cpp
- Covered by UR CTS: https://github.com/oneapi-src/unified-runtime/tree/main/test/conformance/sampler
PlatformTest.cpp
- Covered by UR CTS: https://github.com/oneapi-src/unified-runtime/blob/main/test/conformance/platform/urPlatformGetInfo.cpp
test_device.cpp
- Covered by UR CTS: https://github.com/oneapi-src/unified-runtime/blob/main/test/conformance/device/urDeviceGetInfo.cpp
EnqueueMemTest.cpp
- Covered by UR CTS: https://github.com/oneapi-src/unified-runtime/blob/main/test/conformance/enqueue/urEnqueueMemBufferFill.cpp
test_mem_obj.cpp
- Moved to UR CTS
test_contexts.cpp
- https://github.com/oneapi-src/unified-runtime/blob/main/test/adapters/cuda/context_tests.cpp
test_kernels.cpp
- https://github.com/oneapi-src/unified-runtime/blob/main/test/adapters/cuda/kernel_tests.cpp
test_base_objects.cpp
- Basic tests mostly covered in UR
test_interop_get_native.cpp
- Mostly covered in UR tests and E2E tests

After this both the CUDA and HIP directories could be removed. There are two PI tests remaining, one with regards to xpti handling of PI call arguments, and one regarding OpenCL interop ownership.

These tests are not currently running and are covered in other test suites: * `test_primary_context.cpp` * Deprecated feature, covered in `test-e2e/Basic/context.cpp` * `test_commands.cpp` * Covered by UR CTS * `test_sampler_properties.cpp` * Covered by UR CTS: https://github.com/oneapi-src/unified-runtime/tree/main/test/conformance/sampler * `PlatformTest.cpp` * Covered by UR CTS: https://github.com/oneapi-src/unified-runtime/blob/main/test/conformance/platform/urPlatformGetInfo.cpp * `test_device.cpp` * Covered by UR CTS: https://github.com/oneapi-src/unified-runtime/blob/main/test/conformance/device/urDeviceGetInfo.cpp * `EnqueueMemTest.cpp` * Covered by UR CTS: https://github.com/oneapi-src/unified-runtime/blob/main/test/conformance/enqueue/urEnqueueMemBufferFill.cpp * `test_mem_obj.cpp` * Moved to UR CTS * `test_contexts.cpp` * https://github.com/oneapi-src/unified-runtime/blob/main/test/adapters/cuda/context_tests.cpp * `test_kernels.cpp` * https://github.com/oneapi-src/unified-runtime/blob/main/test/adapters/cuda/kernel_tests.cpp * `test_base_objects.cpp` * Basic tests mostly covered in UR * `test_interop_get_native.cpp` * Mostly covered in UR tests and E2E tests

npmiller · 2024-02-05T11:33:25Z

ping @intel/llvm-reviewers-runtime

steffenlarsen

Sorry for the delay! Seems reasonable, given the UR replacements.

aelovikov-intel · 2024-02-05T17:02:01Z

Post commit failures on Arc GPU:

Failed Tests (2):
  SYCL :: OneapiDeviceSelector/level_zero_top.cpp
  SYCL :: Plugin/level_zero_ext_intel_cslice.cpp

FAIL: SYCL :: OneapiDeviceSelector/level_zero_top.cpp (1421 of 1868)
******************** TEST 'SYCL :: OneapiDeviceSelector/level_zero_top.cpp' FAILED ********************
Exit Code: -6

Command Output (stdout):
--
# RUN: at line 2
/__w/llvm/llvm/toolchain/bin//clang++   -fsycl -fsycl-targets=spir64 /__w/llvm/llvm/llvm/sycl/test-e2e/OneapiDeviceSelector/level_zero_top.cpp -o /__w/llvm/llvm/build-e2e/OneapiDeviceSelector/Output/level_zero_top.cpp.tmp.out
# executed command: /__w/llvm/llvm/toolchain/bin//clang++ -fsycl -fsycl-targets=spir64 /__w/llvm/llvm/llvm/sycl/test-e2e/OneapiDeviceSelector/level_zero_top.cpp -o /__w/llvm/llvm/build-e2e/OneapiDeviceSelector/Output/level_zero_top.cpp.tmp.out
# note: command had no output on stdout or stderr
# RUN: at line 3
env ONEAPI_DEVICE_SELECTOR=level_zero:gpu  /__w/llvm/llvm/build-e2e/OneapiDeviceSelector/Output/level_zero_top.cpp.tmp.out
# executed command: env ONEAPI_DEVICE_SELECTOR=level_zero:gpu /__w/llvm/llvm/build-e2e/OneapiDeviceSelector/Output/level_zero_top.cpp.tmp.out
# .---command stdout------------
# | Level-Zero GPU Device is found: true
# | Intel(R) Level-Zero is found: true
# | Expectedly, cpu device is not found.
# | Expectedly, ACC device is not found.
# | Abort was called at 253 line in file:
# | ../../neo/level_zero/core/source/builtin/builtin_functions_lib_impl.cpp
# `-----------------------------
# error: command failed with exit status: -6

FAIL: SYCL :: Plugin/level_zero_ext_intel_cslice.cpp (1457 of 1868)
******************** TEST 'SYCL :: Plugin/level_zero_ext_intel_cslice.cpp' FAILED ********************
Exit Code: -6

Command Output (stdout):
--
# RUN: at line 4
/__w/llvm/llvm/toolchain/bin//clang++   -fsycl -fsycl-targets=spir64 /__w/llvm/llvm/llvm/sycl/test-e2e/Plugin/level_zero_ext_intel_cslice.cpp -o /__w/llvm/llvm/build-e2e/Plugin/Output/level_zero_ext_intel_cslice.cpp.tmp.out
# executed command: /__w/llvm/llvm/toolchain/bin//clang++ -fsycl -fsycl-targets=spir64 /__w/llvm/llvm/llvm/sycl/test-e2e/Plugin/level_zero_ext_intel_cslice.cpp -o /__w/llvm/llvm/build-e2e/Plugin/Output/level_zero_ext_intel_cslice.cpp.tmp.out
# note: command had no output on stdout or stderr
# RUN: at line 6
env ZEX_NUMBER_OF_CCS=0:4 UR_L0_DEBUG=1 env ONEAPI_DEVICE_SELECTOR=level_zero:gpu  /__w/llvm/llvm/build-e2e/Plugin/Output/level_zero_ext_intel_cslice.cpp.tmp.out 2>&1 | /__w/llvm/llvm/toolchain/bin/FileCheck /__w/llvm/llvm/llvm/sycl/test-e2e/Plugin/level_zero_ext_intel_cslice.cpp --check-prefixes=CHECK-PVC
# executed command: env ZEX_NUMBER_OF_CCS=0:4 UR_L0_DEBUG=1 env ONEAPI_DEVICE_SELECTOR=level_zero:gpu /__w/llvm/llvm/build-e2e/Plugin/Output/level_zero_ext_intel_cslice.cpp.tmp.out
# note: command had no output on stdout or stderr
# error: command failed with exit status: -6
# executed command: /__w/llvm/llvm/toolchain/bin/FileCheck /__w/llvm/llvm/llvm/sycl/test-e2e/Plugin/level_zero_ext_intel_cslice.cpp --check-prefixes=CHECK-PVC
# note: command had no output on stdout or stderr

npmiller · 2024-02-05T17:27:55Z

I don't see how this patch could have caused these failures, could it be something on the machine or from a previous patch?

This just removed old tests from a separate test suite, that were already disabled and not running.

aelovikov-intel · 2024-02-05T17:39:44Z

That wasn't a call to action. I'm just posting the failures (and encourage others in @intel/llvm-gatekeepers do the same) so that failures are searchable through Github interface (it can't look into the logs). That way, we can get some statistics on how flaky the test is, what configuration it could fail on, etc.

Ultimately, once a search for a failing test provides several instances, we should be creating an issue/internal bug report and disabling the test until that is resolved.

bader · 2024-02-05T19:25:46Z

It sounds like something better be automated rather than requested from gatekeepers.
@stdale-intel, FYI.

aelovikov-intel · 2024-02-05T19:38:55Z

It sounds like something better be automated rather than requested from gatekeepers. @stdale-intel, FYI.

While I agree that some automation would be helpful, it's still gatekeeper's responsibilities to explain every failure in the post-commit or request the PR author to do so. We cannot be simply ignoring all the flaky failures in our CI.

ldrumm · 2024-02-06T14:37:25Z

I'm just posting the failures (and encourage others in @intel/llvm-gatekeepers do the same) so that failures are searchable through Github interface

How does this make failures searchable? Are you using some consistent language in your comment?

It's still gatekeeper's responsibilities to explain every failure in the post-commit or request the PR author to do so.

Do we really want to divert the attention of all users in the failure blame list even if it's obvious which commits are not responsible? Seems like pointless busywork for the author and the gatekeeper. If we have a buildbot commenting on every PR in the blamelist, then fine - since the machine can't make that distinction - but if a human is in the loop, then surely they should use their judgment; if the human is not allowed to use their judgment they should be replaced by a machine in order to not waste human time.

aelovikov-intel · 2024-02-06T16:47:24Z

How does this make failures searchable? Are you using some consistent language in your comment?

Copy-paste the failing test and then search for it using github's repo search:

https://github.com/search?q=repo%3Aintel%2Fllvm+Plugin%2Flevel_zero_ext_intel_cslice.cpp&type=pullrequests

Do we really want to divert the attention of all users

I usually tag people if I expect some action from them, so I don't think we divert attention that much.

We do have lots of flaky tests though, and we need to do something about that. The first step there is to gather some statistics and that's the best we can do for now. If somebody is willing to write scripts to parse the logs, update some spreadsheet/database with those flaky fails, and maintain that process I'd be more than happy to switch to it, but nobody volunteered so far.

Yes, it's gatekeepers' responsibility, because we can't place that burden on occasional contributors, but we have to have somebody looking into the issues.

sommerlukas · 2024-02-06T16:51:27Z

An alternative would be to create issues for each flaky test (or a number of related flaky tests) and post PRs for which the test failed in post-commit without being related to changes in the PR in that issue.

That way, the information for a flaky test would be collected in a single location (the issue) and the PR author's attention would not be diverted.

aelovikov-intel · 2024-02-06T17:18:26Z

That would only work if you already know the issue number. In my experience, we couldn't even make people post a comment without searches, expecting them to find an issue first in unrealistic.

npmiller temporarily deployed to WindowsCILock January 22, 2024 19:15 — with GitHub Actions Inactive

npmiller mentioned this pull request Jan 22, 2024

DPC++ runtime plug-in unittests are broken. #10688

Open

npmiller temporarily deployed to WindowsCILock January 22, 2024 19:36 — with GitHub Actions Inactive

npmiller force-pushed the rm-pi-tests branch from f80f42c to bbb48c0 Compare January 24, 2024 17:55

npmiller temporarily deployed to WindowsCILock January 24, 2024 17:55 — with GitHub Actions Inactive

npmiller changed the title ~~[SYCL] Remove PI unit tests~~ [SYCL][CUDA][HIP] Remove CUDA and HIP PI unit tests Jan 24, 2024

npmiller marked this pull request as ready for review January 24, 2024 18:10

npmiller requested a review from a team as a code owner January 24, 2024 18:10

npmiller requested a review from bso-intel January 24, 2024 18:10

npmiller had a problem deploying to WindowsCILock January 24, 2024 18:16 — with GitHub Actions Failure

npmiller force-pushed the rm-pi-tests branch from bbb48c0 to 9808288 Compare February 5, 2024 11:32

steffenlarsen approved these changes Feb 5, 2024

View reviewed changes

npmiller temporarily deployed to WindowsCILock February 5, 2024 11:40 — with GitHub Actions Inactive

npmiller temporarily deployed to WindowsCILock February 5, 2024 11:59 — with GitHub Actions Inactive

steffenlarsen merged commit b781e6c into intel:sycl Feb 5, 2024
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCL][CUDA][HIP] Remove CUDA and HIP PI unit tests #12459

[SYCL][CUDA][HIP] Remove CUDA and HIP PI unit tests #12459

npmiller commented Jan 22, 2024 •

edited

Loading

npmiller commented Feb 5, 2024

steffenlarsen left a comment

aelovikov-intel commented Feb 5, 2024 •

edited

Loading

npmiller commented Feb 5, 2024

aelovikov-intel commented Feb 5, 2024

bader commented Feb 5, 2024

aelovikov-intel commented Feb 5, 2024

ldrumm commented Feb 6, 2024

aelovikov-intel commented Feb 6, 2024

sommerlukas commented Feb 6, 2024

aelovikov-intel commented Feb 6, 2024 •

edited

Loading

[SYCL][CUDA][HIP] Remove CUDA and HIP PI unit tests #12459

[SYCL][CUDA][HIP] Remove CUDA and HIP PI unit tests #12459

Conversation

npmiller commented Jan 22, 2024 • edited Loading

npmiller commented Feb 5, 2024

steffenlarsen left a comment

Choose a reason for hiding this comment

aelovikov-intel commented Feb 5, 2024 • edited Loading

npmiller commented Feb 5, 2024

aelovikov-intel commented Feb 5, 2024

bader commented Feb 5, 2024

aelovikov-intel commented Feb 5, 2024

ldrumm commented Feb 6, 2024

aelovikov-intel commented Feb 6, 2024

sommerlukas commented Feb 6, 2024

aelovikov-intel commented Feb 6, 2024 • edited Loading

npmiller commented Jan 22, 2024 •

edited

Loading

aelovikov-intel commented Feb 5, 2024 •

edited

Loading

aelovikov-intel commented Feb 6, 2024 •

edited

Loading