Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use C++ for unit tests #646

Open
wants to merge 22 commits into
base: master
Choose a base branch
from
Open

Commits on Oct 5, 2024

  1. fix(build): treat functional+unit tests separately

    CI has not been running unit tests for neuron, even though it's
    possible. This doesn't make functional tests work on neuron yet, and it
    remains automatically disabled. Separate out the enablement of the
    functional tests from the unit tests. Make --disable-tests not build the
    unit tests or the functional tests.
    
    Signed-off-by: Nicholas Sielicki <[email protected]>
    aws-nslick committed Oct 5, 2024
    Configuration menu
    Copy the full SHA
    2331505 View commit details
    Browse the repository at this point in the history
  2. fix(ci/gha): remove "libfabric" from job names

    858f448 stopped building libfabric from git, but didn't change the name,
    so all of these builds have been named "libfabric@git" when they're
    using the prebuilt efa installer. Remove the token.
    
    Signed-off-by: Nicholas Sielicki <[email protected]>
    aws-nslick committed Oct 5, 2024
    Configuration menu
    Copy the full SHA
    6552704 View commit details
    Browse the repository at this point in the history
  3. feat(ci/gha): enable unit tests for neuron builds

    Remove the existing unit-tests action -- pay the install cost time only
    once, and just run the tests after we run the build.
    
    Prefer to not pass -j to make -- this can create jumbled output that is
    harder to debug. These are running in parallel, it's okay if each
    individual TU takes longer.
    
    Signed-off-by: Nicholas Sielicki <[email protected]>
    aws-nslick committed Oct 5, 2024
    Configuration menu
    Copy the full SHA
    d51bba6 View commit details
    Browse the repository at this point in the history
  4. feat(ci/gha): add V=1 to all make invocations

    The only reason you are ever looking at these logs is because the job
    failed, in which case, it's a lot easier to debug if you can just see
    precisely what the compiler called. Enable that here.
    
    Signed-off-by: Nicholas Sielicki <[email protected]>
    aws-nslick committed Oct 5, 2024
    Configuration menu
    Copy the full SHA
    dfd2669 View commit details
    Browse the repository at this point in the history
  5. feat(ci/gha): use /etc/alternatives for cc/cxx

    This was a mess previously, and a lot of jobs were mislabeled as a
    result.
    
    Signed-off-by: Nicholas Sielicki <[email protected]>
    aws-nslick committed Oct 5, 2024
    Configuration menu
    Copy the full SHA
    64e4346 View commit details
    Browse the repository at this point in the history
  6. fix(rdma): assert on create_send_comm result

    Fixing the compiler being misapplied in CI has exposed another gcc13
    -fanalyzer report. This is likely a false positive; it seems to gets
    confused about how the error status is being stored within the allocated
    req before being freed and is unable to reason about the relationship
    between ret and s_comm. Just add another check.
    
    Signed-off-by: Nicholas Sielicki <[email protected]>
    aws-nslick committed Oct 5, 2024
    Configuration menu
    Copy the full SHA
    818afbb View commit details
    Browse the repository at this point in the history
  7. fix(test) handle more -fanalyzer warnings

    Similar to this commits parent, fixing the compiler being misapplied in
    CI has exposed another gcc13 -fanalyzer warning. The compiler wants to
    be able to trust that if the rank is not 0, it must be 1 and vice-versa.
    
    Unnecessarily checking for:
    
      if (rank == 0) {} else if (rank == 2) {}
    
    instead of just
    
      if (rank == 0) {} else {}
    
    means that the compiler considers cases where neither branch is taken.
    Assert at init that both world size is two and further that the rank
    assigned to each job is within {0,1}. Then, also make the else-ifs into
    elses to silence the warning.
    
    Signed-off-by: Nicholas Sielicki <[email protected]>
    aws-nslick committed Oct 5, 2024
    Configuration menu
    Copy the full SHA
    8c6480c View commit details
    Browse the repository at this point in the history
  8. feat(build): add a macro for managing cflags

    Add a macro that uses AX_CHECK_COMPILE_FLAG to test for the existence of
    a potential flag, and appends into the picky flags variable if and only
    if the compiler has the flag.
    
    Signed-off-by: Nicholas Sielicki <[email protected]>
    aws-nslick committed Oct 5, 2024
    Configuration menu
    Copy the full SHA
    98840e1 View commit details
    Browse the repository at this point in the history
  9. feat(build): enable -Wjump-misses-init

    GCC manual says: -Wjump-misses-init is included in -Wc++-compat. It can
    be disabled with the -Wno-jump-misses-init option.
    
    This is seemingly only true in newer GCC, or is just bugged outright,
    because we had -Wc++-compat enabled and did not see these warnings.
    
    Add a macro for appending "picky" compiler flags and use it for existing
    options, then ensure that -Wjump-misses-init is passed.
    
    Signed-off-by: Nicholas Sielicki <[email protected]>
    aws-nslick committed Oct 5, 2024
    Configuration menu
    Copy the full SHA
    13c025d View commit details
    Browse the repository at this point in the history
  10. feat(build): enable -Wshadow, remove all shadowing

    a global `provider_filter` was made redundant with 159bfed and this is
    now passed to every user directly, but the provider_filter string that
    now exists in nccl_net_ofi_create_plugin was shadowing the global one
    that was exported everywhere in nccl_ofi.h. Delete the global one.
    
    rdma had two cases of declaring a new variable, shadowing the old
    variable, pointing to the same thing. Delete the second useless
    declaration. sendrecv had a case where it was shadowing a "req" variable
    in an inner scope, but pointing to a different req. Just rename the
    variable in the inner scope to be more clear.
    
    mr.c needed to remove a useless include, which removes the shadow on
    system_page_size. This unfortunately exposed an issue in the header for
    mr.h; config.h was included late, it was transiently relying on an
    inclusion for the definition of offsetof and, because we do not build
    with C11, stdbool. The conditional inclusion of fi_domain.h was not
    correct in all cases. So add <stddef.h> there, always include
    fi_domain.h, and fix the config.h misordering.
    
    Rename some things in tests to avoid shadowing.
    
    With these fixed, enable -Wshadow whenever -Werror is enabled.
    
    Signed-off-by: Nicholas Sielicki <[email protected]>
    aws-nslick committed Oct 5, 2024
    Configuration menu
    Copy the full SHA
    e2007aa View commit details
    Browse the repository at this point in the history
  11. feat(build): enable -Wformat=2

    Enable -Wformat=2, which is stricter than Wall's -Wformat:
    > -Wformat=2
    >> Enable -Wformat plus additional format checks. Currently equivalent to
    >> -Wformat -Wformat-nonliteral -Wformat-security -Wformat-y2k.
    
    Fix a warning that this exposed: this literal in topo.c can be made
    completely const and should be.
    
    Signed-off-by: Nicholas Sielicki <[email protected]>
    aws-nslick committed Oct 5, 2024
    Configuration menu
    Copy the full SHA
    1ec4605 View commit details
    Browse the repository at this point in the history
  12. feat(build): enable -Wimplicit-fallthrough

    Add missing case statements and enable -Wimplicit-fallthrough.
    
    Signed-off-by: Nicholas Sielicki <[email protected]>
    aws-nslick committed Oct 5, 2024
    Configuration menu
    Copy the full SHA
    55da767 View commit details
    Browse the repository at this point in the history
  13. feat(build): enable -Wcast-align

    Without this, gcc complains:
    > cast from 'char *' to 'struct nccl_ofi_freelist_elem_t *' increases
    > required alignment from 1 to 8
    
    Fix problematic areas in container_of and freelist, then enable
    -Wcast-align whenever -Werror is enabled.
    
    Related-commit: 4e64913
    Signed-off-by: Nicholas Sielicki <[email protected]>
    aws-nslick committed Oct 5, 2024
    Configuration menu
    Copy the full SHA
    e86adb8 View commit details
    Browse the repository at this point in the history
  14. feat(build): enable -Wmissing-declarations

    Many TU-local functions were missing a `static` keyword. Add it.
    
    Signed-off-by: Nicholas Sielicki <[email protected]>
    aws-nslick committed Oct 5, 2024
    Configuration menu
    Copy the full SHA
    186b679 View commit details
    Browse the repository at this point in the history
  15. feat(build): enable -Wnull-dereference

    We get this for free, likely fixed in aws#627 or with some of the
    -fanalyzer fixes. Enable -Wnull-dereference whenever -Werror is enabled.
    
    Signed-off-by: Nicholas Sielicki <[email protected]>
    aws-nslick committed Oct 5, 2024
    Configuration menu
    Copy the full SHA
    6301635 View commit details
    Browse the repository at this point in the history
  16. feat(build): use -isystem instead of -I

    When constructing include directories, use -isystem instead of -I. This
    accomplishes largely the same thing, but informs the compiler that we
    don't control these headers, which is useful for filtering warnings.
    
    Signed-off-by: Nicholas Sielicki <[email protected]>
    aws-nslick committed Oct 5, 2024
    Configuration menu
    Copy the full SHA
    df17f32 View commit details
    Browse the repository at this point in the history
  17. fix(freelist): add missing stdbool.h include

    C99 does not define bool as a keyword (unlike C11 and beyond), which
    means that files are free to use `bool` as something other than a type
    name. When <stdbool.h> is not included through some other mechanism,
    -Wc++-compat treats it as such and complains that it's a reserved
    keyword in C++.
    
    freelist.h uses bool, but does not bring in stdbool.h; add <stdbool.h>
    aws-nslick committed Oct 5, 2024
    Configuration menu
    Copy the full SHA
    3ac7f56 View commit details
    Browse the repository at this point in the history
  18. feat(build): handle 3p headers better

    Don't put 3p header-only dependencies (meaning, nccl's ext-net, nccl's
    ext-tuner, uthash) under the same directory as our headers, put them
    under a separate top-level 3rd-party/ directory. Ensure that a license
    file is present in each directory.
    
    On the build side, use AC_SUBST to make $device_interface accessible
    outside of the top-level configure.ac, then use this to generate any
    accelerator-specific include paths. It is now not possible to include
    the wrong copy of the nccl interface files, because when compiling under
    that accelerator it is simply not in the include path.
    
    Move the tuner to its own Makefile.am for the src/tuner subdirectory;
    don't try to accomplish this in the single src/Makefile.am.
    
    When moving over the includes to the new expected format, prefer
    <header.h> to "header.h" so that it's clear that these are system
    headers, which also has the benefit of telling the compiler that
    warnings in those files are different than any warnings in our actual
    source files.
    
    Signed-off-by: Nicholas Sielicki <[email protected]>
    aws-nslick committed Oct 5, 2024
    Configuration menu
    Copy the full SHA
    70418f4 View commit details
    Browse the repository at this point in the history
  19. fix(cuda): delete broken using directives

    This was missed in review. It doesn't work this way. In C++20, the
    following is legal:
    
    > enum foo { bar, baz };
    > using enum foo;
    
    But it does not support the glob syntax done here, and this is also
    missing the `enum` specifier required, and it also doesn't work with
    C++17. It's unnecessary in this file, anyway, we can just use the bare
    names here. Delete it so that this file can be parsed with `-x c++`
    without causing build failures.
    aws-nslick committed Oct 5, 2024
    Configuration menu
    Copy the full SHA
    dc427a3 View commit details
    Browse the repository at this point in the history
  20. fix(mr_ckey): add C++ compatibility with GCC7

    g++7 fails to initialize C-style tagged unions with the following error
    message:
    
    > sorry, unimplemented: non-trivial designated initializers not
    > supported
    
    Note that gcc7 handles this fine when the file is parsed as C, as well
    as g++8 and beyond, and the problem only exists when operating in C++
    mode under g++7. AL2 is still using this ancient toolchain, so add a
    hack in the case that it is used. This is pretty horrible, and should be
    cleaned up sooner rather than later.
    aws-nslick committed Oct 5, 2024
    Configuration menu
    Copy the full SHA
    36e6948 View commit details
    Browse the repository at this point in the history
  21. feat(rdma): constrain C linkage to init

    On builds where lttng is enabled, the inclusion of nccl_ofi_tracepoint.h
    pulls in <lttng/ust-utils.h>, which defines templates when __cplusplus
    is defined. Those templates cannot have C linkage.
    
    This file does not really need nccl_ofi_tracepoint.h; it only needs
    config.h to test for whether NVTX tracing is enabled and nvtx3 headers
    for conditionally defining the nvtx domain handles.
    
    It's also simultaneously true that the large majority of these headers
    are consumed only by rdma.c, so another option is to just move the guard
    down to where it's actually required.
    
    Signed-off-by: Nicholas Sielicki <[email protected]>
    aws-nslick committed Oct 5, 2024
    Configuration menu
    Copy the full SHA
    db13eef View commit details
    Browse the repository at this point in the history
  22. feat(test): parse as c++ source

    All included headers are fully safe to be parsed as C++, so these unit
    tests can now use C++. Rename them to `.cc' files and use a few c++
    keywords to enforce that the compiler is actually treating them as such.
    Actual C++ cleanups to follow later, hopefully alongside a unit test
    framework like gtest or catch2.
    
    Signed-off-by: Nicholas Sielicki <[email protected]>
    aws-nslick committed Oct 5, 2024
    Configuration menu
    Copy the full SHA
    4ac7d2b View commit details
    Browse the repository at this point in the history