Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Notebook tests failing on latest 24.10 nightlies #712

Closed
jameslamb opened this issue Sep 30, 2024 · 22 comments
Closed

[BUG] Notebook tests failing on latest 24.10 nightlies #712

jameslamb opened this issue Sep 30, 2024 · 22 comments
Assignees
Labels
? - Needs Triage Need team to review and classify bug Something isn't working

Comments

@jameslamb
Copy link
Member

Describe the bug

Several notebook jobs are failing on 24.10 nightlies

Error during notebook tests!
Errors during cugraph/algorithms/community/Community-Clustering.ipynb
Errors during cugraph/algorithms/community/Spectral-Clustering.ipynb
Errors during cuml/arima_demo.ipynb
Errors during cuml/forest_inference_demo.ipynb
Errors during cuml/kmeans_demo.ipynb
Errors during cuml/linear_regression_demo.ipynb
Errors during cuml/nearest_neighbors_demo.ipynb
Errors during cuml/random_forest_demo.ipynb
Errors during cuspatial/trajectory_clustering.ipynb

(build link)

The logs don't contain much other detail.

Steps/Code to reproduce bug

Just run the build CI job against branch-24.10 at https://github.com/rapidsai/docker/actions/runs/11103412516.

Expected behavior

N/A

Environment details (please complete the following information):

N/A

Additional context

N/A

@jameslamb jameslamb added ? - Needs Triage Need team to review and classify bug Something isn't working labels Sep 30, 2024
@jameslamb jameslamb self-assigned this Sep 30, 2024
@jameslamb
Copy link
Member Author

jameslamb commented Sep 30, 2024

Tried with one of the failing cuml notebooks. Ran the following interactively on an x86_64 machine with CUDA 12.2 and 8 V100s.

docker run \
    --rm \
    --gpus "0,1" \
    -p 1234:8888 \
    -it rapidsai/notebooks:24.10a-cuda11.8-py3.10-amd64

Opened cuml/arima_demo.ipynb in JupyterLab and ran it interactively. The kernel died the first time I tried to even initialize a cuml.tsa.arima.ARIMA object. Was able to reproducibly generate that with this more minimal code derived from the notebook.

import cudf
from cuml.tsa.arima import ARIMA

import numpy as np
import pandas as pd

def load_dataset(name, max_batch=4):
    import os
    pdf = pd.read_csv(os.path.join("data", "time_series", "%s.csv" % name))
    return cudf.from_pandas(pdf[pdf.columns[1:max_batch+1]].astype(np.float64))

df_mig = load_dataset("net_migrations_auckland_by_age", 4)

model_mig = ARIMA(df_mig, order=(0,0,2), fit_intercept=True)
# Kernel restarting: The kernel for cuml/arima_demo.ipynb appears to have died. It will restart automatically.

I noticed we were getting older versions of fmt and spdlog in the environment.

conda env export
  - fmt=10.2.1=h00ab1b0_0
  - spdlog=1.12.0=hd2e6256_2

That makes me think there's something wrong with the environment solve building this image, and that maybe these failures are a result of mismatched nightlies. Will keep investigating.

@jameslamb
Copy link
Member Author

Running that same script in the same image, but using gdb

conda install --yes -c conda-forge gdb
gdb --args python test.py
# (gdb) run
# (gdb) bt

Here's what I saw in the trace:

#0  0x00007fd274f69437 in ML::detect_missing(raft::handle_t&, double const*, int) () from /opt/conda/lib/python3.10/site-packages/cuml/internals/../../../../libcuml++.so
#1  0x00007fd1e3ce5339 in ?? () from /opt/conda/lib/python3.10/site-packages/cuml/tsa/arima.cpython-310-x86_64-linux-gnu.so
#2  0x000055f86693f908 in do_call_core (kwdict={}, 
    callargs=(<ARIMA(handle=<pylibraft.common.handle.Handle at remote 0x7fd2b3471080>, verbose=4, output_type='input', output_mem_type=<MemoryType(_value_=1, _name_='device', __objclass__=<EnumMeta(_generate_next_value_=<function at remote 0x7fd5a38a9b40>, __module__='cuml.internals.mem_type', from_str=<classmethod at remote 0x7fd2a8e3e440>, xpy=<property at remote 0x7fd2a8e4f330>, xdf=<property at remote 0x7fd2a8e4f380>, xsparse=<property at remote 0x7fd2a8e4f3d0>, is_device_accessible=<property at remote 0x7fd2a8e4f420>, is_host_accessible=<property at remote 0x7fd2a8e4f470>, __doc__='An enumeration.', _member_names_=['device', 'host', 'managed', 'mirror'], _member_map_={'device': <...>, 'host': <MemoryType(_value_=2, _name_='host', __objclass__=<...>) at remote 0x7fd2a8e3e620>, 'managed': <MemoryType(_value_=3, _name_='managed', __objclass__=<...>) at remote 0x7fd2a8e3e680>, 'mirror': <MemoryType(_value_=4, _name_='mirror', __objclass__=<...>) at remote 0x7fd2a8e3e6e0>}, _member_type_=<type at remote 0x55f866b86920>, _...(truncated), 
    func=<_cython_3_0_11.cython_function_or_method at remote 0x7fd1e7770380>, trace_info=0x7ffcb14a5dd0, tstate=<optimized out>)
    at /usr/local/src/conda/python-3.10.15/Python/ceval.c:5945
full trace (click me)
#0  0x00007fd274f69437 in ML::detect_missing(raft::handle_t&, double const*, int) () from /opt/conda/lib/python3.10/site-packages/cuml/internals/../../../../libcuml++.so
#1  0x00007fd1e3ce5339 in ?? () from /opt/conda/lib/python3.10/site-packages/cuml/tsa/arima.cpython-310-x86_64-linux-gnu.so
#2  0x000055f86693f908 in do_call_core (kwdict={}, 
    callargs=(<ARIMA(handle=<pylibraft.common.handle.Handle at remote 0x7fd2b3471080>, verbose=4, output_type='input', output_mem_type=<MemoryType(_value_=1, _name_='device', __objclass__=<EnumMeta(_generate_next_value_=<function at remote 0x7fd5a38a9b40>, __module__='cuml.internals.mem_type', from_str=<classmethod at remote 0x7fd2a8e3e440>, xpy=<property at remote 0x7fd2a8e4f330>, xdf=<property at remote 0x7fd2a8e4f380>, xsparse=<property at remote 0x7fd2a8e4f3d0>, is_device_accessible=<property at remote 0x7fd2a8e4f420>, is_host_accessible=<property at remote 0x7fd2a8e4f470>, __doc__='An enumeration.', _member_names_=['device', 'host', 'managed', 'mirror'], _member_map_={'device': <...>, 'host': <MemoryType(_value_=2, _name_='host', __objclass__=<...>) at remote 0x7fd2a8e3e620>, 'managed': <MemoryType(_value_=3, _name_='managed', __objclass__=<...>) at remote 0x7fd2a8e3e680>, 'mirror': <MemoryType(_value_=4, _name_='mirror', __objclass__=<...>) at remote 0x7fd2a8e3e6e0>}, _member_type_=<type at remote 0x55f866b86920>, _...(truncated), 
    func=<_cython_3_0_11.cython_function_or_method at remote 0x7fd1e7770380>, trace_info=0x7ffcb14a5dd0, tstate=<optimized out>)
    at /usr/local/src/conda/python-3.10.15/Python/ceval.c:5945
#3  _PyEval_EvalFrameDefault (tstate=<optimized out>, 
    f=Frame 0x55f86c6b1e60, for file /opt/conda/lib/python3.10/site-packages/cuml/internals/api_decorators.py, line 188, in wrapper (args=(<ARIMA(handle=<pylibraft.common.handle.Handle at remote 0x7fd2b3471080>, verbose=4, output_type='input', output_mem_type=<MemoryType(_value_=1, _name_='device', __objclass__=<EnumMeta(_generate_next_value_=<function at remote 0x7fd5a38a9b40>, __module__='cuml.internals.mem_type', from_str=<classmethod at remote 0x7fd2a8e3e440>, xpy=<property at remote 0x7fd2a8e4f330>, xdf=<property at remote 0x7fd2a8e4f380>, xsparse=<property at remote 0x7fd2a8e4f3d0>, is_device_accessible=<property at remote 0x7fd2a8e4f420>, is_host_accessible=<property at remote 0x7fd2a8e4f470>, __doc__='An enumeration.', _member_names_=['device', 'host', 'managed', 'mirror'], _member_map_={'device': <...>, 'host': <MemoryType(_value_=2, _name_='host', __objclass__=<...>) at remote 0x7fd2a8e3e620>, 'managed': <MemoryType(_value_=3, _name_='managed', __objclass__=<...>) at remote 0x7fd2a8e3e680>, 'mirror': <Mem...(truncated), throwflag=<optimized out>)
    at /usr/local/src/conda/python-3.10.15/Python/ceval.c:4277
#4  0x000055f86694cd1c in _PyEval_EvalFrame (throwflag=0, 
    f=Frame 0x55f86c6b1e60, for file /opt/conda/lib/python3.10/site-packages/cuml/internals/api_decorators.py, line 188, in wrapper (args=(<ARIMA(handle=<pylibraft.common.handle.Handle at remote 0x7fd2b3471080>, verbose=4, output_type='input', output_mem_type=<MemoryType(_value_=1, _name_='device', __objclass__=<EnumMeta(_generate_next_value_=<function at remote 0x7fd5a38a9b40>, __module__='cuml.internals.mem_type', from_str=<classmethod at remote 0x7fd2a8e3e440>, xpy=<property at remote 0x7fd2a8e4f330>, xdf=<property at remote 0x7fd2a8e4f380>, xsparse=<property at remote 0x7fd2a8e4f3d0>, is_device_accessible=<property at remote 0x7fd2a8e4f420>, is_host_accessible=<property at remote 0x7fd2a8e4f470>, __doc__='An enumeration.', _member_names_=['device', 'host', 'managed', 'mirror'], _member_map_={'device': <...>, 'host': <MemoryType(_value_=2, _name_='host', __objclass__=<...>) at remote 0x7fd2a8e3e620>, 'managed': <MemoryType(_value_=3, _name_='managed', __objclass__=<...>) at remote 0x7fd2a8e3e680>, 'mirror': <Mem...(truncated), tstate=0x55f867f83100)
    at /usr/local/src/conda/python-3.10.15/Include/internal/pycore_ceval.h:46
#5  _PyEval_Vector (kwnames=<optimized out>, argcount=<optimized out>, args=<optimized out>, locals=0x0, con=0x7fd1e774e2a0, tstate=0x55f867f83100)
    at /usr/local/src/conda/python-3.10.15/Python/ceval.c:5067
#6  _PyFunction_Vectorcall (func=<function at remote 0x7fd1e774e290>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)
    at /usr/local/src/conda/python-3.10.15/Objects/call.c:342
#7  0x00007fd1e3d177fa in ?? () from /opt/conda/lib/python3.10/site-packages/cuml/tsa/arima.cpython-310-x86_64-linux-gnu.so
#8  0x000055f866958afc in PyVectorcall_Call (kwargs=<optimized out>, tuple=<optimized out>, callable=<_cython_3_0_11.cython_function_or_method at remote 0x7fd1e77702b0>)
    at /usr/local/src/conda/python-3.10.15/Objects/call.c:267
#9  _PyObject_Call (kwargs=<optimized out>, args=<optimized out>, callable=<_cython_3_0_11.cython_function_or_method at remote 0x7fd1e77702b0>, tstate=<optimized out>)
    at /usr/local/src/conda/python-3.10.15/Objects/call.c:290
#10 PyObject_Call (callable=<_cython_3_0_11.cython_function_or_method at remote 0x7fd1e77702b0>, args=<optimized out>, kwargs=<optimized out>)
    at /usr/local/src/conda/python-3.10.15/Objects/call.c:317
#11 0x000055f86693f908 in do_call_core (
    kwdict={'order': (0, 0, 2), 'fit_intercept': True, 'self': <ARIMA(handle=<pylibraft.common.handle.Handle at remote 0x7fd2b3471080>, verbose=4, output_type='input', output_mem_type=<MemoryType(_value_=1, _name_='device', __objclass__=<EnumMeta(_generate_next_value_=<function at remote 0x7fd5a38a9b40>, __module__='cuml.internals.mem_type', from_str=<classmethod at remote 0x7fd2a8e3e440>, xpy=<property at remote 0x7fd2a8e4f330>, xdf=<property at remote 0x7fd2a8e4f380>, xsparse=<property at remote 0x7fd2a8e4f3d0>, is_device_accessible=<property at remote 0x7fd2a8e4f420>, is_host_accessible=<property at remote 0x7fd2a8e4f470>, __doc__='An enumeration.', _member_names_=['device', 'host', 'managed', 'mirror'], _member_map_={'device': <...>, 'host': <MemoryType(_value_=2, _name_='host', __objclass__=<...>) at remote 0x7fd2a8e3e620>, 'managed': <MemoryType(_value_=3, _name_='managed', __objclass__=<...>) at remote 0x7fd2a8e3e680>, 'mirror': <MemoryType(_value_=4, _name_='mirror', __objclass__=<...>) at remote 0x7fd2a8e3e6e0>...(truncated), callargs=(), 
    func=<_cython_3_0_11.cython_function_or_method at remote 0x7fd1e77702b0>, trace_info=0x7ffcb14a6420, tstate=<optimized out>)
    at /usr/local/src/conda/python-3.10.15/Python/ceval.c:5945
#12 _PyEval_EvalFrameDefault (tstate=<optimized out>, 
--Type <RET> for more, q to quit, c to continue without paging--
    f=Frame 0x7fd2a086a700, for file /opt/conda/lib/python3.10/site-packages/cuml/internals/api_decorators.py, line 344, in inner_f (args=(<ARIMA(handle=<pylibraft.common.handle.Handle at remote 0x7fd2b3471080>, verbose=4, output_type='input', output_mem_type=<MemoryType(_value_=1, _name_='device', __objclass__=<EnumMeta(_generate_next_value_=<function at remote 0x7fd5a38a9b40>, __module__='cuml.internals.mem_type', from_str=<classmethod at remote 0x7fd2a8e3e440>, xpy=<property at remote 0x7fd2a8e4f330>, xdf=<property at remote 0x7fd2a8e4f380>, xsparse=<property at remote 0x7fd2a8e4f3d0>, is_device_accessible=<property at remote 0x7fd2a8e4f420>, is_host_accessible=<property at remote 0x7fd2a8e4f470>, __doc__='An enumeration.', _member_names_=['device', 'host', 'managed', 'mirror'], _member_map_={'device': <...>, 'host': <MemoryType(_value_=2, _name_='host', __objclass__=<...>) at remote 0x7fd2a8e3e620>, 'managed': <MemoryType(_value_=3, _name_='managed', __objclass__=<...>) at remote 0x7fd2a8e3e680>, 'mirror': <Mem...(truncated), throwflag=<optimized out>)
    at /usr/local/src/conda/python-3.10.15/Python/ceval.c:4277
#13 0x000055f86694cd1c in _PyEval_EvalFrame (throwflag=0, 
    f=Frame 0x7fd2a086a700, for file /opt/conda/lib/python3.10/site-packages/cuml/internals/api_decorators.py, line 344, in inner_f (args=(<ARIMA(handle=<pylibraft.common.handle.Handle at remote 0x7fd2b3471080>, verbose=4, output_type='input', output_mem_type=<MemoryType(_value_=1, _name_='device', __objclass__=<EnumMeta(_generate_next_value_=<function at remote 0x7fd5a38a9b40>, __module__='cuml.internals.mem_type', from_str=<classmethod at remote 0x7fd2a8e3e440>, xpy=<property at remote 0x7fd2a8e4f330>, xdf=<property at remote 0x7fd2a8e4f380>, xsparse=<property at remote 0x7fd2a8e4f3d0>, is_device_accessible=<property at remote 0x7fd2a8e4f420>, is_host_accessible=<property at remote 0x7fd2a8e4f470>, __doc__='An enumeration.', _member_names_=['device', 'host', 'managed', 'mirror'], _member_map_={'device': <...>, 'host': <MemoryType(_value_=2, _name_='host', __objclass__=<...>) at remote 0x7fd2a8e3e620>, 'managed': <MemoryType(_value_=3, _name_='managed', __objclass__=<...>) at remote 0x7fd2a8e3e680>, 'mirror': <Mem...(truncated), tstate=0x55f867f83100)
    at /usr/local/src/conda/python-3.10.15/Include/internal/pycore_ceval.h:46
#14 _PyEval_Vector (kwnames=<optimized out>, argcount=<optimized out>, args=<optimized out>, locals=0x0, con=0x7fd1e774df40, tstate=0x55f867f83100)
    at /usr/local/src/conda/python-3.10.15/Python/ceval.c:5067
#15 _PyFunction_Vectorcall (func=<function at remote 0x7fd1e774df30>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)
    at /usr/local/src/conda/python-3.10.15/Objects/call.c:342
#16 0x000055f866945207 in _PyObject_FastCallDictTstate (tstate=0x55f867f83100, callable=<function at remote 0x7fd1e774df30>, args=<optimized out>, nargsf=<optimized out>, 
    kwargs=<optimized out>) at /usr/local/src/conda/python-3.10.15/Objects/call.c:153
#17 0x000055f866955c89 in _PyObject_Call_Prepend (kwargs={'order': (0, 0, 2), 'fit_intercept': True}, 
    args=(<DataFrame(_data=<ColumnAccessor(_data={'5-9 years': <NumericalColumn(nan_count=<numpy.int64 at remote 0x7fd1e7701f70>) at remote 0x7fd1e774e560>, '10-14 years': <NumericalColumn(nan_count=<numpy.int64 at remote 0x7fd1e7701eb0>) at remote 0x7fd1e774e8c0>, '15-19 years': <NumericalColumn(nan_count=<numpy.int64 at remote 0x7fd1e7702110>) at remote 0x7fd1e774ea70>, '30-34 years': <NumericalColumn(nan_count=<numpy.int64 at remote 0x7fd1e7702170>) at remote 0x7fd1e774ec20>}, rangeindex=False, multiindex=False, label_dtype=<numpy.dtypes.ObjectDType at remote 0x7fd5a36a93e0>, _level_names=(None,), _grouped_data={...}, names=('5-9 years', '10-14 years', '15-19 years', '30-34 years'), columns=(<...>, <...>, <...>, <...>)) at remote 0x7fd1e777ac20>, _index=<RangeIndex(_name=None, _range=<range at remote 0x7fd1e777aa30>) at remote 0x7fd1e7779e40>) at remote 0x7fd1e777a830>,), 
    obj=<optimized out>, callable=<function at remote 0x7fd1e774df30>, tstate=0x55f867f83100) at /usr/local/src/conda/python-3.10.15/Objects/call.c:431
#18 slot_tp_init (self=<optimized out>, args=<optimized out>, kwds=<optimized out>) at /usr/local/src/conda/python-3.10.15/Objects/typeobject.c:7734
#19 0x000055f866945cdb in type_call (kwds={'order': (0, 0, 2), 'fit_intercept': True}, 
    args=(<DataFrame(_data=<ColumnAccessor(_data={'5-9 years': <NumericalColumn(nan_count=<numpy.int64 at remote 0x7fd1e7701f70>) at remote 0x7fd1e774e560>, '10-14 years': <NumericalColumn(nan_count=<numpy.int64 at remote 0x7fd1e7701eb0>) at remote 0x7fd1e774e8c0>, '15-19 years': <NumericalColumn(nan_count=<numpy.int64 at remote 0x7fd1e7702110>) at remote 0x7fd1e774ea70>, '30-34 years': <NumericalColumn(nan_count=<numpy.int64 at remote 0x7fd1e7702170>) at remote 0x7fd1e774ec20>}, rangeindex=False, multiindex=False, label_dtype=<numpy.dtypes.ObjectDType at remote 0x7fd5a36a93e0>, _level_names=(None,), _grouped_data={...}, names=('5-9 years', '10-14 years', '15-19 years', '30-34 years'), columns=(<...>, <...>, <...>, <...>)) at remote 0x7fd1e777ac20>, _index=<RangeIndex(_name=None, _range=<range at remote 0x7fd1e777aa30>) at remote 0x7fd1e7779e40>) at remote 0x7fd1e777a830>,), 
    type=<optimized out>) at /usr/local/src/conda/python-3.10.15/Objects/typeobject.c:1135
#20 _PyObject_MakeTpCall (tstate=0x55f867f83100, 
    callable=callable@entry=<BaseMetaClass(__module__='cuml.tsa.arima', __doc__='\n    Implements a batched ARIMA model for in- and out-of-sample\n    time-series prediction, with support for seasonality (SARIMA)\n\n    ARIMA stands for Auto-Regressive Integrated Moving Average.\n    See https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average\n\n    This class can fit an ARIMA(p,d,q) or ARIMA(p,d,q)(P,D,Q)_s model to a\n    batch of time series of the same length (or various lengths, using missing\n    values at the start for padding).\n    The implementation is designed to give the best performance when using\n    large batches of time series.\n\n    Parameters\n    ----------\n    endog : dataframe or array-like (device or host)\n        Endogenous variable, assumed to have each time series in columns.\n        Acceptable formats: cuDF DataFrame, cuDF Series, NumPy ndarray,\n        Numba device ndarray, cuda array interface compliant array like CuPy.\n        Missing values are accepted, represented by NaN.\n    order ...(truncated), 
    args=<optimized out>, nargs=<optimized out>, keywords=keywords@entry=('order', 'fit_intercept')) at /usr/local/src/conda/python-3.10.15/Objects/call.c:215
#21 0x000055f8669423dd in _PyObject_VectorcallTstate (kwnames=('order', 'fit_intercept'), nargsf=<optimized out>, args=<optimized out>, 
    callable=<BaseMetaClass(__module__='cuml.tsa.arima', __doc__='\n    Implements a batched ARIMA model for in- and out-of-sample\n    time-series prediction, with support for seasonality (SARIMA)\n\n    ARIMA stands for Auto-Regressive Integrated Moving Average.\n    See https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average\n\n    This class can fit an ARIMA(p,d,q) or ARIMA(p,d,q)(P,D,Q)_s model to a\n    batch of time series of the same length (or various lengths, using missing\n    values at the start for padding).\n--Type <RET> for more, q to quit, c to continue without paging--
    The implementation is designed to give the best performance when using\n    large batches of time series.\n\n    Parameters\n    ----------\n    endog : dataframe or array-like (device or host)\n        Endogenous variable, assumed to have each time series in columns.\n        Acceptable formats: cuDF DataFrame, cuDF Series, NumPy ndarray,\n        Numba device ndarray, cuda array interface compliant array like CuPy.\n        Missing values are accepted, represented by NaN.\n    order ...(truncated), tstate=<optimized out>)
    at /usr/local/src/conda/python-3.10.15/Include/cpython/abstract.h:112
#22 _PyObject_VectorcallTstate (kwnames=('order', 'fit_intercept'), nargsf=<optimized out>, args=<optimized out>, 
    callable=<BaseMetaClass(__module__='cuml.tsa.arima', __doc__='\n    Implements a batched ARIMA model for in- and out-of-sample\n    time-series prediction, with support for seasonality (SARIMA)\n\n    ARIMA stands for Auto-Regressive Integrated Moving Average.\n    See https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average\n\n    This class can fit an ARIMA(p,d,q) or ARIMA(p,d,q)(P,D,Q)_s model to a\n    batch of time series of the same length (or various lengths, using missing\n    values at the start for padding).\n    The implementation is designed to give the best performance when using\n    large batches of time series.\n\n    Parameters\n    ----------\n    endog : dataframe or array-like (device or host)\n        Endogenous variable, assumed to have each time series in columns.\n        Acceptable formats: cuDF DataFrame, cuDF Series, NumPy ndarray,\n        Numba device ndarray, cuda array interface compliant array like CuPy.\n        Missing values are accepted, represented by NaN.\n    order ...(truncated), tstate=<optimized out>)
    at /usr/local/src/conda/python-3.10.15/Include/cpython/abstract.h:99
#23 PyObject_Vectorcall (kwnames=('order', 'fit_intercept'), nargsf=<optimized out>, args=<optimized out>, 
    callable=<BaseMetaClass(__module__='cuml.tsa.arima', __doc__='\n    Implements a batched ARIMA model for in- and out-of-sample\n    time-series prediction, with support for seasonality (SARIMA)\n\n    ARIMA stands for Auto-Regressive Integrated Moving Average.\n    See https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average\n\n    This class can fit an ARIMA(p,d,q) or ARIMA(p,d,q)(P,D,Q)_s model to a\n    batch of time series of the same length (or various lengths, using missing\n    values at the start for padding).\n    The implementation is designed to give the best performance when using\n    large batches of time series.\n\n    Parameters\n    ----------\n    endog : dataframe or array-like (device or host)\n        Endogenous variable, assumed to have each time series in columns.\n        Acceptable formats: cuDF DataFrame, cuDF Series, NumPy ndarray,\n        Numba device ndarray, cuda array interface compliant array like CuPy.\n        Missing values are accepted, represented by NaN.\n    order ...(truncated))
    at /usr/local/src/conda/python-3.10.15/Include/cpython/abstract.h:123
#24 call_function (kwnames=('order', 'fit_intercept'), oparg=<optimized out>, pp_stack=<synthetic pointer>, trace_info=0x7ffcb14a6740, tstate=<optimized out>)
    at /usr/local/src/conda/python-3.10.15/Python/ceval.c:5893
#25 _PyEval_EvalFrameDefault (tstate=<optimized out>, f=Frame 0x7fd5a3aada40, for file /home/rapids/notebooks/cuml/test.py, line 14, in <module> (), throwflag=<optimized out>)
    at /usr/local/src/conda/python-3.10.15/Python/ceval.c:4231
#26 0x000055f8669ddbac in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7fd5a3aada40, for file /home/rapids/notebooks/cuml/test.py, line 14, in <module> (), tstate=0x55f867f83100)
    at /usr/local/src/conda/python-3.10.15/Include/internal/pycore_ceval.h:46
#27 _PyEval_Vector (tstate=tstate@entry=0x55f867f83100, con=con@entry=0x7ffcb14a6840, 
    locals=locals@entry={'__name__': '__main__', '__doc__': None, '__package__': None, '__loader__': <SourceFileLoader(name='__main__', path='/home/rapids/notebooks/cuml/test.py') at remote 0x7fd5a39a8af0>, '__spec__': None, '__annotations__': {}, '__builtins__': <module at remote 0x7fd5a3b1c900>, '__file__': '/home/rapids/notebooks/cuml/test.py', '__cached__': None, 'cudf': <module at remote 0x7fd5a3909580>, 'ARIMA': <BaseMetaClass(__module__='cuml.tsa.arima', __doc__='\n    Implements a batched ARIMA model for in- and out-of-sample\n    time-series prediction, with support for seasonality (SARIMA)\n\n    ARIMA stands for Auto-Regressive Integrated Moving Average.\n    See https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average\n\n    This class can fit an ARIMA(p,d,q) or ARIMA(p,d,q)(P,D,Q)_s model to a\n    batch of time series of the same length (or various lengths, using missing\n    values at the start for padding).\n    The implementation is designed to give the best performance when using\n    large batches of...(truncated), args=args@entry=0x0, 
    argcount=argcount@entry=0, kwnames=kwnames@entry=0x0) at /usr/local/src/conda/python-3.10.15/Python/ceval.c:5067
#28 0x000055f8669ddaf7 in PyEval_EvalCode (co=co@entry=<code at remote 0x7fd5a3910be0>, 
    globals=globals@entry={'__name__': '__main__', '__doc__': None, '__package__': None, '__loader__': <SourceFileLoader(name='__main__', path='/home/rapids/notebooks/cuml/test.py') at remote 0x7fd5a39a8af0>, '__spec__': None, '__annotations__': {}, '__builtins__': <module at remote 0x7fd5a3b1c900>, '__file__': '/home/rapids/notebooks/cuml/test.py', '__cached__': None, 'cudf': <module at remote 0x7fd5a3909580>, 'ARIMA': <BaseMetaClass(__module__='cuml.tsa.arima', __doc__='\n    Implements a batched ARIMA model for in- and out-of-sample\n    time-series prediction, with support for seasonality (SARIMA)\n\n    ARIMA stands for Auto-Regressive Integrated Moving Average.\n    See https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average\n\n    This class can fit an ARIMA(p,d,q) or ARIMA(p,d,q)(P,D,Q)_s model to a\n    batch of time series of the same length (or various lengths, using missing\n    values at the start for padding).\n    The implementation is designed to give the best performance when using\n    large batches of...(truncated), 
    locals=locals@entry={'__name__': '__main__', '__doc__': None, '__package__': None, '__loader__': <SourceFileLoader(name='__main__', path='/home/rapids/notebooks/cuml/test.py') at remote 0x7fd5a39a8af0>, '__spec__': None, '__annotations__': {}, '__builtins__': <module at remote 0x7fd5a3b1c900>, '__file__': '/home/rapids/notebooks/cuml/test.py', '__cached__': None, 'cudf': <module at remote 0x7fd5a3909580>, 'ARIMA': <BaseMetaClass(__module__='cuml.tsa.arima', __doc__='\n    Implements a batched ARIMA model for in- and out-of-sample\n    time-series prediction, with support for seasonality (SARIMA)\n\n    ARIMA stands for Auto-Regressive Integrated Moving Average.\n    See https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average\n\n    This class can fit an ARIMA(p,d,q) or ARIMA(p,d,q)(P,D,Q)_s model to a\n    batch of time series of the same length (or various lengths, using missing\n    values at the start for padding).\n    The implementation is designed to give the best performance when using\n    large batches of...(truncated))
    at /usr/local/src/conda/python-3.10.15/Python/ceval.c:1134

@jameslamb
Copy link
Member Author

jameslamb commented Sep 30, 2024

It looks to me like the environment has an older set of RAFT packages, that's definitely troubling.

  - libraft=24.10.00a37=cuda11_240923_gf49567e1_37
  - libraft-headers=24.10.00a37=cuda11_240923_gf49567e1_37
  - libraft-headers-only=24.10.00a37=cuda11_240923_gf49567e1_37

The latest nightly for those is 24.10.00a48 (https://anaconda.org/rapidsai-nightly/libraft/files?version=24.10.00a48). So this environment is 11 commits behind.

That older version of libraft-headers-only has the older fmt / spdlog pins!

Image

https://anaconda.org/rapidsai-nightly/libraft-headers-only/files?version=24.10.00a37

I'll look into how that pin is getting in there, I think that's a likely candidate root cause for these failures.

@jameslamb
Copy link
Member Author

This definitely looks related to fmt / spdlog (at this point, probably should link what I'm talking about: rapidsai/build-planning#56)

Trying to install the latest raft in the container

conda install \
    --name base \
    --yes libraft-headers-only=24.10.00a48

Results in this

Platform: linux-64
Collecting package metadata (repodata.json): done
Solving environment: | warning  libmamba Added empty dependency for problem type SOLVER_RULE_UPDATE
failed

LibMambaUnsatisfiableError: Encountered problems while solving:
  - package libmambapy-1.5.10-py310h86cbe3b_0 requires fmt >=10.2.1,<11.0a0, but none of the providers can be installed

Could not solve for environment specs
The following packages are incompatible
├─ conda >=24.3.0  is installable with the potential options
│  ├─ conda [24.3.0|24.4.0|24.5.0|24.7.1] would require
│  │  └─ conda-libmamba-solver >=23.11.0 , which requires
│  │     └─ libmambapy [>=1.5.3,<2.0.0a0 |>=1.5.6,<2.0a0 ] with the potential options
│  │        ├─ libmambapy [1.5.10|1.5.7|1.5.8|1.5.9] would require
│  │        │  └─ fmt >=10.2.1,<11.0a0 , which can be installed;
│  │        ├─ libmambapy [1.5.10|1.5.3|...|1.5.9] would require
│  │        │  └─ python >=3.11,<3.12.0a0 , which can be installed;
│  │        ├─ libmambapy [1.5.10|1.5.3|...|1.5.9] would require
│  │        │  └─ python >=3.12,<3.13.0a0 , which can be installed;
│  │        ├─ libmambapy [1.5.10|1.5.3|...|1.5.9] would require
│  │        │  └─ python >=3.9,<3.10.0a0 , which can be installed;
│  │        ├─ libmambapy [1.5.3|1.5.4|1.5.5|1.5.6] would require
│  │        │  └─ fmt >=10.1.1,<11.0a0 , which can be installed;
│  │        └─ libmambapy [1.5.3|1.5.4|...|1.5.8] would require
│  │           └─ python >=3.8,<3.9.0a0 , which can be installed;
│  ├─ conda [24.3.0|24.4.0|24.5.0|24.7.1] would require
│  │  └─ python >=3.11,<3.12.0a0 , which can be installed;
│  ├─ conda [24.3.0|24.4.0|24.5.0|24.7.1] would require
│  │  └─ python >=3.12,<3.13.0a0 , which can be installed;
│  ├─ conda [24.3.0|24.4.0|24.5.0|24.7.1] would require
│  │  └─ python >=3.8,<3.9.0a0 , which can be installed;
│  └─ conda [24.3.0|24.4.0|24.5.0|24.7.1] would require
│     └─ python >=3.9,<3.10.0a0 , which can be installed;
├─ libraft-headers-only 24.10.00a48**  is not installable because it requires
│  └─ fmt >=11.0.2,<12 , which conflicts with any installable versions previously reported;
└─ pin-1 is not installable because it requires
   └─ python 3.10.* , which conflicts with any installable versions previously reported.

@jameslamb
Copy link
Member Author

jameslamb commented Oct 1, 2024

Root cause

I think it's just not possible to install packages in the base environment that depend on fmt>=11.

As of this writing, the latest version of conda is 24.7.1 (conda-forge/conda).

That depends on conda-libmamba-solver >=23.11.0, which depends on libmambapy >=1.5.6,<2.0a0 (conda-forge/conda-libmamba-solver).

The latest 1.x of libmambapy is 1.5.10 (conda-forge/libmambapy), which depends on fmt >=10.2.1,<11.0a0.

Why didn't we catch this in CI earlier?

Throughout RAPIDS libraries' CI, we don't install packages into the base environment... they're always installed into the isolated build environment created by conda-build or into new environments created like this:

rapids-dependency-file-generator \
  --output conda \
  --file-key ${FILE_KEY} \
  --matrix "cuda=${RAPIDS_CUDA_VERSION%.*};arch=$(arch);py=${RAPIDS_PY_VERSION};dependencies=${RAPIDS_DEPENDENCIES}" \
    | tee "${ENV_YAML_DIR}/env.yaml"

rapids-mamba-retry env create --yes -f "${ENV_YAML_DIR}/env.yaml" -n test

(cudf code link)

Will this be resolved by upstream changes?

Eventually, ... but probably not in the next few days. And even if they were, this could happen again the next time conda-forge updates its fmt pins.

The spdlog / fmt migrations on conda-forge are not fully done yet: rapidsai/build-planning#56 (comment)

And the first libmambapy to support fmt>=11 is v2.0 (conda-forge/mamba-feedstock#237), which conda can't be installed alongside yet (code link).

So what can we do?

Stop using the base environment in the images produced from this repo, and instead create a new environment.

I'm testing that approach in #713.

@raydouglass
Copy link
Member

Thanks for the thorough investigation @jameslamb!

I agree that creating a new environment to install rapids will work, but eliminating that was one of the goals/requirements for the overhaul (#539). That said, I am struggling to think of a solution that work for 24.10 in time.

@jameslamb
Copy link
Member Author

eliminating that was one of the goals/requirements for the overhaul (#539)

oy 😫

Thanks for pointing that issue out. Do you recall why it was requirement? Was it just about reducing the friction introduced by needing to conda activate rapids in all the places you use these images?

I am struggling to think of a solution that work for 24.10 in time.

The only other thing I can think of... is it possible to use micromamba to create an environment named base but which doesn't contain conda? Sort of like @msarahan is pursuing in rapidsai/ci-imgs#190.

Though even if we do that, it'll still be a breaking change from the perspective of anyone who's right now using these rapids/* images as a base and then expecting to be able to run conda install inside them to further modify the environment.

@bdice
Copy link
Contributor

bdice commented Oct 1, 2024

Yes, a separate environment will be needed here. I don't think we can count on base always solving properly with all of RAPIDS, we've seen related issues with fmt and mamba before. Maybe we could try micromamba but I think that micromamba hard-errors when packages clobber one another (rapidsai/cuml#4832). Perhaps that error could be disabled.

@msarahan
Copy link

msarahan commented Oct 1, 2024

Though even if we do that, it'll still be a breaking change from the perspective of anyone who's right now using these rapids/* images as a base and then expecting to be able to run conda install inside them to further modify the environment.

You can alias conda to micromamba, but that's still kind of yuck.

You could also consider stacking environments. https://stackoverflow.com/a/76746419/1170370, https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#nested-activation

It's not commonplace and probably has rough spots, but maybe it's good enough as a stopgap.

@raydouglass
Copy link
Member

I think given the time constraints (ie 24.10 release is next week), we should use a separate environment like @jameslamb suggested and is testing in #713.

I think for many users, this change will not impact them since the docker entrypoint will activate the right environment. So the affected users would be those overriding the entrypoint which is used in some tooling that deploys our containers. Might need to talk to @rapidsai/deployment / @jacobtomlinson to confirm.

@jacobtomlinson
Copy link
Member

jacobtomlinson commented Oct 1, 2024

Switching to a separate environment that needs to be activated via an entrypoint will break container use on a large number of platforms including AI Workbench, Vertex AI, Kubeflow, Databricks, DGX Cloud Base Command Platform and many more.

The general requirement that these platforms have is that the required dependencies (usually ipython, jupyter, dask or similar) must be available on the PATH/PYTHONPATH without needing to run any additional code like an entrypoint script. This is because they either override the entrypoint with their own platform specific one, or they just use the container image as a packaging mechanism and don't actually start the container.

Perhaps a solution could be to bake the environment variables that get set by conda activate into the container, so when you start the container without an entrypoint script the environment is already active.

@jameslamb
Copy link
Member Author

I just saw all CI pass on #713: #713 (comment)

Which is at least confirmation that the root cause of the notebook failures is this environment-solve stuff, and not something like "cuml requires code changes".

@KyleFromNVIDIA
Copy link

KyleFromNVIDIA commented Oct 1, 2024

Perhaps a solution could be to bake the environment variables that get set by conda activate into the container

Could we just actually call conda activate in the Dockerfile instead of manually setting environment variables? No, because that would just spawn a new process within docker build that sets the environment variables and then immediately exits, resulting in the loss of said variables.

@raydouglass
Copy link
Member

raydouglass commented Oct 1, 2024

No, because that would just spawn a new process within docker build that sets the environment variables and then immediately exits

But doing a conda activate rapids in a build step would add filesystem changes to that layer such as symlinks (maybe?) or other files in the package activation scripts. Then maybe doing the environment variable diff before & after activation would cover everything else?

There must be some edge cases in this though.

@jameslamb
Copy link
Member Author

I was thinking this same thing! That a conda activate could leave the filesystem changes in the image, and that combining that with setting the environment variables on the container directly might be enough.

There is one other possibility I'm exploring right now... it might be possible to downgrade conda all the way to the first version before it took on a libmambapy dependency, which might allow installing newer fmt with it, which might allow us to keep the base environment for this release (with the hope that by the 24.12 release, we can return to the latest conda because it'd hopefully support a newer fmt).

@jameslamb
Copy link
Member Author

it might be possible to downgrade conda all the way to the first version before it took on a libmambapy dependency

This did not work for Python 3.12 (solve timed out). I'm going to go back to the conda activate + set environment variables approach.

@KyleFromNVIDIA
Copy link

What if we added conda activate to a .bashrc file?

@jameslamb
Copy link
Member Author

That isn't sufficient, because it can't be assumed that the images will only be used with login shells or even with bash.

Some of the examples @jacobtomlinson mentioned in #712 (comment) are equivalent to running like:

docker run \
   rapidsai/notebooks \
   jupyter lab --ip 0.0.0.0

Or similar.

@jameslamb
Copy link
Member Author

jameslamb commented Oct 1, 2024

@msarahan I want to be sure to address your suggestions, so you know I did consider them.

You can alias conda to micromamba, but that's still kind of yuck.

I agree, micromamba doesn't make API compatibility guarantees with mamba / conda so I'm unsure what the size of the risk is there.

HOWEVER... if we find that the hacks in #713 are just intolerably bad, using micromamba is the next-best option I can think of.

That would look like:

  • use micromamba to create an environment named base with all the RAPIDS libraries (and pip)
  • modify these entrypoints to use micromamba:
    if [ -e "/home/rapids/environment.yml" ]; then
    echo "environment.yml found. Installing packages."
    timeout ${CONDA_TIMEOUT:-600} mamba env update -n base -f /home/rapids/environment.yml || exit $?
    fi
    if [ "$EXTRA_CONDA_PACKAGES" ]; then
    echo "EXTRA_CONDA_PACKAGES environment variable found. Installing packages."
    timeout ${CONDA_TIMEOUT:-600} mamba install -n base -y $EXTRA_CONDA_PACKAGES || exit $?
    fi
  • document that installing more packages requires either using micromamba install or that EXTRA_CONDA_PACKAGES environment variable pattern
    • (I think this is preferable to aliasing conda to micromamba... a big "not found: conda" would be a clear sign of what you need to do)

You could also consider stacking environments. https://stackoverflow.com/a/76746419/1170370, https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#nested-activation

Reading these docs, it seems like this is only about the PATH bits, and that it still requires running conda activate? If that's right, I think the hack of calling conda activate once at build time to try to get the bulk of the filesystem changes + hard-coding the environment variable changes from the activation scripts is a preferable (if hacky) fix, because it's less likely to lead to surprises at runtime when people do e.g. conda install -n rapids some-other-library.

@jacobtomlinson
Copy link
Member

Here's a concrete example that might be useful for testing. We know that Vertex AI inspects the available Jupyter kernels of a user provided image. It does this by calling jupyter kernelspec list and it resets the entrypoint.

docker run --rm --entrypoint='' rapidsai/notebooks jupyter kernelspec list --json

The output of this has to be a valid JSON because it will get deserialised by the Vertex AI backend. So the 24.08 release images look like this.

$ docker run --rm --entrypoint='' nvcr.io/nvidia/rapidsai/notebooks:24.08-cuda12.5-py3.11 jupyter kernelspec list --json
{
  "kernelspecs": {
    "python3": {
      "resource_dir": "/opt/conda/share/jupyter/kernels/python3",
      "spec": {
        "argv": [
          "/opt/conda/bin/python",
          "-m",
          "ipykernel_launcher",
          "-f",
          "{connection_file}"
        ],
        "env": {},
        "display_name": "Python 3 (ipykernel)",
        "language": "python",
        "interrupt_mode": "signal",
        "metadata": {
          "debugger": true
        }
      }
    }
  }
}

@jameslamb
Copy link
Member Author

There are now new libmambapy=1.5.* packages supporting the newer versions of fmt and spdlog, thanks to @msarahan 's PR here: conda-forge/mamba-feedstock#253.

And mamba / libmamba / libmambapy 1.x will now automatically be included in future conda-forge migrations, thanks to conda-forge/mamba-feedstock#254.

Thanks to those changes... there is no action required in this repo 🎉

Re-ran a nightly build and saw what I'd hoped for... the latest raft, cuml, cudf, and others getting installed in the base environment, and all the tests passing: https://github.com/rapidsai/docker/actions/runs/11147797532/job/30986558932

Thanks so much for the help everyone!!!

@melodywang060
Copy link

melodywang060 commented Oct 2, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify bug Something isn't working
Projects
None yet
7 participants