Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remote IO: http support #464

Open
wants to merge 71 commits into
base: branch-24.12
Choose a base branch
from
Open

Conversation

madsbk
Copy link
Member

@madsbk madsbk commented Sep 18, 2024

Support read directly from a http server like:

import kvikio
import cupy

with kvikio.RemoteFile.from_http_url("http://127.0.0.1:9000/myfile") as f:
    ary = cupy.empty(f.nbytes, dtype="uint8")
    f.read(ary)

This PR is the first step to support S3 using libcurl instead of aws-s3-sdk, which has some pros and cons:

  • Pros
  • Cons
    • Hard to support the AWS configuration file. We will require the user to either specify the options programmatically or through environment variables like AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY .

@madsbk madsbk added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Sep 18, 2024
Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C++, not python yet

cpp/include/kvikio/remote_handle.hpp Outdated Show resolved Hide resolved
cpp/include/kvikio/remote_handle.hpp Outdated Show resolved Hide resolved
cpp/include/kvikio/remote_handle.hpp Outdated Show resolved Hide resolved
cpp/include/kvikio/remote_handle.hpp Outdated Show resolved Hide resolved
cpp/include/kvikio/remote_handle.hpp Outdated Show resolved Hide resolved
cpp/include/kvikio/remote_handle.hpp Show resolved Hide resolved
cpp/include/kvikio/remote_handle.hpp Outdated Show resolved Hide resolved
cpp/include/kvikio/remote_handle.hpp Outdated Show resolved Hide resolved
cpp/include/kvikio/remote_handle.hpp Outdated Show resolved Hide resolved
dependencies.yaml Outdated Show resolved Hide resolved
@madsbk madsbk mentioned this pull request Sep 24, 2024
3 tasks
@madsbk madsbk force-pushed the remote-io branch 2 times, most recently from 701f603 to b886419 Compare September 30, 2024 08:47
@madsbk madsbk requested review from a team as code owners October 1, 2024 13:34
@madsbk madsbk mentioned this pull request Oct 2, 2024
@madsbk madsbk requested a review from wence- October 2, 2024 08:51
Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments, overall looking good to me

const std::size_t nbytes = size * nmemb;
if (ctx->size < ctx->offset + nbytes) {
ctx->overflow_error = true;
return CURL_WRITEFUNC_ERROR;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: Nothing can be done, because it's in the curl API, but if nbytes == CURL_WRITEFUNC_ERROR and ctx->size < ctx->offset + nbytes then curl won't notice that we returned an error here, I think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nbytes == CURL_WRITEFUNC_ERROR cannot happen, CURL_WRITEFUNC_ERROR is defined as 0xFFFFFFFF, which is greater than CURL_MAX_WRITE_SIZE.

cpp/include/kvikio/remote_handle.hpp Outdated Show resolved Hide resolved
cpp/include/kvikio/remote_handle.hpp Outdated Show resolved Hide resolved
cpp/include/kvikio/remote_handle.hpp Outdated Show resolved Hide resolved
cpp/include/kvikio/remote_handle.hpp Outdated Show resolved Hide resolved
python/kvikio/kvikio/_lib/CMakeLists.txt Outdated Show resolved Hide resolved
Comment on lines +19 to +26
cdef extern from "<kvikio/remote_handle.hpp>" nogil:
cdef cppclass cpp_RemoteEndpoint "kvikio::RemoteEndpoint":
pass

cdef cppclass cpp_HttpEndpoint "kvikio::HttpEndpoint":
cpp_HttpEndpoint(string url) except +

cdef cppclass cpp_RemoteHandle "kvikio::RemoteHandle":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cdef extern from "<kvikio/remote_handle.hpp>" nogil:
cdef cppclass cpp_RemoteEndpoint "kvikio::RemoteEndpoint":
pass
cdef cppclass cpp_HttpEndpoint "kvikio::HttpEndpoint":
cpp_HttpEndpoint(string url) except +
cdef cppclass cpp_RemoteHandle "kvikio::RemoteHandle":
cdef extern from "<kvikio/remote_handle.hpp>" nogil namespace "kvikio":
cdef cppclass RemoteEndpoint:
pass
cdef cppclass HttpEndpoint:
HttpEndpoint(string url) except +
cdef cppclass RemoteHandle:

?
Since RemoteEndPoint/HttpEndpoint and RemoteHandle are not used by any python cdef class names.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I keep mixing C++ and Python objects when reading Cython, I think always using cpp_ makes it a bit easier?

python/kvikio/kvikio/remote_file.py Outdated Show resolved Hide resolved
Comment on lines +69 to +72
self.process = multiprocessing.Process(
target=LocalHttpServer._server,
args=(queue, str(self.root_path), self.range_support, self.max_lifetime),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Since we're starting a new process here to run the server, why do we also run the server in its own thread in _server?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To handle the max_lifetime

python/kvikio/tests/test_examples.py Show resolved Hide resolved
dependencies.yaml Outdated Show resolved Hide resolved
@@ -83,6 +84,7 @@ outputs:
{% else %}
- libcufile-dev # [linux]
{% endif %}
- libcurl>=7.87.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use an exact pinning in host: (next to cuda-version on line 77). Then it will use a run-export to get the compatible run pinning here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -52,6 +52,7 @@ requirements:
{% else %}
- libcufile-dev # [linux]
{% endif %}
- libcurl>=7.87.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use an exact pinning here -- building against a specific (possibly older) version will allow users to have anything equal or newer at runtime. See other comment below about run pinnings.

The problem you want to avoid is that this host pinning could pull libcurl 7.999 at build time (anything >=7.87), but your run pinning below is >=7.87.0 which might be incompatible. Pinning a specific version at build time and then relying on run-exports for a compatible runtime pinning is the cleanest solution here, and it's how we handle most dependencies like this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification, I understand meta.yaml a little bit better now :)

cpp/CMakeLists.txt Outdated Show resolved Hide resolved
cpp/cmake/thirdparty/get_libcurl.cmake Outdated Show resolved Hide resolved
cpp/include/kvikio/shim/libcurl.hpp Outdated Show resolved Hide resolved
@madsbk madsbk requested review from wence- and bdice October 3, 2024 10:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improves an existing functionality non-breaking Introduces a non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants