Replies: 3 comments 3 replies
-
(Keeping comparison performant did require some special casing for strings, since StringView supports prefix comparison before branching on size.) |
Beta Was this translation helpful? Give feedback.
-
(👋 Apache Arrow PMC Chair here) BackgroundVelox has a Recently Apache Arrow added However, we have hit snag recently in that existing implementation of variable length strings in is not compatible with what is (now) in the Arrow Spec (which is what @bkietz is describing above). Specifically, directly using pointers (rather than a base + offset) doesn't work for many language implementations. Request:
@pedroerp and @mbasmanova perhaps you can offer some guidance. |
Beta Was this translation helpful? Give feedback.
-
More discussion with DuckDB folks: https://lists.apache.org/thread/3qhkomvvc69v3gkotbwldyko7yk9cs9k |
Beta Was this translation helpful? Give feedback.
-
Context
Ideally velox's in memory format would not only be efficient, but also directly usable across memory spaces and across interface boundaries. This requires that the format be guaranteed safe and amenable to validation. However
FlatVector<StringView>::values_
is a buffer containingStringView
, which includes raw pointers. This is certainly efficient, but raw pointers are difficult to validate and do not provide a robust lifetime guarantee. Furthermore, the array must be rewritten when the array is serialized or passed into a different memory space.Proposal - indices/offset representation
In the place of a 64 bit pointer,
FlatVector<StringView>
could store a 16 bit index and a 48 bit offset. A string view's index will refer to the buffer inFlatVector::stringBuffers_
where its character data is stored, while its offset will refer to the first character of its character data in that buffer. Accessors to the vector's elements can be rewritten to materialize the pointer; unless the values buffer is directly accessed usage of string vectors will not change.The lifetime guarantee which is necessary for vectors of strings is that the character data viewed by each string have lifetime equal to the vector itself; in C++ terms the character data should always be a member of the vector. Velox stores character data for each view in
FlatVector<StringView>::stringBuffers_
, as described in a comment in FlatVector.h. However there is no code to enforce this contract; it is possible to construct a vector whose strings view memory outside the vector, and these strings' pointers may dangle or otherwise trigger undefined behavior when dereferenced. Using indices/offsets makes this contract explicit and obvious since each string refers to a specific region in a specific buffer, and validation is as simple as boundschecking the index and offset.The ability to validate and test guarantees is key to building and testing integrations between disparate systems. The added unavoidability of the lifetime guarantee will serve as guidance for producers who might otherwise be unclear what vectors are legal to construct and export. From the consumer perspective, it’s important to note that on the other side of an interface boundary might be code that we don’t own, with which we can't communicate except through an IPC channel, or which in extreme cases may be untrusted.
Compounding the problem of acceptability to disparate systems, programming languages have differing constraints which must be accommodated. Some languages are not able to interact with raw pointers in a fully compartmentalized way. For example, rust’s compiler mandates lifetimes be known at compile time or the code must be marked unsafe... which then also applies to consumers of that code transitively. This will be a major barrier to usage of string vectors in rust, since unsafe code carries a significant stigma in the rust community. To illustrate how serious this stigma is, the arrow implementation in rust was entirely rewritten to avoid the necessity of transmute (equivalent to a c++ reinterpret_cast) in buffers- because the compiler could not otherwise verify that buffer data had the correct alignment (demoting the implementation to unsafe code) and the rust community decided it was worth the investment to achieve that guarantee.
A final benefit of this proposal is venue agnosticism- an indices/offsets based string vector can be used in shared memory or serialized without rewriting. By contrast string vector using raw pointers must modify those pointers on each transition to a new memory space, and are not meaningful when directly serialized. The benefits of anywhere-usable data are demonstrated by the success of the arrow ecosystem. The string vector is the only corner of velox data which is not equally anywhere-usable, so I think it'd be worth the investment to close that gap.
Benchmarking
I have written a PR which makes a first pass at this refactoring and runs some microbenchmarks to demonstrate that no great performance loss is incurred: main...bkietz:velox:indices-and-offsets
In order to keep the scope of the refactor minimal for now, I haven't repaired the entire suite. Below is a selection of compared results:
Vector hashing, to test simple linear access
Vector comparison and sorting, to test random access
Beta Was this translation helpful? Give feedback.
All reactions