[pitch] 2431 Storage layer and metastore v3.3 (git-inspired) #1222
Unanswered
rufuspollock
asked this question in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
This is a major RFC/Shaping doc. It sets out a design for the storage layer and metastore of DataHub Cloud. IMO this represents something of a breakthrough 😄
Definitions
Tasks
Executive Summary
This design is git-inspired and git-syncable i.e. you can sync git to it.
It has the following attractive features ...
This in turn supports the following useful use cases:
Layout on disk
On disk e.g. in R2, it looks like the following:
Index
Index structure. This will likely be in a central database. however, could also be on disk if wanted.
This is per project so imagine everything prefixed by project
Questions
content-disposition
header. On s3 you can do this per request by setting a query parameter when giving out a signed url.Algorithms
Key algorithms operate as follows:
Original
Background and motivation
We want a design for data/content store. We would like one where:
What we don't need:
What we may need ...
Storage layer needs
On retrieval
/@me/myproject/abc.csv
display file associated with it ... (optionally a commit / branch / tag)Design
Layout on disk
See exec summary
How do we support queries for files ...
Design
A full cached index of this would look like (per project, though strictly commit should be unique per project ...):
ASIDE: do we want commit_sha or root_tree in the table?
Howto get all files
SELECT * FROM object_index WHERE commit_sha = XXX;
SELECT * FROM object_index WHERE commit_sha = XXX AND path = YYY
How do we add a file direct ...
type: lfs
and storing sha256 e.g.Delete a file
How do we sync from git(hub)
Support for Git LFS
We can use our storage for git lfs support ...
Need to be a bit specific about what we mean. Do we mean:
TODO
Optimizations ...
HEAD
field inobject_index
table that showsTRUE
if part of HEAD. This allows skipping the first step in look up.Appendix: How does current system work and what are its disadvantages
Advantages:
Disadvantages / issues
Appendix: storage calculations
Is it a problem that we store a copy every time for a change of a file even if a small change to a large file (i.e. we don't do packing like git does)? Basic answer: no, mainly because
Let's do an illustrative calculation:
❓ [minor] Asides: can we use gzip compression for more efficient storage and transfer?
Appendix: How does git work?
See https://git-scm.com/book/en/v2/Git-Internals-Git-Objects
Git is a content-addressable filesystem with revisions.
A git repository stores all content in the object-store.
3 things stored in the object store, each named by their hash:
What does each entry look like?
References
https://git-scm.com/book/en/v2/Git-Internals-Git-References
Need a nice way to find relevant commit e.g. what is HEAD, what is a given branch pointing to.
Refs then make it easy by providing human readable names ...
Git LFS
TODO
Sha calculation
Strictly git calculates not the sha of the file contents but sha of file contents plus a short header which is of form:
e.g. for a blob:
Appendix: How does huggingface work
Extra benefits
Builds on / supersedes
![[../Excalidraw/metastore-and-storage-layer-2024-05-24.excalidraw.svg]]
Beta Was this translation helpful? Give feedback.
All reactions