Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update mission #28

Merged
merged 1 commit into from
Aug 17, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 8 additions & 1 deletion content/home/mission.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,11 @@ design:
css_class: null
---

SkyhookDM is an open source project to enable automatic mapping of data processing of structured data to heterogeneous architectures by providing a framework for efficient and composable data processing in storage and network layers. SkyhookDM leverages Apache Arrow and other open source projects that receive significant investment by the data management community in science and industry. The project strives to maximize contributions to existing open source projects while minimizing the size (and need for maintenance) of an independent codebase.
Cultivate an ecosystem in which the open source software for the computational I/O stack can be developed, distributed, and sustained. This open source software must reduce barriers of adoption and meet the current and future challenges of the computational I/O stack, and the solutions should leverage the existing expertise outside storage and network I/O communities.

## Goals

- Foster collaboration around the open source data science, storage, and networking systems ecosystem
- Support the development with system- and domain-specific computational I/O stack packages
- Reduce barriers of adoption of open source software for the computational I/O stack

8 changes: 5 additions & 3 deletions content/home/why.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,12 @@ design:
css_style: null
css_class: null
---
A key challenge in data science is extracting efficient and timely insights from an ever increasing flood of data streams. Apache Arrow, an open source data processing framework, provides an efficient and timely approach by reducing in-memory serialization and copy overheads. It is widely used in the development of data management services of structured data due to its interoperability across multiple programming languages and runtimes and plays an essential role in the rapid evolution of the open source data science ecosystem.
The key advantage of the cloud is its elasticity. This is implemented by systems that can expand and shrink resources quickly and by disaggregation services, including compute, networking, and storage. Elasticity is also valuable for on-premise datacenters where disaggregation allows compute and storage to scale independently. This disaggregation however places greater demand on expensive top-of-rack networking resources since compute and storage nodes end up in different racks and even rows as the installation is growing. More network traffic also requires more CPU cycles to be dedicated to sending and receiving data. Therefore, disaggregation, somewhat paradoxically, amplifies the benefit of moving some compute – the compute that involves data management – into storage & network layers because data management filtering operations can reduce data movement significantly.

Apache Arrow, an open source data processing framework, provides an efficient and timely approach by reducing in-memory serialization and copy overheads. It is widely used in the development of data management services of structured data due to its interoperability across multiple programming languages and runtimes and plays an essential role in the rapid evolution of the open source data science ecosystem.

While many existing data management services that use Apache Arrow are well-suited for resource-rich environments, the open source data science ecosystem lacks a common framework for data management services designed for resource-constrained environments, like those found in the storage and network layers. This has led to a variety of insular and hard-to-reuse embedded data processing solutions. An efficient approach is to reduce data movement within the storage and network layers by embedding data reductive processing and caching throughout the data path. Like most emerging technologies, computational storage devices, smart NICs, and similar devices where data processing can be embedded have to overcome market entry barriers. Thus reducing complexity, increasing interoperability, and lower development costs are critical issues in embedded data management.

SkyhookDM is a full-stack data management framework with the purpose of bridging the gap between resource-rich and resource-constrained environments to better serve data-intensive applications. By leveraging Apache Arrow in the storage and network layers, SkyhookDM adds extra data processing capabilities to embedded devices that were previously inflexible black boxes. This allows SkyhookDM to lower market entry barriers by saving costs and accelerating the development of data management services of structured data.
SkyhookDM is an ecosystem of computational I/O stack components and building blocks that bridge the gap between resource-rich and resource-constrained environments to better serve data-intensive applications. By leveraging Apache Arrow in the storage and network layers, these components add extra data processing capabilities to embedded devices that were previously inflexible black boxes. This allows for lower market entry barriers by saving costs and accelerating the development of data management services of structured data.

Beyond the integration and portability of data management services to embedded devices, SkyhookDM allows data management systems to interoperate with heterogeneous data management services, data sources, and devices in the storage and network layers. Sharing rich metadata from data management systems to embedded devices allows data management services to become smarter and better adapt to heterogeneous architectures. This allows truly distributed data management services to interact even more intelligently with their specific hardware and software contexts.
Beyond the integration and portability of data management services to embedded devices, the SkyhookDM ecosystem allows data management systems to interoperate with heterogeneous data management services, data sources, and devices in the storage and network layers. Sharing rich metadata from data management systems to embedded devices allows data management services to become smarter and better adapt to heterogeneous architectures. This allows truly distributed data management services to interact even more intelligently with their specific hardware and software contexts.