POD5 File Format Design Details

Summary

This file format has the following design goals (roughly in priority order):

Good write performance for MinKNOW
Recoverable if the writing process crashes
Good read performance for downstream tools, including basecall model generation
Efficient use of space
Straightforward to implement and maintain
Extensibility

Note that trade-offs have been made between these goals, but we have mostly aimed to make those run-time decisions.

We have also chosen not to optimise for editing existing files.

Write performance

The aspects of this format that are designed to maximise write performance are:

Data can be written sequentially
- The sequential access pattern makes it easy to use efficient operating system APIs (such as io_uring on Linux)
- The sequential access pattern helps the operating system's I/O scheduler maximise throughput
Signal data from different reads can be interleaved, and data streams can be safely abandoned (at the cost of using more space than necessary)
- This allows MinKNOW to write out data as it arrives, potentially avoiding the need have an intermediate caching format (this file format can be used for the cache and the final output)
Support for space- and CPU-efficient compression routines (VBZ)
- This reduces the amount of data that needs to be written, which reduces I/O load

Recovery

The aspects of this format that are designed to allow for recovery if the writing process crashes are:

A way to indicate that a file is actually complete as intended (complete files end with a recognisable footer)
The Apache Feather format can be assembled by reading it sequentially, without using the footer
The data file format is append-only, which means that once data is recorded it cannot be corrupted by later updates

Read performance

The aspects of this format that are designed to maximise read performance are:

The Apache Feather format can be memory mapped and used directly
Apache Arrow has significant existing engineering work geared around efficient access to data, from the layout of the data itself to the library tooling
Storing direct information about signal data locations with the row table
- This allows quick access to a read's data without scanning the data file
It is possible to only decode part of a long read, due to read data being stored in chunks
- This is useful for model training
Read access does not require locking or otherwise modifying the file
- This allows multi-threaded and multi-process access to a file for reading

Efficient use of space

The aspects of this format that are designed to maximise use of space are:

Support for efficient compression routines (VBZ)
Apache Arrow's support for dictionary encoding
Apache Arrow's support for compressing buffers with standard compression routines

Ease of implementation

The aspects of this format that are designed to make the format easy to implement are:

Relying on an existing, widely-used format (Apache Arrow)

Extensibility

The aspects of this format that are designed to make the format extensible are:

Apache Arrow uses a self-describing schema with named columns, so it is straightforward to write code that is resilient in the face of things like additional columns being added.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DESIGN.md

DESIGN.md

POD5 File Format Design Details

Summary

Write performance

Recovery

Read performance

Efficient use of space

Ease of implementation

Extensibility

Files

DESIGN.md

Latest commit

History

DESIGN.md

File metadata and controls

POD5 File Format Design Details

Summary

Write performance

Recovery

Read performance

Efficient use of space

Ease of implementation

Extensibility