Skip to content

Latest commit

 

History

History
71 lines (47 loc) · 3.3 KB

DESIGN.md

File metadata and controls

71 lines (47 loc) · 3.3 KB

POD5 File Format Design Details

Summary

This file format has the following design goals (roughly in priority order):

  • Good write performance for MinKNOW
  • Recoverable if the writing process crashes
  • Good read performance for downstream tools, including basecall model generation
  • Efficient use of space
  • Straightforward to implement and maintain
  • Extensibility

Note that trade-offs have been made between these goals, but we have mostly aimed to make those run-time decisions.

We have also chosen not to optimise for editing existing files.

Write performance

The aspects of this format that are designed to maximise write performance are:

  • Data can be written sequentially
    • The sequential access pattern makes it easy to use efficient operating system APIs (such as io_uring on Linux)
    • The sequential access pattern helps the operating system's I/O scheduler maximise throughput
  • Signal data from different reads can be interleaved, and data streams can be safely abandoned (at the cost of using more space than necessary)
    • This allows MinKNOW to write out data as it arrives, potentially avoiding the need have an intermediate caching format (this file format can be used for the cache and the final output)
  • Support for space- and CPU-efficient compression routines (VBZ)
    • This reduces the amount of data that needs to be written, which reduces I/O load

Recovery

The aspects of this format that are designed to allow for recovery if the writing process crashes are:

  • A way to indicate that a file is actually complete as intended (complete files end with a recognisable footer)
  • The Apache Feather format can be assembled by reading it sequentially, without using the footer
  • The data file format is append-only, which means that once data is recorded it cannot be corrupted by later updates

Read performance

The aspects of this format that are designed to maximise read performance are:

  • The Apache Feather format can be memory mapped and used directly
  • Apache Arrow has significant existing engineering work geared around efficient access to data, from the layout of the data itself to the library tooling
  • Storing direct information about signal data locations with the row table
    • This allows quick access to a read's data without scanning the data file
  • It is possible to only decode part of a long read, due to read data being stored in chunks
    • This is useful for model training
  • Read access does not require locking or otherwise modifying the file
    • This allows multi-threaded and multi-process access to a file for reading

Efficient use of space

The aspects of this format that are designed to maximise use of space are:

  • Support for efficient compression routines (VBZ)
  • Apache Arrow's support for dictionary encoding
  • Apache Arrow's support for compressing buffers with standard compression routines

Ease of implementation

The aspects of this format that are designed to make the format easy to implement are:

  • Relying on an existing, widely-used format (Apache Arrow)

Extensibility

The aspects of this format that are designed to make the format extensible are:

  • Apache Arrow uses a self-describing schema with named columns, so it is straightforward to write code that is resilient in the face of things like additional columns being added.