Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add splitting BAM index to spec #321

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 42 additions & 0 deletions SAMv1.tex
Original file line number Diff line number Diff line change
Expand Up @@ -1204,6 +1204,48 @@ \subsection{C source code for computing bin number and overlapping bins}\label{s
\end{verbatim}
}

\subsection{Splitting BAM}\label{sec:code}
A BAM file can be processed in parallel by conceptually dividing the file into
splits (typically of a fixed, but arbitrary, number of bytes) and for each
split processing alignments from the first known alignment after the split
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want a sentence here describing why this additional index is necessary and this use can't be handled by the bai.

start up to the first known alignment of the next split.

A splitting BAM index is a linear index of virtual file offsets of alignment
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence is a bit hard to parse on the first run through. To many of in a row. I'm not sure how to improve it without making it much more wordy though, but maybe someone has a good idea.

Also, does linear index imply that it's a sorted list of increasing offsets? Should we mention that somewhere?

start positions. The index must contain the virtual file offset for the first
alignment, and a virtual file offset for the overall length of the BAM
file.\footnote{In the unlikely event the BAM file has no alignment records,
the index will consist of a single entry for the overall length of the
BAM file.} It does not need to contain a virtual file offset for every
alignment, merely a subset. A granularity of $n$ means that an offset is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to allow indication of approximate number of records per offset? Or is that just making things unnecessarily complicated? At the cost of increasing the index size we could include the number of records in each section in the index. Instead of [offset], we'd have [offset, number of records until next offset] That might be useful for tools deciding how to split the file.

written for every $n$ alignments.

To find the alignments for a split that covers a byte range {\tt [beg,\,end)}
use the index to find the smallest virtual file offset, {\tt v1}, that falls
in this range, and the smallest virtual file offset, {\tt v2}, that is
greater than or equal to {\tt end}. If {\tt v1} does not exist, then the
split has no alignments. Otherwise, it has alignments in the range
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no alignments -> no alignment starts

{\tt [v1,\,v2)}. This method will map a set of contiguous, non-overlapping
{\it file ranges} that cover the whole BAM file to a set of contiguous,
non-overlapping {\it virtual file ranges} that cover the whole file.

Splitting BAM index filenames have a {\tt .sbi} extension added to the BAM
filename (so {\tt foo.bam.sbi} is the splitting BAM index filename for
{\tt foo.bam}). Index files contain a header followed by a sorted list of
virtual files offsets in ascending order.

\begin{table}[ht]
\centering
{\small
\begin{tabular}{|l|l|l|p{8.15cm}|l|r|}
\cline{1-6}
\multicolumn{3}{|c|}{\bf Field} & \multicolumn{1}{c|}{\bf Description} & \multicolumn{1}{c|}{\bf Type} & \multicolumn{1}{c|}{\bf Value} \\\cline{1-6}
\multicolumn{3}{|l|}{\sf magic} & Magic string & {\tt char[4]} & {\tt SBI\char92 1}\\\cline{1-6}
\multicolumn{3}{|l|}{\sf granularity} & Number of alignments between offsets, or $-1$ if unspecified & {\tt int32\_t} & \\\cline{1-6}
\multicolumn{6}{|c|}{\textcolor{gray}{\it List of offsets}} \\\cline{2-6}
& \multicolumn{2}{l|}{\sf offset} & Virtual file offset of the alignment & {\tt uint64\_t} & \\\cline{1-6}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like there should be a special final entry after the list the table for the offset to the end of the bam.

\end{tabular}}
\end{table}

\pagebreak

\begin{appendices}
Expand Down