From 4323b3bbe5d488f4f2bd3f5910cc4bbd5891ff3a Mon Sep 17 00:00:00 2001 From: James Bonfield Date: Wed, 29 Mar 2023 12:05:31 +0100 Subject: [PATCH] Add an MZ:i tag. This is used as a sanity check on the validity of the MM and ML tags. It holds the length of SEQ at the time MM and ML were produced and/or updated. The intention is to provide a mechanism to detect hard-clipping has been performed with a tool that is not MM/ML aware. Fixes #646 --- SAMtags.tex | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/SAMtags.tex b/SAMtags.tex index e19ec290b..5d18557d6 100644 --- a/SAMtags.tex +++ b/SAMtags.tex @@ -93,6 +93,7 @@ \section{Standard tags} {\tt ML} & B,C & Base modification probabilities \\ {\tt MM} & Z & Base modifications / methylation \\ {\tt MQ} & i & Mapping quality of the mate/next segment \\ + {\tt MZ} & i & Length of sequence at the time {\tt MM} and {\tt ML} were produced \\ {\tt NH} & i & Number of reported alignments that contain the query in the current record \\ {\tt NM} & i & Edit distance to the reference \\ {\tt OA} & Z & Original alignment \\ @@ -621,6 +622,16 @@ \subsection{Base modifications} {\tt ML} values for ambiguity codes give the probability that the modification is one of the possible codes compatible with that ambiguity code. For example {\tt MM:Z:C+C,10; ML:B:C,229} indicates a C call with a probability of 90\% of having some form of unspecified modification. +\item[MZ:i:\tagvalue{length}] +\hfill\\ +Tools may edit the {\sf SEQ} sequence data, such as modifying the alignment with hard-clipping. +If the sequence is shrunk in this manner then the base offsets in {\tt MM} and {\tt MM} become invalid unless they are also updated accordingly. + +There may be hard-clipping tools which update {\tt MM} and tools which do not, so the {\tt MZ} tag offers a simple sanity check. +It holds the length of the sequence at the time {\tt MM} was last written. +Tools that wish to validate {\tt MM} should compare the length of the {\sf SEQ} field with the contents of the {\tt MZ} tag. +The tag is optional, but recommended, and if it is absent then there is an implicit assumption that the {\tt MM} data is valid unless evidence implies otherwise (such as having coordinates beyond the end of the sequence). + \end{description} \section{Draft tags}