Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redundancy update #425

Closed
wants to merge 26 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
a6c0aee
Basic setup
RCollins13 Jun 10, 2021
bb1bf8c
Basic changes to reorganize flexible external benchmarking
RCollins13 Jun 10, 2021
eddc23e
Rework inputs in QC WDL to make trio analyses and external benchmarki…
RCollins13 Jun 10, 2021
bb41290
Update
RCollins13 Jun 18, 2021
0677e3e
Updates to QC for more efficient VID collection per sample
RCollins13 Jun 22, 2021
89e9973
Allow VCF-wide QC plotting to accept VCFs with only allosomes
RCollins13 Jul 1, 2021
2faf23c
Subset external benchmarking to contigs of interest
RCollins13 Jul 8, 2021
a251170
QC updates with XZ IRS
RCollins13 Aug 2, 2021
8e3b131
Fix typo in site-level benchmarking plotting code
RCollins13 Sep 18, 2021
00784c5
Fix external benchmarking helper
RCollins13 Sep 18, 2021
06c4fd9
Docker update
RCollins13 Sep 19, 2021
390fada
Update
RCollins13 Sep 19, 2021
346fc3c
Fix per-sample benchmarking in QC
RCollins13 Sep 20, 2021
c3ed180
Updates to per-sample plotting code
RCollins13 Sep 20, 2021
616b6f6
Fix compression edge-case for per-sample callset comparison
RCollins13 Sep 21, 2021
0d03868
Add contingengy for edge case where comparison set has no variants
RCollins13 Sep 22, 2021
8c27332
Make per-sample benchmark plotting more tractable
RCollins13 Sep 22, 2021
efd0924
Make SV type-matching more generous in compare_callsets.sh
RCollins13 Sep 27, 2021
5c9ee0d
Add contingency for when variant is SVTYPE=CNV
RCollins13 Sep 27, 2021
00cfef7
Add contingency for QC on VCFs prior to cleanVCF step 5
RCollins13 Oct 20, 2021
2dd3a89
Update logic for selecting best matches in AF benchmarking
RCollins13 Oct 25, 2021
ad167a8
Split site-level benchmarking per chromosome
RCollins13 Oct 25, 2021
3858221
Update QC code before Vahid's PR
RCollins13 Oct 28, 2021
db23760
Merge pull request #1 from talkowski-lab/rlc_qc_basics
RCollins13 Oct 28, 2021
e3ff4be
added scripts to remove redundancies
xuefzhao Sep 28, 2022
a993595
recluster script updated
xuefzhao Oct 7, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
440 changes: 9 additions & 431 deletions README.md

Large diffs are not rendered by default.

13 changes: 10 additions & 3 deletions dockerfiles/expansion-hunter-denovo/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,10 +1,17 @@
FROM alpine:latest
RUN apk --no-cache add curl && \
FROM python:3.7-slim
RUN apt-get update && apt-get install -y \
wget \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/* && \

wget https://github.com/Illumina/ExpansionHunterDenovo/releases/download/v0.9.0/ExpansionHunterDenovo-v0.9.0-linux_x86_64.tar.gz && \
mkdir ehdn_extract && \
tar -xf *.tar.gz --strip-components=1 -C ehdn_extract && \
rm -rf *.tar.gz && \
mkdir ehdn && \
mv ehdn_extract/bin/* ehdn/ && \
rm -rf ehdn_extract
mv ehdn_extract/scripts ehdn/ && \
rm -rf ehdn_extract && \
pip install -r /ehdn/scripts/requirements.txt
ENV PATH="/ehdn/:$PATH"
ENV SCRIPTS_DIR /ehdn/scripts
417 changes: 226 additions & 191 deletions dockerfiles/igv/MakeRDtest.py

Large diffs are not rendered by default.

21 changes: 10 additions & 11 deletions dockerfiles/igv/igv.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,12 @@
import sys
import sys
[_,varfile]=sys.argv
plotdir="plots"
igvfile="igv.txt"
igvsh="igv.sh"
with open(varfile,'r') as f:
[_, varfile] = sys.argv
plotdir = "plots"
igvfile = "igv.txt"
igvsh = "igv.sh"
with open(varfile, 'r') as f:
for line in f:
dat=line.split('\t')
chr=dat[0]
start=dat[1]
end=dat[2]
data=dat[3].split(',')
dat = line.split('\t')
chr = dat[0]
start = dat[1]
end = dat[2]
data = dat[3].split(',')
79 changes: 45 additions & 34 deletions dockerfiles/igv/makeigv_cram.py
Original file line number Diff line number Diff line change
@@ -1,70 +1,81 @@
import sys,os,argparse
#[_,varfile,buff,fasta]=sys.argv #assume the varfile has *.bed in the end
# Example
import os
import argparse
# [_,varfile,buff,fasta]=sys.argv #assume the varfile has *.bed in the end
# Example
# python makeigv.py /data/talkowski/xuefang/local/src/IGV_2.4.14/IL_DUP/IL.DUP.HG00514.V2.bed /data/talkowski/Samples/1000Genomes/HGSV_Illumina_Alignment_GRCh38 400
# bash IL.DUP.HG00514.V2.sh
# bash igv.sh -b IL.DUP.HG00514.V2.txt


parser = argparse.ArgumentParser("makeigvsplit_cram.py")
parser.add_argument('varfile', type=str, help='name of variant file in bed format, with cram and SVID in last two columns')
parser.add_argument('buff', type=str, help='length of buffer to add around variants')
parser.add_argument('varfile', type=str,
help='name of variant file in bed format, with cram and SVID in last two columns')
parser.add_argument(
'buff', type=str, help='length of buffer to add around variants')
parser.add_argument('fasta', type=str, help='reference sequences')

parser.add_argument('sample', type=str, help='name of sample to make igv on')
parser.add_argument('chromosome', type=str, help='name of chromosome to plot igv on')
parser.add_argument('chromosome', type=str,
help='name of chromosome to plot igv on')
args = parser.parse_args()


buff = int(args.buff)
fasta = args.fasta
varfile = args.varfile

outstring=os.path.basename(varfile)[0:-4]
bamdir="bam"
outdir="screenshot"
igvfile="igv.txt"
bamfiscript="igv.sh"
outstring = os.path.basename(varfile)[0:-4]
bamdir = "bam"
outdir = "screenshot"
igvfile = "igv.txt"
bamfiscript = "igv.sh"
###################################
with open(bamfiscript,'w') as h:
with open(bamfiscript, 'w') as h:
h.write("#!/bin/bash\n")
h.write("set -e\n")
h.write("mkdir -p {}\n".format(bamdir))
h.write("mkdir -p {}\n".format(outdir))
with open(igvfile,'w') as g:
with open(igvfile, 'w') as g:
g.write('new\n')
g.write('genome {}\n'.format(fasta))
with open(varfile,'r') as f:
with open(varfile, 'r') as f:
for line in f:
dat=line.rstrip().split("\t")
Chr=dat[0]
if not Chr == args.chromosome: continue
Start=str(int(dat[1])-buff)
End=str(int(dat[2])+buff)
Dat=dat[3].split(',')
ID=dat[4]
dat = line.rstrip().split("\t")
Chr = dat[0]
if not Chr == args.chromosome:
continue
Start = str(int(dat[1]) - buff)
End = str(int(dat[2]) + buff)
Dat = dat[3].split(',')
ID = dat[4]
for cram in Dat:
#sample=cram.split("/")[-1].split('.')[0]
g.write('load '+bamdir+'/'+args.sample+'_'+args.chromosome+'.bam\n')
if int(End)-int(Start)<10000:
g.write('goto '+Chr+":"+Start+'-'+End+'\n')
# sample=cram.split("/")[-1].split('.')[0]
g.write('load ' + bamdir + '/' + args.sample +
'_' + args.chromosome + '.bam\n')
if int(End) - int(Start) < 10000:
g.write('goto ' + Chr + ":" + Start + '-' + End + '\n')
g.write('sort base\n')
g.write('viewaspairs\n')
g.write('collapse\n')
g.write('snapshotDirectory '+outdir+'\n')
g.write('snapshot '+args.sample+'_'+ID+'.png\n' )
g.write('snapshotDirectory ' + outdir + '\n')
g.write('snapshot ' + args.sample + '_' + ID + '.png\n')
else:
g.write('goto '+Chr+":"+Start+'-'+str(int(Start)+1000)+'\n') # Extra 1kb buffer if variant large
# Extra 1kb buffer if variant large
g.write('goto ' + Chr + ":" + Start +
'-' + str(int(Start) + 1000) + '\n')
g.write('sort base\n')
g.write('viewaspairs\n')
g.write('collapse\n')
g.write('snapshotDirectory '+outdir+'\n')
g.write('snapshot '+args.sample+'_'+ID+'.left.png\n' )
g.write('goto '+Chr+":"+str(int(End)-1000)+'-'+End+'\n')
g.write('snapshotDirectory ' + outdir + '\n')
g.write('snapshot ' + args.sample +
'_' + ID + '.left.png\n')
g.write('goto ' + Chr + ":" +
str(int(End) - 1000) + '-' + End + '\n')
g.write('sort base\n')
g.write('collapse\n')
g.write('snapshotDirectory '+outdir+'\n')
g.write('snapshot '+args.sample+'_'+ID+'.right.png\n' )
g.write('snapshotDirectory ' + outdir + '\n')
g.write('snapshot ' + args.sample +
'_' + ID + '.right.png\n')
# g.write('goto '+Chr+":"+Start+'-'+End+'\n')
# g.write('sort base\n')
# g.write('viewaspairs\n')
Expand All @@ -73,4 +84,4 @@
# g.write('snapshot '+ID+'.png\n' )
g.write('new\n')
g.write('exit\n')
# with open(bamfiscript,'w') as g:
# with open(bamfiscript,'w') as g:
106 changes: 59 additions & 47 deletions dockerfiles/igv/makeigvpesr_cram.py
Original file line number Diff line number Diff line change
@@ -1,21 +1,26 @@
import sys,os,argparse
#[_,varfile,buff,fasta]=sys.argv #assume the varfile has *.bed in the end
# Usage
import os
import argparse
# [_,varfile,buff,fasta]=sys.argv #assume the varfile has *.bed in the end
# Usage
# python makeigvpesr_cram.py varfile fasta sample ped cram_list buffer chromosome
# bash IL.DUP.HG00514.V2.sh
# bash igv.sh -b IL.DUP.HG00514.V2.txt


parser = argparse.ArgumentParser("makeigvsplit_cram.py")
parser.add_argument('varfile', type=str, help='name of variant file in bed format, with cram and SVID in last two columns')
parser.add_argument('varfile', type=str,
help='name of variant file in bed format, with cram and SVID in last two columns')
parser.add_argument('fasta', type=str, help='reference sequences')
#parser.add_argument('bam', type=str, help='name of bam to make igv on')
# parser.add_argument('bam', type=str, help='name of bam to make igv on')
parser.add_argument('sample', type=str, help='name of sample to make igv on')
parser.add_argument('ped', type=str, help='name of ped file')
parser.add_argument('cram_list', type=str, help='a file including sample and cram path')
parser.add_argument('outdir', type=str, help = 'output folder')
parser.add_argument('-b','--buff', type=str, help='length of buffer to add around variants', default=500)
parser.add_argument('-c','--chromosome', type=str, help='name of chromosome to make igv on', default='all')
parser.add_argument('cram_list', type=str,
help='a file including sample and cram path')
parser.add_argument('outdir', type=str, help='output folder')
parser.add_argument('-b', '--buff', type=str,
help='length of buffer to add around variants', default=500)
parser.add_argument('-c', '--chromosome', type=str,
help='name of chromosome to make igv on', default='all')

args = parser.parse_args()

Expand All @@ -24,84 +29,91 @@
fasta = args.fasta
varfile = args.varfile

outstring=os.path.basename(varfile)[0:-4]
bamdir="pe_bam"
outdir=args.outdir
igvfile="pe.txt"
bamfiscript="pe.sh"
outstring = os.path.basename(varfile)[0:-4]
bamdir = "pe_bam"
outdir = args.outdir
igvfile = "pe.txt"
bamfiscript = "pe.sh"
###################################
sample = args.sample
chromosome = args.chromosome


def ped_info_readin(ped_file):
out={}
fin=open(ped_file)
out = {}
fin = open(ped_file)
for line in fin:
pin=line.strip().split()
pin = line.strip().split()
if not pin[1] in out.keys():
out[pin[1]]=[pin[1]]
if not(pin[2])==0:
out[pin[1]] = [pin[1]]
if not(pin[2]) == 0:
out[pin[1]].append(pin[2])
if not(pin[3])==0:
if not(pin[3]) == 0:
out[pin[1]].append(pin[3])
fin.close()
return out


def cram_info_readin(cram_file):
out={}
fin=open(cram_file)
out = {}
fin = open(cram_file)
for line in fin:
pin=line.strip().split()
pin = line.strip().split()
if not pin[0] in out.keys():
out[pin[0]]=pin[1:]
out[pin[0]] = pin[1:]
fin.close()
return(out)


ped_info = ped_info_readin(args.ped)
cram_info = cram_info_readin(args.cram_list)
cram_list=[]
cram_list = []
for member in ped_info[sample]:
if member in cram_info.keys():
cram_list.append(cram_info[member][0])

with open(bamfiscript,'w') as h:
with open(bamfiscript, 'w') as h:
h.write("#!/bin/bash\n")
h.write("set -e\n")
h.write("mkdir -p {}\n".format(bamdir))
h.write("mkdir -p {}\n".format(outdir))
with open(igvfile,'w') as g:
with open(igvfile, 'w') as g:
g.write('new\n')
g.write('genome {}\n'.format(fasta))
with open(varfile,'r') as f:
with open(varfile, 'r') as f:
for line in f:
dat=line.rstrip().split("\t")
Chr=dat[0]
if not chromosome=='all':
if not Chr == chromosome: continue
Start=str(int(dat[1])-buff)
End=str(int(dat[2])+buff)
ID=dat[4]
dat = line.rstrip().split("\t")
Chr = dat[0]
if not chromosome == 'all':
if not Chr == chromosome:
continue
Start = str(int(dat[1]) - buff)
End = str(int(dat[2]) + buff)
ID = dat[4]
for cram in cram_list:
g.write('load '+cram+'\n')
if int(End)-int(Start)<10000:
g.write('goto '+Chr+":"+Start+'-'+End+'\n')
g.write('load ' + cram + '\n')
if int(End) - int(Start) < 10000:
g.write('goto ' + Chr + ":" + Start + '-' + End + '\n')
g.write('sort base\n')
g.write('viewaspairs\n')
g.write('squish\n')
g.write('snapshotDirectory '+outdir+'\n')
g.write('snapshot '+sample+'_'+ID+'.png\n' )
g.write('snapshotDirectory ' + outdir + '\n')
g.write('snapshot ' + sample + '_' + ID + '.png\n')
else:
g.write('goto '+Chr+":"+Start+'-'+str(int(Start)+1000)+'\n') # Extra 1kb buffer if variant large
# Extra 1kb buffer if variant large
g.write('goto ' + Chr + ":" + Start +
'-' + str(int(Start) + 1000) + '\n')
g.write('sort base\n')
g.write('viewaspairs\n')
g.write('squish\n')
g.write('snapshotDirectory '+outdir+'\n')
g.write('snapshot '+sample+'_'+ID+'.left.png\n' )
g.write('goto '+Chr+":"+str(int(End)-1000)+'-'+End+'\n')
g.write('snapshotDirectory ' + outdir + '\n')
g.write('snapshot ' + sample + '_' + ID + '.left.png\n')
g.write('goto ' + Chr + ":" +
str(int(End) - 1000) + '-' + End + '\n')
g.write('sort base\n')
g.write('squish\n')
g.write('snapshotDirectory '+outdir+'\n')
g.write('snapshot '+sample+'_'+ID+'.right.png\n' )
g.write('snapshotDirectory ' + outdir + '\n')
g.write('snapshot ' + sample + '_' + ID + '.right.png\n')
# g.write('goto '+Chr+":"+Start+'-'+End+'\n')
# g.write('sort base\n')
# g.write('viewaspairs\n')
Expand All @@ -110,4 +122,4 @@ def cram_info_readin(cram_file):
# g.write('snapshot '+ID+'.png\n' )
g.write('new\n')
g.write('exit\n')
# with open(bamfiscript,'w') as g:
# with open(bamfiscript,'w') as g:
Loading