Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Giraffe alignment is very slow and produces warning[vg::Watchdog] messages unless rescue is disabled #4368

Open
polchan opened this issue Aug 7, 2024 · 14 comments

Comments

@polchan
Copy link

polchan commented Aug 7, 2024

Dear VG Team Members,

I hope this message finds you well.

While using the vg giraffe tool to align short reads, I encountered an issue similar to the one described in issue #4171 . Specifically, some samples process successfully in approximately 20 minutes, while others fail to complete even after a full day of computation. This inconsistency was observed using vg version 1.58.0, with the graph constructed from 23 genomes using minigraph-cactus. Notably, the command executed for all samples was identical.

Given that some samples were processed successfully, I ruled out potential issues related to the graph and vg itself. Upon examining the differences between the samples, I noticed that the fastq.gz files for those that were easily processed are only half to a third of the size of those that encountered errors. These files originate from resequencing data provided by others. As such, I suspect that the errors may be due to the large size of the data in some samples.

Here is a log from a sample that failed to run
~/tools/vg giraffe -p -t 12 --max-extensions 1600 -Z panMalus.d2.gbz -d panMalus.d2.dist -m panMalus.d2.min -f /ngsproject/qmyu/ncbi/public/re_sequence/2021_Plant_Biotechnol_J/rawreads/SRR14557116_1.fastq.gz -f /ngsproject/qmyu/ncbi/public/re_sequence/2021_Plant_Biotechnol_J/rawreads/SRR14557116_2.fastq.gz > SRR14557116n.gam:

Preparing Indexes
Loading Minimizer Index
Loading GBZ
Loading Distance Index v2
Paging in Distance Index v2
Initializing MinimizerMapper
Loading and initialization: 58.8633 seconds
Of which Distance Index v2 paging: 1.80507 seconds
Mapping reads to "-" (GAM)
--watchdog-timeout 10
--match 1
--mismatch 4
--gap-open 6
--gap-extend 1
--full-l-bonus 5
--max-multimaps 1
--hit-cap 10
--hard-hit-cap 500
--score-fraction 0.9
--max-min 500
--num-bp-per-min 1000
--distance-limit 200
--max-extensions 1600
--max-alignments 8
--cluster-score 50
--pad-cluster-score 20
--cluster-coverage 0.3
--extension-score 1
--extension-set 20
--rescue-attempts 15
--max-fragment-length 2000
--paired-distance-limit 2
--rescue-subgraph-size 4
--rescue-seed-limit 100
--chaining-cluster-distance 100
--precluster-connection-coverage-threshold 0.3
--min-precluster-connections 10
--max-precluster-connections 50
--max-lookback-bases 100
--min-lookback-items 1
--lookback-item-hard-cap 15
--chain-score-threshold 100
--min-chains 1
--chain-min-score 100
--max-chain-connection 100
--max-tail-length 100
--max-dp-cells 16777216
--rescue-algorithm dozeu
Using fragment length estimate: 433.691 +/- 19.1175
warning[vg::Watchdog]: Thread 0 has been checked in for 10 seconds processing: SRR14557116.18068, SRR14557116.18068
warning[vg::Watchdog]: Thread 2 has been checked in for 10 seconds processing: SRR14557116.20283, SRR14557116.20283
warning[vg::Watchdog]: Thread 11 has been checked in for 10 seconds processing: SRR14557116.19070, SRR14557116.19070
warning[vg::Watchdog]: Thread 6 has been checked in for 10 seconds processing: SRR14557116.25833, SRR14557116.25833
warning[vg::Watchdog]: Thread 4 has been checked in for 10 seconds processing: SRR14557116.26411, SRR14557116.26411
warning[vg::Watchdog]: Thread 1 has been checked in for 10 seconds processing: SRR14557116.29993, SRR14557116.29993
warning[vg::Watchdog]: Thread 8 has been checked in for 10 seconds processing: SRR14557116.28908, SRR14557116.28908
warning[vg::Watchdog]: Thread 10 has been checked in for 10 seconds processing: SRR14557116.29652, SRR14557116.29652
warning[vg::Watchdog]: Thread 7 has been checked in for 10 seconds processing: SRR14557116.32621, SRR14557116.32621
warning[vg::Watchdog]: Thread 9 has been checked in for 10 seconds processing: SRR14557116.33712, SRR14557116.33712
warning[vg::Watchdog]: Thread 7 finally checked out after 14 seconds and 20808 kb memory growth processing: SRR14557116.32621, SRR14557116.32621
warning[vg::Watchdog]: Thread 5 has been checked in for 10 seconds processing: SRR14557116.296057, SRR14557116.296057
warning[vg::Watchdog]: Thread 3 has been checked in for 10 seconds processing: SRR14557116.39673, SRR14557116.39673
warning[vg::Watchdog]: Thread 5 finally checked out after 23 seconds and 0 kb memory growth processing: SRR14557116.296057, SRR14557116.296057
warning[vg::Watchdog]: Thread 7 has been checked in for 10 seconds processing: SRR14557116.42053, SRR14557116.42053
warning[vg::Watchdog]: Thread 5 has been checked in for 10 seconds processing: SRR14557116.305002, SRR14557116.305002
warning[vg::Watchdog]: Thread 5 finally checked out after 20 seconds and 0 kb memory growth processing: SRR14557116.305002, SRR14557116.305002
warning[vg::Watchdog]: Thread 5 has been checked in for 10 seconds processing: SRR14557116.308221, SRR14557116.308221
warning[vg::Watchdog]: Thread 7 finally checked out after 1020 seconds and 24628 kb memory growth processing: SRR14557116.42053, SRR14557116.42053
warning[vg::Watchdog]: Thread 7 has been checked in for 10 seconds processing: SRR14557116.42113, SRR14557116.42113
warning[vg::Watchdog]: Thread 11 finally checked out after 10631 seconds and 508080 kb memory growth processing: SRR14557116.19070, SRR14557116.19070
warning[vg::Watchdog]: Thread 11 has been checked in for 10 seconds processing: SRR14557116.42588, SRR14557116.42588
warning[vg::Watchdog]: Thread 11 finally checked out after 2780 seconds and 0 kb memory growth processing: SRR14557116.42588, SRR14557116.42588
warning[vg::Watchdog]: Thread 11 has been checked in for 10 seconds processing: SRR14557116.43021, SRR14557116.43021

Despite running for 10 hours, the process did not complete and showed no signs of doing so even with extended time.

I experimented with various parameters, such as --max-extensions, --max-alignments, and --hard-hit-cap, but found that these adjustments had no effect and the samples continued to fail. However, when I modified the --rescue-algorithm parameter, I discovered that not using any rescue algorithm yielded unexpected results. Here is the log from a successful run of the same sample, with the only difference being the change in the --rescue-algorithm parameter (~/tools/vg giraffe -p -t 12 -A none --max-extensions 1600 -Z panMalus.d2.gbz -d panMalus.d2.dist -m panMalus.d2.min -f /ngsproject/qmyu/ncbi/public/re_sequence/2021_Plant_Biotechnol_J/rawreads/SRR14557116_1.fastq.gz -f /ngsproject/qmyu/ncbi/public/re_sequence/2021_Plant_Biotechnol_J/rawreads/SRR14557116_2.fastq.gz > SRR14557116.gam).:

Preparing Indexes
Loading Minimizer Index
Loading GBZ
Loading Distance Index v2
Paging in Distance Index v2
Initializing MinimizerMapper
Loading and initialization: 61.6039 seconds
Of which Distance Index v2 paging: 1.47501 seconds
Mapping reads to "-" (GAM)
--watchdog-timeout 10
--match 1
--mismatch 4
--gap-open 6
--gap-extend 1
--full-l-bonus 5
--max-multimaps 1
--hit-cap 10
--hard-hit-cap 500
--score-fraction 0.9
--max-min 500
--num-bp-per-min 1000
--distance-limit 200
--max-extensions 1600
--max-alignments 8
--cluster-score 50
--pad-cluster-score 20
--cluster-coverage 0.3
--extension-score 1
--extension-set 20
--rescue-attempts 0
--max-fragment-length 2000
--paired-distance-limit 2
--rescue-subgraph-size 4
--rescue-seed-limit 100
--chaining-cluster-distance 100
--precluster-connection-coverage-threshold 0.3
--min-precluster-connections 10
--max-precluster-connections 50
--max-lookback-bases 100
--min-lookback-items 1
--lookback-item-hard-cap 15
--chain-score-threshold 100
--min-chains 1
--chain-min-score 100
--max-chain-connection 100
--max-tail-length 100
--max-dp-cells 16777216
--rescue-algorithm none
Using fragment length estimate: 433.691 +/- 19.1175
Mapped 115609390 reads across 12 threads in 16840.4 seconds with 48.7877 additional single-threaded seconds.
Mapping speed: 571.946 reads per second per thread
Used 218606 CPU-seconds (including output).
Achieved 528.848 reads per CPU-second (including output)
Used 1164134888389127 CPU instructions (not including output).
Mapping slowness: 10.0696 M instructions per read at 5325.26 M mapping instructions per inclusive CPU-second
Memory footprint: 27.0476 GB

Based on this observation, could it be that the default --rescue-algorithm parameter is causing the warnings flagged by [vg::Watchdog] and leading to the failures? Additionally, I would like to inquire whether the Dozeu algorithm might be having an adverse effect when applied to samples with high sequencing depth.

I would greatly appreciate any insights or suggestions you might have regarding this issue.

Best regards,

Bo-Cheng Guo

@zhangyixing3
Copy link

zhangyixing3 commented Aug 7, 2024

Hi, it could be the same problem #4207. I have solved this problem.

@zhangyixing3
Copy link

May be you could filter more samples from clip pangenome graph ,for example d5 or d8.

@polchan
Copy link
Author

polchan commented Aug 7, 2024

Hi, it could be the same problem #4207. I have solved this problem.

@zhangyixing3 Thanks your reply. I have been exploring various solutions to address the issue at hand. Specifically, I attempted renaming the graph, distance, and minimizer files and tested both on a local server and a HPC cluster. Unfortunately, these changes did not resolve the problem.

It was not until I modified the --rescue-algorithm parameter that I observed a significant improvement. Previously, I had been using the default Dozeu algorithm. By opting not to use any rescue algorithm (i.e., setting -A none), the samples that had previously failed to process were successfully computed into GAM files without triggering warning[vg::Watchdog].

Therefore, I suspect that the Dozeu rescue algorithm may be responsible for triggering the warning[vg::Watchdog].In particular, this issue may lead to program failures when dealing with sequencing files of high depth.

@zhangyixing3
Copy link

zhangyixing3 commented Aug 7, 2024

In my opinion,some reads differ too much from the reference genome or map to complex regions. Giraffe executes the rescue-algorithm to handle such situations. You might find that some data is difficult to map, for example Thread 1 finally checked out after 62309 seconds and 0 kb memory growth processing: SRR19465.19470728, SRR19465.19470728 . Setting -A none directly might be another solution.

@polchan
Copy link
Author

polchan commented Aug 7, 2024

@zhangyixing3 In fact, all my samples belong to the same species, and the graph I used includes these species as well, making it unlikely that the issue is due to significant differences. Admittedly, I have not tested GSSW, so I cannot comment on its performance. Overall, I believe that the Dozeu algorithm may have issues with processing certain samples.

@jltsiren
Copy link
Contributor

jltsiren commented Aug 7, 2024

I think the specific issue here is the complexity of the rescue subgraph.

When Giraffe can find a good alignment for one read but not the pair, it extracts a subgraph at approximately the right distance from the aligned read. Then it tries to align the pair to the subgraph using a simplified version of the Giraffe aligner. We have some safeguards for too large subgraphs and for having too many seeds in the subgraph, but not for complex subgraphs.

Out of the two rescue algorithms, dozeu is faster but uses heuristics. GSSW is slower but always finds the best alignment. Neither of them is haplotype-aware, and both of them require that the graph they are aligning to is acyclic. If the (relevant part of) the rescue subgraph contains cycles, it will be transformed into an equivalent DAG. And if the rescue subgraph is complex, the DAG can be very large and aligning the read to it can be slow.

We used to have a prototype haplotype-aware rescue algorithm, but it was too slow to be practical. Perhaps we could try using our new haplotype-aware WFA here.

@adamnovak
Copy link
Member

adamnovak commented Aug 8, 2024

If we think the problem is that the rescue DAGs get too big, we could add a limit and abort dagification and rescue if we hit it.

@polchan I don't think that the depth specifically is the problem, except that it provides more reads and thus more chances to hit it. The watchdog times how long each read pair takes to map individually and complains when that particular pair takes a long time. So if you see:

warning[vg::Watchdog]: Thread 11 finally checked out after 10631 seconds and 508080 kb memory growth processing: SRR14557116.19070, SRR14557116.19070

Then you should be able to reproduce that warning with a FASTQ containing only that one SRR14557116.19070 read pair, and it should still take ten thousand seconds to map on its own.

Running a pair by itself it by itself with the --show-work and --track-provenance Giraffe options would make a GAM with embedded annotations noting how long it spent in each stage of the alignment process, and a debugging log that might have useful information about what it was doing that managed to take so long.

@adamnovak adamnovak changed the title Dozeu algorithm is a contributor to the warnings[vg::Watchdog]. Giraffe alignment is very slow and produces warning[vg::Watchdog] messages unless rescue is disabled Aug 8, 2024
@zhangyixing3
Copy link

If we think the problem is that the rescue DAGs get too big, we could add a limit and abort dagification and rescue if we hit it.

If the rescue DAGs take too long, it would be better to impose some limitations rather than disable the rescue altogether.

When I mapping 10023 sample, I encouter these warnings as blow.

warning[vg::Watchdog]: Thread 3 finally checked out after 295 seconds and 23804 kb memory growth processing: SRR1946553.2920144, SRR1946553.2920144
warning[vg::Watchdog]: Thread 5 has been checked in for 10 seconds processing: SRR1946553.9728216, SRR1946553.9728216
warning[vg::Watchdog]: Thread 5 finally checked out after 143 seconds and 0 kb memory growth processing: SRR1946553.9728216, SRR1946553.9728216
warning[vg::Watchdog]: Thread 3 has been checked in for 10 seconds processing: SRR1946553.11939425, SRR1946553.11939425
warning[vg::Watchdog]: Thread 4 has been checked in for 10 seconds processing: SRR1946553.16735946, SRR1946553.16735946
warning[vg::Watchdog]: Thread 4 finally checked out after 13 seconds and 0 kb memory growth processing: SRR1946553.16735946, SRR1946553.16735946
warning[vg::Watchdog]: Thread 1 has been checked in for 10 seconds processing: SRR1946553.19470728, SRR1946553.19470728
warning[vg::Watchdog]: Thread 3 finally checked out after 2914 seconds and 30932 kb memory growth processing: SRR1946553.11939425, SRR1946553.11939425
warning[vg::Watchdog]: Thread 1 finally checked out after 62309 seconds and 0 kb memory growth processing: SRR1946553.19470728, SRR1946553.19470728
warning[vg::Watchdog]: Thread 8 has been checked in for 10 seconds processing: SRR1946553.25227565, SRR1946553.25227565
warning[vg::Watchdog]: Thread 8 finally checked out after 2871 seconds and 0 kb memory growth processing: SRR1946553.25227565, SRR1946553.25227565

I extracted the SRR1946553.2920144 SRR1946553.9728216 SRR1946553.16735946 SRR1946553.25227565 SRR1946553.11939425 SRR1946553.19470728 from the raw data. However, when I finished mapping, it only took 5 seconds, and I did not reproduce warnings.

vg giraffe -t 1 -p -Z ../../mapping.d5.giraffe.gbz  -m ../../mapping.d5.min  -d  ../../mapping.d5.dist  / 
-f test_1.fq  -f test_2.fq  --show-work  --track-provenance  1>test_gam.log  2>test.log

test.log

test_gam.log

@jltsiren
Copy link
Contributor

@zhangyixing3 When you are mapping a small subset of reads, you have to specify the fragment length distribution manually to ensure that the reads are mapped in the same way as in the full run.

For example, if you have the following line in the original log:

Using fragment length estimate: 433.691 +/- 19.1175

You add the following options to the vg giraffe command:

--fragment-mean 433.691 --fragment-stdev 19.1175

@zhangyixing3
Copy link

Thanks, I've already rerun it.

@zhangyixing3
Copy link

Hi, dear sir,
Using the following command:

vg giraffe -t 1 -p -Z ../../mapping.d5.giraffe.gbz  -m ../../mapping.d5.min  -d  ../../mapping.d5.dist  \
      -f test_1.fq  -f test_2.fq   \
     --show-work  --track-provenance --fragment-mean 334.005  --fragment-stdev 28.9223  1>test.gam 2>test.log

The mapping process for six sequences took a long time (from August 12, 18:15 to August 14, 08:53). Here are the results.
test.gam.gz
test.log.gz

@adamnovak
Copy link
Member

Yeah it definitely looks like the rescues are taking all the time. There's warning[vg::Watchdog]: Thread 0 finally checked out after 125536 seconds and 0 kb memory growth processing: SRR1946553.19470728, SRR1946553.19470728, and between the 10 second message and that it's doing nothing but rescues and output. And the GAM-embedded stats for one of those reads show:

    "stage_align_results": 97,
    "stage_align_time": 0.016973553225398064,
    "stage_cluster_results": 66,
    "stage_cluster_time": 0.35473459959030151,
    "stage_extend_results": 66,
    "stage_extend_time": 0.14604765176773071,
    "stage_minimizer_results": 4,
    "stage_minimizer_time": 0.000012450000212993473,
    "stage_pairing_results": 0,
    "stage_pairing_time": 125535.9375,
    "stage_seed_results": 318,
    "stage_seed_time": 0.0020093880593776703,
    "stage_winner_results": 0,
    "stage_winner_time": 0.0099111190065741539

So all the time is going to the "pairing" stage where rescue happens.

We need to add those rescue work limits to resolve this.

@zwh82
Copy link

zwh82 commented Sep 10, 2024

Is it possible to add a time limit for rescue? By counting the number of output reads, I found that only a very small portion of the reads (about 0.5%) exhibit this issue. I believe that even discarding these reads would not significantly affect the results.

@adamnovak
Copy link
Member

We much prefer to express limits in terms of work rather than wall-clock time since otherwise the results are not deterministic across systems, and so runs are very hard to repeat.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants