-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Giraffe alignment is very slow and produces warning[vg::Watchdog] messages unless rescue is disabled #4368
Comments
Hi, it could be the same problem #4207. I have solved this problem. |
May be you could filter more samples from clip pangenome graph ,for example d5 or d8. |
@zhangyixing3 Thanks your reply. I have been exploring various solutions to address the issue at hand. Specifically, I attempted renaming the graph, distance, and minimizer files and tested both on a local server and a HPC cluster. Unfortunately, these changes did not resolve the problem. It was not until I modified the --rescue-algorithm parameter that I observed a significant improvement. Previously, I had been using the default Dozeu algorithm. By opting not to use any rescue algorithm (i.e., setting -A none), the samples that had previously failed to process were successfully computed into GAM files without triggering warning[vg::Watchdog]. Therefore, I suspect that the Dozeu rescue algorithm may be responsible for triggering the warning[vg::Watchdog].In particular, this issue may lead to program failures when dealing with sequencing files of high depth. |
In my opinion,some reads differ too much from the reference genome or map to complex regions. Giraffe executes the rescue-algorithm to handle such situations. You might find that some data is difficult to map, for example Thread 1 finally checked out after 62309 seconds and 0 kb memory growth processing: SRR19465.19470728, SRR19465.19470728 . Setting -A none directly might be another solution. |
@zhangyixing3 In fact, all my samples belong to the same species, and the graph I used includes these species as well, making it unlikely that the issue is due to significant differences. Admittedly, I have not tested GSSW, so I cannot comment on its performance. Overall, I believe that the Dozeu algorithm may have issues with processing certain samples. |
I think the specific issue here is the complexity of the rescue subgraph. When Giraffe can find a good alignment for one read but not the pair, it extracts a subgraph at approximately the right distance from the aligned read. Then it tries to align the pair to the subgraph using a simplified version of the Giraffe aligner. We have some safeguards for too large subgraphs and for having too many seeds in the subgraph, but not for complex subgraphs. Out of the two rescue algorithms, dozeu is faster but uses heuristics. GSSW is slower but always finds the best alignment. Neither of them is haplotype-aware, and both of them require that the graph they are aligning to is acyclic. If the (relevant part of) the rescue subgraph contains cycles, it will be transformed into an equivalent DAG. And if the rescue subgraph is complex, the DAG can be very large and aligning the read to it can be slow. We used to have a prototype haplotype-aware rescue algorithm, but it was too slow to be practical. Perhaps we could try using our new haplotype-aware WFA here. |
If we think the problem is that the rescue DAGs get too big, we could add a limit and abort dagification and rescue if we hit it. @polchan I don't think that the depth specifically is the problem, except that it provides more reads and thus more chances to hit it. The watchdog times how long each read pair takes to map individually and complains when that particular pair takes a long time. So if you see:
Then you should be able to reproduce that warning with a FASTQ containing only that one Running a pair by itself it by itself with the |
If the rescue DAGs take too long, it would be better to impose some limitations rather than disable the rescue altogether. When I mapping 10023 sample, I encouter these warnings as blow.
I extracted the
|
@zhangyixing3 When you are mapping a small subset of reads, you have to specify the fragment length distribution manually to ensure that the reads are mapped in the same way as in the full run. For example, if you have the following line in the original log:
You add the following options to the
|
Thanks, I've already rerun it. |
Hi, dear sir,
The mapping process for six sequences took a long time (from August 12, 18:15 to August 14, 08:53). Here are the results. |
Yeah it definitely looks like the rescues are taking all the time. There's
So all the time is going to the "pairing" stage where rescue happens. We need to add those rescue work limits to resolve this. |
Is it possible to add a time limit for rescue? By counting the number of output reads, I found that only a very small portion of the reads (about 0.5%) exhibit this issue. I believe that even discarding these reads would not significantly affect the results. |
We much prefer to express limits in terms of work rather than wall-clock time since otherwise the results are not deterministic across systems, and so runs are very hard to repeat. |
Dear VG Team Members,
I hope this message finds you well.
While using the vg giraffe tool to align short reads, I encountered an issue similar to the one described in issue #4171 . Specifically, some samples process successfully in approximately 20 minutes, while others fail to complete even after a full day of computation. This inconsistency was observed using vg
version 1.58.0
, with the graph constructed from 23 genomes using minigraph-cactus. Notably, the command executed for all samples was identical.Given that some samples were processed successfully, I ruled out potential issues related to the graph and vg itself. Upon examining the differences between the samples, I noticed that the fastq.gz files for those that were easily processed are only half to a third of the size of those that encountered errors. These files originate from resequencing data provided by others. As such, I suspect that the errors may be due to the large size of the data in some samples.
Here is a log from a sample that failed to run
~/tools/vg giraffe -p -t 12 --max-extensions 1600 -Z panMalus.d2.gbz -d panMalus.d2.dist -m panMalus.d2.min -f /ngsproject/qmyu/ncbi/public/re_sequence/2021_Plant_Biotechnol_J/rawreads/SRR14557116_1.fastq.gz -f /ngsproject/qmyu/ncbi/public/re_sequence/2021_Plant_Biotechnol_J/rawreads/SRR14557116_2.fastq.gz > SRR14557116n.gam
:Despite running for 10 hours, the process did not complete and showed no signs of doing so even with extended time.
I experimented with various parameters, such as
--max-extensions
,--max-alignments
, and--hard-hit-cap
, but found that these adjustments had no effect and the samples continued to fail. However, when I modified the--rescue-algorithm
parameter, I discovered that not using any rescue algorithm yielded unexpected results. Here is the log from a successful run of the same sample, with the only difference being the change in the--rescue-algorithm
parameter (~/tools/vg giraffe -p -t 12 -A none --max-extensions 1600 -Z panMalus.d2.gbz -d panMalus.d2.dist -m panMalus.d2.min -f /ngsproject/qmyu/ncbi/public/re_sequence/2021_Plant_Biotechnol_J/rawreads/SRR14557116_1.fastq.gz -f /ngsproject/qmyu/ncbi/public/re_sequence/2021_Plant_Biotechnol_J/rawreads/SRR14557116_2.fastq.gz > SRR14557116.gam
).:Based on this observation, could it be that the default
--rescue-algorithm
parameter is causing the warnings flagged by [vg::Watchdog] and leading to the failures? Additionally, I would like to inquire whether the Dozeu algorithm might be having an adverse effect when applied to samples with high sequencing depth.I would greatly appreciate any insights or suggestions you might have regarding this issue.
Best regards,
Bo-Cheng Guo
The text was updated successfully, but these errors were encountered: