Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory limit exceeded during vg autoindex for GCSA/LCP indexing #4404

Open
HuangXZhuo opened this issue Sep 20, 2024 · 15 comments
Open

Memory limit exceeded during vg autoindex for GCSA/LCP indexing #4404

HuangXZhuo opened this issue Sep 20, 2024 · 15 comments

Comments

@HuangXZhuo
Copy link

Hello,

I am encountering an issue when running vg autoindex to construct a graph from a HG002 reference FASTA and VCF file. The command I am using is as follows:
vg autoindex --workflow map --threads 24 --prefix /public1/home/sc30852/HG002/vg/graph --ref-fasta ../../hg002.mat.fasta --vcf ../mat.vcf.gz
Here is part of the log output:
[IndexRegistry]: Checking for phasing in VCF(s).
[IndexRegistry]: Chunking inputs for parallelism.
[IndexRegistry]: Chunking FASTA(s).
[IndexRegistry]: Chunking VCF(s).
[IndexRegistry]: Constructing VG graph from FASTA and VCF input.
[IndexRegistry]: Constructing XG graph from VG graph.
[IndexRegistry]: Pruning complex regions of VG to prepare for GCSA indexing.
[IndexRegistry]: Constructing GCSA/LCP indexes.
PathGraphBuilder::write(): Memory use of file 5 of kmer paths (503.81 GB) exceeds memory limit (503.781 GB).

It seems like the memory consumption during the GCSA indexing step exceeds the available memory (around 504 GB). Do you have any suggestions on how I can reduce memory usage, or is there a way to chunk the input differently to avoid this issue?

Any help would be appreciated!

Thank you!

@jltsiren
Copy link
Contributor

What do you have in the VCF file? Usually when GCSA construction runs out of memory, it happens because the graph is too complex, there are too many variants in repetitive regions, or there are too many nondeterministic locations (where the first/last bases of reference and alternative nodes are identical).

@HuangXZhuo
Copy link
Author

I used the command:
nucmer -t 24 ../hg002.mat.fasta ../hg002.pat.fasta
delta-filter -i 99 -l 100000 out.delta > out.delta.filter
to align the maternal and paternal genomes of HG002 using MUMmer for a diploid mapping. Could it be that the VCF file generated from this process makes the graph too complex?

@jltsiren
Copy link
Contributor

I'm not very familiar with MUMmer, but I think the problem is that you have too much duplicated sequence in the graph. GCSA construction does not like that, because it can't collapse identical k-mers if they start from different positions in the graph.

If you want to build a graph based on two aligned haplotypes, Minigraph-Cactus should be a better choice. You can then map reads using Giraffe, which is faster than vg map. I'm not sure if the default clipped graph or the full unclipped graph is a better choice, so you should probably try both.

@HuangXZhuo
Copy link
Author

Thank you so much for your advice! I will definitely try using Minigraph-Cactus to build the graph. I appreciate your help and will let you know how it goes after testing. Thanks again!

@ld9866
Copy link

ld9866 commented Sep 20, 2024

Dear developer:
I used 29 genomes to get the vcf file using the Minigraph-Cactus pipeline, and now I want to do some pan-transcriptome analysis, so I need to convert the required file.

there is still enough storage, but the task automatically terminates after running for one day, I would like to ask how to solve this situation?

l only not obtain the sample.trans.spliced.gcsa and sample.trans.spliced.gcsa.lcp, other files are ok.

Best
Dong

Code:
vg autoindex --threads 32 --workflow mpmap --workflow rpvg --prefix sample.trans --ref-fasta reference.fa --vcf sample.result.vcf.gz --tx-gff Duroc.111.chr1-18.gtf --tmp-dir /home/test/nvmedata2/02.Pantrans/TMP -M 850G
erro:
warning:[vg::Constructor] Skipping duplicate variant with hash c5757e8eca1e42a9bafd6bf1aed0bacad2826367 at 1:274146418
[IndexRegistry]: Constructing GBWT from spliced VG graph and phased VCF input.
[IndexRegistry]: Merging contig GBWTs.
[IndexRegistry]: Stripping allele paths from spliced VG.
[IndexRegistry]: Constructing haplotype-transcript GBWT and finishing spliced VG.
[IndexRegistry]: Merging contig GBWTs.
[IndexRegistry]: Joining transcript origin table.
[IndexRegistry]: Constructing spliced XG graph from spliced VG graph.
[IndexRegistry]: Constructing distance index for a spliced graph.
[IndexRegistry]: Pruning complex regions of spliced VG to prepare for GCSA indexing with GBWT unfolding.
[IndexRegistry]: Constructing GCSA/LCP indexes.
PathGraphBuilder::write(): Memory use of file 0 of kmer paths (850.002 GB) exceeds memory limit (850 GB)
PathGraphBuilder::write(): Memory use of file 0 of kmer paths (850.045 GB) exceeds memory limit (850 GB)
[IndexRegistry]: Exceeded disk or memory use limit while performing k-mer doubling steps. Rewinding to pruning step with more aggressive pruning to simplify the graph.
[IndexRegistry]: Pruning complex regions of spliced VG to prepare for GCSA indexing with GBWT unfolding.
[IndexRegistry]: Constructing GCSA/LCP indexes.
PathGraphBuilder::write(): Memory use of file 0 of kmer paths (850.017 GB) exceeds memory limit (850 GB)
PathGraphBuilder::write(): Memory use of file 0 of kmer paths (850.06 GB) exceeds memory limit (850 GB)
[IndexRegistry]: Exceeded disk or memory use limit while performing k-mer doubling steps. Rewinding to pruning step with more aggressive pruning to simplify the graph.
[IndexRegistry]: Pruning complex regions of spliced VG to prepare for GCSA indexing with GBWT unfolding.
[IndexRegistry]: Constructing GCSA/LCP indexes.
PathGraphBuilder::write(): Memory use of file 0 of kmer paths (850.039 GB) exceeds memory limit (850 GB)
PathGraphBuilder::write(): Memory use of file 0 of kmer paths (850.082 GB) exceeds memory limit (850 GB)
[IndexRegistry]: Exceeded disk or memory use limit while performing k-mer doubling steps. Rewinding to pruning step with more aggressive pruning to simplify the graph.
[IndexRegistry]: Pruning complex regions of spliced VG to prepare for GCSA indexing with GBWT unfolding.
[IndexRegistry]: Constructing GCSA/LCP indexes.
DiskIO::write(): Write failed
DiskIO::write(): You may have run out of temporary disk space at /home/test/nvmedata2/02.Pantrans/TMP
[IndexRegistry]: Unrecoverable error in GCSA2 indexing.

@adamnovak
Copy link
Member

@jeizenga Do you know why the GCSA write might fail even if the internal size limit is not exceeded?

@ld9866 How much disk space is actually present at /home/test/nvmedata2/02.Pantrans/TMP when you see this? What did you do to test whether it was sufficient?

@jeizenga
Copy link
Contributor

The error robustness bits of vg autoindex for GCSA indexing all assume that you run up to a max memory/disk space limit, which is set by the program. When you do that, GCSA gives a particular error that can be caught and handled. If instead your system runs out of space before hitting this limit (for, e.g., having too little disk space or a Linux policy that limits the disk usage of your temp directory), then the error is determined by the OS, and it is not caught.

I would advise users to fix the disk availability issues rather than dig into the internals of vg autoindex if at all possible. For example, you could use -T to redirect temp files to a location with more disk availability. If the limited disk space is truly insurmountable based your available hardware, there is an option --gcsa-size-limit that you can use to reduce the target maximum disk space to an amount that your system can handle (given in units of bytes).

@ld9866
Copy link

ld9866 commented Sep 25, 2024

Dear developers:
I think you're probably talking about the normal question of not having enough storage, because I realized that I would run out of memory when I finally got down to 1TB, so I replaced it with a drive that still had 100Tb of storage to test again.
Too bad there's another problem:

Code:
vg autoindex --threads 64 --workflow mpmap --workflow rpvg --prefix test.rpvg --ref-fasta test.fa --vcf test.result.vcf.gz --tx-gff test.gtf --tmp-dir /home/lidong/Data/10.pantrans/TMP -M 850G

Erro log:
[IndexRegistry]: Constructing GBWT from spliced VG graph and phased VCF input.
[IndexRegistry]: Merging contig GBWTs.
[IndexRegistry]: Stripping allele paths from spliced VG.
[IndexRegistry]: Constructing haplotype-transcript GBWT and finishing spliced VG.
[IndexRegistry]: Merging contig GBWTs.
[IndexRegistry]: Joining transcript origin table.
[IndexRegistry]: Constructing spliced XG graph from spliced VG graph.
[IndexRegistry]: Constructing distance index for a spliced graph.
[IndexRegistry]: Pruning complex regions of spliced VG to prepare for GCSA indexing with GBWT unfolding.
[IndexRegistry]: Constructing GCSA/LCP indexes.
InputGraph::checkK(): Invalid kmer length in graph file /home/lidong/Data/10.pantrans/TMP/vg-kqDzCD/vg-kmers-tmp-DxFyw5
InputGraph::checkK(): Expected 16, got 0
[IndexRegistry]: Unrecoverable error in GCSA2 indexing.

@jeizenga
@adamnovak

@adamnovak
Copy link
Member

I'm not able to see how the GCSA code can read a 0 off disk for a kmer length, unless it was actually written there, or the filesystem manages to return binary 0s when reading binary-format kmer data. It looks like something simple like a file being truncated can't trigger this.

We don't have error checking at the close call for when we write these files:

vg/src/kmer.cpp

Line 351 in d6ea214

out.close();

And I suppose it's possible for close() to fail and leave a bunch of 0 bytes visible in the file. But I can't see why that would happen unless the filesystem was almost exactly full and we couldn't successfully flush the very end of the file.

@jltsiren do you know how the GCSA library might be getting convinced to write a kmer length of 0 into its temp files? Maybe if it ends up processing an empty subgraph somehow?

@ld9866 What do you get for mkdir -p /home/lidong/Data/10.pantrans/TMP && df -T /home/lidong/Data/10.pantrans/TMP? In particular for the Type column which gives the filesystem type? If it is a 100 TB filesystem it's presumably a distributed filesystem, and some of those creatively interpret what a filesystem needs to do.

If your particular distributed filesystem doesn't actually guarantee that immediately after a successful close() by one process, another process on the same machine will be able to read the data written if it does an open(), then the child process that reads these files might be seeing 0s instead of actual data, and that could cause this problem.

adamnovak added a commit that referenced this issue Sep 26, 2024
This might help with problems like #4404 (comment) by letting us tell whether the kmer files were actually accepted by the filesystem or not.
@jltsiren
Copy link
Contributor

@adamnovak The error is from InputGraph, which only reads input files to GCSA construction. The temporary files created by the GCSA library are read using PathGraph.

Each input file consists of one or more sections, and each section starts with a header that defines kmer length. The error seems to occur in the InputGraph constructor, which reads all section headers. At some point, it manages to read a header containing kmer length 0 without reaching the end of the file.

@ld9866
Copy link

ld9866 commented Sep 26, 2024

Dear developer:

Now what can I do to solve this problem?

Best
Dong

@ld9866
Copy link

ld9866 commented Sep 26, 2024

df -T /home/lidong/Data/10.pantrans/TMP

Filesystem Type 1K-blocks Used Available Use% Mounted on
/dev/sda1 ext4 295793979636 183805691260 97144184544 66% /data

@adamnovak
Copy link
Member

@jltsiren I guess these are really vg's temp files and not properly GCSA's.

@ld9866 You have an approximately three hundred terabyte single block device? Can you describe that storage setup in more detail to convincingly rule it out as the source of the problem? How confident are you in the quality of its implementation?

If the storage is working, and vg really is writing 0-size kmers into the GCSA input files, then someone has to go through vg's kmer.cpp and figure out how that could happen. This might involve adding some checks before writing to ensure that a 0-size kmer is not about to be written, and logging what is going on if the check fails, and then running that modified build of vg on the data that produces the problem.

Are you able to share the input files you are using?

As a workaround, you might be able to generate the indexes you need with a series of vg commands instead of using the vg autoindex wrapper. Then you could manually adjust the pruning parameters to see if you can get a graph that the GCSA can be built for; maybe vg autoindex is pruning it too much or not enough, somehow. @jeizenga do we have documentation that would explain how to get the pantranscriptome indexes without autoindex?

@ld9866
Copy link

ld9866 commented Sep 30, 2024

Dear developer:

I have tested the process of extracting a chromosome and the construction is OK. I guess it may be caused by too much variation information. Do you have any good way to keep the effective variation information?

Or is there any way you can build an index file for a very large set of variations

Best
Dong

@jltsiren @adamnovak

@jeizenga
Copy link
Contributor

Yes, there is some documentation about how to make the spliced pangenome graph with vg rna here: https://github.com/vgteam/vg/wiki/Transcriptomic-analyses.

Manual GCSA construction is documented here: https://github.com/vgteam/vg/wiki/Index-Construction#complex-graph

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants