Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hdtCat error in LongArrayDisk with large files #211

Open
balhoff opened this issue Jun 20, 2024 · 6 comments
Open

hdtCat error in LongArrayDisk with large files #211

balhoff opened this issue Jun 20, 2024 · 6 comments

Comments

@balhoff
Copy link

balhoff commented Jun 20, 2024

I'm trying to merge two HDT files using hdtCat.sh. Each file has more than 13 billion triples:

  • file 1 has 13736601325 triples
  • file 2 has 13827925785 triples

After about 25 hours I get this error:

Exception in thread “main” java.lang.ArrayIndexOutOfBoundsException: Index -4 out of bounds for length 29
	at org.rdfhdt.hdt.util.disk.LongArrayDisk.get(LongArrayDisk.java:116)
	at org.rdfhdt.hdt.dictionary.impl.utilCat.CatMappingBack.set(CatMappingBack.java:77)
	at org.rdfhdt.hdt.dictionary.impl.FourSectionDictionaryCat.cat(FourSectionDictionaryCat.java:244)
	at org.rdfhdt.hdt.hdt.impl.HDTImpl.cat(HDTImpl.java:486)
	at org.rdfhdt.hdt.hdt.HDTManagerImpl.doHDTCat(HDTManagerImpl.java:329)
	at org.rdfhdt.hdt.hdt.HDTManager.catHDT(HDTManager.java:642)
	at org.rdfhdt.hdt.tools.HDTCat.cat(HDTCat.java:82)
	at org.rdfhdt.hdt.tools.HDTCat.execute(HDTCat.java:116)
	at org.rdfhdt.hdt.tools.HDTCat.main(HDTCat.java:184)

I tried both v3.0.10 and v3.0.9 with the same result. I can provide these files, but each is about 170 GB. I haven't run into this issue with any smaller files.

@D063520
Copy link
Contributor

D063520 commented Jun 21, 2024

Hi, could you try out this:

https://github.com/the-qa-company/qEndpoint/wiki/qEndpoint-CLI-commands#hdtdiffcat-qep-specific

it is an evolution of the tool ....

@balhoff
Copy link
Author

balhoff commented Jun 21, 2024

@D063520 thank you for pointing that out, I hadn't come across it yet. I'm trying it now.

@balhoff
Copy link
Author

balhoff commented Jun 24, 2024

@D063520 the qEndpoint tool worked! It seems a good bit faster as well, but it uses quite a bit more RAM. I had originally been using a max heap of 150 GB, but ended up increasing it 3 times until it worked with a 400 GB heap. Now I've got an HDT file containing 27.5 billion triples.

@balhoff
Copy link
Author

balhoff commented Jun 24, 2024

@D063520 actually I used hdtCat.sh from your package, rather than hdtDiffCat. Are these different?

@D063520
Copy link
Contributor

D063520 commented Jun 26, 2024

@ate47

@ate47
Copy link
Contributor

ate47 commented Jun 26, 2024

If you have the -kcat it's the same, otherwise by default the qep cli is using the disk optimized version and the rdfhdt cli the memory version. The memory one is slow and not efficient

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants