Getting your hands dirty on serious RDF/Linked Data processing – Step 2: Loading and Processing big amounts of RDF data

This continues my previous post on playing around with HDT and seeing whether we can load process some really serious data.
As an example, for getting a big RDF dataset. we will use the PubChem RDF data. Thanks to Maulik Kamdar for the pointer!
Maulik told me that in some earlier attempts, he had difficulties loading that dataset with HDT, as he ran into space problems. So, let’s see how far we get with the current HDT version (I am using a machine with, RAM: 200GB, Disk: 500GB and running Ubuntu 16.04, FWIW.
First, we need to download the resp.nt.gz files… as described on the PubChem page
As this will take a while, I started t the script with nohup, something like:
$ nohup ./download_script.sh &
[1] 2312
the return is a the process ID, will be useful to check whether it’s still running.
Now, when this was done, let’s proceed to create HDTs for the downloaded .ttl.gz files.
I have decided to do one HDT per directory and then merge them.
So, eg to create the HDT for the endpoint/ folder I you’d run:
$ cd endpoint
$ rm all.ttl.gz; zcat *.ttl.gz | gzip > all.ttl.gz ; rdf2hdt -f ttl all.ttl.gz all.hdt ; rm all.ttl.gz
Since this actually runs looong (the unpacked .ttl.gz files are around 500GB, probably better to let a script (we call it hdt_directories_script.sh) do the work and go for a coffee or leave it over the weekend 😉
#! /bin/bash
for i in$(ls -d */|sed -e “s/\///g”) ; do
       echo “Creating ${i}_all.ttl.gz”
       cat `find ${i} -name “*ttl.gz” -print | tr ‘\n’ ‘ ‘` > ${i}_all.ttl.gz
       echo “Creating ${i}_all.hdt”
       rdf2hdt -f ttl ${i}_all.ttl.gz ${i}_all.hdt
done
Now, let’s run this through:
$ nohup ./hdt_directories_script.sh &
[1] 4416
Hope that runs through and see you on Monday to see how we can merge these HDTs 😉
Update1 on Sunday evening… didn’t yet check how long this took in detail, but all seems to have worked fine, here’s all the created HDT files, seems like the bigger ones needed some 7-8 hrs max to create to
-rw-r–r– 1 poll poll 14379369 Jan 26 22:41 bioassay_all.hdt
-rw-r–r– 1 poll poll 26932724 Jan 26 22:42 biosystem_all.hdt
-rw-r–r– 1 poll poll 217454 Jan 26 22:42 concept_all.hdt
-rw-r–r– 1 poll poll 2160706 Jan 26 22:42 conserveddomain_all.hdt
-rw-r–r– 1 poll poll 4095466858 Jan 27 02:09 endpoint_all.hdt
-rw-r–r– 1 poll poll 19693329 Jan 27 02:09 gene_all.hdt
-rw-r–r– 1 poll poll 6756298527 Jan 27 04:29 inchikey_all.hdt
-rw-r–r– 1 poll poll 2825071919 Jan 27 05:54 measuregroup_all.hdt
-rw-r–r– 1 poll poll 13119606 Jan 27 05:55 protein_all.hdt
-rw-r–r– 1 poll poll 3305089984 Jan 27 07:03 reference_all.hdt
-rw-r–r– 1 poll poll 34257 Jan 27 07:03 source_all.hdt
-rw-r–r– 1 poll poll 11392523014 Jan 27 15:30 synonym_all.hdt
So, let’s proceed again to merge the resulting HDTs
$ nohup ../tools/mergeHDT pubchem_all.hdt *hdt &
let’s go for a break and wait again (note to self: again I forgot to time that job, but I started it around 5:15am CET on 28th on the machine… )
Update2: I had to restart this on a machin with about double (200GB) of the RAM I had used originally (100GB) and fixed the script… so the text above is slightly changed.

Leave a comment