RDF Data Wrangling: How to work with huge RDF files…

When working with bio2rdf and pubchem you realize quickly that you sometimes need to deal with pretty large gzipped TTL and NQ files. Maybe though, some of the tools in your RDF tool chain don’t swallow those, or you simply can’t afford to unpack those files due to sheer disk space restrictions (unless you have a large disk or cloud storage).
So what can you do?
This post suggests a BASH way of solving two problems in this context I recently encountered:
  1.  convert gzipped n-quads file (.nq.gz) file into nt.gz (i.e. throwing away the context URLs and just extracting the plain triples).
  2. splitting huge N-Triples or Turtle (nt.gz or ttl.gz) files into “digestable” chunks.
As for 1, we can use the following on bash-onliner to merge all *.nq.gz in one directory into one gzipped N-Triples file (output.nt.gz):
$ gunzip -c *.nq.gz | sed -e s/' \.$'// -e 's/ ]*>$/ \./' | gzip > output.nt.gz
What this does it just stripping of any trailing enclosed URIs at the end of a gzipped NQ file and piping the result back into an NT file that is directly gzipped again… without actually storing the intermediate file…. Might be useful.
Update (after the painful experience that this didn’t work on a huge +60GB nq.gz dump as expected ;-)): Attention, this doesn’t always work, though: imagine, an n-quad looking like this
  "x < 3"  .
Try it out and: Ouch! The following fixes that by first replacing “<” in between quotes with something else, e.g. and then replacing that back (bit of a hack again, since also here you need to make sure that the aux-character doesn’t appear in the n-quads (but not as bad, since at least it would only replace that character within quoted strings, so it wouldn’t make your parser choke ;-)). The following should fix that:
export auxchar=±

and replace the sed sub-command above with the following:

sed -e "s/\(\".*\)<\(.*\"\)/\1${auxchar}\2/g" \     -e s/' \.$'// -e "s/ ]*>$/ \./" \ 
    -e "s/\(\".*\)${auxchar}\(.*\"\)/\1<\2/g"

As for 2, now how do you split a huge TTL.GZ  or NT.GZ file  into smaller ones
  • without having to uncompress it entirely to disk, and 
  • splitting correctly, that means making sure that the split files are all proper TTL files, preserving prefixes and not splitting in the middle of Turtle shortcuts?

I did this with the following script (no one-liner here, sorry):

#! /bin/bash

# splitlines is a variable that tells the script after how many 
# Turtle triples groups it should split. At the moment, we simply 
# split by assuming that each trailing '.' at the end of a line
# ends a triples group.
# The default is 200000000 but depending on the size of your 
# ttl.gz or your memory you may want to decide for smaller chunks.
# splitlines can be changed/overridden with option -l
splitlines=200000000

# auxchar is an auxiliary character 
# that the script needs to replace end of lines 
# in an intermediate step. you need to make sure
# that it doesn't appear in the turtle file you want to split.
# auxchar can be changed/overridden with option -x
auxchar=±

while getopts l:x: opt; do
  case $opt in
  l)
  splitlines=$OPTARG
  echo "splitlines set to $splitlines explicitly by option -l" >&2
  ;;
  x)
  auxchar=$OPTARG
  echo "auxchar set to $auxchar explicitly by option -x" >&2
  ;;
  \?)
  echo "Invalid option: -$OPTARG" >&2
  ;;
  esac
done

#get the last argument (filename) expects a ttl.gz:
fn="${@: -1}"

# Then create all the split gz files:
# Note: the last ${fn}.gz probably creates files named .gz.gz but who cares.

splitfiles=(*_${fn}_split*)

if [ -e "${splitfiles[0]}" ] || [ -e "prefixes_${fn}.gz" ]
then
  echo "splits or prefix file for file ${fn} already exist!"
else
  # First, extract a file containing all the prefixes...
  echo "Creating prefixes_${fn}.gz"
  gunzip -c ${fn} | grep @prefix |sort -u | gzip > prefixes_${fn}.gz
  
  # ... then, create the actual split files
  echo "Creating split files"
  zcat ${fn}| grep -v @prefix | sed -e s/'\.$'/"$auxchar"/g | \
 tr '\n' ' ' | tr "$auxchar" '\n' | grep -v -e '^[ ]*$' | \
 sed -e s/'$'/' .'/ | \
 split -l$splitlines --additional-suffix=_${fn}_split --filter='gzip > $FILE.gz'
fi

I needed this recently. Let me know if you find this useful or if you have any other “RDF Data Wrangling” experiences worth sharing in the comments 😉

Quick update on other compression formats:

If your file isn’t gzipped, but xzipped, use

xz -dc

instead of

gunzip -c

Likewise, for bzip2, use

bzip2 -dc

 

Leave a comment