Creating self-updating databases in bioinformatics

We have combined multiple data sources in the project that I am working on and we wanted to make sure that whenever a new release of the source database is out it should be incorporated automatically. Thus we avoid the hassle of updating every year our service, so I think putting more time in setting up the pipeline is a worth investment. Therefore I provide a draft, how that pipeline can look like:

#!/bin/sh

# Change to correct working directory.

cd ...

# Download UniProtKB if necessary.

wget -N ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz

...

# Run make to update files as needed.

make all

# Import new tagging results into database.

./create_database.pl

../pgsql/bin/psql ...

rm database_*.tsv

The update script is a simple bash shell script. First it changes to the directory where the scripts and data files are located. The reason the script is usually scheduled to be launched by crontab and the working directory should be specified.

Afterwards wget with flag –N downloads the database only if it has been updated. Then make will check whether any of the databases have been changed and it will compile those targets which are dependent on the newer data sources.

The last part of the script takes care of the export to a SQL server. Calling a perl script it prepares a tab separated value file for PostgreSQL, and then the file will be imported in the database. Finally the temporary files are cleaned up.