Creating self-updating databases in bioinformatics

We have combined multiple data sources in the project that I am working on and we wanted to make sure that whenever a new release of the source database is out it should be incorporated automatically. Thus we avoid the hassle of updating every year our service, so I think putting more time in setting up the pipeline is a worth investment. Therefore I provide a draft, how that pipeline can look like:

#!/bin/sh

# Change to correct working directory.
cd ...

# Download UniProtKB if necessary.
wget -N ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz

...

# Run make to update files as needed.
make all

# Import new tagging results into database.
./create_database.pl
../pgsql/bin/psql ...
rm database_*.tsv

The update script is a simple bash shell script. First it changes to the directory where the scripts and data files are located. The reason the script is usually scheduled to be launched by crontab and the working directory should be specified.

Afterwards wget with flag –N downloads the database only if it has been updated. Then make will check whether any of the databases have been changed and it will compile those targets which are dependent on the newer data sources.

The last part of the script takes care of the export to a SQL server. Calling a perl script it prepares a tab separated value file for PostgreSQL, and then the file will be imported in the database. Finally the temporary files are cleaned up.

Using makefiles in bioinformatics pipelines

Recently I am involved in many projects where parsing text files is necessary, and I use small scripts to archive the task. However the code should support pipelines and I found makefiles very handy.

Makefiles are efficiently used in software development to define a ruleset to automatically build programs and they have been used since the end of ‘70-es. The command called make reads makefiles (named: makefile or Makefile) and executes the commands within and produces the desired build. Moreover make can also invoke scripts, therefore it is a very handy utility in bioinformatics.

A makefile consists of multiple rules, and these rules define what components are needed to create the target using a script. The rules can be compared to a recipe in a cookbook, what ingredients are needed to cook a food, and the actual recipe the script which “compiles” our ingredients into the food on the table.

fried_chicken: raw_chicken oil
	./fry_chicken_in_a_pan.pl

Formally speaking:

target1 [target2 ...]: [component1 component2 ...]
	[<TAB>command 1]
	[<TAB>command 2]
	...

On the left side the targets need to be defined, on the right side just after the colon the necessary components need to be stated, however they are not necessary e.g.: creating a file by just “touching it”. Afterwards the commands are listed and they will create the targets. Usually make’s basic interpreter executes commands by using Unix’s default shell, the /bin/sh , so cat, cp, rm etc… commands can be invoked.

Another nice feature of make that a target can be a component. For example:

dinner: fried_chicken baked_potatoes

fried_chicken: ...

baked_potatoes: ...

Here if we issue make dinner , make will check whether fried_chicken and baked_potatoes exist, if not it will call those rules as well.

There are two common targets: all and clean. Programmers define all target to create every target, while clean  is responsible of launch an rm command to clean up the build environment.

There is another advantage of makefiles. Let us assume that that we have an all target and one of the components has been updated. (E.g.: A newer source file has been downloaded from the internet and it has a newer timestamp). After we issue make all again, it will discover that component is newer and call any target where that component has been listed. This feature allows to build up pipelines.

Before executing make, we may be interested what will be done.

make –n target1 target2 ...

Calling make with –n will show what commands will be issued upon a real execution.

I hope that, this article gave some brief introduction to make. There are a few links about make that I found useful:

Advanced Makefile Tricks – it is described here how to use special macros. This is very useful e.g.: passing components as arguments for the commands, pipeing output to the target etc.

Make (software) – Wikipedia entry about make where its history described and some examples are shown.