The Helical Harp: new-age music from genome data

The command-line conversion utility fa2mid.py (also see COPYING) will make a standard MIDI file out of genome data such as that available at http://genome.ucsc.edu/goldenPath/hg16/chromosomes/.

Here is some sample data from one of the files:

CGGGAGCCGGGGCGCTTGCCTGCACTCCCCTCCGGCCTGTAGCTGGATGA
GCTACTGGCGAGCCGGGAGTTTCTGCAGCAGGCGCAACGTTTCCTAGAAA
CGGAGGCCCTGGGGGAACTGGAGGCCTTCGAAGTGGCCGCCTTGCTAGAG
CACCCATCAGTGAGGAAGAATACCGGGCTCTGCTGGAGGAGCATTAGGAC
AAGGGATTGGGACGGGGTCGGGTCGGGACACGGCAGTGCCCTCTCTTTCA

And here is how it sounds after running

python fa2mid.py -a -l 3
 chr10_random.sample.fa

. Load the MIDI file into Winamp or Timidity to see the data as lyrics. No MIDI? Try the u-law audio file (1.5 MB). With the -a option it generates rather "noisy" music because it translates all codons for all three forward frames whether or not there was a 'start' codon (AUG). The result ranges from horrendous discord to remarkable beauty; considering that the notes were taken without modification from the original abbreviations of the genetic code, mapping notes above 'G' to higher octaves (e. g. 'H' is the MIDI note A3), anything resembling music is purely serendipitous. Running the script without the '-a' option generates music which matches (more or less correctly according to my limited knowledge of the subject) what the actual translation process produces. The bases (nucleotides) are represented by an acoustic bass, while the amino acids are, by default, played by synthesized "aahs". Other options are available; execute the script with no arguments to get a summary.

This, obviously, isn't the first work ever done in this area. The idea was inspired by Douglas Hofstadter's book Gödel, Escher, Bach: An Eternal Golden Braid; I wouldn't be surprised if Doug himself wrote a similar program 30 years or more ago. And after a google search for 'genome music' many other such efforts came to light. One particularly well-known example appears to be that of Todd Barton; his work is a lot more artistic and abstract than my simple approach, but to me has the drawback of not making the relationship of the music to the data quite as clear as mine. But hey, since when did any two programmers ever agree on anything?

Note that these data files are huge, and depending on the options you choose, the resulting .mid files can be more than 10 times larger. You might want to snip out a part of one chromosome as I did, and/or use the -e option to skip the introns.

To get the reverse frames, first invert the file using fareverse or similar tool, then run fa2mid.py on the result. Listening to the music generated by both forward and reverse frames of all 3.1 billion base pairs of the human genome should only take about 98 years if I calculated correctly (at 1/2 second per base). Of course, you could speed it up using the --duration option.

The coding of the script leaves a lot to be desired; I'm still trying to get the hang of Python two years or so after hacker emeritus Eric Raymond recommended it to me.

Better genome data compression

I've been fiddling with this on and off for about 2 years now, but finally have it to where I can compress and decompress the actual genome data from http://genome/ucsc.edu and get much better compression out of the standard compression tools zip, gzip, and bzip2: 19% smaller with bzip2 and 35% smaller with the other two:

-rwxr-xr-x    1 jcomeau  None       724468 Sep 28 19:03 chr10_random.fa
-rw-r--r--    1 jcomeau  None       355150 Sep 28 19:10 chr10_random.fa.2bit
-rwx------+   2 jcomeau  None     51292293 Aug  4 16:26 chrY.fa
-rw-r--r--    1 jcomeau  None     25143296 Sep 28 16:54 chrY.fa.2bit
-rw-r--r--    1 jcomeau  None      5792890 Sep 28 18:49 chrY.fa.2bit.bz2
-rw-r--r--    1 jcomeau  None      5884500 Sep 28 18:51 chrY.fa.2bit.gz
-rw-r--r--    1 jcomeau  None      5884625 Sep 28 18:59 chrY.fa.2bit.zip
-rwx------    1 jcomeau  None      6907201 Sep 28 18:46 chrY.fa.bz2
-rwx------    1 jcomeau  None      7956104 Sep 28 18:50 chrY.fa.gz
-rw-r--r--    1 jcomeau  None      7956224 Sep 28 18:59 chrY.fa.zip

What it does is to pack the bases using the fact that 4 values can be expressed in two bits: binary 00, 01, 10, and 11. Would it were that simple! Because repeat-masked fragments of a sequence can be lowercased, and there are 'N' and 'n' to mark unknown bases, I needed another 'mask' section of the output file to indicate what was what. Luckily, that is mostly zeroes and long runs of 'U', so it compresses very well afterwards using the aforementioned programs (normally to less than 2% of the finished bzipped size). The routines for compression and decompression are 'packgenome()' and 'unpackgenome()' in fa2mid.py; use them by making symlinks to that program:

ln -s fa2mid.py packgenome
ln -s fa2mid.py unpackgenome

Then you invoke the program so, after decompressing the zipfiles you got from the databank:

./packgenome *.fa
bzip2 *.fa.2bit

Of course, to decompress you simply reverse the above:

bunzip2 *.fa.2bit.bz2
./unpackgenome *.fa.2bit

I realize these routines are very hard to follow; I'll be working on that in my "spare time". Meanwhile that extra percentage of compression can save a lot of bandwidth, so use it whenever you can! And please spread the word. Bit-bangers may want to review the 2bit file format (such as it is), so that better programs can be written. I used that spec to write a new program in C which is even more ugly than the python script but is much faster (by about 10 times on my laptop). It has compiled without warning on Cygwin, Linux, and NetBSD, of course all GNU systems though.

(Update 2003-10-10)
Found out a few days ago that the chr3.fa data file does not pack, due to its containing 'M' and 'R' characters. Anybody have an idea what those stand for? In any case, I had to modify both the Python and C programs; it was a good thing in the case of 2bit.c, because I corrected a bunch of very grotty code (still a lot left though), and it seems to run even faster though I didn't benchmark it to make sure. The source linked from this page is the latest update; check the RCSid's to make sure.

The current date is Sunday, 03-May-2026 12:01:45 EDT
This page was last modified Friday, 26-Aug-2005 17:57:31 EDT
Author jc@jcomeau.com, otherwise known as:

John Comeau
P. O. Box 100632
Ft. Lauderdale, FL 33310-0632

Home page and GPG key