Here is some sample data from one of the files:
CGGGAGCCGGGGCGCTTGCCTGCACTCCCCTCCGGCCTGTAGCTGGATGA GCTACTGGCGAGCCGGGAGTTTCTGCAGCAGGCGCAACGTTTCCTAGAAA CGGAGGCCCTGGGGGAACTGGAGGCCTTCGAAGTGGCCGCCTTGCTAGAG CACCCATCAGTGAGGAAGAATACCGGGCTCTGCTGGAGGAGCATTAGGAC AAGGGATTGGGACGGGGTCGGGTCGGGACACGGCAGTGCCCTCTCTTTCAAnd here is how it sounds after running
. Load the MIDI file into Winamp or Timidity to see the data as lyrics. No MIDI? Try the u-law audio file (1.5 MB). With the
python fa2mid.py -a -l 3 chr10_random.sample.fa
-aoption it generates rather "noisy" music because it translates all codons for all three forward frames whether or not there was a 'start' codon (AUG). The result ranges from horrendous discord to remarkable beauty; considering that the notes were taken without modification from the original abbreviations of the genetic code, mapping notes above 'G' to higher octaves (e. g. 'H' is the MIDI note A3), anything resembling music is purely serendipitous. Running the script without the '-a' option generates music which matches (more or less correctly according to my limited knowledge of the subject) what the actual translation process produces. The bases (nucleotides) are represented by an acoustic bass, while the amino acids are, by default, played by synthesized "aahs". Other options are available; execute the script with no arguments to get a summary.
This, obviously, isn't the first work ever done in this area. The idea was inspired by Douglas Hofstadter's book Gödel, Escher, Bach: An Eternal Golden Braid; I wouldn't be surprised if Doug himself wrote a similar program 30 years or more ago. And after a google search for 'genome music' many other such efforts came to light. One particularly well-known example appears to be that of Todd Barton; his work is a lot more artistic and abstract than my simple approach, but to me has the drawback of not making the relationship of the music to the data quite as clear as mine. But hey, since when did any two programmers ever agree on anything?
Note that these data files are huge, and depending on the options you choose,
the resulting .mid files can be more than 10 times larger. You might want to
snip out a part of one chromosome as I did, and/or use the
option to skip the introns.
To get the reverse frames, first invert the file using fareverse or similar tool, then run fa2mid.py on the result. Listening to the music generated by both forward and reverse frames of all 3.1 billion base pairs of the human genome should only take about 98 years if I calculated correctly (at 1/2 second per base). Of course, you could speed it up using the --duration option.
The coding of the script leaves a lot to be desired; I'm still trying to get the hang of Python two years or so after hacker emeritus Eric Raymond recommended it to me.
Better genome data compression
I've been fiddling with this on and off
for about 2 years now, but finally have it
to where I can compress and decompress the actual genome data from
http://genome/ucsc.edu and get much
better compression out of the standard compression tools zip, gzip, and bzip2:
19% smaller with bzip2 and 35% smaller with the other two:
-rwxr-xr-x 1 jcomeau None 724468 Sep 28 19:03 chr10_random.fa -rw-r--r-- 1 jcomeau None 355150 Sep 28 19:10 chr10_random.fa.2bit -rwx------+ 2 jcomeau None 51292293 Aug 4 16:26 chrY.fa -rw-r--r-- 1 jcomeau None 25143296 Sep 28 16:54 chrY.fa.2bit -rw-r--r-- 1 jcomeau None 5792890 Sep 28 18:49 chrY.fa.2bit.bz2 -rw-r--r-- 1 jcomeau None 5884500 Sep 28 18:51 chrY.fa.2bit.gz -rw-r--r-- 1 jcomeau None 5884625 Sep 28 18:59 chrY.fa.2bit.zip -rwx------ 1 jcomeau None 6907201 Sep 28 18:46 chrY.fa.bz2 -rwx------ 1 jcomeau None 7956104 Sep 28 18:50 chrY.fa.gz -rw-r--r-- 1 jcomeau None 7956224 Sep 28 18:59 chrY.fa.zipWhat it does is to pack the bases using the fact that 4 values can be expressed in two bits: binary 00, 01, 10, and 11. Would it were that simple! Because repeat-masked fragments of a sequence can be lowercased, and there are 'N' and 'n' to mark unknown bases, I needed another 'mask' section of the output file to indicate what was what. Luckily, that is mostly zeroes and long runs of 'U', so it compresses very well afterwards using the aforementioned programs (normally to less than 2% of the finished bzipped size). The routines for compression and decompression are 'packgenome()' and 'unpackgenome()' in fa2mid.py; use them by making symlinks to that program:
ln -s fa2mid.py packgenome
ln -s fa2mid.py unpackgenome
Then you invoke the program so, after decompressing the zipfiles you
got from the databank:
Of course, to decompress you simply reverse the above:
I realize these routines are very hard to follow; I'll be working on that in my "spare time". Meanwhile that extra percentage of compression can save a lot of bandwidth, so use it whenever you can! And please spread the word. Bit-bangers may want to review the 2bit file format (such as it is), so that better programs can be written. I used that spec to write a new program in C which is even more ugly than the python script but is much faster (by about 10 times on my laptop). It has compiled without warning on Cygwin, Linux, and NetBSD, of course all GNU systems though.
Found out a few days ago that the chr3.fa data file does not pack, due to its containing 'M' and 'R' characters. Anybody have an idea what those stand for? In any case, I had to modify both the Python and C programs; it was a good thing in the case of 2bit.c, because I corrected a bunch of very grotty code (still a lot left though), and it seems to run even faster though I didn't benchmark it to make sure. The source linked from this page is the latest update; check the RCSid's to make sure.
John Comeau P. O. Box 100632 Ft. Lauderdale, FL 33310-0632Home page and GPG key