2bit file format

The format of 2bit compression is very simple and can be made very efficient; with the proper software it can be the best way to store genome data, because it can be accessed linearly with very straightforward and simple processing to extract a range of bases. Here are the nitty-gritty details.

FieldHeaderDataMask
Description The header from the FASTA file is copied, and appended to it is a colon (:) followed by the range represented by the data. I originally chose the Python indexing method of expressing the range, e. g. 0-10 means the first 10 bases; however on more careful consideration, it is better to use the 1-based system as that in use by the genome browser at http://genome.ucsc.edu/, which would indicate the same range as 1-10. Since the final number is the same in either case, and is the only one used by the program, it should not present any difficulty such as not being able to decompress an older file with a newer program.

Following the header proper is a newline character (actually, one or more should be allowed, so that the format is not unix-specific; however the first version of the program probably will not work with DOS-format files), then a capital letter 'P'. Immediately following the 'P' is the packed data, followed by the packed "mask" information.

This header was chosen for ease of use: with standard GNU tools; one can head -n 1 unknown.fa.2bit and see what data is in the file.

The packed data uses two bits per byte of source data, using the arbitrary mapping (binary - ascii) 00 - 'G', 01 - 'A', 10 - 'T', and 11 - 'C'. In conjunction with the mask information following (q. v.), the same binary dibits can represent the introns: 00 - 'g', 01 - 'a', 10 - 't', and 11 - 'c', and the unknown or "don't care" data: 00 - 'N' and 01 - 'n'. The first base in the input file will be shifted into the highest two bits of the first data byte of the output file; for example, input of GATC would be packed as 00011011, hex 0x1b, the ASCII 'Escape' (<ESC>) character. At the end of the data, if the input did not have an exact multiple of 4 bases, any remaining data would be shifted into the high positions of the final byte, and that byte would be output before the mask information begins. However, if the input data count were an exact multiple of 4, there would be no space between the end of the data proper, and the mask information. In no case is there more than 6 bits of separation between the two sets. The mask is exactly as long as the compressed data, and clarifies the meaning of the data proper. The data can be used without the mask, however the repeat-masking would be lost, and unknown data would show up incorrectly as repeats of 'G'. Currently 3 values of mask are used: 00 - normal (exon) sequence data, 01 - repeat (intron) sequence data, and 10 - unknown ([NnMR]) sequence data. The decision to store the mask information separately, rather than using 3 or more bits for each byte of input data, was an arbitrary one based on my love of simplicity. It may turn out to have been the best possible design after all. Note: the 'M' and 'R' tags were only seen in chr3.fa. I have no idea what they stand for; such little surprises make me glad I left some room in the spec for some extra stuff.
Example Input>chrY\ngatcGATCnN(N/A)
Example Output>chrY:1-10\nP (hex) 1B1B40 or (binary) 00 01 10 11 00 01 10 11 01 00 00 00 (hex) 5500A0 or (binary) 01 01 01 01 00 00 00 00 10 10 00 00

Another way of expressing it is that 00 (G) masked by the dibit 00 is still G, masked with 01 becomes g, and masked with 10 becomes N. A mask of 11 would render the data undefined at this point; I may find valuable use for that mask value later.

There is no final end-of-line or end-of-file character specified by the format, however because of the structure design, any remaining bytes will be ignored (carriage return, linefeed, ^Z, whatever). That's all there is to it! The above example is repeated below in a different format.


jcomeau@notebook ~
$ cat gatcGATCnN.txt
>testfile
gatcGATCnN

jcomeau@notebook ~
$ dump gatcGATCnN.txt.2bit
gatcGATCnN.txt.2bit:

  Addr     0 1  2 3  4 5  6 7  8 9  A B  C D  E F 0 2 4 6 8 A C E
--------  ---- ---- ---- ---- ---- ---- ---- ---- ----------------
00000000  3e74 6573 7466 696c 653a 302d 3130 0a50 >testfile:0-10.P
00000010  1b1b 4055 00a0                          ..@U.