The header from the FASTA file is copied, and appended to it is
a colon (:) followed by the range
represented by the data.
I originally chose the Python indexing method of
expressing the range,
Following the header proper is a newline character (actually, one or more should be allowed, so that the format is not unix-specific; however the first version of the program probably will not work with DOS-format files), then a capital letter 'P'. Immediately following the 'P' is the packed data, followed by the packed "mask" information.
This header was chosen for ease of use: with standard GNU tools; one can
The packed data uses two bits per byte of source data, using the arbitrary
The mask is exactly as long as the compressed data, and clarifies the
meaning of the data proper. The data can be used without the mask, however
the repeat-masking would be lost, and unknown data would show up
incorrectly as repeats of 'G'. Currently 3 values of mask are used:
|Example Output||(hex) 1B1B40 or (binary)
||(hex) 5500A0 or (binary)
Another way of expressing it is that 00 (G) masked by the dibit 00 is still G, masked with 01 becomes g, and masked with 10 becomes N. A mask of 11 would render the data undefined at this point; I may find valuable use for that mask value later.
There is no final end-of-line or end-of-file character specified by the format, however because of the structure design, any remaining bytes will be ignored (carriage return, linefeed, ^Z, whatever). That's all there is to it! The above example is repeated below in a different format.
jcomeau@notebook ~ $ cat gatcGATCnN.txt >testfile gatcGATCnN jcomeau@notebook ~ $ dump gatcGATCnN.txt.2bit gatcGATCnN.txt.2bit: Addr 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 2 4 6 8 A C E -------- ---- ---- ---- ---- ---- ---- ---- ---- ---------------- 00000000 3e74 6573 7466 696c 653a 302d 3130 0a50 >testfile:0-10.P 00000010 1b1b 4055 00a0 ..@U.