Ok so this is part of a project I'm working on. I am dealing with DNA strings that only consist of the letters ACGT. I have to parse these strings into a file and be able to read them back out of the file. However when putting them in the file we have to convert it to bytes and compress it (to save file space) some how using this format.
A: 00
C: 01
G: 10
T: 11
It was explained to me that you can do this by byte shifting along with some AND's and OR's however after hours of work I'm not getting it.
So how I understand it, each character is it's own byte Ex:
A: 00000000
C: 00000001
G: 00000010
T: 00000011
And these strings can be a few dozen characters long to thousands of characters long. However I have to compress 4 letters at a time.
Example:
If I read in the above as a string "ACGT"
it should be compressed and look like this as a byte.
00011011
So this is read from left to right.
I don't know how to do this and here is my compression attempt and it's not working.
String string = "ACGT";
byte[] byt = new byte[string.length()/4];
byte temp = 0;
for(int i = 0; i < string.length(); i += 4) {
for(int j = 0; j < 4; j++) {
if(string.charAt(j+i) == 'A')
temp = (byte) (temp << 2 | 00);
if(string.charAt(j+i) == 'C')
temp = (byte) (temp << 2 | 01);
if(string.charAt(j+i) == 'G')
temp = (byte) (temp << 2 | 10);
if(string.charAt(j+i) == 'T')
temp = (byte) (temp << 2 | 11);
}
byt[i] = temp;
}
for(int k = 0; k < byt.length; k++)
System.out.println("byt[k] " + byt[k]);
System.out.println("byt: " + byt);
The output is:
byt[k] 59
byt: [B@3e25a5
I believe it's suppose to be
byt[k] 27 (since 00011011 is 27 in binary right?)
and byt: won't be readable at all. So I think since it's showing 59, it's incorrect.
My attempt at decompressing it which should read out the string entered: "ACGT"
String str = "";
for(int i = 0; i < byt.length; i++) {
for(int j = 0; j < 4; j++) {
if((byte) (byt[i]%4) == 00)
str = str + "A";
if((byte) (byt[i]%4) == 01)
str = str + "C";
if((byte) (byt[i]%4) == 10)
str = str + "G";
if((byte) (byt[i]%4) == 11)
str = str + "T";
byt[i] = (byte) (byt[i] >> 2);
}
}
System.out.println("str: " + str);
The output is:
str: A
So for some reason it's not reading the rest of the characters.
Could someone help me out with this? I've spent way too much time on this and I'm just not getting it to work. I also need to be able to decompress it after I can compress it.