Thursday, September 26, 2013

Sorting Fasta files

I am doing a 3 way comparison of repeat elements in 3 species of anole lizards. All 3 of these genomes are Illumina Next-Gen.  One challenge I face is moving the data around and formatting it properly. I had to do some Googling to find the appropriate bash commands, to process the fasta files from the terminal command line.

Below I:
1) cat - concatenated all the files into 1 large file
2) I want to sort by line size, but if you do it with a normal fasta file, the header is read as a different line than the sequence body (so it'll delete all the sequences under 250 INCLUDING the fasta headers, which have all the information I need!) Therefore, I had to remove all newlines \n and put in a * so the header and sequence all are read as one long line.
3) Next I used awk to sort based on size and I removed all sequences that are less than 250 base pairs.
4) Put the fasta format back the way it was before I sorted based on size.

#concatenate all the files from all 3 species into one sub-subfamily specific file
cat    species1.fasta     species2.fasta     species3.fasta   >   /Desktop/masters_repeats/anolis_repeatfamily_3way_total.fasta

#i have to make all the entries one line, so i replaced the newline with a *
tr "()\n" "()*" <  /Desktop/masters_repeats/anolis_repeatfamily_3way_total.fasta > /Desktop/masters_repeats/anolis_repeatfamily_3way_total_oneline.txt

sed -e 's/*>/\'$'\n>/g'    /Desktop/masters_repeats/anolis_repeatfamily_3way_total_oneline.txt  > /Desktop/masters_repeats/anolis_repeatfamily_3way_total_oneline2.txt

#sort based on size
awk '{ print length(), $0 | "sort -n" }'    /Desktop/masters_repeats/anolis_repeatfamily_3way_total_oneline2.txt    >    /Desktop/masters_repeats/anolis_repeatfamily_3way_total_SORTEDlength.txt

#i then removed all the lines that are under 250 characters
awk 'length > 250'  /Desktop/masters_repeats/anolis_repeatfamily_3way_total_SORTEDlength.txt  > /Desktop/masters_repeats/anolis_repeatfamily_3way_total_SORTEDlength_over250.txt

#remove everything before the >, which is just a space (because awk sorting adds an unecessary number to the beginning of each line. There may be a less verbose option, but I couldn't find it.)

sed 's/.* //'    /Desktop/masters_repeats/anolis_repeatfamily_3way_total_SORTEDlength_over250.txt  >   /Desktop/masters_repeats/anolis_repeatfamily_3way_total_SORTEDlength_over250_clean.txt

#I replace the star with a newline again, so the fasta header, signified by a > is one line with the sequence below that.

tr "*" "\n"    <    /Desktop/masters_repeats/anolis_repeatfamily_3way_total_SORTEDlength_over250_clean.txt  > /Desktop/masters_repeats/anolis_repeatfamily_3way_total_SORTEDlength_over250_clean_expandedw-headers.fasta

My final product is ready to be uploaded into Geneious for alignments and tree-building.