Bioinformatics : September 2013

I am doing a 3 way comparison of repeat elements in 3 species of anole lizards. All 3 of these genomes are Illumina Next-Gen. One challenge I face is moving the data around and formatting it properly. I had to do some Googling to find the appropriate bash commands, to process the fasta files from the terminal command line.

Below I:
1) cat - concatenated all the files into 1 large file
2) I want to sort by line size, but if you do it with a normal fasta file, the header is read as a different line than the sequence body (so it'll delete all the sequences under 250 INCLUDING the fasta headers, which have all the information I need!) Therefore, I had to remove all newlines \n and put in a * so the header and sequence all are read as one long line.
3) Next I used awk to sort based on size and I removed all sequences that are less than 250 base pairs.
4) Put the fasta format back the way it was before I sorted based on size.

#concatenate all the files from all 3 species into one sub-subfamily specific file
cat species1.fasta species2.fasta species3.fasta > /Desktop/masters_repeats/anolis_repeatfamily_3way_total.fasta

#i have to make all the entries one line, so i replaced the newline with a *
tr "()\n" "()*" < /Desktop/masters_repeats/anolis_repeatfamily_3way_total.fasta > /Desktop/masters_repeats/anolis_repeatfamily_3way_total_oneline.txt

sed -e 's/*>/\'$'\n>/g' /Desktop/masters_repeats/anolis_repeatfamily_3way_total_oneline.txt > /Desktop/masters_repeats/anolis_repeatfamily_3way_total_oneline2.txt

#sort based on size
awk '{ print length(), $0 | "sort -n" }' /Desktop/masters_repeats/anolis_repeatfamily_3way_total_oneline2.txt > /Desktop/masters_repeats/anolis_repeatfamily_3way_total_SORTEDlength.txt

#i then removed all the lines that are under 250 characters
awk 'length > 250' /Desktop/masters_repeats/anolis_repeatfamily_3way_total_SORTEDlength.txt > /Desktop/masters_repeats/anolis_repeatfamily_3way_total_SORTEDlength_over250.txt

#remove everything before the >, which is just a space (because awk sorting adds an unecessary number to the beginning of each line. There may be a less verbose option, but I couldn't find it.)

sed 's/.* //' /Desktop/masters_repeats/anolis_repeatfamily_3way_total_SORTEDlength_over250.txt > /Desktop/masters_repeats/anolis_repeatfamily_3way_total_SORTEDlength_over250_clean.txt

#I replace the star with a newline again, so the fasta header, signified by a > is one line with the sequence below that.

tr "*" "\n" < /Desktop/masters_repeats/anolis_repeatfamily_3way_total_SORTEDlength_over250_clean.txt > /Desktop/masters_repeats/anolis_repeatfamily_3way_total_SORTEDlength_over250_clean_expandedw-headers.fasta

My final product is ready to be uploaded into Geneious for alignments and tree-building.

Bioinformatics

Thursday, September 26, 2013

Sorting Fasta files