Tuesday, December 18, 2012

Text Manipulation via Bash commands and Blast

Text Manipulation via Bash commands and Blast

A very common problem in bioinformatics is how to efficiently and effectively pull the information you want from large, sprawling data files. You could search online for pre-made Perl scripts or write your own Perl script to do this, but often it can be time-consuming and doesn't have a good cost:benefit ratio. Alternatively, you can learn how to do text manipulation using the Terminal, with sed commands and using MPIBlast. For instance, you can do text manipulation, using a blast database to pull your data. The idea is that you format a database, that contains the data in full. Then you blast, or send a query against the database. The query will have a list of information you want mpiblast to pull from the full dataset and output to a file. The query file is a text file that contains information identifying what sequences, etc. you want to pull out.

MPIBlast is an important and useful program to have. It is also a real pain to install on your computer.
To see if its installed already, open the terminal and type which mpiblast. I did this and got back:

/usr/local/bin/mpiblast

This tells me I have it installed somewhere on my Mac. This location just points to the program, but you want to see where the folder is, do a search. If you are lucky and have MPIBlast installed, then make sure that MPIBlast is in your path, but adding it to your bash profile. Your bash profile is usually in your home folder. If you are using Ubuntu it is called .bash_src. You can locate this by opening the terminal and cd into your home folder. Type ls -a to see all the hidden files, such as ones beginning with a '.' On a Mac:

cd /Users/username
ls -a
emacs .bash_profile
Add to the end:
PATH=$PATH\:/Applications/mpiblast-1.6.0/ncbi/bin ; export PATH
control+x+c
y     (to save)
source ~/.bash_profile  (reloads your bash profile while your Terminal is still open).

This strategy of text manipulation is a multi-step process, which will result in a series of output files that progressively move towards the final output you want. I am going to take a sample task in which I need to pull certain data from a large text file. Thankfully, a colleague was here to show me how to do this in a quick and efficient manner. Otherwise, I could easily waste many hours trying to Google ways to text manipulate. The sed command can be used in the Terminal to quickly and easily manipulate text. (There are tutorials online on how to use sed commands). You can either print to the screen or use the > fileoutput.txt to output the data to a file.

I have two major steps to do: reformat both the input file (containing the data I want to pull) and the fasta file. Then I can build a database. First off, I have a large file containing repeat elements, named consensi.fa.classified. I want to pull the sequences for the repeat elements labeled "Unknown." I have attached this sample file to this post.

The entries in this file look like:

>rnd-2_family-448#DNA ( Recon Family Size = 52 Final Multiple Alignment Size = 44 )
GGTGCATCTGCGGCAGGCATCCTCGGAGGCTGTGAGGTCTGGAGTTGTGG
GANTTGAAGTCCAAAACACCCGGAGGGCCGAAGTTTGCCCGTGCCTGATT
GACACTGTAGAATTAATGCAGTTTGACACCACTTTAACTGCCATGGCTCA
ATGCTATGGAATCCTGGGAGTTGTAGTTTGGTGAGGCACCAGCACTCTTT
GGCAGAGAAGGCTAAAGACCTCGTAAAACTACAACTCCCACGATTCCATA
GCATTGAGCCATGGCAGTTAAAGTGGTGTCAAACCGCATTAATTCTACAG
TGTAG

>rnd-2_family-1258#Unknown ( Recon Family Size = 45 Final Multiple Alignment Size = 43 )
AGGGGTGCTCACACTTTTTCAGCATGCGAGCTACTTATAAAACAAC

>rnd-2_family-160#LINE/RTE-BovB ( Recon Family Size = 53 Final Multiple Alignment Size = 41 )
GATTGATAAAGAGATAGATCACAGATTAGCAAAGGCATANAGTGCATTCG
GAAGGCTTCACAAAAGAGTCTGGAGCAACAAGCACCTGAGGCAAAGCACC
AAAATCAGTGTGTACAGAGCTATCGTACTGTCTATTCTCCTCTATGGGTC
TGAAACATGGGTCACCTATCGCCAACACCTACGACTCCTCGAACGCTTTC
ATCAGCGCTGTCTTCGCACAATCCTAAATATACACTGGACCGACTATGTG
ACGAATGTTGCTGTCCTTGAGCAAGCAGGGATCACCAGCGTTGAGGCCAT

Each entry is between a > and a >. The first line has information which acts as an identifier. It is difficult to create a script to pull the information from > to >, for every first line containing the word "Unknown." So what we will do is to first pull ONLY the first line for every entry with the word "Unknown."

I used the sed command:
sed -n '/Unknown/p' /Volumes/scratch/results/consensi.fa.classified > temp1.txt

I did this and got the output file temp1.txt, which looks like this:

rnd-2_family-1258#Unknown ( Recon Family Size = 45 Final Multiple Alignment Size = 43 )
rnd-2_family-144#Unknown ( Recon Family Size = 43 Final Multiple Alignment Size = 41 )
rnd-2_family-295#Unknown ( Recon Family Size = 44 Final Multiple Alignment Size = 40 )

When I try to run fastacmd, I get a long list of errors:

fastacmd -p F -o temp2.txt -i /Volumes/scratch/results/temp1.txt -d /Volumes/scratch/TEs/results/consensi.fa.classified

[fastacmd] ERROR: Entry "(" not found
[fastacmd] ERROR: Entry "recon" not found
[fastacmd] ERROR: Entry "family" not found
[fastacmd] ERROR: Entry "size" not found
[fastacmd] ERROR: Entry "=" not found
[fastacmd] ERROR: Entry "45" not found

Now, in sed, sometimes ( ) can cause confusion, as do spaces, so we want to get rid of them. 

We can do this by trading a space for a tab.

The command is:
sed 's/ /    /' /Volumes/scratch/results/consensi.fa.classified > newdb

You have to type control+v+TAB to create the tab in sed.
This sed command will take a space and turn it into a tab.
Tab-delineated files can be easy to work with. 


Note: Shortcut keys vary from system to system. In EMACS, you have to look into key binding to find out what the shortcuts are, or to set them to your desired key combinations and/or with meta-keys.


After the sed command, the output file temp3.txt looks like this:

rnd-2_family-1258#Unknown    ( Recon Family Size = 45 Final Multiple Alignment Size = 43 )
rnd-2_family-144#Unknown    ( Recon Family Size = 43 Final Multiple Alignment Size = 41 )
rnd-2_family-295#Unknown    ( Recon Family Size = 44 Final Multiple Alignment Size = 40 )
rnd-2_family-368#Unknown    ( Recon Family Size = 46 Final Multiple Alignment Size = 40 )

I want to get rid of the extra text: '    ( Recon Family Size = 46 Final Multiple Alignment Size = 40 )', since I only need the Unknown information.

So, I typed the following command to cut the text:
cut -f1 /Volumes/scratch/TEs/results/temp3.txt > temp4.txt

Then I got an output file called temp4.txt which looks like this:

rnd-2_family-1258#Unknown
rnd-2_family-144#Unknown
rnd-2_family-295#Unknown
rnd-2_family-368#Unknown

This particular cut command will delete everything after a tab. So if there are spaces or strange formatting, I do a replace all with a tab. (ie. change spaces behind Unknown to a tab behind Unknown).

Then you will need to use two commands: formatdb and fastcmd. If you are in doubt about how to use these tools, you can always type the command then help. Help varies from program to program, but in general its either '-h' or '--h' or '-help' and sometimes '--help'. For instance, to find out how to use mpiblast's formatdb command, type:

formatdb --help

This gives me a long list of options that I need/can specify. Using this knowledge, I typed the following commands:

formatdb -t repeat2.txt  -p false -o T -i /Volumes/scratch/results/finaldb

Similarly, to figure out how to use mpiblast's fastacmd command, I typed:

fastacmd --help
fastacmd -p F -o 5Jul_Unknown_Elements.txt -i /Volumes/scratch/TEs/results/temp4.txt -d  /Volumes/scratch/results/finaldb


I get an output file that, for some reason, has two things added to it now that I don't want:

>lcl|rnd-2_family-1258#Unknown No definition line found
AGGGGTGCTCACACTTTTTCAGCATGCGAGCTACTTATAAAACAAC
>lcl|rnd-2_family-144#Unknown No definition line found
AATGATGTAGACCAGGCATGGGCAAACTTGGGCCCTCCAGGTGTTTTGGACTTCAGGCGTTTTGGACTTCAANTCCCAGA
ATTCCTAGCAGCCTGCAGGCTGTTAGGAATTGTGGGAGTTGAAGTCCAAAACACCTGGAGGGCCGAAGTTTGCCCATGCC
TGATCTACACTGTAGAATTAATGCAGTTTGACACCACTTTAACTGCCATGGCTCAACGCTATGGAATCCTGGGAGTTGTA
GTTTGGCGAGGCACCAGCACTCTTTGGCAGAGAAGGCTAAAGAAAACTACAACTCCCATGATTCCATAGCAGTTGAAGTG
NTGTCAAACTGCATCAATTGTACAGNTGTGT
>lcl|rnd-2_family-295#Unknown No definition line found
AGAACATAAGAACATAAGAAGAGCCATGCTGAATCGGGCCTCAGCCC
>lcl|rnd-2_family-368#Unknown No definition line found
GTACACTGTAGAATTAATGCAGTTTGACACCACTTTAACTGCCATGGCTCAATGCTATGGAATCATGGGAGTTGTAGTTT
TACAAGGTCTTCTCTGCCAAAGAGTGCTGGTGCCTCGCCAAACTACAANTCCCAGGATCCCATAGCATTGAGCCATGGCA
GTTAAAGTGGTGTCAAACTGCATTAATTCTACAGTGTAG


In the text editor, I do a simple find and replace for '>lcl|' to nothing and 'No definition line found' to nothing.
This gets rid of the extra text and gives me my final output:

rnd-2_family-1258#Unknown
AGGGGTGCTCACACTTTTTCAGCATGCGAGCTACTTATAAAACAAC
rnd-2_family-144#Unknown
AATGATGTAGACCAGGCATGGGCAAACTTGGGCCCTCCAGGTGTTTTGGACTTCAGGCGTTTTGGACTTCAANTCCCAGA
ATTCCTAGCAGCCTGCAGGCTGTTAGGAATTGTGGGAGTTGAAGTCCAAAACACCTGGAGGGCCGAAGTTTGCCCATGCC
TGATCTACACTGTAGAATTAATGCAGTTTGACACCACTTTAACTGCCATGGCTCAACGCTATGGAATCCTGGGAGTTGTA
GTTTGGCGAGGCACCAGCACTCTTTGGCAGAGAAGGCTAAAGAAAACTACAACTCCCATGATTCCATAGCAGTTGAAGTG
NTGTCAAACTGCATCAATTGTACAGNTGTGT
rnd-2_family-295#Unknown
AGAACATAAGAACATAAGAAGAGCCATGCTGAATCGGGCCTCAGCCC
rnd-2_family-368#Unknown
GTACACTGTAGAATTAATGCAGTTTGACACCACTTTAACTGCCATGGCTCAATGCTATGGAATCATGGGAGTTGTAGTTT
TACAAGGTCTTCTCTGCCAAAGAGTGCTGGTGCCTCGCCAAACTACAANTCCCAGGATCCCATAGCATTGAGCCATGGCA
GTTAAAGTGGTGTCAAACTGCATTAATTCTACAGTGTAG
rnd-2_family-267#Unknown
CTTGGCNTTNCTTCGCGAACGAAGATTCTTAGGAAGGACTCATCCCACGCTTTCTACAAGCGCGTTGATGATTCGGAGGT
TGGATGTGTCCTTCGAGCCTGTGCAATGGATTTTTCTGGTGGAGCGCAGCGTGCACAGAACTGGCCTCACCCTTTAAACC
AAAGGTTCGTCTGCCGGGGCCTAGCAAGCTTGGACGGTGGCAGCGAGGTCCTCAGGTCGTAGGTTTGGTTCGAGT

Another great tool is cat! For instance, I had two large files about ~1 GB each that I needed to be in one file. With smaller files, I could copy and paste the text of one file into the text of the other, etc. But this is not really possible with files this large. Using the bash command cat (concatenate) is the easy answer. 

cat    file1.txt    file2.txt    > /Users/username/Desktop/appended_file.txt

Now, I have an appended file with all the data I need.

I am just learning how to do work with these tools, but I hope this helps you out!

No comments:

Post a Comment