Bioinformatics : Welcome to Bash

Welcome to Bash

What is Bash? Bash is an acronym for "Bourne-Again SHell." Bash is the shell, accessible via the Terminal that acts as a command language interpreter. Bash is what lets you type commands into the Terminal to make things happen. You are probably already familiar with the Bash commands like:

cd (change directory)
ls (list files)
pwd (print working directory)
mkdir (make a new directory)
cp (copy)
mv (move or rename a file)
more (display output one screen at a time)
jobs (show jobs)
bg (send a job to the background)
fg (send a job to the foreground)
kill (stop a job)
echo (display something like echo $PATH)
chmod (change permissions)
ssh (secure Shell client - remote login program)
which (search the user's $path for a program file, eg. which perl)

Beyond these basic Bash commands are some for manipulating text:

awk      Find and Replace text, database sort/validate/index
cat        Concatenate and print (display) the content of files
egrep    Search file(s) for lines that match an extended expression
fgrep    Search file(s) for lines that match a fixed string
gawk     Find and Replace text within file(s)
grep     Search file(s) for lines that match a given pattern
look     Display lines beginning with a given string
sed      Stream Editor
sort     Sort text files
tr          (translate, squeeze, and/or delete characters)
wc       Print byte, word, and line counts

To learn about more Bash commands, click here.

In Terminal, you can write bash commands that are quite useful, such as

grep -c "Gene" /Users/username/Desktop/gene_file.txt

That command would count up every single time the string "Gene" occurs in the file and it will output the number to you (case sensitive). If you wanted to use a large list of genes to make a smaller subset of genes, you could also

grep "myod" /Users/username/Desktop/gene_file.txt > /Users/username/Desktop/myodgenes.txt

That would output a file that only contain the muscle specific myod genes.

But what if you want to print out everything except a certain string of elements? For instance, I had a really long list of elements, over 50,000. I wanted to make sure I had each type of element (eg. L1-1_Acar) It would be impractical to scroll through to catalog each type. For example, there are only 8 different types of LINE repeat elements listed below, but in the full file there are over 30 types interspersed and some elements are repeated a lot more.

422   18.7 0.0 8.2 1               6079081   6079185 (257841273) C     L1-1_Acar         LINE/L1            (117) 5470   5374    7560
1077   15.7 6.1 3.0 1              6093682   6093944 (257826514) +     L1_AC_17          LINE/L1             4613   4883 (527)    7570
867   18.0 0.0 0.0 1               6094716   6094887 (257825571) +     L1_AC_17          LINE/L1             4959   5130 (280)    7570 *
534   16.6 0.7 10.1 1               6095318   6095458 (257825000) +    L1_AC_17           LINE/L1             5129   5257 (153)    7570 *
8699   29.8 1.1 1.1 1              6185694   6189641 (257730817) +    L1-28_ACar         LINE/L1             35     6162   (86)    7682
242   25.8 4.8 0.0 1               6230638   6230699 (257689759) +     L1_AC_18          LINE/L1             3488   3552 (1971)    7740
534   16.6 0.7 10.1 1               6095318   6095458 (257825000) +    L1_AC_17           LINE/L1             5129   5257 (153)    7570 *
534   16.6 0.7 10.1 1               6095318   6095458 (257825000) +    L1_AC_17           LINE/L1             5129   5257 (153)    7570 *
534   16.6 0.7 10.1 1               6095318   6095458 (257825000) +    L1_AC_17           LINE/L1             5129   5257 (153)    7570 *
534   16.6 0.7 10.1 1               6095318   6095458 (257825000) +    L1_AC_17           LINE/L1             5129   5257 (153)    7570 *
2706   30.1 3.1 2.4 1              6256710   6258043 (257662415) C    L1_AC_16           LINE/L1            (2892) 3237   1895    7770
725   23.2 4.3 2.6 1               6342819   6343048 (257577410) C    L1_AC_15           LINE/L1             (294)   5869   5636    7826
534   16.6 0.7 10.1 1               6095318   6095458 (257825000) +    L1_AC_17           LINE/L1             5129   5257 (153)    7570 *
534   16.6 0.7 10.1 1               6095318   6095458 (257825000) +    L1_AC_17           LINE/L1             5129   5257 (153)    7570 *
534   16.6 0.7 10.1 1               6095318   6095458 (257825000) +    L1_AC_17           LINE/L1             5129   5257 (153)    7570 *
241   16.1 6.1 6.1 1               6373497   6373562 (257546896) C    L1_AC_3            LINE/L1             (438)   5866   5801    7853
534   16.6 0.7 10.1 1               6095318   6095458 (257825000) +    L1_AC_17           LINE/L1             5129   5257 (153)    7570 *
534   16.6 0.7 10.1 1               6095318   6095458 (257825000) +    L1_AC_17           LINE/L1             5129   5257 (153)    7570 *
534   16.6 0.7 10.1 1               6095318   6095458 (257825000) +    L1_AC_17           LINE/L1             5129   5257 (153)    7570 *
534   16.6 0.7 10.1 1               6095318   6095458 (257825000) +    L1_AC_17           LINE/L1             5129   5257 (153)    7570 *
298   28.8 11.6 3.5 1               6513640   6513984 (257406474) +    L1_AC_9            LINE/L1             5476   5847 (564)    7961
848   27.3 3.2 4.1 1               6570802   6571363 (257349095) +    L1_AC_16           LINE/L1             4711   5267 (862)    7999 *
924   26.4 0.0 1.0 1               6571344   6571638 (257348820) +    L1_AC_16           LINE/L1             5638   5929 (200)    7999

In order to compile a list, I want to first subtract out the most repeated elements. In this case, L1_AC_17 is repeated so often its difficult to find the other elements.

sed '/L1_AC_17/d' /Users/username/Desktop/lines.txt > /Users/username/Desktop/lines_allexcept.txt

This will give me an updated list:

422   18.7 0.0 8.2 1               6079081   6079185 (257841273) C     L1-1_Acar         LINE/L1            (117) 5470   5374    7560
8699   29.8 1.1 1.1 1              6185694   6189641 (257730817) +    L1-28_ACar         LINE/L1             35     6162   (86)    7682
242   25.8 4.8 0.0 1               6230638   6230699 (257689759) +     L1_AC_18          LINE/L1             3488   3552 (1971)    7740
2706   30.1 3.1 2.4 1              6256710   6258043 (257662415) C    L1_AC_16           LINE/L1            (2892) 3237   1895    7770
725   23.2 4.3 2.6 1               6342819   6343048 (257577410) C    L1_AC_15           LINE/L1             (294)   5869   5636    7826
241   16.1 6.1 6.1 1               6373497   6373562 (257546896) C    L1_AC_3            LINE/L1             (438)   5866   5801    7853
298   28.8 11.6 3.5 1               6513640   6513984 (257406474) +    L1_AC_9            LINE/L1             5476   5847 (564)    7961
848   27.3 3.2 4.1 1               6570802   6571363 (257349095) +    L1_AC_16           LINE/L1             4711   5267 (862)    7999 *
924   26.4 0.0 1.0 1               6571344   6571638 (257348820) +    L1_AC_16           LINE/L1             5638   5929 (200)    7999

Without the L1_AC_17's I can clearly see more element types, such as L1-1_Acar, L1-28_ACar, L1_AC_18, L1_AC_16, etc. As I compile my list, I want to start eliminating the genes I have already marked down, so that I know the list includes every element. Since my list is 50,000 long, I start constructing multiple sed commands in a row.

sed   -e '/L1-1_Acar/d'    -e '/L1-28_ACar/d'    -e '/L1_AC_18/d'    -e '/L1_AC_16/d'   -e '/L1_AC_15/d'   -e '/L1_AC_3/d'   -e '/L1_AC_9/d'   /Users/username/Desktop/lines.txt > /Users/username/Desktop/lines_allexcept2.txt

At this point, I see a new file pop up on the Desktop that is an empty text file. Now, I know that my list of elements is comprehensive.

Using bash commands in Terminal is quite useful, but you can go beyond this and write bash scripts to automate your processes. To write a bash script, just open up new file in TextEditor, save it as name.sh and put the header #!/bin/bash at the beginning.

#!/bin/bash

sed -e '/R2/d' -e '/LOA/d' -e '/telomeric/d' -e '/R1/d' -e '/I\ /d' -e '/MaLR/d' -e '/SINE?/d' -e '/LINE?/d' -e '/LTR?/d' -e '/ERV/d' -e '/Low_complexity/d' -e '/hAT-Ac/d' -e '/Satellite/d' -e '/Harbinger/d' -e '/hAT-Charlie/d' -e '/DNA/d' -e '/hAT-Tip100/d' -e '/TcMar-Tc1/d' -e '/TcMar-Tc2/d' -e '/TcMar-Mariner/d' -e '/Maverick/d' -e '/Penelope/d' -e '/Chapaev/d' -e '/Sola/d' -e '/Harbinger/d' -e '/Helitron/d' -e '/Piggybac/d' -e '/SINE\ /d' -e '/LINE\ /d' -e '/LTR\ /d' -e '/Sauria/d' -e '/Tigger/d' -e '/RTE-BovB/d' -e '/RTE-X/d' -e '/Jockey/d' -e '/Gypsy/d' -e '/Copia/d' -e '/Ngaro/d' -e '/MIR/d' -e '/Rex-Babar/d' -e '/LTR\ /d' -e '/Pao/d' -e '/ERV1/d' -e '/ERVK/d' -e '/ERVL/d'   -e '/DIRS/d' -e '/Blackjack/d' -e '/RNA/d' -e '/rRNA/d' -e '/Kolobok/d' -e '/EnSpm/d' -e '/Simple_repeat/d' -e '/CR1/d' -e '/RC/d' -e'/tRNA/d' -e '/scRNA/d' -e '/snrNA/d' -e '/Unknown/d' -e '/Chompy/d' -e '/RTE/d' -e'/Dong-R4/d' -e '/LINE\ /d' -e '/Deu/d' -e '/B2/d' -e '/ID/d' -e '/B4/d' -e '/5S/d' -e '/Alu/d' -e '/MuDR/d' -e '/MULE-MuDR/d' -e '/L1/d' -e '/L2/d' /Users/username/Desktop/species_comp_repeats/acar2.fa.out    > /Users/username/Desktop/acar2.txt

The 'shebang' #! (# sh, ! bang) at the header of the file allows you to run the script from the Terminal, as you would for cd or ls. After that, move the file to the /usr/sbin folder (lose the .sh). To make it executable, you can type chmod u+x script-name, or you can do chmod 777 name. Since my bash script was named allexecept, I simply type allexcept into the terminal and the script will run, removing the 90 or so elements from my file. After I automated this process, I was able to get information on 7 other species and create a massive Excel spreadsheet that I otherwise would never have time to make. In addition, when I get new assemblies or updating data I can run the script quickly instead of having to gather all that data all over again!

Another script I wrote is called greprep.sh. It stands for grep repeats. Again, my file contained over 100,000 elements, but I wanted to get information about specific subfamilies of repeats. I wrote the script to tell me how many copy numbers the element has. For instance, Mariner appears 72,898 times!

For each element, I just wrote:

STRING12="DNA/TcMar-Mariner"

echo $STRING12

grep -c "DNA/TcMar-Mariner" /Users/username/Desktop/species_comp_repeats/acar2.fa.out

The bash script greprep.sh is attached. I also wanted to be able to copy a column of the numbers, so I could easily paste them into Excel. I created another bash script called greprepnum.sh that just displays the numbers (not the names using echo). There is probably an easier way to do this, as there is always an easier way to do everything in scripting. ie creating an output Excel file that is already formatted the way you need it. But it takes too long sometimes to create these "shortcuts." With writing bash scripts there are so many different ways to do any one thing, and you can waste all day trying to get your perfect product. Choose your battles wisely!

Another thing that can be done with bash scripts is putting different commands into one script, such as:

grep "Unknown" /Users/username/Desktop/names.txt | sed -e '/Other/d' | awk ' '{print $1}'   file | perl fastqconversion.pl   fa2std   acar.fastq

You can separate commands with a | and kind of make your own pipeline that can include several different transformations or processing of the data. This is a very useful utility for the kind of data we get in bioinformatics! Remember Google is your friend, when it comes to finding scripts or script commands (but it may take some tweaking to get things up and running).

Good luck.

Bioinformatics

Tuesday, December 18, 2012

Welcome to Bash

No comments:

Post a Comment