Welcome to Bash
What is Bash? Bash is an acronym for "Bourne-Again SHell." Bash is the shell, accessible via the Terminal that acts as a command language interpreter. Bash is what lets you type commands into the Terminal to make things happen. You are probably already familiar with the Bash commands like:
cd (change directory)
ls (list files)
pwd (print working directory)
mkdir (make a new directory)
cp (copy)
mv (move or rename a file)
more (display output one screen at a time)
jobs (show jobs)
bg (send a job to the background)
fg (send a job to the foreground)
kill (stop a job)
echo (display something like echo $PATH)
chmod (change permissions)
ssh (secure Shell client - remote login program)
which (search the user's $path for a program file, eg. which perl)
Beyond these basic Bash commands are some for manipulating text:
awk Find and Replace text, database sort/validate/index
cat Concatenate and print (display) the content of files
egrep Search file(s) for lines that match an extended expression
fgrep Search file(s) for lines that match a fixed string
gawk Find and Replace text within file(s)
grep Search file(s) for lines that match a given pattern
look Display lines beginning with a given string
sed Stream Editor
sort Sort text files
tr (translate, squeeze, and/or delete characters)
wc Print byte, word, and line counts
To learn about more Bash commands, click here.
In Terminal, you can write bash commands that are quite useful, such as
grep -c "Gene" /Users/username/Desktop/gene_file.txt
That command would count up every single time the string "Gene" occurs in the file and it will output the number to you (case sensitive). If you wanted to use a large list of genes to make a smaller subset of genes, you could also
grep "myod" /Users/username/Desktop/gene_file.txt > /Users/username/Desktop/myodgenes.txt
That would output a file that only contain the muscle specific myod genes.
But what if you want to print out everything except a certain string of elements? For instance, I had a really long list of elements, over 50,000. I wanted to make sure I had each type of element (eg. L1-1_Acar) It would be impractical to scroll through to catalog each type. For example, there are only 8 different types of LINE repeat elements listed below, but in the full file there are over 30 types interspersed and some elements are repeated a lot more.
422 18.7 0.0 8.2 1 6079081 6079185 (257841273) C L1-1_Acar LINE/L1 (117) 5470 5374 7560
1077 15.7 6.1 3.0 1 6093682 6093944 (257826514) + L1_AC_17 LINE/L1 4613 4883 (527) 7570
867 18.0 0.0 0.0 1 6094716 6094887 (257825571) + L1_AC_17 LINE/L1 4959 5130 (280) 7570 *
534 16.6 0.7 10.1 1 6095318 6095458 (257825000) + L1_AC_17 LINE/L1 5129 5257 (153) 7570 *
8699 29.8 1.1 1.1 1 6185694 6189641 (257730817) + L1-28_ACar LINE/L1 35 6162 (86) 7682
242 25.8 4.8 0.0 1 6230638 6230699 (257689759) + L1_AC_18 LINE/L1 3488 3552 (1971) 7740
534 16.6 0.7 10.1 1 6095318 6095458 (257825000) + L1_AC_17 LINE/L1 5129 5257 (153) 7570 *
534 16.6 0.7 10.1 1 6095318 6095458 (257825000) + L1_AC_17 LINE/L1 5129 5257 (153) 7570 *
534 16.6 0.7 10.1 1 6095318 6095458 (257825000) + L1_AC_17 LINE/L1 5129 5257 (153) 7570 *
534 16.6 0.7 10.1 1 6095318 6095458 (257825000) + L1_AC_17 LINE/L1 5129 5257 (153) 7570 *
2706 30.1 3.1 2.4 1 6256710 6258043 (257662415) C L1_AC_16 LINE/L1 (2892) 3237 1895 7770
725 23.2 4.3 2.6 1 6342819 6343048 (257577410) C L1_AC_15 LINE/L1 (294) 5869 5636 7826
534 16.6 0.7 10.1 1 6095318 6095458 (257825000) + L1_AC_17 LINE/L1 5129 5257 (153) 7570 *
534 16.6 0.7 10.1 1 6095318 6095458 (257825000) + L1_AC_17 LINE/L1 5129 5257 (153) 7570 *
534 16.6 0.7 10.1 1 6095318 6095458 (257825000) + L1_AC_17 LINE/L1 5129 5257 (153) 7570 *
241 16.1 6.1 6.1 1 6373497 6373562 (257546896) C L1_AC_3 LINE/L1 (438) 5866 5801 7853
534 16.6 0.7 10.1 1 6095318 6095458 (257825000) + L1_AC_17 LINE/L1 5129 5257 (153) 7570 *
534 16.6 0.7 10.1 1 6095318 6095458 (257825000) + L1_AC_17 LINE/L1 5129 5257 (153) 7570 *
534 16.6 0.7 10.1 1 6095318 6095458 (257825000) + L1_AC_17 LINE/L1 5129 5257 (153) 7570 *
534 16.6 0.7 10.1 1 6095318 6095458 (257825000) + L1_AC_17 LINE/L1 5129 5257 (153) 7570 *
298 28.8 11.6 3.5 1 6513640 6513984 (257406474) + L1_AC_9 LINE/L1 5476 5847 (564) 7961
848 27.3 3.2 4.1 1 6570802 6571363 (257349095) + L1_AC_16 LINE/L1 4711 5267 (862) 7999 *
924 26.4 0.0 1.0 1 6571344 6571638 (257348820) + L1_AC_16 LINE/L1 5638 5929 (200) 7999
In order to compile a list, I want to first subtract out the most repeated elements. In this case, L1_AC_17 is repeated so often its difficult to find the other elements.
sed '/L1_AC_17/d' /Users/username/Desktop/lines.txt > /Users/username/Desktop/lines_allexcept.txt
This will give me an updated list:
422 18.7 0.0 8.2 1 6079081 6079185 (257841273) C L1-1_Acar LINE/L1 (117) 5470 5374 7560
8699 29.8 1.1 1.1 1 6185694 6189641 (257730817) + L1-28_ACar LINE/L1 35 6162 (86) 7682
242 25.8 4.8 0.0 1 6230638 6230699 (257689759) + L1_AC_18 LINE/L1 3488 3552 (1971) 7740
2706 30.1 3.1 2.4 1 6256710 6258043 (257662415) C L1_AC_16 LINE/L1 (2892) 3237 1895 7770
725 23.2 4.3 2.6 1 6342819 6343048 (257577410) C L1_AC_15 LINE/L1 (294) 5869 5636 7826
241 16.1 6.1 6.1 1 6373497 6373562 (257546896) C L1_AC_3 LINE/L1 (438) 5866 5801 7853
298 28.8 11.6 3.5 1 6513640 6513984 (257406474) + L1_AC_9 LINE/L1 5476 5847 (564) 7961
848 27.3 3.2 4.1 1 6570802 6571363 (257349095) + L1_AC_16 LINE/L1 4711 5267 (862) 7999 *
924 26.4 0.0 1.0 1 6571344 6571638 (257348820) + L1_AC_16 LINE/L1 5638 5929 (200) 7999
Without the L1_AC_17's I can clearly see more element types, such as L1-1_Acar, L1-28_ACar, L1_AC_18, L1_AC_16, etc. As I compile my list, I want to start eliminating the genes I have already marked down, so that I know the list includes every element. Since my list is 50,000 long, I start constructing multiple sed commands in a row.
sed -e '/L1-1_Acar/d' -e '/L1-28_ACar/d' -e '/L1_AC_18/d' -e '/L1_AC_16/d' -e '/L1_AC_15/d' -e '/L1_AC_3/d' -e '/L1_AC_9/d' /Users/username/Desktop/lines.txt > /Users/username/Desktop/lines_allexcept2.txt
At this point, I see a new file pop up on the Desktop that is an empty text file. Now, I know that my list of elements is comprehensive.
Using bash commands in Terminal is quite useful, but you can go beyond this and write bash scripts to automate your processes. To write a bash script, just open up new file in TextEditor, save it as name.sh and put the header #!/bin/bash at the beginning.
#!/bin/bash
sed -e '/R2/d' -e '/LOA/d' -e '/telomeric/d' -e '/R1/d' -e '/I\ /d' -e '/MaLR/d' -e '/SINE?/d' -e '/LINE?/d' -e '/LTR?/d' -e '/ERV/d' -e '/Low_complexity/d' -e '/hAT-Ac/d' -e '/Satellite/d' -e '/Harbinger/d' -e '/hAT-Charlie/d' -e '/DNA/d' -e '/hAT-Tip100/d' -e '/TcMar-Tc1/d' -e '/TcMar-Tc2/d' -e '/TcMar-Mariner/d' -e '/Maverick/d' -e '/Penelope/d' -e '/Chapaev/d' -e '/Sola/d' -e '/Harbinger/d' -e '/Helitron/d' -e '/Piggybac/d' -e '/SINE\ /d' -e '/LINE\ /d' -e '/LTR\ /d' -e '/Sauria/d' -e '/Tigger/d' -e '/RTE-BovB/d' -e '/RTE-X/d' -e '/Jockey/d' -e '/Gypsy/d' -e '/Copia/d' -e '/Ngaro/d' -e '/MIR/d' -e '/Rex-Babar/d' -e '/LTR\ /d' -e '/Pao/d' -e '/ERV1/d' -e '/ERVK/d' -e '/ERVL/d' -e '/DIRS/d' -e '/Blackjack/d' -e '/RNA/d' -e '/rRNA/d' -e '/Kolobok/d' -e '/EnSpm/d' -e '/Simple_repeat/d' -e '/CR1/d' -e '/RC/d' -e'/tRNA/d' -e '/scRNA/d' -e '/snrNA/d' -e '/Unknown/d' -e '/Chompy/d' -e '/RTE/d' -e'/Dong-R4/d' -e '/LINE\ /d' -e '/Deu/d' -e '/B2/d' -e '/ID/d' -e '/B4/d' -e '/5S/d' -e '/Alu/d' -e '/MuDR/d' -e '/MULE-MuDR/d' -e '/L1/d' -e '/L2/d' /Users/username/Desktop/species_comp_repeats/acar2.fa.out > /Users/username/Desktop/acar2.txt
The 'shebang' #! (# sh, ! bang) at the header of the file allows you to run the script from the Terminal, as you would for cd or ls. After that, move the file to the /usr/sbin folder (lose the .sh). To make it executable, you can type chmod u+x script-name, or you can do chmod 777 name. Since my bash script was named allexecept, I simply type allexcept into the terminal and the script will run, removing the 90 or so elements from my file. After I automated this process, I was able to get information on 7 other species and create a massive Excel spreadsheet that I otherwise would never have time to make. In addition, when I get new assemblies or updating data I can run the script quickly instead of having to gather all that data all over again!
Another script I wrote is called greprep.sh. It stands for grep repeats. Again, my file contained over 100,000 elements, but I wanted to get information about specific subfamilies of repeats. I wrote the script to tell me how many copy numbers the element has. For instance, Mariner appears 72,898 times!
For each element, I just wrote:
STRING12="DNA/TcMar-Mariner"
echo $STRING12
grep -c "DNA/TcMar-Mariner" /Users/username/Desktop/species_comp_repeats/acar2.fa.out
The bash script greprep.sh is attached. I also wanted to be able to copy a column of the numbers, so I could easily paste them into Excel. I created another bash script called greprepnum.sh that just displays the numbers (not the names using echo). There is probably an easier way to do this, as there is always an easier way to do everything in scripting. ie creating an output Excel file that is already formatted the way you need it. But it takes too long sometimes to create these "shortcuts." With writing bash scripts there are so many different ways to do any one thing, and you can waste all day trying to get your perfect product. Choose your battles wisely!
Another thing that can be done with bash scripts is putting different commands into one script, such as:
grep "Unknown" /Users/username/Desktop/names.txt | sed -e '/Other/d' | awk ' '{print $1}' file | perl fastqconversion.pl fa2std acar.fastq
You can separate commands with a | and kind of make your own pipeline that can include several different transformations or processing of the data. This is a very useful utility for the kind of data we get in bioinformatics! Remember Google is your friend, when it comes to finding scripts or script commands (but it may take some tweaking to get things up and running).
Good luck.
No comments:
Post a Comment