Bioinformatics : December 2012

Thursday, December 27, 2012

Terminal Shortcuts

While you are typing in commands into the Terminal, you may find it faster to use shortcuts to navigate. Firstly, you can change system preferences so the cursor moves faster when you arrow over. On MacOSX, go to system preferences - keyboard - keyboard tab and turn the Key Repeat Rate up to fast. This will help save a lot of time. Some general Unix commands that can help enormously are:

Commands that are useful:

Ctrl+a    goes to the start of the command line
Ctrl+e    move to the end of the command line
Ctrl+w    delete the word before the cursor
Ctrl+u    delete from cursor at the start of the command line
Ctrl+k    delete from the cursor to the end of the command line
Ctrl+xx   move between start of command line and current cursor position (and back again)

Option+F+# move to the end of the current or next word on the command line.
Alt+B      Move the cursor to the beginning of the current or previous word
Alt+F     Move the cursor to the end of the next word
Tab    will autocomplete commands and file names (ENORMOUSLY USEFUL!!!)
Control+l    clear command

To find old Terminal commands, you can scroll through your history

Arrow up or Control+p scroll up in the history and edit previously executed commands/press enter

To search through old commands use Control+R then type in letters to search.

Control+z send the current process to the background
type fg to get it back to the foreground

Bioinformatic data sets can be gigabytes or more. There are PERL scripts and Python scripts available that can assist in fast data manipulation, such as pulling certain sequences from a large file. But script writing can take way longer than its worth. You can also use grep and sed commands right in the terminal to pull out information from a large data file. I have attached general manuals on sed and grep commands to this post. Below is an example that a colleague gave me.:

To grab a sequence from a file:

If you are familiar with command line terminal, you could try grep in the command line:

grep "yoursequence" yourfastafile > Users/name/Desktop/outputfile.txt

Using the '-B 1' option will allow you to grab the header for this sequence too:

grep -B 5 "yoursequence" yourfastafile > myfinds.txt

if you have a text file with each sequence on each line you can use fgrep:

fgrep -B 1 -f filewithlistofsequences yourfastafile

The 'grep' command can also be followed by a -A -B or -C option and the number of lines you want to grab. -A will print the number of lines trailing your search query in parentheses after the matching lines. -B will print the number of lines before the query. -C will print the number of lines containing the query.

For instance, if my query is grep -A 10 "HOX" > hoxfile.fa hox.txt it will find instances of HOX and print 10 lines after that match. So if my HOX gene sequence is longer than 10 lines, it will truncate the sequence.

If you are searching for a sequence in a very long scaffold this can be useful

grep -A 1000 "TATACTGCGATGCTCGCTCGAGAGTATGAGAG" seq.fa > mysequence.txt

These will produce an output to the screen if the sequence exists (or you could redirect to another file using '> output.txt'). Otherwise there will be no output. But the search will only be for exact matches, and will miss sequences with line breaks in between.

Tuesday, December 18, 2012

Welcome to Bash

Welcome to Bash

What is Bash? Bash is an acronym for "Bourne-Again SHell." Bash is the shell, accessible via the Terminal that acts as a command language interpreter. Bash is what lets you type commands into the Terminal to make things happen. You are probably already familiar with the Bash commands like:

cd (change directory)
ls (list files)
pwd (print working directory)
mkdir (make a new directory)
cp (copy)
mv (move or rename a file)
more (display output one screen at a time)
jobs (show jobs)
bg (send a job to the background)
fg (send a job to the foreground)
kill (stop a job)
echo (display something like echo $PATH)
chmod (change permissions)
ssh (secure Shell client - remote login program)
which (search the user's $path for a program file, eg. which perl)

Beyond these basic Bash commands are some for manipulating text:

awk      Find and Replace text, database sort/validate/index
cat        Concatenate and print (display) the content of files
egrep    Search file(s) for lines that match an extended expression
fgrep    Search file(s) for lines that match a fixed string
gawk     Find and Replace text within file(s)
grep     Search file(s) for lines that match a given pattern
look     Display lines beginning with a given string
sed      Stream Editor
sort     Sort text files
tr          (translate, squeeze, and/or delete characters)
wc       Print byte, word, and line counts

To learn about more Bash commands, click here.

In Terminal, you can write bash commands that are quite useful, such as

grep -c "Gene" /Users/username/Desktop/gene_file.txt

That command would count up every single time the string "Gene" occurs in the file and it will output the number to you (case sensitive). If you wanted to use a large list of genes to make a smaller subset of genes, you could also

grep "myod" /Users/username/Desktop/gene_file.txt > /Users/username/Desktop/myodgenes.txt

That would output a file that only contain the muscle specific myod genes.

But what if you want to print out everything except a certain string of elements? For instance, I had a really long list of elements, over 50,000. I wanted to make sure I had each type of element (eg. L1-1_Acar) It would be impractical to scroll through to catalog each type. For example, there are only 8 different types of LINE repeat elements listed below, but in the full file there are over 30 types interspersed and some elements are repeated a lot more.

422   18.7 0.0 8.2 1               6079081   6079185 (257841273) C     L1-1_Acar         LINE/L1            (117) 5470   5374    7560
1077   15.7 6.1 3.0 1              6093682   6093944 (257826514) +     L1_AC_17          LINE/L1             4613   4883 (527)    7570
867   18.0 0.0 0.0 1               6094716   6094887 (257825571) +     L1_AC_17          LINE/L1             4959   5130 (280)    7570 *
534   16.6 0.7 10.1 1               6095318   6095458 (257825000) +    L1_AC_17           LINE/L1             5129   5257 (153)    7570 *
8699   29.8 1.1 1.1 1              6185694   6189641 (257730817) +    L1-28_ACar         LINE/L1             35     6162   (86)    7682
242   25.8 4.8 0.0 1               6230638   6230699 (257689759) +     L1_AC_18          LINE/L1             3488   3552 (1971)    7740
534   16.6 0.7 10.1 1               6095318   6095458 (257825000) +    L1_AC_17           LINE/L1             5129   5257 (153)    7570 *
534   16.6 0.7 10.1 1               6095318   6095458 (257825000) +    L1_AC_17           LINE/L1             5129   5257 (153)    7570 *
534   16.6 0.7 10.1 1               6095318   6095458 (257825000) +    L1_AC_17           LINE/L1             5129   5257 (153)    7570 *
534   16.6 0.7 10.1 1               6095318   6095458 (257825000) +    L1_AC_17           LINE/L1             5129   5257 (153)    7570 *
2706   30.1 3.1 2.4 1              6256710   6258043 (257662415) C    L1_AC_16           LINE/L1            (2892) 3237   1895    7770
725   23.2 4.3 2.6 1               6342819   6343048 (257577410) C    L1_AC_15           LINE/L1             (294)   5869   5636    7826
534   16.6 0.7 10.1 1               6095318   6095458 (257825000) +    L1_AC_17           LINE/L1             5129   5257 (153)    7570 *
534   16.6 0.7 10.1 1               6095318   6095458 (257825000) +    L1_AC_17           LINE/L1             5129   5257 (153)    7570 *
534   16.6 0.7 10.1 1               6095318   6095458 (257825000) +    L1_AC_17           LINE/L1             5129   5257 (153)    7570 *
241   16.1 6.1 6.1 1               6373497   6373562 (257546896) C    L1_AC_3            LINE/L1             (438)   5866   5801    7853
534   16.6 0.7 10.1 1               6095318   6095458 (257825000) +    L1_AC_17           LINE/L1             5129   5257 (153)    7570 *
534   16.6 0.7 10.1 1               6095318   6095458 (257825000) +    L1_AC_17           LINE/L1             5129   5257 (153)    7570 *
534   16.6 0.7 10.1 1               6095318   6095458 (257825000) +    L1_AC_17           LINE/L1             5129   5257 (153)    7570 *
534   16.6 0.7 10.1 1               6095318   6095458 (257825000) +    L1_AC_17           LINE/L1             5129   5257 (153)    7570 *
298   28.8 11.6 3.5 1               6513640   6513984 (257406474) +    L1_AC_9            LINE/L1             5476   5847 (564)    7961
848   27.3 3.2 4.1 1               6570802   6571363 (257349095) +    L1_AC_16           LINE/L1             4711   5267 (862)    7999 *
924   26.4 0.0 1.0 1               6571344   6571638 (257348820) +    L1_AC_16           LINE/L1             5638   5929 (200)    7999

In order to compile a list, I want to first subtract out the most repeated elements. In this case, L1_AC_17 is repeated so often its difficult to find the other elements.

sed '/L1_AC_17/d' /Users/username/Desktop/lines.txt > /Users/username/Desktop/lines_allexcept.txt

This will give me an updated list:

422   18.7 0.0 8.2 1               6079081   6079185 (257841273) C     L1-1_Acar         LINE/L1            (117) 5470   5374    7560
8699   29.8 1.1 1.1 1              6185694   6189641 (257730817) +    L1-28_ACar         LINE/L1             35     6162   (86)    7682
242   25.8 4.8 0.0 1               6230638   6230699 (257689759) +     L1_AC_18          LINE/L1             3488   3552 (1971)    7740
2706   30.1 3.1 2.4 1              6256710   6258043 (257662415) C    L1_AC_16           LINE/L1            (2892) 3237   1895    7770
725   23.2 4.3 2.6 1               6342819   6343048 (257577410) C    L1_AC_15           LINE/L1             (294)   5869   5636    7826
241   16.1 6.1 6.1 1               6373497   6373562 (257546896) C    L1_AC_3            LINE/L1             (438)   5866   5801    7853
298   28.8 11.6 3.5 1               6513640   6513984 (257406474) +    L1_AC_9            LINE/L1             5476   5847 (564)    7961
848   27.3 3.2 4.1 1               6570802   6571363 (257349095) +    L1_AC_16           LINE/L1             4711   5267 (862)    7999 *
924   26.4 0.0 1.0 1               6571344   6571638 (257348820) +    L1_AC_16           LINE/L1             5638   5929 (200)    7999

Without the L1_AC_17's I can clearly see more element types, such as L1-1_Acar, L1-28_ACar, L1_AC_18, L1_AC_16, etc. As I compile my list, I want to start eliminating the genes I have already marked down, so that I know the list includes every element. Since my list is 50,000 long, I start constructing multiple sed commands in a row.

sed   -e '/L1-1_Acar/d'    -e '/L1-28_ACar/d'    -e '/L1_AC_18/d'    -e '/L1_AC_16/d'   -e '/L1_AC_15/d'   -e '/L1_AC_3/d'   -e '/L1_AC_9/d'   /Users/username/Desktop/lines.txt > /Users/username/Desktop/lines_allexcept2.txt

At this point, I see a new file pop up on the Desktop that is an empty text file. Now, I know that my list of elements is comprehensive.

Using bash commands in Terminal is quite useful, but you can go beyond this and write bash scripts to automate your processes. To write a bash script, just open up new file in TextEditor, save it as name.sh and put the header #!/bin/bash at the beginning.

#!/bin/bash

sed -e '/R2/d' -e '/LOA/d' -e '/telomeric/d' -e '/R1/d' -e '/I\ /d' -e '/MaLR/d' -e '/SINE?/d' -e '/LINE?/d' -e '/LTR?/d' -e '/ERV/d' -e '/Low_complexity/d' -e '/hAT-Ac/d' -e '/Satellite/d' -e '/Harbinger/d' -e '/hAT-Charlie/d' -e '/DNA/d' -e '/hAT-Tip100/d' -e '/TcMar-Tc1/d' -e '/TcMar-Tc2/d' -e '/TcMar-Mariner/d' -e '/Maverick/d' -e '/Penelope/d' -e '/Chapaev/d' -e '/Sola/d' -e '/Harbinger/d' -e '/Helitron/d' -e '/Piggybac/d' -e '/SINE\ /d' -e '/LINE\ /d' -e '/LTR\ /d' -e '/Sauria/d' -e '/Tigger/d' -e '/RTE-BovB/d' -e '/RTE-X/d' -e '/Jockey/d' -e '/Gypsy/d' -e '/Copia/d' -e '/Ngaro/d' -e '/MIR/d' -e '/Rex-Babar/d' -e '/LTR\ /d' -e '/Pao/d' -e '/ERV1/d' -e '/ERVK/d' -e '/ERVL/d'   -e '/DIRS/d' -e '/Blackjack/d' -e '/RNA/d' -e '/rRNA/d' -e '/Kolobok/d' -e '/EnSpm/d' -e '/Simple_repeat/d' -e '/CR1/d' -e '/RC/d' -e'/tRNA/d' -e '/scRNA/d' -e '/snrNA/d' -e '/Unknown/d' -e '/Chompy/d' -e '/RTE/d' -e'/Dong-R4/d' -e '/LINE\ /d' -e '/Deu/d' -e '/B2/d' -e '/ID/d' -e '/B4/d' -e '/5S/d' -e '/Alu/d' -e '/MuDR/d' -e '/MULE-MuDR/d' -e '/L1/d' -e '/L2/d' /Users/username/Desktop/species_comp_repeats/acar2.fa.out    > /Users/username/Desktop/acar2.txt

The 'shebang' #! (# sh, ! bang) at the header of the file allows you to run the script from the Terminal, as you would for cd or ls. After that, move the file to the /usr/sbin folder (lose the .sh). To make it executable, you can type chmod u+x script-name, or you can do chmod 777 name. Since my bash script was named allexecept, I simply type allexcept into the terminal and the script will run, removing the 90 or so elements from my file. After I automated this process, I was able to get information on 7 other species and create a massive Excel spreadsheet that I otherwise would never have time to make. In addition, when I get new assemblies or updating data I can run the script quickly instead of having to gather all that data all over again!

Another script I wrote is called greprep.sh. It stands for grep repeats. Again, my file contained over 100,000 elements, but I wanted to get information about specific subfamilies of repeats. I wrote the script to tell me how many copy numbers the element has. For instance, Mariner appears 72,898 times!

For each element, I just wrote:

STRING12="DNA/TcMar-Mariner"

echo $STRING12

grep -c "DNA/TcMar-Mariner" /Users/username/Desktop/species_comp_repeats/acar2.fa.out

The bash script greprep.sh is attached. I also wanted to be able to copy a column of the numbers, so I could easily paste them into Excel. I created another bash script called greprepnum.sh that just displays the numbers (not the names using echo). There is probably an easier way to do this, as there is always an easier way to do everything in scripting. ie creating an output Excel file that is already formatted the way you need it. But it takes too long sometimes to create these "shortcuts." With writing bash scripts there are so many different ways to do any one thing, and you can waste all day trying to get your perfect product. Choose your battles wisely!

Another thing that can be done with bash scripts is putting different commands into one script, such as:

grep "Unknown" /Users/username/Desktop/names.txt | sed -e '/Other/d' | awk ' '{print $1}'   file | perl fastqconversion.pl   fa2std   acar.fastq

You can separate commands with a | and kind of make your own pipeline that can include several different transformations or processing of the data. This is a very useful utility for the kind of data we get in bioinformatics! Remember Google is your friend, when it comes to finding scripts or script commands (but it may take some tweaking to get things up and running).

Good luck.

Convert FASTQ files

Here is a Perl script for FASTQ conversion. I didn't make it. Just copy and paste this into a text file, but save with the .pl extension. If you already have perl installed and in your path, just type

fastqconversion.pl   command infile.txt

#!/usr/local/bin/perl -w

# Author: lh3

use strict;
use warnings;
use Getopt::Std;

my $usage = qq(
Usage:   fastqconversion.pl <command> <in.txt>

Command: scarf2std      Convert SCARF format to the standard/Sanger FASTQ
         fqint2std      Convert FASTQ-int format to the standard/Sanger FASTQ
         sol2std        Convert Solexa/Illumina FASTQ to the standard FASTQ
         fa2std         Convert FASTA to the standard FASTQ
         fq2fa          Convert various FASTQ-like format to FASTA
         instruction    Explanation to different format
         example        Show examples of various formats

Note:    Read/quality sequences MUST be presented in one line.
\n);

die($usage) if (@ARGV < 1);

# Solexa->Sanger quality conversion table
my @conv_table;
for (-64..64) {
$conv_table[$_+64] = chr(int(33 + 10*log(1+10**($_/10.0))/log(10)+.499));
}

# parsing command line
my $cmd = shift;
my %cmd_hash = (scarf2std=>\&scarf2std, fqint2std=>\&fqint2std, sol2std=>\&sol2std, fa2std=>\&fa2std,
                fq2fa=>\&fq2fa, example=>\&example, instruction=>\&instruction);
if (defined($cmd_hash{$cmd})) {
&{$cmd_hash{$cmd}};
} else {
die("** Unrecognized command $cmd");
}

sub fa2std {
my %opts = (q=>25);
getopts('q:', \%opts);
my $q = chr($opts{q} + 33);
warn("-- The default quality is set to $opts{q}. Use '-q' at the command line to change the default.\n");
while (<>) {
    if (/^>(\S+)/) {
    print "\@$1\n";
    $_ = <>;
    print "$_+\n", $q x (length($_)-1), "\n";
    }
}
}

sub fq2fa {
while (<>) {
    if (/^@(\S+)/) {
    print ">$1\n";
    $_ = <>; print;
    <>; <>;
    }
}
}

sub scarf2std {
while (<>) {
    my @t = split(':', $_);
    my $name = join('_', @t[0..4]);
    print "\@$name\n$t[5]\n+\n";
    my $qual = '';
    @t = split(/\s/, $t[6]);
    $qual .= $conv_table[$_+64] for (@t);
    print "$qual\n";
}
}

sub fqint2std {
while (<>) {
    if (/^@/) {
    print;
    $_ = <>; print; $_ = <>; $_ = <>;
    my @t = split;
    my $qual = '';
    $qual .= $conv_table[$_+64] for (@t);
    print "+\n$qual\n";
    }
}
}

sub sol2std {
my $max = 0;
while (<>) {
    if (/^@/) {
    print;
    $_ = <>; print; $_ = <>; $_ = <>;
    my @t = split('', $_);
    my $qual = '';
    $qual .= $conv_table[ord($_)] for (@t);
    print "+\n$qual\n";
    }
}
}

sub instruction {

print "
FASTQ format is first used in the Sanger Institute, and therefore
we take the Sanger specification as the standard FASTQ. Although
Solexa/Illumina reads file looks pretty much like the standard
FASTQ, they are different in that the qualities are scaled
differently. In the quality string, if you can see a character
with its ASCII code higher than 90, probably your file is in the
Solexa/Illumina format.

Sometimes we also use an integer, instead of a single character,
to explicitly show the qualities. In that case, negative
qualities indicates that Solexa/Illumina qualities are used.

";

}

sub example {
my $exam_scarf = '
USI-EAS50_1:4:2:710:120:GTCAAAGTAATAATAGGAGATTTGAGCTATTT:23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 19 23 23 23 18 23 23 23
USI-EAS50_1:4:2:690:87:GTTTTTTTTTTTCTTTCCATTAATTTCCCTTT:23 23 23 23 23 23 23 23 23 23 23 23 12 23 23 23 23 23 16 23 23 9 18 23 23 23 12 23 18 23 23 23
USI-EAS50_1:4:2:709:32:GAGAAGTCAAACCTGTGTTAGAAATTTTATAC:23 23 23 23 23 23 23 23 20 23 23 23 23 23 23 23 23 23 23 23 23 12 23 18 23 23 23 23 23 23 23 23
USI-EAS50_1:4:2:886:890:GCTTATTTAAAAATTTACTTGGGGTTGTCTTT:23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23
USI-EAS50_1:4:2:682:91:GGGTTTCTAGACTAAAGGGATTTAACAAGTTT:23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 20 23 23 23 23 23 23 23 23 23 23 23 18 23 23 23 23
USI-EAS50_1:4:2:663:928:GAATTTGTTTGAAGAGTGTCATGGTCAGATCT:23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23
';

my $exam_fqint = '
@4_1_912_360
AAGGGGCTAGAGAAACACGTAATGAAGGGAGGACTC
+4_1_912_360
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 21 40 40 40 40 40 40 40 40 40 26 40 40 14 39 40 40
@4_1_54_483
TAATAAATGTGCTTCCTTGATGCATGTGCTATGATT
+4_1_54_483
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 16 40 40 40 28 40 40 40 40 40 40 16 40 40 5 40 40
@4_1_537_334
ATTGATGATGCTGTGCACCTAGCAAGAAGTTGCATA
+4_1_537_334
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 21 29 40 40 33 40 40 33 40 40 33 31 40 40 40 40 18 26 40 -2
@4_1_920_361
AACGGCACAATCCAGGTTGATGCCTACGGCGGGTAC
+4_1_920_361
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 39 40 40 40 40 40 40 40 40 31 40 40 40 40 40 40 15 5 -1 3
@4_1_784_155
AATGCATGCTTCGAATGGCATTCTCTTCAATCACGA
+4_1_784_155
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 31 40 40 40 40 40
@4_1_595_150
AAAGACGTGGCCAGATGGGTGGCCAAGTGCCCGACT
+4_1_595_150
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 30 40 40 40 40 40 40 40 40 40 20 40 40 40 40 40 14 40 40
';

my $exam_sol = '
@SLXA-B3_649_FC8437_R1_1_1_610_79
GATGTGCAATACCTTTGTAGAGGAA
+SLXA-B3_649_FC8437_R1_1_1_610_79
YYYYYYYYYYYYYYYYYYWYWYYSU
@SLXA-B3_649_FC8437_R1_1_1_397_389
GGTTTGAGAAAGAGAAATGAGATAA
+SLXA-B3_649_FC8437_R1_1_1_397_389
YYYYYYYYYWYYYYWWYYYWYWYWW
@SLXA-B3_649_FC8437_R1_1_1_850_123
GAGGGTGTTGATCATGATGATGGCG
+SLXA-B3_649_FC8437_R1_1_1_850_123
YYYYYYYYYYYYYWYYWYYSYYYSY
@SLXA-B3_649_FC8437_R1_1_1_362_549
GGAAACAAAGTTTTTCTCAACATAG
+SLXA-B3_649_FC8437_R1_1_1_362_549
YYYYYYYYYYYYYYYYYYWWWWYWY
@SLXA-B3_649_FC8437_R1_1_1_183_714
GTATTATTTAATGGCATACACTCAA
+SLXA-B3_649_FC8437_R1_1_1_183_714
YYYYYYYYYYWYYYYWYWWUWWWQQ
';

print qq(
solexa
======
$exam_sol
scarf
=====
$exam_scarf
fqint
=====
$exam_fqint
);
}

BLAST notes

A note on systems:

I recommend downloading VMWare and Ubuntu and running the operating system virtually. You may need a virtual drive mounting program, which will simulate a physical installation CD, so that you can install Ubuntu. I run Ubuntu on my PC using VMware to run a virtual machine with Ubuntu.

Also, if you will be logging on to a remote server via ssh often you will have to go through some sort of secure logon (for most Universities). I have to use Cisco VPN client to logon. SSH is a "secure shell," which will allow you to log on to another machine and run programs, etc. You will need the username and IP address of the computer to logon remotely. The command to log on will look like: ssh username@11.111.111.111. You will be prompted for a password to that computer. If you are really uncomfortable using text commands, you can always enable Screen Sharing, but this has its downsides. If you have a set of tasks to do in a routine manner, you can write out the commands in advance, in a script, and execute them. With the normal user interface, you'd have to click icons and click menus for certain options. You will not have a record of what you've done (this is why people prefer to use the R software environment as well), and sometimes the repetitive icon/menu clicking is annoying. Automating tasks and scripting are your friend!

Set up a BLAST database:

It is necessary to become comfortable navigating your file folder system using text commands in the Terminal. At first this is difficult for Window's users that are used to a graphical interface. But you can explore using commands like

cd Desktop
or
cd ..
to move up or down in the folder hierarchy.

The command ls is like a flashlight, allowing you to see whats in a folder. In addition to ls, you can use ls -l to see more information about the files in a folder, such as date created, size, what permissions are set, etc. If you want to see hidden files (such as system files that begin with a . ) you can type ls -a.

makeblastdb -in infile.txt -dbtype nucl -out databasename

Here's the general structure of the command:
program -in nameoffile -nextoption option -nextoption option

If I type ssh username@11.111.111.111, I can log on my computer at school, from home, and execute commands remotely. As some of my genome files are 30 gigabytes, this is very convenient. Some of the temporary folders created with bioinformatics programs generate 500 gigabytes of data! Trying to move folders that large, even on a 1 terabyte removable harddrive is a real pain and it eats up a lot of time.

You can use mkdir to make a new folder.

mkdir gene_run
cd gene_run

Now I am in the folder where I want my files to go. Since I am in this folder, when I execute the makeblastdb command and blastn commands, files will be created and the programs will run fine.

NOTE: that if you exit the folder or start in a default folder like Users/name, your files will be created there. This causes confusion for new users, who might try to execute a command without being in the right directory. You might wonder why your files disappeared because you thought you were in the Desktop folder, but weren't.

"WHY WON'T IT WORK? IT DID BEFORE" might be what you ask yourself. Check what folder you are in and be cognizant that your location is implicitly defined when you are using the terminal, but that your location can be critical to running certain programs. When doing BLASTS, you must be in the same folder where you made your database. If you leave the folder, you won't be able to run that BLAST unless you make a new database, etc.

In Windows, to run a program, you click on an icon. In Terminal, you just type the name, followed by the options you want to use to run it. The "program" is makeblastdb and I can tell it what file to use as input, what kind of database I want to make, and the name of the output file. Just separate them by spaces. It doesn't matter what order you put the options in either.

NOTE: First I made sure I was in the directory, which I named jan_blasts. I also made sure the .fa file was in the same folder. THEN I made the BLAST database.

There will be 3 files generated in the directory:

xtropdatabase.nhr
xtropdatabase.nin
xtropdatabase.nsq

These are files that the BLAST program will use.

When in doubt about what commands you need to use for doing a BLAST, you can always type makeblastdb -h to pull up the help file.

-dbtype is the database type. Since my Ensembl FASTA file has nucleotide sequences, I choose nucl as the option. If I had proteins, I would use prot etc. -in is the infile -out is the name of your "database" now and you will need to remember the name, for the next step.

When running multiple blasts, which is often the case, you can press the up button at the terminal's blinking cursor and the last command will appear. For doing multiple blasts, this is a good shortcut. You can also drag a file from the desktop to the terminal window and the file path will appear at the blinking cursor.? ?

The input file should be from Ensembl; the gene you want to look at. The output file can be a tabular format for Excel, as denoted by the -m 9. You should look into other output formats, like the 7 option. The -a 4 is instructing the blast to use 4 cores, to speed up the blast.

Getting Gene Sequences

To begin using bioinformatic tools with a Linux-like terminal, proper organization is a must. Because you will be working with a command line, all the files, file names and folders should be structured and arranged very clearly. Otherwise when you have hundreds of files, it gets confusing especially if you move files or archive them. It is advantageous to use one folder for many types of files, although this can be confusing. I have named all my files to include pertinent information (i.e. species_gene_cDNA_hits_extracted.txt), so that if they are moved to another folder, they can easily be identified. This will prevent confusion in the future if files are moved around.

To set up for the blast you should download your gene files from Ensembl, perhaps cDNA, or the exon, depending on what you want. You can export your genes from Ensembl, or just copy the text into a text file. Go to Ensembl, type in a gene and select your species. Then, you will choose the Export Data button on the left sidebar.

Next, chose what parts of the gene you want and click next.

You can just copy and paste this text into a text file and name it species_gene.txt. You only need to copy the part until the line break. Subsequent text after the line break will represent other isoforms, and at the end is a long scaffold. To find a gene ortholog in another species, you can just use the first isoform.

You will need to use different commands for nucleotide, protein, etc. The description for tblastn, tblastx ,blastp can be found at Ensembl's Blast Program Selection Guide. The website is well documented, so make use of it.   You can download whole genomes from Ensembl from the Downloads section.

Okay, to start the BLAST:    You will need to format or make the blast database. Essentially you are designating a file as the “target” which you will be searching a “query” with. The target file is the genome you are using as a reference, and the query is perhaps the individual gene you want to search the target for. i.e. If I want to find A apletophallus genes that are related to development, but I only have the complete genome for A carolinesis to guide me, I would set up the apletophallus contigs I have as the target, to pull out known carolinesis genes, using aapl.fa. Then while blasting, I would use acar_lfng_cnda.txt to search against the aapl database.   

BLAST COMMANDS - WHAT ARE THEY?

blastx -help will bring up the command line arguments (or blastn, etc)

After –d you can drag and drop the name of your target database (Acar-contigs1.fa) into the terminal window. You can erase the path name, leaving only Acar-contigs1.fa. The program knows where to look for your target database, because you set your path variables already. Put the path name of your query file after –i, then for –o name a new output file name where you will get your result. I like the m -9 result (also by including a .xls at the end), which is in tabular form and can be opened in excel(appl_lfng_exons_out.xls).

Without the m -9 option, a text file will be created.

Once the blast results have been obtained, I selected the top hits with the greatest accuracy. You should look for a very low e-value, sometimes 10^-50, but ideally over 10^-6. The option is -evalue .00000000001, for instance I then copied the kmer names and created a .txt file for it. (Creating file names with information about the file is very important, once you get a folder with hundreds of files in it!)

To do the BLAST type:

blastn –query acar_gene_dkk2.txt -db Acar-contigs1 -evalue .000000000001 -out acar_results.txt  

-db is the name of the database that you specified earlier. BLAST will use this database (the genome file) to search the gene (text query file) against. -evalue should be 10^-10 or so. blastn will do a search using a nucleotide query and a nucleotide database. You may have to use other programs like blastn, blastx, blastp, tblastn, etc. Look on Ensembl's website for an explanation of these programs and when its best to use them. Or you can always check by typing

blastn -h

blastx -h

blastp -h
tblastx -h

A description will appear at the bottom as a reminder of what the program is for.

blastn - Nucleotide-Nucleotide BLAST

blastx - Translated Query-Protein Subject BLAST

blastp - Protein-Protein BLAST

tblastx - Translated Query-Translated Subject BLAST

I'm not going to get into why using a protein query is sometimes the best thing, but there are very good reasons sometimes to do a BLAST using protein sequences. Remember that genes have a lot of non-coding elements, introns are more poorly conserved, and if you are searching for genes in two species that have a long time period say Chicken vs. Stickelback, you might not find a lot of conservation and you could get low hits. However, the its easier to find orthologous matches for protein-coding regions, because proteins have a greater propensity to be conserved for functional reasons.

Grepping

You will also need to know how to grab parts of a sequence from the very large fasta files. With a normal size file, you could just do a search for the text, but this becomes difficult when working with a whole genome. For this reason, grep is your friend! Let's say you get back your BLAST results and you have a match. The file shows you the sequence that matches your query (called the subject, abbreviated sbjct). I copy the sequence.

Next, I use the grep command to pull the sequence from the fasta file and output it to a file on my desktop.

grep -C 15 "sequencehere" inputfile > outputfile

This will give me a text file with the sequence in it, as well as 15 lines above and below my search target, which is good if you want to check out the additional sequence around the exon you are looking at.

Likewise, if you type:

grep -A 3 "sequencehere" inputfile > outputfile

You will get 3 lines returned, after your search target.

Grep -B will print x number of lines before your search target.

grep -B 3 "sequencehere" inputfile > outputfile

Grep can be a great tool to use! Read more here.

Other things

I next created a folder called Tools, with all my perl scripts and important files. In order to access these from the terminal, permissions must be set correctly. In the directory of the file type:
ls –l

This tells you what the permissions are: read-write-execute.

chmod 777 *.pl extract2align.pl

This sets full access to read, write, etc for all files that end in .pl. I set my permissions to full, but you can set them according to your preferences, from 7-0, with 0 locking the file (be careful not to lock yourself out of the file.) If you are working on Saguaro, be aware that others can access your files if you do not restrict access. For more info on the User – Group structure of permissions, see Google. Google is your friend and there are many sites that will have answers about linux or bash that you will need to use as a resource or to solve problems you encounter.

Often times you will want to reference an earlier terminal session to see where you left off. To review terminal history, open the hidden system file in the home folder (username) by:

open .bash_history

To search use command F I keep copies of the .bash_history to keep track of what I've done, as I think this is a good scientific habit for noting bioinformatic methods.

Installing the BLAST binaries can be difficult if you don't have the right software dependencies installed already. At minimum, you should install PERL and make sure it is in your path file in your .bash_profile.

You can type emacs ~/.bash_profile. Add the path to your program and then type Control+X Control+C to save and exit. I have added a long list of my programs to my bash profile.

Emacs is a text-like editor in Mac. You can use it to edit the file, which tells your computer where to look for programs, so that when you type in a program name from any location, it will work. (type which program to find where things are kept. ie. which blastn or which perl will show you the path to where the file is kept. If you want to know where you are you can type pwd and it will print your current location.) Control+Z will let you exit out of Emacs back into the terminal.

To install BLAST software, follow the install instructions, you probably have to tar -xvf emsemble.tar to unpack it into your directory.

Next, I used the perl script to search through the Acar-contigs1.fa to pull out the information on the kmers I want out of this massive file, which would take a long time to search through otherwise. Learn how to use scripts!! They will make your life so much easier. There are plenty of perl scripts that can be downloaded for free from the internet. Make use of as many resources as possible. This extract2align.pl script is attached to this post.

/Users/username/Tools/extract2align.pl --infile /Users/username/Aapl_JanProject/blastdb/ Acar-contigs1.fa --idlist /Users/username/Aapl_JanProject/aapl_lfng_hit.txt --outfile /Users/username/Aapl_JanProject/aapl_lfng_hit_extracted.txt

This executes the extract2align.pl perl script that will dig through my target database and pull out the kmers that were the top hits. It will create a new file with (put the _extracted in the file name, to remain organized) the nucleotides. You can find a copy of this Perl script here.

aapl_lfng_hit_extracted.txt

Results

hes1_aminos_hits.txt

--------------------

k31:121332 k51:3345759

------------------------------

>k31:121332

ACACGACTACAGCTCCTCGGACAGCGAGCTAGACGAGAACATCGAGGTGGA GAAGGAGAGTGCAGACGAGAATGGAAACCTGAGTTCAGTGCCGGGGTCCAT GTCCCCCTCCACGTCCTCCCAGATCTTGGCCAGAAAAAGGCGCAGAGGGGT GATCGAGAAACGCCGGCGAGATCGAATCAACAACAGTTTATCTGAATTAAG GAGACTGGTGCCCAGCGCTTTCGAGAAACAGGGATCGGCGAAGCTGGAAA AGGCAGAGATTTTGCAAATGACCGTAGACCACCTGAAAATGCTGCACACGG CAGGAGGAAAGGGGTACTTCGACGCTCACGCTCTGGCGATGGACTACCGCAG CCTAGGGTTCCGGGAGTGCCTGGCAGAGGTAGCACGGTACCTCAGCATCATC GAGGGCCTGGACACCTCCGACCCT

------------------------------

(or if the .fa was run on only one contig, there will be no k31 before the subject id number. Just use the subject id number that came up in your _out.xls file.)

If you did –m 9 for the tabulated file, your excel file will not have any column information, so copy and paste this into the first row of your excel spreadsheet.

Query Subject ID % ID length mismatches gaps query start query end subject start subject end e value bit score

You want to look for both length and e-value to determine which is most likely the gene match you are looking for.

Next I use Geneious and drag the _extracted file into the window. Next you can do alignments with another genome/gene to compare. You can also use UCSC'S Genome Browser.