DNA Sequence Databases

1. Introduction

This paper examines techniques used to implement DNA sequence databases. Specifically, we consider will the following issues:

For each of these areas we will discuss the issues involved. We will compare the techniques used in terms of complexity, resource utilization, and results returned. Existing biological databases will be used to determine the accuracy of the results returned. Our comparison will center around three popular DNA sequence search methods: Smith-Waterman dynamic programming alignment, BLAST (Basic Local Alignment Search), and FastA (Fast Alignment). We will also examine a few other techniques which have demonstrate limited success, but have not become broadly accepted.

A glossary of the terms used in this paper and the papers referenced herein may be found at http://www.cs.ucsb.edu/~mdipper/dna/terms.html

1.1 Why DNA Sequences?

Proteins are a basic building blocks of life. The majority of a cell's structure is composed of proteins. Proteins (more specifically enzymes) are also responsible for all chemical transformations that a cell is capable of. Proteins are composed of a series of amino acids. There are 20 types of amino acids, the order in which they appear in a protein determines both its shape and function.

Though much of an organism's biochemically important information may be learned from the study of proteins, it is difficult to chemically determine the sequence of amino acids that make up a protein. The proteins of a cell are built from the cell's RNA. RNA is a single strand of nucleotides which is transcribed from the cell's DNA. Using present day biological techniques, it is a much simpler task to determine the nucleotides that form a DNA strand, then RNA nucleotides or protein amino acids.

1.2 About DNA Sequences

DNA is composed of two complementary sequences of nucleotides. If one of the two sequences is known, it is a trivial task to compute it's complement. The nucleotides which make up a DNA strand are:

Nucleotide Referred to as Abbreviation
deoxyAdenosine monophosphate Adenosine A
deoxyCytidine monophosphate Cytidine C
deoxyGuanosine monophosphate Guanosine G
deoxyThymidine monophosphate Thymidine T

Adenosine and Thymidine monophosphate are complements of each other as are Guanosine and Cytidine. If a single DNA subsequence is known to be, CGATGATC, The double stranded subsequence would be formed as follows:

CGATGATC
||||||||
GCTACTAG

Nucleotide triples in a DNA sequence are known as codons. Codons encode a single amino acid on a protein. Since there are four DNA nucleotides, there are 64 (43) possible nucleotide triples, yet there are only 20 amino acids. The mapping of triples to amino acids is a many to one function, with some triples also mapping to an end of sequence marker. The following table lists the codon mappings.

Second Position of Codon
T C A G
F
i
r
s
t

P
o
s
i
t
i
o
n
T
TTTPhe[F]
TTCPhe[F]
TTALeu[L]
TTGLeu[L]
TCTSer[S]
TCCSer[S]
TCASer[S]
TCGSer[S]
TATTyr[Y]
TACTyr[Y]
TAATer[*]
TAGTer[*]
TGTCys[C]
TGCCys[C]
TGATer[*]
TGGTrp[W]
T
C
A
G
T
h
i
r
d

P
o
s
i
t
i
o
n
C
CTTLeu[L]
CTCLeu[L]
CTALeu[L]
CTGLeu[L]
CCTPro[P]
CCCPro[P]
CCAPro[P]
CCGPro[P]
CATHis[H]
CACHis[H]
CAAGln[Q]
CAGGln[Q]
CGTArg[R]
CGCArg[R]
CGAArg[R]
CGGArg[R]
T
C
A
G
A
ATTIle[I]
ATCIle[I]
ATAIle[I]
ATGMet[M]
ACTThr[T]
ACCThr[T]
ACAThr[T]
ACGThr[T]
AATAsn[N]
AACAsn[N]
AAALys[K]
AAGLys[K]
AGTSer[S]
AGCSer[S]
AGAArg[R]
AGGArg[R]
T
C
A
G
G
GTTVal[V]
GTCVal[V]
GTAVal[V]
GTGVal[V]
GCTAla[A]
GCCAla[A]
GCAAla[A]
GCGAla[A]
GATAsp[D]
GACAsp[D]
GAAGlu[E]
GAGGlu[E]
GGTGly[G]
GGCGly[G]
GGAGly[G]
GGGGly[G]
T
C
A
G
Codon to Amino Acid Mapping

1.3 Information Gained from Matching Sequences

A goal a of sequence matching databases is to determine if there are any sequences known by the database similar to a query sequence. If two sequences are very similar, it is likely that:

  1. The sequences have a related structure or function.
    In most cases, there is some available information on the structure and function of the sequences in the database. A researcher with a sequence similar to a known sequence may be able to gain information about the form and function of the new sequence by studying similar known sequences.
  2. The sequences may have a common ancestor sequence.
    If two sequences are reasonably similar, it is likely that both sequences evolved from a common ancestor and an evolutionary relationship may exist between the source of each sequence.

If the query sequence is a partial sequence, it may also be possible to gain information about the sequence's position and role in the sequence it came from.

In section 3, we will discuss techniques used to measure the similarity of two DNA sequences.

1.4 Sequence Alignment Issues

A DNA strand may be thought of as a string of four nucleotides in any order (e.g. ATGCCATT). A simple approach towards aligning two DNA sequences would be to search for an exact match of one string within another. Reasonably fast techniques for performing exact matches of strings are well known.

There are a few problems with using exact matches to align two DNA sequences.

In order for more meaningful alignments to be formed inexact alignments must be allowed. We will discuss some of these alignment techniques the section which follows.

1.5 Nucleotide Data Sets

The largest publicly accessible nucleotide data sets are maintained in: GenBank (National Center for Biotechnology Information Genetic Databank), EMBL (European Molecular Biology Laboratory), and DDJB (DNA Database of Japan). Each of these three databases shares their information. In October 1999 GenBank reported that there were approximately 3,841,000,000 bases in 4,865,000 sequence records in the GenBank database.

2. Aligning DNA sequences

Nucleotide and amino acid databases traditionally perform nearest neighbor and range queries based upon the scores of gapped local alignments. In this section we will discuss the differences between gapped and ungapped alignments as well as local and global alignments. Section 3 discusses the scoring of such alignments.

2.1 Gapped Alignments vs. Ungapped Alignments

In the ungapped alignment model, only nucleotides (or amino acids) may be aligned with each other. In such alignments, nucleotides must be inserted or deleted to correct length differences or allow for breaks in the alignment sequences. Gapped alignments have all the features of an ungapped alignment, plus they allow for sequences to be extended by the insertion of gaps. Algorithms which allow for gaps generally have a penalty associated with opening the gap and another for extending it. See Section 3.3 for the scoring of gaps.

2.2 Local vs. Global

In the local alignment model, portions of two sequences are aligned with each other, while portions which are not similar are not used in the alignment. Global alignment, differs in that two sequences are aligned from beginning to end. While global alignments are useful for comparing two complete, closely related sequences, they are not very useful when searching for sequences that contain near matching subsequences or more distantly related sequences.

3. Scoring of Similarity Between Sequences

Alignment of two nucleotide sequences is traditionally scored using values which may be looked up in a weighted scoring matrix. Several biologically significant matrices exist. The matrices most frequently used for scoring alignments of amino acid and nucleotide sequences come from the PAM (Percent Accepted Mutation) and BLOSUM (Blocks Substitution Matrices) families.

3.1 PAM

The PAM matrices are constructed so that the highest scoring alignments using PAMn will be within n PAM units of each other. PAM1 is the base PAM matrix. It is constructed so that maximal scores are produced for alignments with only 1% mutation (99% conservation). To construct the PAMn matrix from PAM1 (M1), the following formula is used[28]:

Mn = (M1) n

The following matrix is a representation of PAM100.

#
# This matrix was produced by "pam" Version 1.0.6 [28-Jul-93]
#
# PAM 100 substitution matrix, scale = ln(2)/2 = 0.346574
#
# Expected score = -1.99, Entropy = 1.18 bits
#
# Lowest score = -9, Highest score = 12
#
   A  R  N  D  C  Q  E  G  H  I  L  K  M  F  P  S  T  W  Y  V  B  Z  X  *
A  4 -3 -1 -1 -3 -2  0  1 -3 -2 -3 -3 -2 -5  1  1  1 -7 -4  0 -1 -1 -1 -9
R -3  7 -2 -4 -5  1 -3 -5  1 -3 -5  2 -1 -6 -1 -1 -3  1 -6 -4 -3 -1 -2 -9
N -1 -2  5  3 -5 -1  1 -1  2 -3 -4  1 -4 -5 -2  1  0 -5 -2 -3  4  0 -1 -9
D -1 -4  3  5 -7  0  4 -1 -1 -4 -6 -1 -5 -8 -3 -1 -2 -9 -6 -4  4  3 -2 -9
C -3 -5 -5 -7  9 -8 -8 -5 -4 -3 -8 -8 -7 -7 -4 -1 -4 -9 -1 -3 -6 -8 -5 -9
Q -2  1 -1  0 -8  6  2 -3  3 -4 -2  0 -2 -7 -1 -2 -2 -7 -6 -3  0  5 -2 -9
E  0 -3  1  4 -8  2  5 -1 -1 -3 -5 -1 -4 -8 -2 -1 -2 -9 -5 -3  3  4 -2 -9
G  1 -5 -1 -1 -5 -3 -1  5 -4 -5 -6 -3 -4 -6 -2  0 -2 -9 -7 -3 -1 -2 -2 -9
H -3  1  2 -1 -4  3 -1 -4  7 -4 -3 -2 -4 -3 -1 -2 -3 -4 -1 -3  1  1 -2 -9
I -2 -3 -3 -4 -3 -4 -3 -5 -4  6  1 -3  1  0 -4 -3  0 -7 -3  3 -3 -3 -2 -9
L -3 -5 -4 -6 -8 -2 -5 -6 -3  1  6 -4  3  0 -4 -4 -3 -3 -3  0 -5 -4 -3 -9
K -3  2  1 -1 -8  0 -1 -3 -2 -3 -4  5  0 -7 -3 -1 -1 -6 -6 -4  0 -1 -2 -9
M -2 -1 -4 -5 -7 -2 -4 -4 -4  1  3  0  9 -1 -4 -3 -1 -6 -5  1 -4 -2 -2 -9
F -5 -6 -5 -8 -7 -7 -8 -6 -3  0  0 -7 -1  8 -6 -4 -5 -1  4 -3 -6 -7 -4 -9
P  1 -1 -2 -3 -4 -1 -2 -2 -1 -4 -4 -3 -4 -6  7  0 -1 -7 -7 -3 -3 -1 -2 -9
S  1 -1  1 -1 -1 -2 -1  0 -2 -3 -4 -1 -3 -4  0  4  2 -3 -4 -2  0 -2 -1 -9
T  1 -3  0 -2 -4 -2 -2 -2 -3  0 -3 -1 -1 -5 -1  2  5 -7 -4  0 -1 -2 -1 -9
W -7  1 -5 -9 -9 -7 -9 -9 -4 -7 -3 -6 -6 -1 -7 -3 -7 12 -2 -9 -6 -8 -6 -9
Y -4 -6 -2 -6 -1 -6 -5 -7 -1 -3 -3 -6 -5  4 -7 -4 -4 -2  9 -4 -4 -6 -4 -9
V  0 -4 -3 -4 -3 -3 -3 -3 -3  3  0 -4  1 -3 -3 -2  0 -9 -4  5 -4 -3 -2 -9
B -1 -3  4  4 -6  0  3 -1  1 -3 -5  0 -4 -6 -3  0 -1 -6 -4 -4  4  2 -2 -9
Z -1 -1  0  3 -8  5  4 -2  1 -3 -4 -1 -2 -7 -1 -2 -2 -8 -6 -3  2  5 -2 -9
X -1 -2 -1 -2 -5 -2 -2 -2 -2 -2 -3 -2 -2 -4 -2 -1 -1 -6 -4 -2 -2 -2 -2 -9
* -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9  1

3.2 BLOSUM

BLOSUM matrices are based on local multiple alignments of more distantly related sequences. Unlike PAM matrices, BLOSUM matrices were created from real amino acid data.

For the creation of BLOSUM matrices, a database of multiple alignments without gaps for short regions of related sequences was derived. Within each alignment, the sequences were clustered into groups of sequences similar at by some threshold percent value. Substitution frequencies for all pairs of amino acids were calculated between the groups, to create the Block Substitution Matrix for each cluster.

The number associated with the matrix is the minimum percent of identity of the sequences in the block. For example BLOSUM50 means that the sequences in this block are at least 50% identical.

3.3 Gaps

In addition to the scoring matrix, if gaps are allowed in an alignment, a linear gap penalty is usually used. The default gap penalty for many of the scoring systems is 10 points for opening the gap, and 1 point for each additional nucleotide the gap spans.

Using the PAM100 matrix above and a 10 point gap creation 1 point gap extension penalty, the following is and example of how two alignments might be scored.

Sequence 1  C  T  G  A  G  A  A  T  C  A  T  A  C  G  T  C  A  G  T  A  C  C
            |  +  |  |  |  +        |  |  |  +  |  |  |     +  |  |  +  |  |
Sequence 2  C  A  G  A  G  T  -  -  C  A  T  G  C  G  T  -  A  G  T  A  G  C
Score      +9 +1 +5 +4 +5 +1 -10-1 +9 +4 +5 +1 +9 +5 +5 -10+4 +5 +5 +4 -5 +9

Total Score: 64                   | = Match,  + = Mutation, - = Gap

4. Statistical Analysis of Alignments [6][10][11][12]

Just as important as having a high score in a string similarity match is its statistical significance. Consider the following matches:

Database: ATATATATATATATATATCAAG
Query:    ATATATATAT

and

Database: CAGTAGAACGGACTAGACA
Query:    CAGTAGAA

In the first case the match is practically irrelevant. It can be made several places on the string. It can be considered a chance occurrence that it is aligned the way it is. The second alignment is more significant since it probably will not occur elsewhere. Information theory describes the significance of the alignments.

Given an independent, identical distribution (IID) the probability of seeing a particular sequence of letters a1, a2, ... an is simply P1 × P2 × ... × Pn for a probability distribution Pi'.

As an alternative to multiplication, the logarithm is taken and probabilities are summed as log(P1) + log(P2 ) + ... + log(Pn). This is call the log odds and actual probability would be computed by taking the log base to this power. Log probabilities are measures of information and their units are related to the log base. Log base 2 units are bits. Natural log units are called nats. Bits are related to nats by the factor ln(2).

4.1 Shannon's Entropy

Shannon's entropy (I) is a measure of information stored in a string in bits per symbol and is given by the equation:

I = -(S i Pi × log(Pi'))

If a message is 200 symbols long, the amount of information stored in it is 200 × I, and in theory there is no shorter encoding of this message. For a uniform probability distribution of nucleotide sequences PA = PT = PG = PC = 0.25. This means that each character takes 2 bits to encode. For a nonuniform distribution where PA = PG = 0.05 and PA = PG = 0.45, the average is about 1.47 bits per letter. If a Huffman code for the this distribution is used the symbols may be represented by the following bit pattern:

Nucleotide Pattern
G 0
T 10
A 110
C 111

The Huffman Entropy is 1 × 0.45 + 2 × 0.45 + 3 × 0.05 + 3 × 0.05 = 1.65 bits per letter. While the Huffman code is simpler to implement, there is a small cost (1.65 vs. 1.47) for using a slightly different probability distribution.

If the true distribution is not PA = PG = 0.05 the encoding could take more than 2 bits per nucleotide. In general if a bias of a distribution is unknown, the conservative use of the uniform distribution may be longer than if the bias were known but is the best for that situation.

4.2 Relative Entropy

Relative entropy, H, is a measure of the expected inefficiency per letter for a message with an assume distribution P and an actual distribution Q. Relative Entropy is sometimes called Kullback Leibler distance between distributions, and is always positive or zero if P = Q. Relative entropy is given by the equation:

H = S i Qi × log (Qi / Pi)

Given a scoring matrix such as PAM62 and a null model sequence generated independent, identical distribution using some background distribution, P, and a query to be aligned BLAST and FastA find maximal scoring pair (MSP). The distribution of the letters in MSP is called the target distribution and is related to the background distribution by the scoring matrix. If the letters in the MSP are described using the optimal Shannon code for the background distribution, it is expected that this would be longer than if the MSP is described with a Shannon code on its own distribution, the target distribution. The extra bits describe the odds that the alignment is by chance alone.

4.3 Karlin-Altschul Statistics

Karlin-Altschul statistics show that given two sequences IID, effectively long, and not too dissimilar in size with X and Y with background distributions of PX and PY, if the sequences are respectively searched using a scoring matrix S(i,j), and the total pairwise score sum is expected to be negative while some positive score is possible for some letters i and j, PX(i) × PY(j) × S(i,j) > 0 that:

S(i,j) = log(Q(i,j) / PX(i) × PY(j)) / l

Q(i,j) is the limiting target distribution of letter pairs (i,j) in the MSP and l can be thought of as a scale of the scoring matrix and is the solution to:

Si,j PX(i) × PY(j) × e lS(i,j) = 1

The expected frequency E of chance occurrence that the MSP has a score S or greater is:

E = K × M × N × e -lS

Where N is the size of the database string and M is the size of the query string. MN is the size of the search space. K is a measure of the relative interdependence of this space.

There is less of a chance that sequences will align on the edges of the search space since an MSP would run over the edge and the area over the edge does not generate any score. These effects on the expected frequency E are modeled by showing a slightly reduced search space.

E represents the expected number of answers given the search space and the distributions and 5 - 10 is generally used as a cutoff for an answer. E is a fundamental measure for an alignment. An alignment with a high E is considered irrelevant.

5. Exhaustively Searching DNA Databases

In this section we will discuss 3 popular algorithms for exhaustively searching DNA databases: SSearch, FastA, and BLAST. Each of these algorithms may be used for both nearest neighbor and range searches of nucleotide data bases against a nucleotide query string. A fourth, algorithm, SALSA is also discussed, because we believe that it demonstrates the potential for an increased sensitivity, rapid search algorithm.

5.1 SSearch (Smith-Waterman Algorithm) [24][28]

SSearch is a rigorous local alignment program written by William Pearson as an implementation of the of Smith and Waterman method which will always find a maximal scoring alignment of two sequences. Although it can be very slow compared to FastA and BLAST, it is considered the most sensitive method of searching and will generally find all similar and statistically significant alignments. For this reason it is an essential tool for doing exhaustive searching and is distributed with other tools such as FastA. The program can also be run on multiprocessors for faster search times.

SSearch can be used to search for nucleotide or amino acid sequences against a FastA formatted database. It uses standard scoring matrices such as PAM or BLOSUM and a gap creation penalty plus a gap extension penalty. It is based on the idea of dynamic programming.

On brute force search method this problem would take O(n2m 2) time, where n is the size of the database string and m is the size of the query string. If no gaps are allowed this is O(nm) since worst every symbol on the database must be compared with every symbol on the query. With gaps there are a factor of n permutations on the database and a factor of m permutation on the query string, hence O(n2m2) time. The most widely accepted remedy is to use dynamic programming to remember substring comparisons once we have made them. The SSearch algorithm works as follows. Create a matrix (n + 1) × (m + 1). In the first row and first column put all zeros. For all rows and columns compute the ith column and jth row as the follows:

    cell(i, j) = maximum {
        cell(i - 1, j - 1) + score(ai,bj),      // Describes a match or mismatch
        max(cell(i - k, j) + Wk),               // Describes a gap of length k on query
        max(cell(i, j - p) + Wp),               // Describes a gap of length p on database
        0}                                      // Describes a new start

The score is computed from a given matrix such as PAM62 or BLOSUM50 for amino acids and usually by +5 for a match, -4 for a mismatch for nucleotides. PAM and BLOSUM describe the amount of expected mutations (mismatches). Wk and Wp are gap penalties computed with as:

Wk = Creation_Penalty_for_query + Extension_Penalty_for_query × k
Wp = Creation_Penalty_for_database + Extension_Penalty_for_database × p

Normally a gap creation and extension for a database is given a much higher penalty than for a query because it is assume that the database has been filled to a greater degree. The maximal score is highest number in the last row of the table where all of the query string has be processed. Other high scores can also be analyzed to give range or nearest neighbor type searches. SSearch uses a simple linear regression against the natural log of the search set sequence length to calculate a normalized z-score for the sequence pair. The statistical significance is also computed and compared against the Statistical Expectation. The table does not give the alignment only the score. In order to find the alignment a directed graph must be kept of the path of the score.

Example

Use a scoring matrix of +5 for a match and -4 for a mismatch.
Gap creation penalty = -4 for both database and query.
Gap extension penalty = -1 for both database and query.

Database String: AGCTATATACGGTAACGTA
Query String:    GCTTA
* A G C T A T A T A C G G T A A C G T A
* 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
G 0 0 5 1 0 0 0 0 0 0 0 5 5 1 0 0 0 5 1 0
C 0 0 1 10 6 5 4 3 2 1 5 1 1 0 0 0 5 1 0 0
T 0 0 0 6 15 11 10 9 8 7 6 5 4 6 2 1 1 0 6 2
T 0 0 0 5 11 11 16 12 14 10 9 8 7 9 5 4 3 2 5 1
A 0 5 1 4 10 16 12 21 17 19 15 14 13 12 14 10 9 8 7 10

The highest score in the last row is 21. An alignment is:

AGCTATATACGGTAACGTA
 ||| ||
 GCT-TA

Another type of searching table is a dot matrix table which puts a dot wherever there is a match and nothing for a mismatch. The idea is to look for the longest diagonals (longest substring match) and then to connect the ends using the smallest distance (allowing for gaps). This can be used for visually seeing matches.

The output of SSearch is a file containing a histogram showing the distributions of the z-scores between the query and the database sequences. Below the histogram, SSearch display a list of the best scores and the protein strings where they were found. Finally the output shows the actual alignments for the highest scores.

Other algorithms for speeding up the search are possible. To save space only one line of the table needs to be stored. It will be updated by subsequent symbols of the query string. For this method the algorithm looks like a Turing Machine with the database on every other spot on the tape and the query string being fed in as input.

A G C T A T A T A C G G T A A C G T A
G * * * *
C * * *
T * * * * *
T * * * * *
A * * * * * * *

5.2 FastA

FastA is an algorithm [18] that attempts to speed up string matching over the standard optimal alignment. Optimal alignment techniques such as Smith-Waterman [28] are based on dynamic programming. String matching using dynamic programming run in quadratic time. FastA uses direct addressing or k-tuple preprocessing to cut down the dynamic programming search space significantly. This results in reduced search time at the expense of some sensitivity.

5.2.1 FastA Algorithm

The FastA algorithm is implemented in the following 6 stages:

5.2.1.1 Locating Hot spots

FastA allows the specification of a parameter called ktup (short for k respective tuples). The ktup sets the basis word length for the comparisons between the query string and a given string in the database. ktup values are typically six for DNA sequences and two for protein sequences.

The matching ktup-length substrings are referred to as hot spots. To locate the hot spots, FastA creates a dictionary of all possible words of length ktup that occurs in the query sequence. For DNA sequences this means 46 or 4096 possible entries in the table. For amino acid sequences we have as many as 202 or 400 possible entries because the amino acid alphabet has 20 symbols. For the DNA case each word is represented as a base 4 number that is also the index into the table. Each entry contains the offsets where this particular combination of 6 letters occur in the query sequence. In this way, for each word in the searched string, only the dictionary need be consulted to determine if and where the word occurs in the query string. Consecutive hot spots (regions) are located along the dynamic programming matrix diagonals.

5.2.1.2 Finding the 10 Best Regions

A region is a sequence of consecutive hot spots on the same diagonal. Spaces between the hot spots are permitted. FastA ranks regions by giving each hot spot a positive score. The intervening space between consecutive hot spots in is given a negative score. The larger the gap the more severe the penalty. The score of the diagonal run is the sum of the hot spots scores and the interspot penalties. If a word match overlaps another word on the same diagonal, only the scores of the non-overlapping symbols are added to the score of the diagonal. A region does not contain any indels (insertions or deletions) because it is derived from a single diagonal.

5.2.1.3 Scoring with Substitution Matrix

FastA next applies a substitution matrix to the 10 best regions found above. The substitution matrix may be an amino acid or nucleotide based. This step allows different matches to be weighted differently. It also allows for conservative amino acid replacements. Words containing relatively immutable amino acids will contribute a larger score to the diagonal than a word which contains amino acids which are less stable. The matrix encapsulates the biological significance of word matches. The single best subalignment found after the application of the substitution matrix is termed init1.

5.2.1.4 Combining Initial Regions from Different Diagonals

In this step, FastA checks to see if any of the initial regions from different diagonals may be combined to form a new higher scoring region. The score for the combined regions is the sum of the scores of the contributing regions less a joining penalty for each join. Regions that overlap cannot be joined together. In order to make the joins the indels are introduced in this step. The score for the highest scoring region after this step is termed initn.

This step can be implemented using directed weighted graphs where the vertices are the subalignments from the last stage.[24] The weight in each vertex is its score. An edge is drawn from vertex u to vertex v if v begins beneath where u ends. The edge is negatively weighted based on the number of gaps it would introduce. The maximum weight path gives the initn alignment.

5.2.1.5 Optimal Alignment

In addition to initn, FastA computes an alternative local alignment score opt. This score is obtained by considering a narrow diagonal band in the dynamic programming matrix, centered along the init1 diagonal. Using the ordinary dynamic programming, the optimal local alignment in this band is computed.

5.2.1.6 Presentation

Finally the database is ordered by either the opt or initn scores and the highest ranking result sequences are run thorough a full Smith-Waterman alignment. The number of alignments done this way depends on the user defined scope of the search.

5.2.2 Statistics

Starting with version 2.0 of FastA, a measure of the significance of the match is provided along with the raw scores.

FastA uses a simple linear regression against the natural log of the search set sequence length to calculate a normalized z-score for the sequence pair.[19] The idea is that longer sequences in the search set can get artificially high scores because of the increased likelihood of random matches. The z-score is typically calculated from the opt score but FastA can be made to calculate it from the initn score instead.

The distribution of the z-scores tends to closely approximate an extreme-value distribution. As such FastA can report an E() score. This is an estimate of the expected number of sequences that would (from pure chance) have a z-score greater than or equal to the z-score obtained in the search.

5.2.3 FastA Variants

The following variants of FastA exist for searching with different query types or data sets.

Program Query Type Database Type
FastA Nucleotide/Amino Acid Nucleotide/Amino Acid
FastAX Nucleotide Amino Acid
TFastX Amino Acid Nucleotide

For the purposes of comparing algorithms we used FastA to perform a DNA query against a DNA database.

5.3 BLAST[3][4][10][30]

The Basic Local Alignment Search Tool (BLAST) program uses a heuristic algorithm to search for local alignments of a query string on a BLAST formatted database. It is reported to run 100 times faster than a Smith-Waterman serial search and although it is not as sensitive as a rigorous search, it generally is a good place to start a search and will return all high scoring matches. For this reason it is the most popular search program. There is a different BLAST version for each of the combinations of query types and database types.

Program Query Type Database Type Frames
BLASTP Amino Acid Amino Acid 1
BLASTN Nucleotide Nucleotide 1
BLASTX Nucleotide Amino Acid 6
TBLASTN Amino Acid Nucleotide 6
TBLASTX Nucleotide Nucleotide 36

The BLAST database consists of three files for every FastA file input.[10] The first contains all of the sequence headers, textual information about the amino acid or nucleotide sequence. The second contains the compressed sequences (2 bits for each nucleotide, 5 bits for each amino acid). The third file contains an index of the compressed sequences so that they can be matched with the corresponding headers. The formats are slightly different between BLAST 1.4 and BLAST 2.0 but the compression gives small I/O speed advantage over programs that use the FastA format.

The program runs in 3 rounds.

  1. Database Scanning
  2. Seed Growing
  3. Combining Alignments

5.3.1 Database Scanning

BLAST searches the database using sequential search for short words (k-tups) of length W (Word Size) in the query string which score higher than T (Neighborhood Word Score Threshold also called the threshold or cutoff score). In BLAST 1.4, W is usually 3 amino acids or 11 nucleotides, and in BLAST 2.0 this is usually 2 amino acids or 2 sets of 5 nucleotides that are not contiguous.

Once the tuple size has been selected, scanning may be accomplished in either of two ways. The first maps all k-tups of length W into a unique integer. An array (indexed with this integer) pointing to lists of places in the database that will give a score of T or greater when compared with W long sequence is constructed. For nucleotide sequences where W = 5, this array would be of size 54 or 1024. Using a k-tup as an index into the array would give an immediate list of corresponding hits.

The second approach for scanning a database is to construct a deterministic finite automata (DFA) based on the query word to scan the database. A special feature is a construction which allows acceptance on transitions as opposed to states. This second approach was chosen since it saved on time and space (BLAST scans a protein database at approximately 500,000 residues/sec).

The list of successful words is called the neighborhood. This information is stored in an index for the next round.

5.3.2 Seed Growing

In this round the matches found in the first round are used as "seeds" to "grow" the alignment in both directions. When a growing match falls off by the quantity X (Maximum Permissible Drop-off) from its maximum achievable score or the score becomes zero after too many negative scores, this seed is halted and another is tested. This will also terminate in one direction if the end of either sequence is reached. For each alignment a score is a calculated. For nucleotide sequences a positive score is given to a match using M (Single-Letter Match) and a negative score is given to a mismatch using N (Single-Letter Mismatch). For each alignment the probability of attaining a match is calculated using Poisson or sum [15] statistics. These results are sorted by score are those higher than S2 (Secondary Cutoff Score) and with statistical significance greater than E2 (Secondary Expectation) are further analyzed. The passing alignments are called High Scoring Pairs (HSPs).

5.3.3 Combining Alignments

Now the program attempts to combine multiple alignments. If the calculated statistics are less than E (Statistical Expectation) the result is not report. By default E is set to 10 which means that we expect at least 10 sequences to be reported by chance alone. Finally all significant alignments with scores greater than S (Cutoff Score) are reported.

5.4 SALSA

Even though SALSA [20] (Searching with Assembly of Local Alignments) has only been implemented as a protein sequence alignment and search algorithm, we have included it in our survey, because we believe that the algorithm may be adapted to nucleotide sequences with a reasonable amount of effort. We also believe that SALSA provides a unique approach towards increasing the sensitivity of searches without greatly increasing the time requirements.

SALSA uses a three stage heuristic algorithm to approximate the optimal alignment between sequences. Though it does not always produce the score of the optimal alignment, its authors claim that SALSA is more sensitive (accurate) than both BLAST and FastA.

The three stages in the SALSA are heuristic are:

  1. Database Scanning
  2. Hit Extension
  3. Final Score Calculation

5.4.1 Database Scanning

Much like FastA and BLAST, the first stage of the SALSA heuristic attempts to find similarity between k-tuples in the query sequence and k-tuples in the database sequences. For amino acids a k of sizes between 2 and 4 are allowed. This would correspond to a k of sizes between 6 and 12 for nucleotides.

For this step, a lookup table is built containing the offset in the query sequence to the position where a high scoring match to each k-tuple starts. Only weighted matches which score above a threshold are considered. For amino acids with k = 2 there are 400 (202) positions in the lookup table. The table becomes far larger if the analogous search is done on a nucleotide sequence with k = 6. There would be 4096 (46) positions in nucleotide lookup table.

For query sequences of size m, building the table for nucleotide sequences is order O((4k) × k × m). This is not a cheap step, however, for averages sequences, this will be the most expensive step in the SALSA heuristic.

Once the lookup table is built, each sequence in the database is scanned for exact matches with each of the k-tuples that exceeds the score threshold. Any sequence in the database which contains an exact match to one of the k-tuples that scored above the threshold will be remembered as a potential match candidate. The position of the high scoring k-tuple will also be retained.

5.4.2 Hit Extension

The hit extension phase of the algorithm attempts to take each of the high scoring k-tuples and extend them, without gaps, in either direction. To prevent the extension process from running for ever, the high scoring pairs are extended until a score of 0 is obtained, at which point the fragment is trimmed back to the length that produced the maximal score. This produces a collection of high scoring, ungapped, locally aligned fragments.

5.4.3 Final Score Calculation

Up to this point SALSA behaves much like BLAST 2.0. Here SALSA sacrifices speed for sensitivity. While the BLAST algorithms build their final approximate alignment from the highest scoring fragment. SALSA will attempt to chain its fragments into even higher scoring gapped alignments.

For each sequence in the database, each of the fragments that score above a threshold are retained in a structure which sorts them in a double linked list by diagonal, and a single linked list by position in the sequence. Eliminating the low scoring fragments will save computation time, though it might cause any high scoring alignment built around a low scoring fragment to be overlooked. If SALSA is to be adapted to nucleotide searches, some experimentation will be required to determine an appropriate threshold.

Once all the fragments have been arranged into a sorted structure, the SALSA heuristic will attempt to connect the fragments to form an even higher scoring chain of fragments. It is this process of connecting high scoring fragments together that differentiates SALSA from BLAST.

Two fragments will only be considered for joining in a chain if the cost of the gap required to connect the two fragments is less than the score of either fragment. This simple rule allows pruning of the combination of fragments considered for chaining.

When two fragments are considered for chaining, the gap necessary to connect the fragments is inserted, or any overlap between the two fragments is trimmed. If the score of the chained fragments is greater than that of the original fragment, the new chained fragment and its score are added to the fragment list. Once all valid chaining combinations have been considered, the highest scoring fragment is used as the approximation for the optimal alignment between the query sequence and the database entry.

6. Performance (Empirical Results)

In this section we empirically examine the performance of SSearch, FastA, and BLAST. All searches described in this section were performed against three databases with the following attributes:

Name Sequences Nucleotides Average Nucleotides
Per Sequence
Percent A Percent C Percent G Percent T
Ecoli 400 4662258 11655 24.638491 25.403892 25.346817 24.610800
Yeast 3260 14686548 4505 31.108815 18.953712 18.926640 31.010834
HS_Chr01 130 22067529 169750 28.393451 21.580459 21.580926 28.445165

The ecoli database is from GenBank's collection of E. coli DNA sequences. It was selected, because it is relatively small in size. The yeast database is from GenBank's collection of yeast DNA. It was selected, because it is substantially larger than the E. coli dataset. The HS_Chr01 database is from GenBank's collection of human DNA sequences found in chromosome one. It was selected because it is larger than the yeast database, but has less sequences than both the yeast and ecoli database. An algorithm that is slowed by the number of sequences more than the number of nucleotides, should search the human dataset faster than the yeast database.

All of our queries were performed on an IBM PC compatible with an 166 MHz Intel Pentium Processor and 48MB of RAM, running Windows 95 in safe mode. Each of the search programs were executed while no other programs were running. To prevent any queries gaining a performance advantage from a previously cached data set, no two consecutive queries were made on the same data set. During early trials of the search software it was observed that the compressed BLAST databases for all three collections were being held in the disk cache together. Since Windows 95 does not document a method for clearing it's disk cache, the computer was rebooted between a searches on the same database. To prevent later searches with the same executable from gaining any cache benefits over the first execution, each search program was invoked once with no parameters, just to place the program in the disk cache and possibly the RAM cache.

For each of the queries, we used the unmodified Win32 executables that came with the official distribution. The version of SSearch and FastA used was 3.2t06. The version of BLAST used was 2.0.10.

Each of the three search programs were given three query sequences for our testing. Query one is a sequence of 25 nucleotides extracted from one of the sequences in the HS_Chr01 database. Query two is a sequence of 50 nucleotides extracted from one of the sequences in the yeast database, and query three is a sequence of 100 nucleotides extracted from one of the sequences in the E. coli database.

The results of our queries are listed in the table below. Each query was made using a +5/-4 match/mismatch penalty and a -16/-4 gap/extension penalty. These values were chosen, because they are the default values for SSearch. We only report the number of sequences found with an E < 2.0, again because this is the default SSearch value. Time is the time to complete execution in seconds as reported by the MS-DOS time functions. Our queries sequences, their results, and the MS-DOS batch files we used to run the queries may be found at http://www.cs.ucsb.edu/~mdipper/dna/results.zip.

Query 1 Query 2 Query 3
Time Sequences Time Sequences Time Sequences
E
c
o
l
i
SSearch 24.94 Sec 0 43.61 Sec 0 86.4 Sec 3
FastA 10.93 Sec 0 10.16 Sec 1 10.65 Sec 2
BLAST 2.97 Sec 0 3.28 Sec 1 9.89 Sec 143
Y
e
a
s
t
SSearch 78.65 Sec 0 141.16 Sec 1 246.45 Sec 0
FastA 32.39 Sec 1 33.67 Sec 2 36.31 Sec 2
BLAST 5.09 Sec 0 5.49 Sec 3 6.04 Sec 0
H
u
m
a
n
SSearch 116.74 Sec 4 207.68 Sec 3 372.23 Sec 5
FastA 46.69 Sec 3 47.40 Sec 0 48.01 Sec 0
BLAST 7.20 Sec 1 6.75 Sec 0 6.70 Sec 0
Database Search Results

From this table above and are actual results we have made some observations about the speed and correctness of each of the algorithms.

6.1 Speed

Our results show FastA to be more than twice as fast as SSearch, and BLAST to be more than twice as fast as FastA. In addition it can be observed that the size of the database is a greater impact on the performance of FastA and BLAST than the size of the query sequence. This is because FastA and BLAST spend much of their initial processing time comparing the database to a k-tuple from the query sequence instead of the whole query sequence. The size of the query sequence has no bearing on the initial steps of FastA and BLAST.

Unlike FastA and BLAST, SSearch takes approximately double the time to search a database if the size of a query sequence is doubled. This is because SSearch is a O(MN) algorithm, where M is the size of the query string. Double the size of M, and the order of operations will double.

6.2 Correctness

An interesting thing to observe from the table is that, even though SSearch produces an optimal scoring for the alignment of a query sequence against each of the database sequences, there are times when SSearch reports less statistically significant alignments than either FastA or BLAST. We believe that this phenomena occurs because the E values of FastA and BLAST are calculated from suboptimal alignments, which turn out to be less likely than a higher scoring optimal alignment obtained by SSearch.

We have verified that in every case where the query sequence was a subsequence from the database being searched, it was matched with it's source. However, there does not seem to be a clear connection between the rest of the high scoring sequences returned by each program.

6.3 SALSA Performance

SALSA was tested using 11 queries against the protein collection in SWISS-PROT protein database.[20] Only sequences with a score of E < 1.0 were reported by each program. The results of the queries were compared against BLAST, FastA, and SSearch for accuracy and time. The table below attempts to summarize these results. As with our earlier performance analysis, the SSearch results should be read for optimal alignments. The time was given as CPU cycles with no further clarification.

Number of Sequences Found CPU Cycles Required
SSearch BLAST FastA SALSA SSearch BLAST FastA SALSA
2051 1988 1904 2029 7463 189 676 613
Results from SALSA Study

Though the current implementation of SALSA applies only to amino acid databases, there is no reason to believe, SALSA, BLAST, FastA, and SSearch would not scale to nucleotide databases with similar results.

7. Index Search Tools

SSearch, FastA and BLAST are all examples of exhaustive search tools. Each entry in the database is consulted before formulating an answer set to a query. We have chosen in or paper to concentrate on the "bread and butter" tools that are in everyday use. There has however, been work in trying to index genomic databases. Despite the likely benefits of indexing in reducing query evaluation costs, existing indexed-based systems are not favored over exhaustive approaches because of limitations in indexed systems. This section overviews four such schemes: RAMdb, FLASH, PropSearch, and CAFE.

7.1 RAMdb

Rapid Access Motif database is a system for finding short patterns in genomic databases.[9] Each genomic sequence is indexed by its constituent overlapping intervals in a hash table structure. For each interval, an associated list of sequence numbers and offsets is stored. This allows a quick lookup of any sequences matching a query sequence.

RAMdb is best suited for the lookup of query sequences whose length is on the order the of the indexed interval length. RAMdb has been shown to result in a up to a 800-fold speedup in search times over comparable exhaustive approximate pattern matching approaches.

7.1.1 RAMdb Limitations

RAMdb requires a large index twice the size of the original flat-file database (including the textual descriptions) and suffers from lack of special-purpose ranking schemes designed for identifying initial match regions. In addition, the non-overlapping interval means false dismissals unless the frame happens to coincidentally align with the start of the interval frame.

7.2 FLASH

The FLASH search tool redundantly indexes genomic data based on a probabilistic scheme[8]. For each interval of length n, the FLASH search structure stores, in a hash-table, all possible similarly-ordered contiguous and non-contiguous subsequences of length m that begin with the first base in the interval, where m < n. As an example, for a nucleotide sequence ACCTGATT the index terms for the first n = 5 bases, where m = 3, would be ACC, ACT, ACG, ACT, ACG and ATG- each of the permuted strings begins with the base A, the first base in the interval of length n = 5. The hash-table then stores each permuted m-length subsequence, the sequences that contain the permuted subsequences, and the offsets within each sequence of the permuted subsequence. The key idea of the flash scheme is that the permuted scheme gives an accurate model that approximates a reasonable number of insertions, deletions, and substitutions in genomic sequences.

The authors found that FLASH was of the order of ten times faster for a small test collection than BLAST and was superior in accurately and sensitively determining homologies in database searching.

7.2.1 FLASH Limitations

FLASH utilizes a redundant index, which is stored in a hash-table and is uncompressed, is impractically large. For a nucleotide collection of around 100 Mb, the index requires 18 Gb on disk, around 180 times the collection size. We seriously doubt the 10x performance could be attained on general purpose hardware unless the collection was sufficiently small such that swapping the index in and out of memory was not an issue.

7.3 PropSearch

In the PROPSEARCH tool[14], the authors propose neglecting the order of amino acids in a sequence and representing the sequence as a vector of 144 different characteristics, notably residue frequency, molecular weight, average residue-size, average hydrophobicity, and average charge. Characteristics such as hydrophobicity that are known to be stronger indicators of homology, carry more weight than those that are known to be lesser indicators. Query sequences are translated into vectors and results of a search returned as a ranked list of sequences in decreasing order of Euclidean distance from database sequence vectors. The authors provide a tabulation of distances and estimations of their reliability.

7.3.1 PropSearch Limitations

Other studies have found [33] that experience with PROPSEARCH suggests it is only reliable for detecting evolutionary close similarities between highly homologous sequences and that subsequent alignments are required to verify homology. Lack of positional information of physicochemical properties, as used in the exhaustive schemes, is the likely contributing factor to its poor retrieval effectiveness. In addition most of the properties used by PROPSEARCH are not shared by the nucleotides that generate the amino acid sequences. At the present there are no known set of properties which may be used to form the nucleotide vector.

7.4 CAFE

CAFE is based on a partitioned search approach [33], where a coarse search using an inverted index is used to rank sequences by similarity to a query sequence and a subsequent fine search used to locally align only a subset of database sequences with the query. The CAFE index consists of three components: a search structure, which contains the index terms or distinct intervals, that is, fixed-length overlapping subsequences from the collection being indexed; inverted lists, which are a carefully compressed list of ordinal sequence numbers, where each list is an index of sequences containing a particular interval; and, a mapping table, which maps ordinal sequence numbers to the physical location of sequence data on disk. Queries are evaluated by representing the query as a set of intervals, retrieving the list for each interval, and using a ranking structure to store a similarity score of each database sequence to the query. Like FLASH, CAFE uses an overlapping interval. It uses a compression scheme to try to make the index size more manageable.

7.4.1 CAFE Limitations

The author notes that although it is more computationally efficient than the exhaustive methods, "exhaustive systems generally have better retrieval effectiveness".

7.5 Final Thoughts on Index Based Searching

The motivation for indexing is to decrease the query evaluation costs through retrieving only a fraction of the database sequences to formulate an answer set. The negative aspect of indexing is an often large uncanny index. Even so the problem with exhaustive search tools grows increasingly acute as the genomic databases increase in size. GenBank for instance is doubling in size every year. This is typical of genomic databases. In spite of problems with current indexing techniques, the need to move away from exhaustive search techniques will keep getting stronger. CAFE seems to be the most mature and developed of the indexed based techniques.

8. Possible Performance Improvements

Up until this point, the most successful techniques we have discussed attempt to improve search time by pruning the search set though attempts at smaller alignments and/or using a faster heuristic to approximate the optimal alignment of sequences. While other less successful techniques have attempted use an indexing scheme to reduce the number of structures searched. In this section we will discuss other techniques used to reduce the search time in other types of databases and examine the effects similar techniques may have on nucleotide databases.

8.1 Feature Vectors

Feature vectors are commonly used to compare similarity between query items and a search set. The goal of a feature vector is to extract features from the data base objects and the query. If the feature vectors are to be useful, the feature vector should be smaller than the actual object, and there should be a way to calculate distances between feature vectors which produces results similar to the distances between the actual objects.

The only features in DNA sequences are the nucleotides that make up the patterns formed by the sequences. Feature vectors consisting of counts of individual nucleotides or nucleotide sequences are not that useful for the following reasons:

8.1.1 Problems with Subsequences

One sequence my be a subsequence of the other, and because of that sections of the supersequence that are not part of the optimal alignment should not contribute to the feature vector. Since it is not possible to have apriori knowledge of which query sequences will be used, the sections of a DNA string that do not contribute to the alignment cannot be kept from effecting the feature vector. In many cases the contribution of features in the sections that are not part of an optimal alignment can cause the distance between the vectors to be far from the scores of the optimal alignments of the sequences.

8.1.2 Nucleotide Ordering and Exact Matches

The order that nucleotides appear in plays an important role in determining the similarity between two DNA sequences. A feature vector based on nucleotide frequencies alone will not provide any ordering information. Ordering information may be preserved by considering features to be k-tuples. Since k-tuples in a single sequence may be overlapping, a single DNA sequence of size n will have n - k + 1 different k-tuples.

Both FastA and BLAST have demonstrated that it is possible to prune list of sequences searched by examining nucleotide sequences of length 6 (six-tuples). However, they have demonstrated that it is not enough to only consider sequences which contain matching 6-tuples they must also consider sequences that contain 6-tuples that have high scoring matches with each other.

If a 6-tuple feature vector is to be constructed, it would have 46 (or 4096) entries. It is common to use SVD to reduce the dimensionality of feature vectors, that will not work for DNA. By using SVD, there will be k-tuples which are not considered for alignments. Though the k-tuples thrown out may not be important to distinguish database sequences from each other, they may be the k-tuples which optimal alignments are built from. Unless a method of determining apriori which 6-tuples do not count towards queries, all 4096 possible 6-tuples must be included in the feature vector. Since the average database DNA sequences are about 1000 nucleotides long, a feature vector for a 6-tuple will be four times larger than its corresponding sequence.

Aside from it's size, there are other limitations to feature vectors made from high scoring 6-tuples. We will discuss these limitations in section 8.3.

8.2 Frequency Domain Mapping

Another technique used to reduce dimensionality of sequences is to transform time domain sequences into the frequency domain. A Discrete Fourier Transform (DFT) is used to transform a time sequence into the frequency domain. The dimensionality of the transformed sequences is then reduced by throwing out coefficients of the resulting transform.

A DNA sequence may be viewed as a sequence in the time domain, in which each nucleotide is considered to occur later in time than the nucleotide to the left and earlier than the nucleotide to the right. Such a view allows a DNA sequence to be transformed into the frequency domain, however, since nucleotides may appear in any order with equal probability, the sequences appear as white noise in the frequency domain. Each of the techniques which reduce dimensionality in the frequency domain require that there be a larger amount of energy in some of the frequencies than others. One of the properties of white noise is that there is no substantial difference between energy over a range of frequencies. Therefore, it is not possible to achieve meaningful reduction of dimensionality by transforming DNA sequences into the frequency domain.

8.3 Precomputation

Both FastA and BLAST use comparison of k-tuples as an initial step in the search process. In section 8.1 we discussed using a feature vector which indicates the presence or absence of high scoring k-tuples. Though sizable, such a feature vector would allow both FastA and BLAST to quickly perform their first phase calculations. FastA would be able to identify sequences containing hot spots and BLAST would be able to identify "seeds".

Thought precomputation offers potential for a great time savings in the initial phases of FastA and BLAST, there are drawbacks as well. Precomputed feature vectors are large (4069 dimensions for a 6-tuple vector). To avoid large feature vectors, it is possible to precomputed a lexicon of possible 6-tuples referencing the locations they occur in the database. For large databases, the lexicon would be smaller than the feature vectors. However either method still suffers from an unreasonable inflexibility. The tuple size and the scoring matrix for a precomputed methods are constant, and cannot be changed at search time.

9. Conclusion

The requirements for nearest neighbor searching of nucleotide databases make the problem substantially different from other common search problems. Searching for DNA sequences does subject itself to conventional techniques used to improve performance of database searching. State of the art optimizations of the problem attempt to use heuristics to lessen the computation required for an exhaustive search. In all cases the heuristics used sacrifice recall for speed. Indexing schemes have also been attempted with limited success, they were either limited by the types of queries that they performed well, or the amount of space required by the index.

Bibliography

[1] L. Allison, D.Powell & T.I.Dix. Compression and Approximate Matching, The Computer Journal, Volume 42, Issue 1, Pages 1-10, 1999

[2] L. Allison. Information-Theoretic Sequence Alignment, Technical Report 98/14, School of Computer Science and Software Engineering, Monash University, Australia 3168

[3] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, D. J. Lipman. Basic Local Alignment Search Tool. National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda MD, 1990.

[4] S. F. Altschul, T. Madden, A. Alejandro, A. Schaffer, J. Zhang, Z. Zhang, W. Miller, D. J. Lipman. Gapped BLAST and PSI-BLAST: a New Generation of Protein Database Search Programs. National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda MD, July 1997.

[5] Dennis A. Benson, Mark S. Boguski, David J. Lipman, James Ostell and B.F. Francis Ouellette. GenBank Nucleic Acids Research, Vol 26, Issue 1.

[6] Gail Binkley. Alignments, Algorithms and Statistics. Compiled from lectures by Warren Gish and William Pearson, Cold Spring Harbor Laboratory Course - Computational Genomics, Oct 31-Nov 5, 1996

[7] Steven E. Brenner, Cyrus Chothia and Tim J. Hubbard. Assessing Sequence Comparison Methods with Reliable Structurally Identified Distant Evolutionary Relationships, Proc. Natl. Acad. Sci. USA Vol. 95, pp. 6073-6078, May 1998 Biochemistry

[8] A. Califano and I. Rigoutsos. FLASH: A fast look-up algorithm for string Homology. In International Conference on Intelligent Systems for Molecular Biology, pages 56-64, Bethesda, MD, 1993.

[9] Arthur L. Delcher, Simon Kasif, Robert D. Fleischmann, Jeremy Peterson, Owen White, and Steven L. Salzberg. Alignment of Whole Genomes. Nucleic Acids Research, Vol. 27 No. 11, pages 2369-2376, 1999.

[9] C. Fondrat and P. Dessen. A Rapid Access Motif Database (RAMdb) with a Search Algorithm for the Retrieval Patterns in Nucleic Acids or Protein Databanks. Computer Applications in the Biosciences, 11(3):273-279, 1995.

[10] B. A. Gaeta. Database Similarity Searching Using BLAST and FastA. Australian National Genomic Information Service, The University of Sydney, NSW, 1995.

[11] W. Gish. Introduction to Alignment Scoring Statistics. St.Louis, Missouri, last updated Oct 17 1999.

[12] E. Ionides. Hueristic Proof of the Karlin-Altschul Theorem.

[13] C.Harger, M. Skupski, J.Bingham, A. Farmer, S. Hoisie, P. Hraber, D. Kiphart, L. Krakowski, M. McLeod, J. Schwertfeger, G. Seluja, A. Siepel, G. Singh, D. Stamper, P.Steadman, N. Thayer, R. Thompson, P. Wargo, M. Waugh, J.J. Zhuang and P.A. Schad. The Genome Sequence DataBase (GSDB): Improving Data Quality and Data Access Nucleic Acids Research, Vol 26, Issue 1.

[14] U. Hobohm and C. Sander. A Sequence Property Approach to Searching Protein Databases. Journal of Molecular Biology, 251:390-399, 1995.

[15] S. Karlin and S.F. Altschul. Applications and Statics for Multiple High-Scoring Sequence Alignemnts. Proceedings of the National Academy of Sciences, 90:2264-2268, 1993.

[16] Gerhard Mehldau and Gene Myers. A System for Pattern Matching Applications on Biosequences CABIOS 9, 3 (1993), 299-314

[17] Dr. David Mount and Dr. Samuel Ward. MCB 416/516 BIOINFORMATICS AND GENOMIC ANALYSIS Class Notes. Spring 1998

[18] Pearson, W.R. and D.J. Lipman. Improved Tools for Biological sequence comparison, Proc. Natl. Acad. Sci. USA 85 (1988), 2444-2448

[19] William R Pearson, Protein Science Vol 4 1145-1160 (1995)

[20] Torbjorn Rognes and Erling Seeberg. SALSA: Improved Protein Database Searching by a New Algorithm for Assembly of Sequence Fragments into Gapped Alignments Bioinformatics, Vol. 14 no. 10, Pages 839-845, 1998

[21] Andrey Rzhetsky. BLAST Database Formats

[23] Ron Shamir. Pairwise Alignment Algorithms for Molecular Biology, Lecture 2, 1998. Tel Aviv University

[24] Ron Shamir. Algorithms for Molecular Biology. Lecture 3. Fall 1998. Tel Aviv University

[25] Ron Shamir. Approximatin Algorithms for Multiple Sequence Alignment Algorithms for Molecular Biology, Lecture 5, 1999. Tel Aviv University

[26] Smith, T.F. and M.S. Waterman. The identification of common molecular subsequences. J. Mol. Biol. 147,195-197, 1981

[27] Jean-Stephen Varre, Jean-Paul Delahaye, and Eric Rivals. Transformation Distances: A Family of Dissimilarity Measures Based on Movements of Segments Bioinformatics, Vol. 15 no. 3, Pages 194-202, 1999

[28] David J. States, Warren Gish, Stephen F. Altschul Improved Sensivity of Nucleic Acid Databas Searches Using Application-Specific Scoring Matrices. Methods A Companion to Methods in Enzymology Vol. 3, No. 1, August, Pages 66-70 1991

[29] Tatiana A. Tatusova and Thomas L. Madden. BLAST 2 Sequences, a New Tool for Comparing Protein and Nucleotide Sequences FEMS Microbiology Letters 174 (1999) 247-250

[30] U.K. Human Genome Mapping Project Resource Centre. Seaching Sequence Databases with Sequences.

[31] U.S. Department of Energy, DOE Human Genome Program, Primer on Molecular Genetics. June 1992.

[32] Thomas Werner, The Biology and Bioinformatics of Regulatory Regions in Genomes. Institute of Mammalian Genetics, München, Germany.

[33] H. Williams, Indexing and Retieval for Genomic Databases. Doctoral Dissertation, Royal Melbourne Institute of Technology, Victoria, Australia

[34] Felix Wu and Guy Blelloch Algorithms in the Real World Lecture #20 (Pattern Matching 2)

1