3.5 Local Alignment and the Smith–Waterman Algorithm
3.5 Local Alignment and the Smith–Waterman Algorithm
Section titled “3.5 Local Alignment and the Smith–Waterman Algorithm”Learning Objectives
Section titled “Learning Objectives”After reading this section, you should be able to:
- Explain why global alignment is often inappropriate for biological sequences
- Define local alignment and its biological interpretation
- Derive the modified dynamic programming recurrence for local alignment
- Perform traceback for local alignments
- Understand how local alignment detects conserved regions within larger sequences
Why Global Alignment Is Not Enough
Section titled “Why Global Alignment Is Not Enough”In Section 3.4, we developed the Needleman–Wunsch algorithm for global alignment, which aligns two sequences over their entire length.
This approach implicitly assumes that:
The sequences are related along their full extent.
While this assumption is valid in some cases, it is often violated in biological systems.
Let us return to the example that motivated this chapter: non-ribosomal peptide synthetases (NRPS).
These enzymes are modular:
- Each module performs a specific function
- Only certain regions, such as adenylation domains, are conserved
- Other regions may differ substantially or be absent entirely
When comparing two NRPS proteins, it is therefore unlikely that the entire sequences are homologous. Instead, only specific segments correspond to each other.
This leads to a mismatch between model and biology:
- Global alignment forces unrelated regions to align
- This introduces artificial gaps and mismatches
- The resulting score may obscure biologically meaningful similarity
This observation motivates a new question:
How can we detect similarity that is confined to subsequences rather than entire sequences?
From Global to Local Similarity
Section titled “From Global to Local Similarity”To address this, we redefine the alignment problem.
Instead of aligning complete sequences, we allow the alignment to focus on substrings:
- any contiguous segment of sequence
- any contiguous segment of sequence
The goal is then:
Find the pair of substrings whose alignment yields the highest score.
This is the essence of local alignment.
Conceptual Interpretation
Section titled “Conceptual Interpretation”Local alignment changes the interpretation of the problem in an important way.
-
Global alignment asks: → How similar are these sequences overall?
-
Local alignment asks: → Do these sequences share a region of high similarity?
This distinction is crucial in many biological contexts:
- detecting conserved domains in proteins
- identifying regulatory motifs in DNA
- finding homologous regions between otherwise unrelated sequences
In the NRPS example, local alignment allows us to focus on functionally relevant domains, ignoring unrelated flanking regions.
Defining Local Alignment
Section titled “Defining Local Alignment”A local alignment is defined as an alignment of substrings:
such that the alignment score is maximized over all possible choices of .
Importantly:
- unaligned regions at the beginning and end incur no penalty
- only the aligned region contributes to the score
The Smith–Waterman Recurrence
Section titled “The Smith–Waterman Recurrence”Surprisingly, we can solve the local alignment problem using a slight modification of the dynamic programming framework introduced earlier.
We again define a matrix , but now with a crucial change:
Why the Zero Matters
Section titled “Why the Zero Matters”The addition of the term is the defining feature of local alignment.
It has a simple but powerful effect:
If all alignment options lead to a negative score, we reset the score to zero and start a new alignment.
This allows the algorithm to:
- ignore poorly matching prefixes
- focus only on regions of high similarity
In contrast to global alignment, where the path must span the entire matrix, local alignment allows paths to:
- start anywhere
- end anywhere
Initialization
Section titled “Initialization”Because of the zero reset, initialization becomes trivial:
This reflects the idea that an alignment can begin at any position without penalty.
A Worked Example
Section titled “A Worked Example”Let us consider the sequences:
with the scoring scheme:
- match:
- mismatch:
- gap:
Step 1: Initialize the Matrix
Section titled “Step 1: Initialize the Matrix”All entries in the first row and column are set to zero:
- G A T T G A - 0 0 0 0 0 0 0 A 0 C 0 G 0 C 0Step 2: Fill the Matrix
Section titled “Step 2: Fill the Matrix”We compute each cell using the recurrence relation.
At each step:
- negative values are replaced by zero
- positive values indicate locally good alignments
As the matrix is filled, regions of high similarity emerge as islands of high scores.
Step 3: Identify the Maximum
Section titled “Step 3: Identify the Maximum”Unlike global alignment, the optimal score is not necessarily at .
Instead:
The highest value anywhere in the matrix corresponds to the best local alignment.
Step 4: Traceback
Section titled “Step 4: Traceback”Traceback begins at the cell with the maximum score and proceeds:
- diagonally, up, or left
- following the path of optimal choices
- until a cell with value 0 is reached
This stopping condition reflects the beginning of the local alignment.
Example Alignments
Section titled “Example Alignments”For the example above, optimal local alignments include:
AC-GATTGor
A-CGATTGThese alignments capture a region of strong similarity, even though the full sequences differ substantially.
Conceptual Comparison: Global vs Local Alignment
Section titled “Conceptual Comparison: Global vs Local Alignment”The difference between the two approaches can be summarized as follows:
| Aspect | Global Alignment | Local Alignment |
|---|---|---|
| Scope | Entire sequences | Substrings |
| Initialization | Gap penalties | Zero |
| Recurrence | No reset | Includes reset to 0 |
| Traceback | From | From maximum cell |
| Use case | Similar sequences of equal length | Detection of conserved regions |
Biological Interpretation
Section titled “Biological Interpretation”Local alignment is particularly powerful because it aligns with how biological sequences evolve:
- domains can be inserted, removed, or rearranged
- only certain regions remain conserved
- functional motifs may be short and localized
Thus, local alignment allows us to detect biologically meaningful similarity even in the presence of large unrelated regions.
In the context of NRPS, this means we can:
- identify conserved catalytic domains
- detect substrate-specific motifs
- compare functionally relevant regions across diverse organisms
Conceptual Summary
Section titled “Conceptual Summary”Local alignment extends the alignment framework to situations where similarity is confined to subsequences. By allowing alignments to start and end freely, and by resetting negative scores, the Smith–Waterman algorithm identifies regions of high similarity within larger sequences.
This modification transforms sequence alignment from a global comparison tool into a powerful method for pattern discovery and functional annotation.
Self-Check Questions
Section titled “Self-Check Questions”- Why does global alignment fail for modular proteins such as NRPS enzymes?
- What is the role of the zero term in the Smith–Waterman recurrence?
- Why can the optimal local alignment start and end at arbitrary positions?
- How does traceback differ between global and local alignment?
- In which biological scenarios is local alignment preferred over global alignment?