3.3 Alignments as a Search Problem
3.3 Alignments as a Search Problem
Section titled “3.3 Alignments as a Search Problem”Learning Objectives
Section titled “Learning Objectives”After reading this section, you should be able to:
- Understand why finding an optimal alignment is a combinatorial problem
- Explain why brute-force approaches are infeasible
- Recognize the recursive structure underlying sequence alignment
- Interpret alignments as paths in a graph
- Motivate the need for dynamic programming
From Definition to Computation
Section titled “From Definition to Computation”In Section 3.2, we defined sequence similarity in terms of edit operations and alignment scores. Conceptually, the problem is now clear:
Given two sequences, find the alignment with the highest score (or lowest cost).
However, this formulation immediately raises a practical question:
How do we actually find this optimal alignment?
At first glance, the problem appears simple. We could imagine generating all possible alignments, computing their scores, and selecting the best one.
This idea is straightforward but fundamentally flawed.
The Combinatorial Explosion of Alignments
Section titled “The Combinatorial Explosion of Alignments”To understand the difficulty, consider what an alignment actually is.
An alignment is created by inserting gap symbols into sequences such that:
- both sequences are extended to the same length
- corresponding positions can be compared
Even for short sequences, there are many ways to insert gaps. Each different placement of gaps results in a different alignment.
The number of possible global alignments between two sequences of length grows approximately as:
This growth is exponential.
For even moderately sized sequences, the number of possible alignments becomes astronomically large. Exhaustively enumerating and evaluating all alignments is therefore computationally infeasible.
Why Brute Force Fails
Section titled “Why Brute Force Fails”Let us connect this observation back to the “alignment by eye” from Section 3.1.
When we manually aligned adenylation domain sequences, we explored only a tiny subset of all possible alignments. We relied on intuition to guide us toward plausible solutions.
A brute-force algorithm, in contrast, would:
- generate every possible way of inserting gaps
- compute a score for each alignment
- return the best one
While guaranteed to find the optimal solution, this approach is impractical because the search space is too large.
This leads to a central insight:
The challenge of sequence alignment is not defining similarity, but searching the enormous space of possible alignments efficiently.
A Key Observation: Optimal Substructure
Section titled “A Key Observation: Optimal Substructure”Despite the apparent complexity, the alignment problem has a hidden structure that we can exploit.
Consider two sequences:
Suppose we already know the optimal alignment of the prefixes:
How can we extend this alignment to include and ?
There are only three possibilities:
- Align with (substitution or match)
- Align with a gap (insertion in )
- Align with a gap (deletion in )
This observation reveals something crucial:
An optimal alignment can be constructed from optimal alignments of smaller prefixes.
This property is known as optimal substructure, and it is the foundation of dynamic programming.
Overlapping Subproblems
Section titled “Overlapping Subproblems”Another important property emerges when we examine the problem more closely.
When computing alignments for different prefixes, we repeatedly encounter the same subproblems. For example, the alignment of:
may be required in multiple larger computations.
This means that:
The problem consists of many overlapping subproblems that can be reused.
This redundancy suggests that we should avoid recomputing the same results multiple times.
Alignments as Paths in a Graph
Section titled “Alignments as Paths in a Graph”A particularly powerful way to understand sequence alignment is to view it as a path-finding problem.
Imagine a grid where:
- the horizontal axis corresponds to sequence
- the vertical axis corresponds to sequence
Each cell represents the alignment of prefixes and .
From each cell, we can move in three directions:
- Diagonal → align with
- Down → align with a gap
- Right → align with a gap
Each move corresponds to an edit operation and is associated with a score.
An alignment then corresponds to a path from the top-left corner to the bottom-right corner .
Scoring Paths
Section titled “Scoring Paths”The score of an alignment is simply the sum of the scores of the moves along its path:
- diagonal moves contribute substitution scores
- horizontal and vertical moves contribute gap penalties
Thus, the problem of sequence alignment can be reformulated as:
Find the highest-scoring path from to in this grid.
This perspective transforms sequence alignment into a classic optimization problem on a graph.
From Search to Dynamic Programming
Section titled “From Search to Dynamic Programming”We are now in a position to resolve the central challenge.
Instead of enumerating all possible paths (alignments), we can:
- compute the best score for each cell
- reuse previously computed results
- build the solution incrementally
This approach avoids redundant computation and reduces the complexity from exponential to polynomial time.
This method is known as dynamic programming.
Conceptual Summary
Section titled “Conceptual Summary”Sequence alignment can be understood as a search problem over a vast space of possible alignments. A naive brute-force approach is infeasible due to the exponential growth of this space.
However, the problem exhibits two key properties:
- Optimal substructure: optimal solutions can be built from optimal subsolutions
- Overlapping subproblems: the same subproblems appear repeatedly
By exploiting these properties and interpreting alignments as paths in a grid, we can transform the problem into one that can be solved efficiently using dynamic programming.
Self-Check Questions
Section titled “Self-Check Questions”- Why does the number of possible alignments grow exponentially with sequence length?
- What makes a brute-force approach to sequence alignment impractical?
- What are the three possible ways to extend an alignment at position ?
- How can sequence alignment be interpreted as a path-finding problem?
- What properties make dynamic programming applicable to this problem?