3.3 Alignments as a Search Problem

Learning Objectives

After reading this section, you should be able to:

Understand why finding an optimal alignment is a combinatorial problem
Explain why brute-force approaches are infeasible
Recognize the recursive structure underlying sequence alignment
Interpret alignments as paths in a graph
Motivate the need for dynamic programming

From Definition to Computation

In Section 3.2, we defined sequence similarity in terms of edit operations and alignment scores. Conceptually, the problem is now clear:

Given two sequences, find the alignment with the highest score (or lowest cost).

However, this formulation immediately raises a practical question:

How do we actually find this optimal alignment?

At first glance, the problem appears simple. We could imagine generating all possible alignments, computing their scores, and selecting the best one.

This idea is straightforward but fundamentally flawed.

The Combinatorial Explosion of Alignments

To understand the difficulty, consider what an alignment actually is.

An alignment is created by inserting gap symbols into sequences such that:

both sequences are extended to the same length
corresponding positions can be compared

Even for short sequences, there are many ways to insert gaps. Each different placement of gaps results in a different alignment.

The number of possible global alignments between two sequences of length $N$ grows approximately as:

\frac{2^{2N}}{\pi N}

This growth is exponential.

For even moderately sized sequences, the number of possible alignments becomes astronomically large. Exhaustively enumerating and evaluating all alignments is therefore computationally infeasible.

Why Brute Force Fails

Let us connect this observation back to the “alignment by eye” from Section 3.1.

When we manually aligned adenylation domain sequences, we explored only a tiny subset of all possible alignments. We relied on intuition to guide us toward plausible solutions.

A brute-force algorithm, in contrast, would:

generate every possible way of inserting gaps
compute a score for each alignment
return the best one

While guaranteed to find the optimal solution, this approach is impractical because the search space is too large.

This leads to a central insight:

The challenge of sequence alignment is not defining similarity, but searching the enormous space of possible alignments efficiently.

A Key Observation: Optimal Substructure

Despite the apparent complexity, the alignment problem has a hidden structure that we can exploit.

Consider two sequences:

X = x_1 x_2 \dots x_m \qquad Y = y_1 y_2 \dots y_n

Suppose we already know the optimal alignment of the prefixes:

x_1 \dots x_{i-1} \quad \text{and} \quad y_1 \dots y_{j-1}

How can we extend this alignment to include $x_i$ and $y_j$ ?

There are only three possibilities:

Align $x_i$ with $y_j$ (substitution or match)
Align $x_i$ with a gap (insertion in $Y$ )
Align $y_j$ with a gap (deletion in $X$ )

This observation reveals something crucial:

An optimal alignment can be constructed from optimal alignments of smaller prefixes.

This property is known as optimal substructure, and it is the foundation of dynamic programming.

Overlapping Subproblems

Another important property emerges when we examine the problem more closely.

When computing alignments for different prefixes, we repeatedly encounter the same subproblems. For example, the alignment of:

x_1 \dots x_i \quad \text{and} \quad y_1 \dots y_j

may be required in multiple larger computations.

This means that:

The problem consists of many overlapping subproblems that can be reused.

This redundancy suggests that we should avoid recomputing the same results multiple times.

Alignments as Paths in a Graph

A particularly powerful way to understand sequence alignment is to view it as a path-finding problem.

Imagine a grid where:

the horizontal axis corresponds to sequence $Y$
the vertical axis corresponds to sequence $X$

Each cell $(i, j)$ represents the alignment of prefixes $x_1 \dots x_i$ and $y_1 \dots y_j$ .

From each cell, we can move in three directions:

Diagonal → align $x_i$ with $y_j$
Down → align $x_i$ with a gap
Right → align $y_j$ with a gap

Each move corresponds to an edit operation and is associated with a score.

An alignment then corresponds to a path from the top-left corner $(0,0)$ to the bottom-right corner $(m,n)$ .

Scoring Paths

The score of an alignment is simply the sum of the scores of the moves along its path:

diagonal moves contribute substitution scores
horizontal and vertical moves contribute gap penalties

Thus, the problem of sequence alignment can be reformulated as:

Find the highest-scoring path from $(0,0)$ to $(m,n)$ in this grid.

This perspective transforms sequence alignment into a classic optimization problem on a graph.

From Search to Dynamic Programming

We are now in a position to resolve the central challenge.

Instead of enumerating all possible paths (alignments), we can:

compute the best score for each cell $(i, j)$
reuse previously computed results
build the solution incrementally

This approach avoids redundant computation and reduces the complexity from exponential to polynomial time.

This method is known as dynamic programming.

Conceptual Summary

Sequence alignment can be understood as a search problem over a vast space of possible alignments. A naive brute-force approach is infeasible due to the exponential growth of this space.

However, the problem exhibits two key properties:

Optimal substructure: optimal solutions can be built from optimal subsolutions
Overlapping subproblems: the same subproblems appear repeatedly

By exploiting these properties and interpreting alignments as paths in a grid, we can transform the problem into one that can be solved efficiently using dynamic programming.

Self-Check Questions

Why does the number of possible alignments grow exponentially with sequence length?
What makes a brute-force approach to sequence alignment impractical?
What are the three possible ways to extend an alignment at position $(i, j)$ ?
How can sequence alignment be interpreted as a path-finding problem?
What properties make dynamic programming applicable to this problem?