Skip to content

3.3 Alignments as a Search Problem

After reading this section, you should be able to:

  • Understand why finding an optimal alignment is a combinatorial problem
  • Explain why brute-force approaches are infeasible
  • Recognize the recursive structure underlying sequence alignment
  • Interpret alignments as paths in a graph
  • Motivate the need for dynamic programming

In Section 3.2, we defined sequence similarity in terms of edit operations and alignment scores. Conceptually, the problem is now clear:

Given two sequences, find the alignment with the highest score (or lowest cost).

However, this formulation immediately raises a practical question:

How do we actually find this optimal alignment?

At first glance, the problem appears simple. We could imagine generating all possible alignments, computing their scores, and selecting the best one.

This idea is straightforward but fundamentally flawed.


To understand the difficulty, consider what an alignment actually is.

An alignment is created by inserting gap symbols into sequences such that:

  • both sequences are extended to the same length
  • corresponding positions can be compared

Even for short sequences, there are many ways to insert gaps. Each different placement of gaps results in a different alignment.

The number of possible global alignments between two sequences of length NN grows approximately as:

22NπN\frac{2^{2N}}{\pi N}

This growth is exponential.

For even moderately sized sequences, the number of possible alignments becomes astronomically large. Exhaustively enumerating and evaluating all alignments is therefore computationally infeasible.


Let us connect this observation back to the “alignment by eye” from Section 3.1.

When we manually aligned adenylation domain sequences, we explored only a tiny subset of all possible alignments. We relied on intuition to guide us toward plausible solutions.

A brute-force algorithm, in contrast, would:

  1. generate every possible way of inserting gaps
  2. compute a score for each alignment
  3. return the best one

While guaranteed to find the optimal solution, this approach is impractical because the search space is too large.

This leads to a central insight:

The challenge of sequence alignment is not defining similarity, but searching the enormous space of possible alignments efficiently.


Despite the apparent complexity, the alignment problem has a hidden structure that we can exploit.

Consider two sequences:

X=x1x2xmY=y1y2ynX = x_1 x_2 \dots x_m \qquad Y = y_1 y_2 \dots y_n

Suppose we already know the optimal alignment of the prefixes:

x1xi1andy1yj1x_1 \dots x_{i-1} \quad \text{and} \quad y_1 \dots y_{j-1}

How can we extend this alignment to include xix_i and yjy_j?

There are only three possibilities:

  1. Align xix_i with yjy_j (substitution or match)
  2. Align xix_i with a gap (insertion in YY)
  3. Align yjy_j with a gap (deletion in XX)

This observation reveals something crucial:

An optimal alignment can be constructed from optimal alignments of smaller prefixes.

This property is known as optimal substructure, and it is the foundation of dynamic programming.


Another important property emerges when we examine the problem more closely.

When computing alignments for different prefixes, we repeatedly encounter the same subproblems. For example, the alignment of:

x1xiandy1yjx_1 \dots x_i \quad \text{and} \quad y_1 \dots y_j

may be required in multiple larger computations.

This means that:

The problem consists of many overlapping subproblems that can be reused.

This redundancy suggests that we should avoid recomputing the same results multiple times.


A particularly powerful way to understand sequence alignment is to view it as a path-finding problem.

Imagine a grid where:

  • the horizontal axis corresponds to sequence YY
  • the vertical axis corresponds to sequence XX

Each cell (i,j)(i, j) represents the alignment of prefixes x1xix_1 \dots x_i and y1yjy_1 \dots y_j.

From each cell, we can move in three directions:

  • Diagonal → align xix_i with yjy_j
  • Down → align xix_i with a gap
  • Right → align yjy_j with a gap

Each move corresponds to an edit operation and is associated with a score.

An alignment then corresponds to a path from the top-left corner (0,0)(0,0) to the bottom-right corner (m,n)(m,n).


The score of an alignment is simply the sum of the scores of the moves along its path:

  • diagonal moves contribute substitution scores
  • horizontal and vertical moves contribute gap penalties

Thus, the problem of sequence alignment can be reformulated as:

Find the highest-scoring path from (0,0)(0,0) to (m,n)(m,n) in this grid.

This perspective transforms sequence alignment into a classic optimization problem on a graph.


We are now in a position to resolve the central challenge.

Instead of enumerating all possible paths (alignments), we can:

  1. compute the best score for each cell (i,j)(i, j)
  2. reuse previously computed results
  3. build the solution incrementally

This approach avoids redundant computation and reduces the complexity from exponential to polynomial time.

This method is known as dynamic programming.


Sequence alignment can be understood as a search problem over a vast space of possible alignments. A naive brute-force approach is infeasible due to the exponential growth of this space.

However, the problem exhibits two key properties:

  • Optimal substructure: optimal solutions can be built from optimal subsolutions
  • Overlapping subproblems: the same subproblems appear repeatedly

By exploiting these properties and interpreting alignments as paths in a grid, we can transform the problem into one that can be solved efficiently using dynamic programming.


  1. Why does the number of possible alignments grow exponentially with sequence length?
  2. What makes a brute-force approach to sequence alignment impractical?
  3. What are the three possible ways to extend an alignment at position (i,j)(i, j)?
  4. How can sequence alignment be interpreted as a path-finding problem?
  5. What properties make dynamic programming applicable to this problem?