Skip to content

3.4 Dynamic Programming and the Needleman–Wunsch Algorithm

3.4 Dynamic Programming and the Needleman–Wunsch Algorithm

Section titled “3.4 Dynamic Programming and the Needleman–Wunsch Algorithm”

After reading this section, you should be able to:

  • Formulate global sequence alignment as a dynamic programming problem
  • Derive the recursive scoring relation for alignments
  • Construct and interpret the dynamic programming matrix
  • Perform traceback to recover an optimal alignment
  • Understand the algorithmic complexity of global alignment

In the previous section, we reframed sequence alignment as a path-finding problem in a grid. Each path corresponds to an alignment, and the goal is to find the highest-scoring path from the origin (0,0)(0,0) to the terminal point (m,n)(m,n).

The key insight was that this problem exhibits optimal substructure. This allows us to compute the optimal alignment score incrementally, using previously computed solutions for smaller prefixes.

Dynamic programming provides the machinery to turn this idea into a concrete algorithm.


Let us define a matrix FF of size (m+1)×(n+1)(m+1) \times (n+1), where:

F(i,j)F(i,j)

represents the optimal alignment score between the prefixes:

x1xiandy1yjx_1 \dots x_i \quad \text{and} \quad y_1 \dots y_j

Thus:

  • F(0,0)F(0,0) corresponds to aligning two empty sequences
  • F(m,n)F(m,n) will contain the score of the optimal global alignment

Before filling the matrix, we must define boundary conditions.

Aligning a sequence with an empty sequence requires inserting gaps:

F(i,0)=igF(0,j)=jgF(i,0) = i \cdot g \qquad F(0,j) = j \cdot g

where gg is the gap penalty (typically negative).

This reflects the idea that aligning ii symbols with nothing requires ii deletions (or insertions).


Now consider how to compute F(i,j)F(i,j).

As discussed earlier, the alignment at position (i,j)(i,j) can arise from three possibilities:

  1. Match or mismatch (diagonal move):

    F(i1,j1)+s(xi,yj)F(i-1, j-1) + s(x_i, y_j)
  2. Insertion (gap in YY, vertical move):

    F(i1,j)+gF(i-1, j) + g
  3. Deletion (gap in XX, horizontal move):

    F(i,j1)+gF(i, j-1) + g

Thus, the recurrence relation is:

F(i,j)=max{F(i1,j1)+s(xi,yj) F(i1,j)+g F(i,j1)+gF(i,j) = \max \begin{cases} F(i-1, j-1) + s(x_i, y_j) \ F(i-1, j) + g \ F(i, j-1) + g \end{cases}

This equation formalizes the intuitive process of alignment by considering all possible ways to extend an optimal alignment.


Each term in the recurrence has a clear meaning:

  • The diagonal term assumes that xix_i is aligned with yjy_j
  • The vertical term assumes that xix_i is aligned with a gap
  • The horizontal term assumes that yjy_j is aligned with a gap

The algorithm chooses the option that yields the highest score, ensuring optimality at each step.

This is the essence of dynamic programming: local optimal decisions based on globally consistent subproblems.


Let us consider a simple example to illustrate the procedure.

We align:

X=ACGCY=GACTACX = \text{ACGC} \qquad Y = \text{GACTAC}

using the scoring scheme:

  • match: +1+1
  • mismatch: 00
  • gap: 1-1

We construct a grid where rows correspond to XX and columns to YY. The first row and column are filled using the gap penalty:

- G A C T A C
- 0 -1 -2 -3 -4 -5 -6
A -1
C -2
G -3
C -4

We now compute each cell using the recurrence relation. For example:

  • At position (1,1)(1,1) (aligning A with G):

    • diagonal: 0+0=00 + 0 = 0 (mismatch)
    • up: 11=2-1 - 1 = -2
    • left: 11=2-1 - 1 = -2

    F(1,1)=0F(1,1) = 0

Proceeding in this way, we fill the entire matrix. Each entry represents the best score achievable for the corresponding prefixes.


Once the matrix is complete, the optimal score is found at F(m,n)F(m,n).

However, the score alone is not sufficient. We want the actual alignment.

To recover it, we perform traceback:

  • Start at (m,n)(m,n)
  • At each step, move in the direction that led to the optimal score
  • Continue until reaching (0,0)(0,0)

This reconstructs the sequence of operations that produced the optimal alignment.


For this example, multiple optimal alignments may exist, such as:

-ACG-C
GACTAC

or

-AC-GC
GACTAC

This illustrates an important point:

Optimal alignments are not necessarily unique.


The Needleman–Wunsch algorithm proceeds in three phases:

  1. Initialization
  2. Matrix filling (forward phase)
  3. Traceback (backward phase)

This structure reflects the separation between computing optimal scores and reconstructing the corresponding solution.


Let mm and nn be the lengths of the sequences.

  • Time complexity:

    O(mn)O(mn)

    since each of the mnmn cells is computed once

  • Space complexity:

    O(mn)O(mn)

    since the full matrix must be stored for traceback

This represents a dramatic improvement over the exponential complexity of brute-force approaches.

Although the Needleman–Wunsch algorithm is efficient in time, it requires storing the entire dynamic programming matrix. For long biological sequences, such as NRPS enzymes, this becomes a practical limitation. Later in this chapter, we will revisit this issue and develop a more memory-efficient algorithm based on divide-and-conquer.


The Needleman–Wunsch algorithm provides the first complete solution to the sequence alignment problem.

Conceptually, it achieves three things:

  • It defines alignment as an optimization problem
  • It provides a principled way to compute optimal solutions
  • It makes explicit the assumptions encoded in the scoring scheme

From a modeling perspective, the algorithm answers the question:

What is the best explanation of how one sequence could be transformed into another under a given scoring model?


Global sequence alignment can be solved efficiently using dynamic programming. By defining a recursive scoring function and systematically evaluating all prefix alignments, the Needleman–Wunsch algorithm computes the optimal alignment in polynomial time.

The algorithm transforms an intractable search problem into a structured computation over a matrix, making sequence alignment practical even for moderately long sequences.


  1. What does the entry F(i,j)F(i,j) represent in the dynamic programming matrix?
  2. Why are there exactly three cases in the recurrence relation?
  3. What is the purpose of the initialization step?
  4. Why is traceback necessary after filling the matrix?
  5. Why does the algorithm run in O(mn)O(mn) time?