Skip to content

6.5 Decoding with the Viterbi Algorithm

Once a Hidden Markov Model has been specified, one of the first questions we want to answer is the following:

Given an observed sequence, what is the most likely sequence of hidden states that generated it?

This is the decoding problem, and the classical solution is the Viterbi algorithm.

At first glance, this problem seems straightforward. If the observed sequence is short and the number of hidden states is small, one might imagine listing all possible hidden paths, computing their probabilities, and choosing the best one. However, this quickly becomes infeasible. For a sequence of length nn and an HMM with kk hidden states, there are knk^n possible hidden state sequences. Even for moderate values of nn, exhaustive search becomes impossible.

The Viterbi algorithm solves this problem by exploiting a key structural property of Hidden Markov Models: although the total number of complete paths is enormous, many of these paths share common prefixes. This creates the kind of overlapping subproblem structure that can be handled by dynamic programming.


Let the observed sequence be

X=(x1,x2,,xn)X = (x_1, x_2, \dots, x_n)

and let the hidden state sequence be

S=(s1,s2,,sn)S = (s_1, s_2, \dots, s_n)

For a fixed model MM, the decoding problem is to find

S=argmaxSP(X,SM)S^* = \arg\max_S P(X, S \mid M)

That is, among all possible hidden paths, we seek the one that maximizes the joint probability of both the observations and the path itself.

This formulation is important. We are not simply asking which path is plausible in isolation. We are asking which path provides the most probable explanation for the observed data under the model.

In the promoter-background example, this means that we want to assign to each observed nucleotide the state that best explains it, while also respecting the transition structure of the model. A path with excellent emissions but impossible transitions will not be favored. Likewise, a path with likely transitions but poor emissions will also not be optimal. The Viterbi algorithm balances both contributions.


To appreciate the value of the algorithm, it is worth examining what a naive solution would require.

Suppose our HMM has only two hidden states:

{P,B}\{P, B\}

and the observed sequence has length five. Then there are already

25=322^5 = 32

possible hidden paths.

For a length of twenty, there would be

220=1,048,5762^{20} = 1{,}048{,}576

possible paths.

For biologically realistic sequence lengths, exhaustive evaluation is no longer practical. Moreover, most of the work would be redundant. Many different paths share the same partial prefix, and repeatedly recomputing the best continuation of such prefixes would be wasteful.

The Viterbi algorithm avoids this redundancy by storing, at each sequence position and for each state, the probability of the best path ending there.


The central observation is simple but powerful.

Suppose that at position ii we want to know the best path ending in state sjs_j. Then this best path must have come from one of the states at position i1i-1. Among all possible predecessor states, only one leads to the optimal path ending in sjs_j.

This means that, instead of considering all complete paths at once, we can compute the best partial path to each state step by step.

Define the Viterbi score

vi(j)v_i(j)

as the probability of the most likely path that generates the first ii observations and ends in state sjs_j.

This quantity summarizes exactly the information we need to continue the dynamic programming recursion.


The algorithm consists of three parts: initialization, recursion, and termination.

For the first observation x1x_1, the best path ending in state sjs_j is simply the probability of starting in that state and emitting the symbol:

v1(j)=πjej(x1)v_1(j) = \pi_j e_j(x_1)

where

  • πj\pi_j is the initial probability of state sjs_j
  • ej(x1)e_j(x_1) is the emission probability of symbol x1x_1 from state sjs_j

For each subsequent position i+1i+1, we compute

vi+1(j)=maxk(vi(k)tkjej(xi+1))v_{i+1}(j) = \max_{k} \bigl( v_i(k)\, t_{kj}\, e_j(x_{i+1}) \bigr)

This equation has a very natural interpretation.

To arrive at state sjs_j at position i+1i+1, we must

  1. have followed the best path to some predecessor state sks_k at position ii
  2. transition from sks_k to sjs_j
  3. emit the observed symbol xi+1x_{i+1} from state sjs_j

Among all predecessor states sks_k, we choose the one that maximizes this product.


To recover the actual path rather than only its probability, we must also store which predecessor state achieved the maximum. Thus, alongside the Viterbi scores, we maintain a backpointer table:

bi+1(j)=argmaxk(vi(k)tkjej(xi+1))b_{i+1}(j) = \arg\max_k \bigl( v_i(k)\, t_{kj}\, e_j(x_{i+1}) \bigr)

This table records, for each position and state, where the optimal path came from.


Once the last observation has been processed, the score of the best complete path is

maxjvn(j)\max_j v_n(j)

and the final state of the best path is

sn=argmaxjvn(j)s_n^* = \arg\max_j v_n(j)

Starting from this final state, we reconstruct the full path by following the backpointers backward through the table.


We now walk through the kind of example used in the lecture, using a simple two-state HMM with a promoter state PP and a background state BB.

Assume the following initial probabilities:

πP=0.1,πB=0.9\pi_P = 0.1, \qquad \pi_B = 0.9

Transition probabilities:

tPP=0.55,tPB=0.45t_{PP} = 0.55, \quad t_{PB} = 0.45 tBP=0.35,tBB=0.65t_{BP} = 0.35, \quad t_{BB} = 0.65

Emission probabilities:

For the promoter state PP,

eP(A)=0.15,eP(T)=0.12,eP(G)=0.30,eP(C)=0.43e_P(A)=0.15,\quad e_P(T)=0.12,\quad e_P(G)=0.30,\quad e_P(C)=0.43

For the background state BB,

eB(A)=0.30,eB(T)=0.30,eB(G)=0.20,eB(C)=0.20e_B(A)=0.30,\quad e_B(T)=0.30,\quad e_B(G)=0.20,\quad e_B(C)=0.20

Now consider the observed sequence

X=A C C T AX = \text{A C C T A}

We compute the best path step by step.


For the first symbol AA,

v1(P)=πPeP(A)=0.10.15=0.015v_1(P) = \pi_P e_P(A) = 0.1 \cdot 0.15 = 0.015 v1(B)=πBeB(A)=0.90.30=0.27v_1(B) = \pi_B e_B(A) = 0.9 \cdot 0.30 = 0.27

So after observing the first symbol, the best path ending in BB has much higher probability than the best path ending in PP. Intuitively, this means that the model initially explains the first AA more naturally as background than as promoter.


The second symbol is CC.

To compute the best path ending in PP,

v2(P)=max(v1(P)tPPeP(C),v1(B)tBPeP(C))v_2(P) = \max \Bigl( v_1(P)t_{PP}e_P(C), v_1(B)t_{BP}e_P(C) \Bigr)

Substituting the numbers gives

v2(P)=max(0.0150.550.43,0.270.350.43)v_2(P) = \max \Bigl( 0.015 \cdot 0.55 \cdot 0.43, 0.27 \cdot 0.35 \cdot 0.43 \Bigr) =max(0.00355,0.04064)=0.04064= \max(0.00355, 0.04064) = 0.04064

Thus the best predecessor is BB.

Now for the background state,

v2(B)=max(v1(P)tPBeB(C),v1(B)tBBeB(C))v_2(B) = \max \Bigl( v_1(P)t_{PB}e_B(C), v_1(B)t_{BB}e_B(C) \Bigr) v2(B)=max(0.0150.450.20,0.270.650.20)v_2(B) = \max \Bigl( 0.015 \cdot 0.45 \cdot 0.20, 0.27 \cdot 0.65 \cdot 0.20 \Bigr) =max(0.00135,0.03510)=0.03510= \max(0.00135, 0.03510) = 0.03510

Thus the best predecessor is again BB.

At this point the best path to PP is slightly better than the best path to BB, meaning that after seeing ( AC ), the model now favors being in the promoter state.


The third symbol is again CC.

For the promoter state,

v3(P)=max(v2(P)tPPeP(C),v2(B)tBPeP(C))v_3(P) = \max \Bigl( v_2(P)t_{PP}e_P(C), v_2(B)t_{BP}e_P(C) \Bigr) =max(0.040640.550.43,0.035100.350.43)= \max \Bigl( 0.04064 \cdot 0.55 \cdot 0.43, 0.03510 \cdot 0.35 \cdot 0.43 \Bigr) =max(0.00961,0.00528)=0.00961= \max(0.00961, 0.00528) = 0.00961

Thus the best predecessor is PP.

For the background state,

v3(B)=max(v2(P)tPBeB(C),v2(B)tBBeB(C))v_3(B) = \max \Bigl( v_2(P)t_{PB}e_B(C), v_2(B)t_{BB}e_B(C) \Bigr) =max(0.040640.450.20,0.035100.650.20)= \max \Bigl( 0.04064 \cdot 0.45 \cdot 0.20, 0.03510 \cdot 0.65 \cdot 0.20 \Bigr) =max(0.00366,0.00456)=0.00456= \max(0.00366, 0.00456) = 0.00456

Thus the best predecessor is BB.

Now the promoter path is clearly favored.


The fourth symbol is TT.

For the promoter state,

v4(P)=max(v3(P)tPPeP(T),v3(B)tBPeP(T))v_4(P) = \max \Bigl( v_3(P)t_{PP}e_P(T), v_3(B)t_{BP}e_P(T) \Bigr) =max(0.009610.550.12,0.004560.350.12)= \max \Bigl( 0.00961 \cdot 0.55 \cdot 0.12, 0.00456 \cdot 0.35 \cdot 0.12 \Bigr) =max(0.000634,0.000192)=0.000634= \max(0.000634, 0.000192) = 0.000634

Thus the best predecessor is PP.

For the background state,

v4(B)=max(v3(P)tPBeB(T),v3(B)tBBeB(T))v_4(B) = \max \Bigl( v_3(P)t_{PB}e_B(T), v_3(B)t_{BB}e_B(T) \Bigr) =max(0.009610.450.30,0.004560.650.30)= \max \Bigl( 0.00961 \cdot 0.45 \cdot 0.30, 0.00456 \cdot 0.65 \cdot 0.30 \Bigr) =max(0.001297,0.000889)=0.001297= \max(0.001297, 0.000889) = 0.001297

Thus the best predecessor is PP.

Now the best path to BB is better than the best path to PP, indicating a transition back into background.


The fifth symbol is AA.

For the promoter state,

v5(P)=max(v4(P)tPPeP(A),v4(B)tBPeP(A))v_5(P) = \max \Bigl( v_4(P)t_{PP}e_P(A), v_4(B)t_{BP}e_P(A) \Bigr) =max(0.0006340.550.15,0.0012970.350.15)= \max \Bigl( 0.000634 \cdot 0.55 \cdot 0.15, 0.001297 \cdot 0.35 \cdot 0.15 \Bigr) =max(0.0000523,0.0000681)=0.0000681= \max(0.0000523, 0.0000681) = 0.0000681

Thus the best predecessor is BB.

For the background state,

v5(B)=max(v4(P)tPBeB(A),v4(B)tBBeB(A))v_5(B) = \max \Bigl( v_4(P)t_{PB}e_B(A), v_4(B)t_{BB}e_B(A) \Bigr) =max(0.0006340.450.30,0.0012970.650.30)= \max \Bigl( 0.000634 \cdot 0.45 \cdot 0.30, 0.001297 \cdot 0.65 \cdot 0.30 \Bigr) =max(0.0000856,0.0002529)=0.0002529= \max(0.0000856, 0.0002529) = 0.0002529

Thus the best predecessor is BB.

The overall best final state is therefore BB, because

v5(B)>v5(P)v_5(B) > v_5(P)

We now reconstruct the best path by following the backpointers backward.

From the calculations above, the best final state is BB. The predecessor choices were:

  • at position 5, BBB \leftarrow B
  • at position 4, BPB \leftarrow P
  • at position 3, PPP \leftarrow P
  • at position 2, PBP \leftarrow B
  • at position 1, start in BB

Thus the most likely hidden path is

BPPBBB \quad P \quad P \quad B \quad B

This is exactly the kind of output we want from a decoding algorithm. The observed sequence ACCTAACCTA is explained as starting in background, passing through a short promoter-like region, and returning to background.


This example illustrates several important points.

First, the most likely state at a given position does not depend only on the symbol observed at that position. It also depends on the probability of reaching that state from the preceding path. This is why the Viterbi algorithm cannot be replaced by a simple position-wise classification rule.

Second, the algorithm balances emission evidence against transition structure. For example, a symbol that is moderately compatible with the promoter state may still be assigned to background if the transition into promoter is unlikely, or vice versa.

Third, the output is not just a score but an interpretable segmentation of the sequence. In biological terms, this segmentation can correspond to regions of different functional character.


It is useful to pause and make the dynamic programming principle explicit.

At each cell of the Viterbi table, we store only the best path ending in that state. Why is it safe to discard all other partial paths ending there?

Because any future continuation depends only on the current state and current score, not on the detailed history of how that state was reached. If one partial path to state PP is worse than another, it can never overtake it later, since both will face the same future transition and emission options.

This is the optimal substructure property that makes Viterbi possible.


Suppose an HMM has NN states and the observed sequence has length LL.

At each position, for each state, we consider all predecessor states. This leads to a time complexity of

O(N2L)O(N^2 L)

This is dramatically better than the exponential complexity of exhaustive enumeration.

The memory cost is typically

O(NL)O(NL)

if the full table and backpointers are stored, which is usually acceptable for moderate state spaces.


The Viterbi algorithm solves the decoding problem by combining three ideas:

  • state-specific emissions
  • state-to-state transitions
  • dynamic programming

Its output is the single most likely hidden path explaining the observed sequence.

For biological sequence analysis, this makes it a powerful tool for:

  • promoter detection
  • gene structure prediction
  • protein secondary structure labeling
  • profile-based sequence classification

In all of these applications, the goal is not merely to score a sequence, but to reconstruct the hidden biological organization underlying it.


  1. Why is exhaustive search over all hidden paths infeasible?
  2. What does the quantity ( v_i(j) ) represent in the Viterbi algorithm?
  3. Why are backpointers needed?
  4. In what sense does the Viterbi algorithm use dynamic programming?
  5. Why can the most likely state at a position depend on earlier positions?