Skip to content

6.7 Parameter Estimation with the Baum–Welch Algorithm

6.7 Parameter Estimation with the Baum–Welch Algorithm

Section titled “6.7 Parameter Estimation with the Baum–Welch Algorithm”

So far, we have assumed that the parameters of our Hidden Markov Model are known. In particular, we assumed that we are given:

  • the transition probabilities
  • the emission probabilities
  • the initial state distribution

In practice, however, this is rarely the case. Typically, we are given only a set of observed sequences, and the goal is to learn the model parameters from data.

This leads to the third fundamental problem of Hidden Markov Models:

How can we estimate the parameters of an HMM when the hidden states are not observed?


If the hidden state sequence were known, parameter estimation would be straightforward.

  • Transition probabilities could be estimated by counting transitions between states
  • Emission probabilities could be estimated by counting how often each symbol is emitted from each state
  • Initial probabilities could be estimated from the starting states

However, in an HMM, the state sequence is hidden. We only observe the emitted symbols. This means that we do not know:

  • which state emitted which symbol
  • how often transitions between states occurred

As a result, we cannot directly count frequencies.


6.7.2 A First Idea: Guess the Hidden States

Section titled “6.7.2 A First Idea: Guess the Hidden States”

A natural starting point is to approximate the hidden state sequence.

Suppose we somehow assign a state label to each position in the observed sequence. For example:

Observed: ACCTA\text{Observed: } A \quad C \quad C \quad T \quad A Hidden (guess): BPPBB\text{Hidden (guess): } B \quad P \quad P \quad B \quad B

Given such a labeling, we can immediately estimate the model parameters:

  • count how often each symbol appears in each state
  • count how often each transition occurs
  • normalize the counts to obtain probabilities

This idea is simple and appealing, but it raises an obvious question:

Where does the labeling come from?


The Baum–Welch algorithm answers this question by turning the problem into an iterative procedure.

The idea is to alternate between:

  1. Inferring a hidden structure given the current model
  2. Updating the model parameters based on this inferred structure

More concretely:

  1. Start with an initial guess for the model parameters
  2. Use the model to infer hidden state sequences
  3. Use these inferred sequences to update the parameters
  4. Repeat until convergence

This procedure gradually improves the model.


Box 6.2 — Learning as “Guess → Count → Update”

Section titled “Box 6.2 — Learning as “Guess → Count → Update””

A useful way to understand Baum–Welch is to think of it as a repeated cycle of guessing and refinement.


Start with an initial model (often random). Use it to assign states to the observed sequence, for example using the Viterbi algorithm.

Observed: ACCTA\text{Observed: } A \quad C \quad C \quad T \quad A Hidden (guess): BPPBB\text{Hidden (guess): } B \quad P \quad P \quad B \quad B

From this guessed labeling, compute:

  • how often each symbol appears in each state
  • how often transitions between states occur

For example:

  • count how many times AA is emitted from BB
  • count how many times we transition from PP to BB

Convert counts into probabilities:

  • normalize emission counts to obtain emission probabilities
  • normalize transition counts to obtain transition probabilities

Use the updated model to compute a new hidden state sequence, and repeat the process.


Each iteration improves the model’s ability to explain the observed data.


6.7.4 From Hard Assignments to Soft Assignments

Section titled “6.7.4 From Hard Assignments to Soft Assignments”

The procedure described above is intuitive but slightly simplified. It corresponds to making a hard assignment of each position to a single state.

However, as we saw earlier, multiple hidden paths may explain the same observed sequence. Instead of committing to a single path, it is often better to consider all possible paths simultaneously, weighted by their probabilities.

This leads to the idea of soft assignments.

Instead of assigning a position entirely to one state, we compute:

  • the probability that position ii was generated by state sjs_j
  • the probability that a transition ( s_i \rightarrow s_j ) occurred

These quantities can be computed efficiently using the forward and backward algorithms.


The Baum–Welch algorithm implements this idea in a principled way. It is a special case of the Expectation–Maximization (EM) algorithm.


Using the current model, compute:

  • expected counts of emissions
  • expected counts of transitions

This is done by combining forward and backward probabilities to estimate how likely each state and transition is at each position.


Update the model parameters:

  • normalize expected emission counts → emission probabilities
  • normalize expected transition counts → transition probabilities
  • update initial probabilities

Repeat the E-step and M-step until the parameters converge.


Baum–Welch can be interpreted as a refinement of the intuitive “guess → count → update” procedure:

  • Instead of guessing a single hidden path, we consider all possible paths
  • Instead of counting occurrences directly, we compute expected counts
  • Instead of updating once, we iterate until convergence

This makes the algorithm both more robust and more principled.


An important property of the Baum–Welch algorithm is that it guarantees improvement of the likelihood at each iteration. However, it does not guarantee convergence to the global optimum.

In practice, this means:

  • the result depends on the initial parameter values
  • the algorithm may converge to a local optimum

To address this, it is common to:

  • run the algorithm multiple times with different initializations
  • compare the resulting models

From a biological perspective, Baum–Welch allows us to:

  • learn motif models from sequence data
  • estimate transition patterns between functional regions
  • refine models iteratively as more data becomes available

In particular, it provides a principled way to move from:

  • observed sequences only to
  • fully parameterized probabilistic models

The Baum–Welch algorithm solves the learning problem for Hidden Markov Models by combining:

  • probabilistic inference (forward and backward algorithms)
  • parameter estimation (normalization of expected counts)
  • iterative refinement (EM framework)

It transforms HMMs from fixed models into learnable models, making them applicable to real biological data.


  1. Why is parameter estimation difficult when hidden states are not observed?
  2. What is the intuition behind the “guess → count → update” procedure?
  3. What is the difference between hard and soft assignments of states?
  4. How do the forward and backward algorithms contribute to parameter estimation?
  5. Why can Baum–Welch converge to a local optimum?