6.3 Structure and Components of Hidden Markov Models

In the previous section, we introduced the Markov assumption as a principled way to incorporate dependencies between neighboring positions in a sequence. This already represents a substantial improvement over position-specific models, which treat all positions as independent. However, we also observed a crucial limitation: in a standard Markov chain, the states of the model are directly tied to the observed symbols. As a consequence, such models cannot distinguish between different underlying biological contexts that may produce identical observations.

To overcome this limitation, we now introduce a more expressive modeling framework: Hidden Markov Models (HMMs). These models separate the observable sequence from the underlying generative process, thereby allowing us to explicitly represent hidden biological structure.

A Generative View of Biological Sequences

The central idea of a Hidden Markov Model is best understood from a generative perspective. Rather than asking how well a sequence matches a given pattern, we imagine a stochastic process that produces the sequence step by step.

More precisely, we assume that:

the system evolves through a sequence of internal states
each state emits observable symbols according to a probability distribution
the system transitions between states over time

This leads to a two-layered structure:

A sequence of hidden states, which encode the underlying biological context
A sequence of observed symbols, which are generated by these states

The key feature of the model is that the hidden states are not directly observable. We only see the emitted sequence, not the process that generated it.

This distinction is essential. It allows us to model situations in which the same observed symbol can arise from fundamentally different biological mechanisms.

Revisiting the Motif Discovery Problem

To make this abstraction concrete, let us return to the motif discovery problem that motivated this chapter.

Previously, we modeled motifs using position-specific probability matrices. In doing so, we implicitly assumed that the sequence is homogeneous, except for local variations captured by the motif model. However, we now adopt a different perspective.

We assume that the sequence is generated by switching between two distinct regimes:

a motif-generating regime, corresponding to biologically meaningful patterns such as promoter regions
a background regime, corresponding to non-specific sequence

In the HMM framework, these regimes are represented as hidden states. For simplicity, we denote them as:

S = {P, B}

where $P$ represents the promoter (motif) state and $B$ represents the background state.

As the model generates a sequence, it moves between these states. When it is in the promoter state, it emits nucleotides according to a motif-specific distribution. When it is in the background state, it emits nucleotides according to background frequencies.

In this way, the model captures not only the composition of motifs, but also their location and extent within the sequence.

Formal Components of a Hidden Markov Model

We now describe the components of an HMM more formally. Each component plays a distinct role in defining the generative process.

Hidden states

An HMM consists of a finite set of hidden states:

S = {s_1, s_2, \dots, s_N}

These states represent different modes of the underlying process. In biological applications, they often correspond to functional or structural regions, such as promoter sequences, coding regions, or background sequence.

In the simplest case discussed here, we consider two states:

S = {P, B}

However, in realistic applications, the number of states can be significantly larger, allowing for much richer models of biological structure.

Initial probabilities

Before the sequence generation begins, the model must choose an initial state. This is described by a probability distribution:

\pi_i = P(S_1 = s_i)

which specifies the probability of starting in state $s_i$ .

Conceptually, this corresponds to a prior belief about where sequences typically begin. For example, in genomic sequences, it is often reasonable to assume that the background state is more likely at the start.

Transition probabilities

Once the process has started, it evolves by transitioning between states. These transitions are governed by a matrix of probabilities:

t_{ij} = P(S_{n+1} = s_j \mid S_n = s_i)

Each entry $t_{ij}$ describes the probability of moving from state $s_i$ to state $s_j$ in the next step.

In the promoter-background example, these transitions have a natural interpretation:

( P(P \rightarrow P) ): probability of remaining within a motif
( P(P \rightarrow B) ): probability of leaving a motif
( P(B \rightarrow B) ): probability of remaining in background
( P(B \rightarrow P) ): probability of entering a motif

These probabilities encode the structure and persistence of different regions. For instance, if motifs tend to occur in short segments, the probability of staying in the promoter state will be relatively low.

Emission probabilities

Each state is associated with a probability distribution over observable symbols. These are the emission probabilities:

e_i(x) = P(X_n = x \mid S_n = s_i)

They specify how likely a state is to emit a particular symbol.

In the case of DNA sequences:

the promoter state may strongly favor certain nucleotides at specific positions
the background state may reflect overall genomic nucleotide frequencies

Thus, emission probabilities connect the hidden states to the observed data.

Two Complementary Representations

Hidden Markov Models can be represented in two equivalent but complementary ways.

Graphical representation

The most intuitive representation is a state transition graph. In this representation:

nodes correspond to hidden states
directed edges correspond to transitions between states
each edge is labeled with a transition probability
each node is associated with emission probabilities

This representation closely mirrors the generative process and is particularly useful for conceptual understanding and model design.

Matrix representation

For computational purposes, it is often more convenient to represent the model using matrices:

a transition matrix ( T = (t_{ij}) )
an emission matrix ( E = (e_i(x)) )

These matrices provide a compact representation of the model and are directly used in algorithmic implementations.

Why the States Are “Hidden”

The defining feature of a Hidden Markov Model is that the sequence of states is not observable.

Given an observed sequence, such as:

X = \text{ACCTA}

there may exist many different sequences of hidden states that could have generated it. For example:

(B, B, P, P, B), \quad (B, P, P, B, B), \quad (P, P, P, B, B)

Each of these state sequences corresponds to a different interpretation of the data. Some may represent short motif occurrences, others longer ones, and still others no motif at all.

Importantly, all of these interpretations may have non-zero probability under the model.

This ambiguity is not a flaw but a fundamental feature of the model. It reflects the inherent uncertainty in biological data and motivates the development of algorithms that can reason about multiple possible explanations simultaneously.

A Step-by-Step Generative Example

To better understand how an HMM generates a sequence, consider the following process:

The model selects an initial state according to $\pi$
It emits a symbol according to the emission distribution of that state
It transitions to a new state according to the transition probabilities
The process repeats

Suppose we observe the sequence:

X = \text{ACCTA}

At the first position, the probability of observing $A$ depends on both:

the probability of starting in each state
the emission probability of $A$ from those states

At the second position, the probability of observing $C$ depends on:

the probability of transitioning from the previous state
the emission probability of $C$ in the new state

This pattern continues along the sequence. At each step, the probability of the observation is influenced by both the current state and the transition from the previous state.

In this way, the model naturally combines local dependencies (via transitions) and state-specific emissions.

Biological Interpretation

One of the strengths of Hidden Markov Models is that their components admit direct biological interpretations.

Hidden states correspond to biological contexts or functional regions
Transitions describe how these regions are organized along the sequence
Emissions describe the observable characteristics of each region

In the motif example:

entering the promoter state corresponds to encountering a motif
remaining in that state corresponds to continuing within the motif
transitioning back to background corresponds to leaving the motif

This makes HMMs not only powerful computational tools, but also meaningful representations of biological processes.

What the Model Enables

By introducing hidden states and structured transitions, Hidden Markov Models significantly expand our modeling capabilities.

They allow us to:

assign biological meaning to different regions of a sequence
infer the most likely underlying structure of a sequence
evaluate how well a sequence fits a given model
learn model parameters from data

These capabilities correspond to three fundamental computational problems, which we will examine in detail in the next section.

Self-Check Questions

In what sense does an HMM provide a generative model of biological sequences?
How do hidden states differ conceptually from observed symbols?
What biological interpretations can be assigned to transition probabilities?
Why is it possible for multiple hidden state sequences to explain the same observed sequence?
How does separating hidden states from observations increase modeling flexibility?