6.3 Structure and Components of Hidden Markov Models
6.3 Structure and Components of Hidden Markov Models
Section titled “6.3 Structure and Components of Hidden Markov Models”In the previous section, we introduced the Markov assumption as a principled way to incorporate dependencies between neighboring positions in a sequence. This already represents a substantial improvement over position-specific models, which treat all positions as independent. However, we also observed a crucial limitation: in a standard Markov chain, the states of the model are directly tied to the observed symbols. As a consequence, such models cannot distinguish between different underlying biological contexts that may produce identical observations.
To overcome this limitation, we now introduce a more expressive modeling framework: Hidden Markov Models (HMMs). These models separate the observable sequence from the underlying generative process, thereby allowing us to explicitly represent hidden biological structure.
A Generative View of Biological Sequences
Section titled “A Generative View of Biological Sequences”The central idea of a Hidden Markov Model is best understood from a generative perspective. Rather than asking how well a sequence matches a given pattern, we imagine a stochastic process that produces the sequence step by step.
More precisely, we assume that:
- the system evolves through a sequence of internal states
- each state emits observable symbols according to a probability distribution
- the system transitions between states over time
This leads to a two-layered structure:
- A sequence of hidden states, which encode the underlying biological context
- A sequence of observed symbols, which are generated by these states
The key feature of the model is that the hidden states are not directly observable. We only see the emitted sequence, not the process that generated it.
This distinction is essential. It allows us to model situations in which the same observed symbol can arise from fundamentally different biological mechanisms.
Revisiting the Motif Discovery Problem
Section titled “Revisiting the Motif Discovery Problem”To make this abstraction concrete, let us return to the motif discovery problem that motivated this chapter.
Previously, we modeled motifs using position-specific probability matrices. In doing so, we implicitly assumed that the sequence is homogeneous, except for local variations captured by the motif model. However, we now adopt a different perspective.
We assume that the sequence is generated by switching between two distinct regimes:
- a motif-generating regime, corresponding to biologically meaningful patterns such as promoter regions
- a background regime, corresponding to non-specific sequence
In the HMM framework, these regimes are represented as hidden states. For simplicity, we denote them as:
where represents the promoter (motif) state and represents the background state.
As the model generates a sequence, it moves between these states. When it is in the promoter state, it emits nucleotides according to a motif-specific distribution. When it is in the background state, it emits nucleotides according to background frequencies.
In this way, the model captures not only the composition of motifs, but also their location and extent within the sequence.
Formal Components of a Hidden Markov Model
Section titled “Formal Components of a Hidden Markov Model”We now describe the components of an HMM more formally. Each component plays a distinct role in defining the generative process.
Hidden states
Section titled “Hidden states”An HMM consists of a finite set of hidden states:
These states represent different modes of the underlying process. In biological applications, they often correspond to functional or structural regions, such as promoter sequences, coding regions, or background sequence.
In the simplest case discussed here, we consider two states:
However, in realistic applications, the number of states can be significantly larger, allowing for much richer models of biological structure.
Initial probabilities
Section titled “Initial probabilities”Before the sequence generation begins, the model must choose an initial state. This is described by a probability distribution:
which specifies the probability of starting in state .
Conceptually, this corresponds to a prior belief about where sequences typically begin. For example, in genomic sequences, it is often reasonable to assume that the background state is more likely at the start.
Transition probabilities
Section titled “Transition probabilities”Once the process has started, it evolves by transitioning between states. These transitions are governed by a matrix of probabilities:
Each entry describes the probability of moving from state to state in the next step.
In the promoter-background example, these transitions have a natural interpretation:
- ( P(P \rightarrow P) ): probability of remaining within a motif
- ( P(P \rightarrow B) ): probability of leaving a motif
- ( P(B \rightarrow B) ): probability of remaining in background
- ( P(B \rightarrow P) ): probability of entering a motif
These probabilities encode the structure and persistence of different regions. For instance, if motifs tend to occur in short segments, the probability of staying in the promoter state will be relatively low.
Emission probabilities
Section titled “Emission probabilities”Each state is associated with a probability distribution over observable symbols. These are the emission probabilities:
They specify how likely a state is to emit a particular symbol.
In the case of DNA sequences:
- the promoter state may strongly favor certain nucleotides at specific positions
- the background state may reflect overall genomic nucleotide frequencies
Thus, emission probabilities connect the hidden states to the observed data.
Two Complementary Representations
Section titled “Two Complementary Representations”Hidden Markov Models can be represented in two equivalent but complementary ways.
Graphical representation
Section titled “Graphical representation”The most intuitive representation is a state transition graph. In this representation:
- nodes correspond to hidden states
- directed edges correspond to transitions between states
- each edge is labeled with a transition probability
- each node is associated with emission probabilities
This representation closely mirrors the generative process and is particularly useful for conceptual understanding and model design.
Matrix representation
Section titled “Matrix representation”For computational purposes, it is often more convenient to represent the model using matrices:
- a transition matrix ( T = (t_{ij}) )
- an emission matrix ( E = (e_i(x)) )
These matrices provide a compact representation of the model and are directly used in algorithmic implementations.
Why the States Are “Hidden”
Section titled “Why the States Are “Hidden””The defining feature of a Hidden Markov Model is that the sequence of states is not observable.
Given an observed sequence, such as:
there may exist many different sequences of hidden states that could have generated it. For example:
Each of these state sequences corresponds to a different interpretation of the data. Some may represent short motif occurrences, others longer ones, and still others no motif at all.
Importantly, all of these interpretations may have non-zero probability under the model.
This ambiguity is not a flaw but a fundamental feature of the model. It reflects the inherent uncertainty in biological data and motivates the development of algorithms that can reason about multiple possible explanations simultaneously.
A Step-by-Step Generative Example
Section titled “A Step-by-Step Generative Example”To better understand how an HMM generates a sequence, consider the following process:
- The model selects an initial state according to
- It emits a symbol according to the emission distribution of that state
- It transitions to a new state according to the transition probabilities
- The process repeats
Suppose we observe the sequence:
At the first position, the probability of observing depends on both:
- the probability of starting in each state
- the emission probability of from those states
At the second position, the probability of observing depends on:
- the probability of transitioning from the previous state
- the emission probability of in the new state
This pattern continues along the sequence. At each step, the probability of the observation is influenced by both the current state and the transition from the previous state.
In this way, the model naturally combines local dependencies (via transitions) and state-specific emissions.
Biological Interpretation
Section titled “Biological Interpretation”One of the strengths of Hidden Markov Models is that their components admit direct biological interpretations.
- Hidden states correspond to biological contexts or functional regions
- Transitions describe how these regions are organized along the sequence
- Emissions describe the observable characteristics of each region
In the motif example:
- entering the promoter state corresponds to encountering a motif
- remaining in that state corresponds to continuing within the motif
- transitioning back to background corresponds to leaving the motif
This makes HMMs not only powerful computational tools, but also meaningful representations of biological processes.
What the Model Enables
Section titled “What the Model Enables”By introducing hidden states and structured transitions, Hidden Markov Models significantly expand our modeling capabilities.
They allow us to:
- assign biological meaning to different regions of a sequence
- infer the most likely underlying structure of a sequence
- evaluate how well a sequence fits a given model
- learn model parameters from data
These capabilities correspond to three fundamental computational problems, which we will examine in detail in the next section.
Self-Check Questions
Section titled “Self-Check Questions”- In what sense does an HMM provide a generative model of biological sequences?
- How do hidden states differ conceptually from observed symbols?
- What biological interpretations can be assigned to transition probabilities?
- Why is it possible for multiple hidden state sequences to explain the same observed sequence?
- How does separating hidden states from observations increase modeling flexibility?