6.2 The Markov Property and Sequence Dependencies
6.2 The Markov Property and Sequence Dependencies
Section titled “6.2 The Markov Property and Sequence Dependencies”In the previous section, we identified two fundamental limitations of position-specific models: the assumption of independence between positions and the inability to handle structural variation such as insertions and deletions. Both limitations arise from the same underlying issue. The model does not describe how sequences are generated along their length. It treats each position in isolation.
To overcome this, we now move toward models that explicitly capture dependencies between neighboring positions. The key idea is to introduce a notion of sequential structure into our probabilistic framework.
From Full Dependency to Tractable Models
Section titled “From Full Dependency to Tractable Models”A natural starting point is to consider the most general case. Suppose we want to model a biological sequence . In full generality, the probability of observing a symbol at position could depend on the entire preceding sequence:
From a biological perspective, this is not unreasonable. In proteins, for example, interactions between amino acids can span long distances along the sequence due to folding. Similarly, regulatory elements in DNA may exhibit complex dependencies.
However, this full model is computationally intractable. The number of possible dependencies grows exponentially with sequence length. Estimating and evaluating such a model would require an impractical amount of data and computation.
We therefore introduce a simplifying assumption.
The Markov Assumption
Section titled “The Markov Assumption”The central idea is that the dependence of a position on its full history can be approximated by considering only a limited context. The simplest and most widely used version is the first-order Markov assumption:
In words:
The probability of observing the next symbol depends only on the immediately preceding symbol.
This assumption dramatically reduces the complexity of the model while still capturing local dependencies between positions.
Processes that satisfy this property are called Markov chains.
Interpretation in Biological Sequences
Section titled “Interpretation in Biological Sequences”The Markov assumption can be interpreted as a compromise between biological realism and computational tractability.
- It acknowledges that sequence positions are not independent.
- It restricts dependencies to a local neighborhood.
When scanning along a sequence, this assumption is often reasonable. Adjacent nucleotides or amino acids frequently exhibit correlations due to biochemical constraints, codon structure, or evolutionary processes.
At the same time, the model remains computationally manageable. Instead of modeling all possible histories, we only need to consider transitions between neighboring symbols.
A Simple Example
Section titled “A Simple Example”Consider a DNA sequence and a simple first-order Markov model defined by transition probabilities between nucleotides. For instance:
- ( P(A \mid G) = 0.3 )
- ( P(G \mid A) = 0.2 )
- ( P(T \mid T) = 0.4 ), and so on
Given such a model, the probability of a sequence ( X = (x_1, x_2, \dots, x_n) ) can be written as:
This resembles the likelihood computation we encountered earlier, but with an important difference:
The probability at each position now depends on the previous symbol, not just on its position in the sequence.
Comparison to Position-Specific Models
Section titled “Comparison to Position-Specific Models”It is instructive to compare this formulation with the position probability matrix introduced earlier.
- In a PPM, each position has its own distribution, independent of other positions.
- In a Markov model, the distribution at a position depends on the previous symbol, but not on the absolute position in the sequence.
This leads to different modeling capabilities:
| Model | Captures positional specificity | Captures dependencies | Handles variable length |
|---|---|---|---|
| PPM | Yes | No | No |
| Markov chain | No | Yes (local) | Yes |
Each model captures a different aspect of biological sequences, but neither alone is sufficient for realistic motif modeling.
Limitations of Observable Markov Models
Section titled “Limitations of Observable Markov Models”While Markov chains introduce dependencies, they still fall short in an important way. The states of the model are directly tied to the observed symbols. That is, the model describes transitions between nucleotides or amino acids themselves.
This creates a limitation:
- We cannot distinguish between different contexts that generate the same symbol.
- We cannot model latent biological processes such as “being in a promoter region” versus “being in background sequence.”
In other words, the model lacks an explicit representation of hidden structure.
Towards Hidden States
Section titled “Towards Hidden States”To address this limitation, we introduce an additional layer to the model.
Instead of assuming that the observed symbols themselves form a Markov chain, we assume that:
- there exists an underlying sequence of hidden states
- these states follow a Markov process
- each state emits observable symbols according to a probability distribution
This leads to a two-level model:
- A hidden sequence of states that evolves according to the Markov property
- An observed sequence generated by these states
This separation allows us to model situations where the same observed symbol can arise from different underlying biological contexts.
Conceptual Bridge to Hidden Markov Models
Section titled “Conceptual Bridge to Hidden Markov Models”We are now in a position to combine the ideas developed so far:
- From motif models, we retain the idea of probabilistic emission of symbols
- From Markov chains, we adopt the idea of state transitions with local dependencies
By introducing hidden states that control the emission of symbols, we obtain a model that:
- captures dependencies between positions
- distinguishes between different biological regimes
- allows sequences to be generated by switching between these regimes
This is precisely the idea behind Hidden Markov Models.
In the next section, we will formalize this model and describe its components in detail.
Self-Check Questions
Section titled “Self-Check Questions”- Why is modeling full dependencies between all positions in a sequence infeasible?
- What is the Markov assumption, and how does it simplify sequence modeling?
- How does a Markov chain differ from a position probability matrix?
- Why are observable Markov models insufficient for modeling biological sequence structure?
- What conceptual step leads from Markov chains to Hidden Markov Models?