Extension to 6.1 — Quantitative Limitations of Position-Specific Models

When Independence Creates Artificial Patterns

A central assumption of the position probability matrix is that positions are statistically independent. While this simplifies computation, it can lead to systematically incorrect conclusions.

To see this, consider a simplified motif over amino acids with strong conservation at certain positions. Suppose that in an alignment we observe the following combinations at two positions:

( R ) is always followed by ( D )
$Q$ is always followed by $H$

That is, the observed pairs are:

(R, D), \quad (Q, H)

Importantly, the combinations

(R, H), \quad (Q, D)

never occur in the data.

Now consider how a position probability matrix represents this situation. It only captures marginal frequencies at each position:

At position 1:
- ( P(R) = 0.6 ), ( P(Q) = 0.4 )
At position 2:
- ( P(D) = 0.6 ), ( P(H) = 0.4 )

Under the independence assumption, the model assigns probabilities to pairs by multiplication:

P(R, H) = P(R) \cdot P(H) = 0.6 \cdot 0.4 = 0.24

P(Q, D) = P(Q) \cdot P(D) = 0.4 \cdot 0.6 = 0.24

However, both of these combinations never occur in reality.

At the same time, the valid combinations receive:

P(R, D) = 0.6 \cdot 0.6 = 0.36, \quad P(Q, H) = 0.4 \cdot 0.4 = 0.16

This leads to a fundamental inconsistency:

The model assigns substantial probability mass to biologically impossible combinations.

This is not merely a biological inconvenience. It is a structural limitation of the model: by ignoring dependencies, it introduces spurious sequence patterns that can mislead downstream analyses.

Insertions and the Collapse of Likelihood

A second limitation becomes apparent when considering insertions.

Suppose we have a motif model of length five and a sequence that matches the motif well. As shown earlier, such a sequence may achieve a relatively high likelihood under the model.

Now consider a slightly modified sequence in which a single additional symbol is inserted:

\text{Original: } S = \text{WEIRD}

\text{Modified: } S' = \text{WETIRD}

Even if the inserted symbol is biologically plausible, the model has no mechanism to accommodate it. Instead, it is forced to evaluate the sequence against fixed positions.

This leads to a drastic effect. The likelihood becomes:

P(S' \mid M) = P(W) \cdot P(E) \cdot P(T) \cdot P(I) \cdot P(R) \cdot P(D)

If the inserted symbol has a low probability at its assigned position, it effectively penalizes the entire sequence. Because probabilities are multiplied, even a single small factor can dominate the result.

For example, if one position contributes a probability of ( 0.01 ), the overall likelihood may drop by several orders of magnitude:

P(S' \mid M) \ll P(S \mid M)

This leads to a critical issue:

A biologically valid motif instance with a small insertion may appear far less likely than an unrelated sequence without insertions.

From a modeling perspective, this is unacceptable. Insertions and deletions are common in biological sequences, and any realistic model must be able to account for them.

Interpretation: Why These Failures Matter

These two examples highlight complementary weaknesses:

Independence assumption → introduces combinations that do not exist
Fixed-length assumption → penalizes valid sequences with structural variation

Together, they reveal a deeper issue:

Position-specific models do not describe how sequences are generated. They only describe what individual positions look like.

To overcome these limitations, we need a model that:

captures dependencies across positions
allows flexible sequence structure
explicitly represents how sequences evolve along their length

This motivates the transition to Hidden Markov Models, which we will develop in the following sections.

(Optional) Self-Check Extension

Why does the independence assumption lead to non-existent sequence combinations receiving non-zero probability?
How does a single low-probability position influence the total likelihood of a sequence?
Why are insertions particularly problematic for fixed-length models?
In what sense do these failures indicate that the model is not generative?