5.5 From Motifs to Latent Variables

Learning Objectives

After reading this section, you should be able to:

distinguish between observed data, hidden variables, and model parameters
reformulate motif discovery as a latent-variable problem
understand the generative perspective of sequence modeling
explain why motif discovery can be viewed as parameter estimation with incomplete data
recognize why iterative approaches become necessary

In the previous section, we formulated the motif finding problem and identified its central difficulty: both the structure of the motif and its positions within the sequences are unknown. This circular dependency prevents us from applying standard methods based on direct counting or direct comparison.

To make further progress, it is useful to reinterpret the problem in a more general and abstract way. Rather than focusing exclusively on sequences and motifs, we now describe the situation in terms of three components: observed data, hidden variables, and model parameters.

Observed Data, Hidden Structure, and Parameters

The observed data consist of the sequences themselves. These are directly available and form the basis of our analysis. However, the locations at which the motif occurs are not observed. For each sequence, there is an unknown starting position that determines which subsequence corresponds to the motif.

These unknown positions can be represented as hidden variables. They encode the structure that we would ideally like to observe but cannot access directly. In parallel, the motif itself is represented by a probabilistic model—specifically, a position probability matrix. The entries of this matrix define the parameters of the model.

This leads to a clear separation:

the sequences are observed
the motif positions are hidden
the motif model is parameterized and unknown

A Generative View of Sequences

This formulation suggests a useful way of thinking about the problem: instead of asking how to detect motifs, we ask how the sequences could have been generated.

Imagine a process in which, for each sequence, a position is selected at which a motif is inserted. At that position, symbols are generated according to the motif model, while the remainder of the sequence is generated according to a background distribution. In this view, the hidden variables correspond to the unknown choices of motif positions, and the observed sequences are the result of this generative process.

This perspective shifts the problem fundamentally. We are no longer searching for patterns directly. Instead, we are trying to learn a model that explains how the data were generated.

The Role of Hidden Variables

If the hidden variables were known—that is, if we knew the exact motif positions—the problem would become simple. We could extract the corresponding subsequences, align them, and estimate the motif model using frequency counts, as described in Section 5.2.

Conversely, if the parameters of the motif model were known, we could evaluate all candidate positions and identify those that best match the model. In other words, either side of the problem becomes tractable once the other is fixed.

The difficulty arises because neither is known. The hidden variables and the parameters depend on each other, creating the circular dependency that we encountered earlier.

Toward an Iterative Solution

This structure suggests a natural strategy. Instead of attempting to solve both aspects of the problem simultaneously, we can alternate between them.

We may begin with an initial guess of the model parameters. Using this model, we estimate which parts of the sequences are likely to correspond to motif instances. Once we have these estimates, we update the model parameters based on the inferred motif positions. This updated model can then be used to refine the position estimates, and the process can be repeated.

This idea introduces an important shift in perspective. Rather than seeking a direct solution, we adopt an iterative approach in which the model and the hidden structure are gradually refined. Each step uses the current estimate of one component to improve the other.

At this point, the structure of the problem closely matches a general class of statistical inference problems in which observations are incomplete and must be explained by models with hidden variables. The motif discovery problem is thus not an isolated case, but an instance of a broader framework.

In the next section, we will introduce a general method designed precisely for this setting. The Expectation Maximization algorithm provides a principled way to alternate between estimating hidden variables and updating model parameters, thereby resolving the circular dependency that characterizes motif discovery.

Self-Check Questions

What are the observed variables, hidden variables, and parameters in the motif discovery problem?
How can motif discovery be interpreted from a generative perspective?
Why does knowing the hidden variables simplify parameter estimation?
Why does knowing the parameters simplify the inference of hidden variables?
Why does this mutual dependence suggest an iterative solution?