This proposal outlines a method to augment an autoregressive Transformer (e.g., GPT) with multi-horizon probabilistic priors derived from external Markov models or a similar statistical basis system. Instead of modifying the architecture, the method uses auxiliary layer-wise losses to align each layer’s internal representation with a synthetic embedding derived from the Markov transition probabilities.
The idea is to teach the model how to utilize prior knowledge to arrive at the most likely futures at multiple temporal horizons and therefore to localize discovery to relevant layers while maintaining compatibility with standard next-token training.
-
Embed and stream dataset incrementally to build multi-horizon Markov models.
-
Construct several Markov models, each modeling a different step horizon:
M1→ 1-step transitionM2→ 2-step transitionM4→ 4-step transition
-
During training, for each position
t, obtain Markov-based probability distributions for future steps. -
Using top-k of <64 on the transition matrix, obtain a distribution for use with the model.
-
Convert these probabilities into synthetic embedding vectors via a learned reverse projection through the model’s LM head.
-
https://github.com/sdan/nanoEBM/blob/master/nanoebm/model.py
-
synthesize a system that iteratively performs newton descent using tiny recurrent blocks and energy minimization to approach the target auxillary for that layer.
-
redesign all mechanisms to have hidden states residing outside probability density, learning to move inward toward valid distributions.
-
Compute auxiliary losses per layer, aligning each layer’s product with its corresponding synthetic embedding target.
-
Combine these auxiliary losses with the final cross-entropy loss.
we yield an early rough draft of this idea