Skip to content

Instantly share code, notes, and snippets.

@falseywinchnet
Last active November 4, 2025 15:03
Show Gist options
  • Select an option

  • Save falseywinchnet/56422881778f813f2bccb2e85389b9ee to your computer and use it in GitHub Desktop.

Select an option

Save falseywinchnet/56422881778f813f2bccb2e85389b9ee to your computer and use it in GitHub Desktop.

Proposal: PROJECT GEMSTONE

joshuah.rainstar@gmail.com

Overview

This proposal outlines a method to augment an autoregressive Transformer (e.g., GPT) with multi-horizon probabilistic priors derived from external Markov models or a similar statistical basis system. Instead of modifying the architecture, the method uses auxiliary layer-wise losses to align each layer’s internal representation with a synthetic embedding derived from the Markov transition probabilities.

The idea is to teach the model how to utilize prior knowledge to arrive at the most likely futures at multiple temporal horizons and therefore to localize discovery to relevant layers while maintaining compatibility with standard next-token training.


Conceptual Flow

  1. Embed and stream dataset incrementally to build multi-horizon Markov models.

  2. Construct several Markov models, each modeling a different step horizon:

    • M1 → 1-step transition
    • M2 → 2-step transition
    • M4 → 4-step transition
  3. During training, for each position t, obtain Markov-based probability distributions for future steps.

  4. Using top-k of <64 on the transition matrix, obtain a distribution for use with the model.

  5. Convert these probabilities into synthetic embedding vectors via a learned reverse projection through the model’s LM head.

  6. https://github.com/sdan/nanoEBM/blob/master/nanoebm/model.py

  7. https://arxiv.org/pdf/2510.04871

  8. https://arxiv.org/pdf/2510.21450

  9. synthesize a system that iteratively performs newton descent using tiny recurrent blocks and energy minimization to approach the target auxillary for that layer.

  10. redesign all mechanisms to have hidden states residing outside probability density, learning to move inward toward valid distributions.

  11. Compute auxiliary losses per layer, aligning each layer’s product with its corresponding synthetic embedding target.

  12. Combine these auxiliary losses with the final cross-entropy loss.

@falseywinchnet
Copy link
Author

update: rough draft revised, now 10x better, now doing interesting and unexpected things

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment