A reusable structure for presenting mech interp experiments consistently.
- Title: Descriptive, specific to the contribution
- Authors & Affiliations
- Date
- Links: Code repo, demo, paper (if applicable)
- TL;DR: 3-5 bullet points covering:
- What problem you're addressing
- Your approach (1-2 sentences)
- Key findings
- Limitations/caveats
- Call to action or future directions
- What phenomenon or capability are you trying to understand?
- Why does this matter for AI safety/alignment/interpretability?
- What has prior work done?
- What's missing or insufficient?
- High-level description of your approach
- What makes it novel or useful?
- What models/tasks/datasets are you studying?
- What are the boundaries of your investigation?
- Diagram or figure showing the full pipeline
- Intuitive explanation before technical details
For each major component:
- What: What does this step do?
- Why: Why is this step necessary?
- How: Technical details (can reference appendix for full rigor)
- What alternatives did you consider?
- Why did you choose this approach?
- What went wrong in early versions?
- How did you address it?
- Qualitative inspection: Does the output look reasonable?
- Known-answer tests: Does it recover structure you intentionally created?
- Edge cases: Does it fail gracefully?
- Task definition: What objective metric are you measuring?
- Baselines: What are you comparing against?
- Naive baselines (random, constant prediction)
- Ablations of your method
- Existing methods (if applicable)
- Strong alternatives (e.g., just asking an LLM)
- Results: Tables/figures with clear takeaways
- Analysis by condition: Break down results by relevant factors
- What do the results tell us?
- What are the limitations of the evaluation?
- Walk through specific examples in detail
- Include visualizations where helpful
- What recurring themes emerged?
- What surprised you?
- Different models
- Different prompts/tasks
- Different hyperparameters
- Restate key results in plain language
- What doesn't this method capture?
- Where does it fail?
- What assumptions does it make?
- What does this tell us about how models work?
- How might this inform safety/alignment work?
- Concrete next steps
- Open questions for the community
- What would make this more useful?
- 2-3 paragraph wrap-up
- Final call to action
- Funding, mentorship, feedback
- Who did what (for multi-author posts)
- Full algorithms/pseudocode
- Hyperparameters and tuning
- Prompts used (verbatim)
- Ablation studies
- Additional baselines
- Per-task/per-model breakdowns
- Compute requirements
- Data/model access
- Known issues
- Additional visualizations
- Extended examples
| Element | Guideline |
|---|---|
| Figures | Every method section should have at least one diagram. Label clearly. |
| Code snippets | Use sparingly in main text; link to repo for full code. |
| Math | Define notation on first use. Keep inline math simple. |
| Length | Target 2,000-4,000 words for main text; appendix can be longer. |
| Tone | Honest about limitations. Avoid overclaiming. |
| Audience | Assume familiarity with ML but not your specific subfield. |
- TL;DR captures the essence in <1 minute of reading
- At least one sanity check shows the method isn't broken
- At least one quantitative baseline shows it's doing something non-trivial
- Limitations section is honest and specific
- Code/demo links work
- Figures render correctly
- A non-expert colleague can follow the main argument
https://www.alignmentforum.org/posts/eJGptPbbFPZGLpjsp/highly-opinionated-advice-on-how-to-write-ml-papers