Skip to content

Instantly share code, notes, and snippets.

@MatrixManAtYrService
Last active December 3, 2025 00:26
Show Gist options
  • Select an option

  • Save MatrixManAtYrService/6eaf50373448c0bc14acca31d69591b9 to your computer and use it in GitHub Desktop.

Select an option

Save MatrixManAtYrService/6eaf50373448c0bc14acca31d69591b9 to your computer and use it in GitHub Desktop.
Nix Subflakes as a Substrate for Integration Testing

There's noting special about sub flakes here. You can build dependency graphs out of nix flakes whether or not they share a git repo. Using those graphs you can control which versions of which flakes come in contact with each other, and using that you can reason inductively about which version contains problematic code.

That's all I'm doing here.

But for the last year (since this was merged) it has been possible to have more than one flake in a repo. Since then I've been wanting to try structuring a project specifically with this kind of reasoning in mind.

Previously, the place for new tests was typically inside the same repo as what is being tested, so when you walked your repo back in time, you also ended up walking your tests back in time. Subflakes, it would seem, fix that: We can vary the version of just one component while holding everything else constant.

Before trying this on a real project, I wanted to test it on a prototype. I've created a repo with some very simple subflakes, and in the rest of this doc I'm just walking through the familiar process of holding one thing constant while changing the other thing in order to reason about cause and effect, but I'll be doing it by varying subflake inputs.

Consider two sibling subflakes: foo and bar. foo's output is bar's input:

# foo/flake.nix
{
  inputs = {};
  outputs = { self }: {
    str = "foo1";
  };
}

# bar/flake.nix
{
  inputs = {
    foo.url = "path:../foo";
  };

  outputs = { self, foo }: {
    str = foo.str + "bar1";
  };
}

Also there's a toplevel flake which uses bar's output for its input:

# flake.nix
{
  inputs = {
    bar.url = "path:./bar";
  };

  outputs = { self, bar }: {
    str = bar.str + "baz";
  };
}

All of these are in the same git repo. In this case, the name of the game is function composition: toplevel(baz(foo())) = foo1bar1baz1

  ls **/*.nix | each { |f|
  { flake: $f.name
    output: (cd ($f.name | path dirname) ; nix eval .#str)
  }
}
╭───────────────┬───────────────╮
    flake          output     
├───────────────┼───────────────┤
 flake.nix      "foo1bar1baz" 
 bar/flake.nix  "foo1bar1"    
 foo/flake.nix  "foo1"        
╰───────────────┴───────────────╯

Now suppose we add an additional commit which changes all the 1's to 2's:

diff --git a/bar/flake.nix b/bar/flake.nix
   outputs = { self, foo }: {
-    str = foo.str + "bar1";
+    str = foo.str + "bar2";
   };
diff --git a/foo/flake.nix b/foo/flake.nix
   outputs = { self }: {
-    str = "foo1";
+    str = "foo2";
   };

Follwing this, all of the 1's are 2's

  ls **/*.nix | each { |f|
  {flake: $f.name
   output: (cd ($f.name | path dirname) ; nix eval .#str)
  }
}
╭───────────────┬───────────────╮
     flake         output     
├───────────────┼───────────────┤
 flake.nix      "foo2bar2baz" 
 bar/flake.nix  "foo2bar2"    
 foo/flake.nix  "foo2"        
╰───────────────┴───────────────╯

We have discovered that the change from 1->2 caused a bug. There are a few ways to proceed.

Revert Everything

This is business as usual and has nothing to do with subflakes. We can use git to revert the entire repo to the earlier commit, and manually test in each location to see if the breakage we found in 2 is missing in 1.

Try an Old Foo

Since foo is consumed by bar and then (transitively) by the toplevel flake, there are two places we can test to see if the bug is in foo. First let's modify the toplevel flake to use the old foo:

diff --git a/flake.nix b/flake.nix
   inputs = {
-    bar.url = "path:./bar";
+    foo.url = "git+file:///Users/matt.rixman/2025/12/01/subflake_test_repo?rev=4426dcaf098c7c567aecb9aa6fce62d4fbf243f5&dir=foo";
+    bar = {
+      url = "path:./bar";
+      inputs.foo.follows = "foo";
+    };

In a real use case we'd be scrutinizing test results here. We may even need to write a new test to run at the toplevel, just for identifying this bug. In this case we're just looking at proof that the toplevel flake is indeed consuming the old foo.

  ls **/*.nix | each { |f|
  {flake: $f.name
   output: (cd ($f.name | path dirname) ; nix eval .#str)
  }
}
╭───────────────┬───────────────╮
     flake         output     
├───────────────┼───────────────┤
 flake.nix      "foo1bar2baz" 
 bar/flake.nix  "foo2bar2"    
 foo/flake.nix  "foo2"        
╰───────────────┴───────────────╯

Hmm, we didn't find the bug. Let's revert that and try modifying bar to use the old foo. (The tests in bar might be more thorough than those at the top level.)

diff --git a/flake.nix b/flake.nix
   inputs = {
-    bar.url = "path:./bar";
+    foo.url = "git+file:///Users/matt.rixman/2025/12/01/subflake_test_repo?rev=4426dcaf098c7c567aecb9aa6fce62d4fbf243f5&dir=foo";
+    bar = {
+      url = "path:./bar";
+      inputs.foo.follows = "foo";
+    };
   };
  ls **/*.nix | each { |f|
  {flake: $f.name
   output: (cd ($f.name | path dirname) ; nix eval .#str)
  }
}
╭───────────────┬───────────────╮
     flake         output     
├───────────────┼───────────────┤
 flake.nix      "foo2bar2baz" 
 bar/flake.nix  "foo1bar2"    
 foo/flake.nix  "foo2"        
╰───────────────┴───────────────╯

Notice that at first, only bar picks up the change because that's the flake we modified. The toplevel flake doesn't pick up the change because it's locked.

If we run nix flake update, we then find that the changes have propagated to the toplevel flake's output also:

╭───────────────┬───────────────╮
     flake         output     
├───────────────┼───────────────┤
 flake.nix      "foo1bar2baz" 
 bar/flake.nix  "foo1bar2"    
 foo/flake.nix  "foo2"        
╰───────────────┴───────────────╯

Try an Old Bar

Try an Old Bar This case is simpler, because only the toplevel flake depends on bar. If we want to allow the old bar to depend on the old foo, like it authentically does, we can alter the toplevel flake like so:

inputs = {
-    bar.url = "path:./bar";
+    bar.url = "git+file:///Users/matt.rixman/2025/12/01/subflake_test_repo?rev=4426dcaf098c7c567aecb9aa6fce62d4fbf243f5&dir=bar";
   };

That will give us this:

╭───────────────┬───────────────╮
     flake         output     
├───────────────┼───────────────┤
 flake.nix      "foo1bar1baz" 
 bar/flake.nix  "foo2bar2"    
 foo/flake.nix  "foo2"        
╰───────────────┴───────────────╯

Alternatively, we can use the old bar, but supply it with a new foo:

inputs = {
-    bar.url = "path:./bar";
+    foo.url = "path:./foo";
+    bar = {
+      url="git+file:///Users/matt.rixman/2025/12/01/subflake_test_repo?rev=4426dcaf098c7c567aecb9aa6fce62d4fbf243f5&dir=bar";
+      inputs.foo.follows = "foo";
+    };
   };

This gives:

╭───────────────┬───────────────╮
     flake         output     
├───────────────┼───────────────┤
 flake.nix      "foo2bar1baz" 
 bar/flake.nix  "foo2bar2"    
 foo/flake.nix  "foo2"        
╰───────────────┴───────────────╯

Notice that bar/flake.nix still shows foo2bar2 because we've only modified the toplevel flake—the local bar subflake remains unchanged.

To verify the old bar code works with the new foo at the bar level, we need to evaluate the pinned bar directly. This is worth doing because bar's own tests may catch integration issues that the toplevel tests miss—perhaps bar has unit tests that exercise edge cases the toplevel doesn't cover.

"foo2bar1"

Thoughts

The above process was tedious, but it feels like it could be automated. If you go through it and grab strings like foo*bar*baz you'll end up with the following:

Toplevel:

foo1bar1baz
foo1bar2baz
foo2bar1baz
foo2bar2baz

Bar:

foo1bar1
foo1bar2
foo2bar2
foo2bar1

Foo:

foo1
foo2

We end up something like a DAG of test matrices, where the leaf nodes contain tests that depend on relatively little, and the root node depends on everything.

Likely, the foo tests are fast, the bar tests are slower (bar is more commplex, because it calls into foo) and the toplevel tests are the slowest (because they call into bar which must then call into foo). In my toy project I only found it necessary to involve containerization at the very top, preferring more stack-specific tests everywhere else. I expect this to be the case generally.

In an integration testing scenario, you want the smallest test that is still big enough to shine light on the bug. From these considerations there emerges a structure we can use to determine how often to run what tests: At the leaves of the dependency DAG, run frequently. Further toward the root, run less frequently.

Also, if you just made some changes and you want to know if you're done, it's often not necessary to rerun everything. In that case you can get away with running successively larger tests until you find a bug. Ideally, any bugs introduced can be caught by small (fast) tests and you're iterating quickly, but other times maybe you're not so lucky and it takes a larger more encompassing test to catch the bugs.

If one imagines AI agents doing this work, it would be a successful run at the top level which would indicate that the task is done and its time to move on to the next one, but following this leaves-first, root-last model would likely be ideal (especially because of how nix caches outputs). The flake boundaries feel sort of like varying "magnifications", so rather than having a single agent zooming in and out and getting potentially conflicting context you could instead have an agent for each flake, and they could notify each other of bugs. That way each agent can be a specialist in their flake and they don't have to wear too many hats.

I'm going to try it out, we'll see how it goes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment