Skip to content

Instantly share code, notes, and snippets.

@CAFxX
Last active December 5, 2025 05:57
Show Gist options
  • Select an option

  • Save CAFxX/983c1e3ab16fa0eecec33f6485cda731 to your computer and use it in GitHub Desktop.

Select an option

Save CAFxX/983c1e3ab16fa0eecec33f6485cda731 to your computer and use it in GitHub Desktop.
Golang bulk allocations

Bulk memory allocations for Go

Consider the following code:

s := make([]*Foo, N)
for i := range s {
  s[i] = new(Foo)
  // ...
}

Currently the code above performs N+1 allocations, but we could realistically bring it down to 1+1 by rewriting it as

s := make([]*Foo, N)
b := make([]Foo, N)
for i := range s {
  s[i] = &b[i]
  // ...
}

This works today, but has the unfortunate side effect of having all elements of the backing array b remain alive as long as any pointer to any of the elements still exists.

It would be ideal if we (or the compiler) could instead obtain from the runtime in a single call all N backing Foo objects, but allocated as individual objects so that they can be individually garbage collected when each of them individually becomes unreachable.

Note

In these examples runtime_multimallocgc is a call inserted by the compiler.

s := make([]*Foo, N)
base, delta := runtime_multimallocgc(<Foo>, N)
for i := range s {
  s[i] = (*Foo)(unsafe.Pointer(base))
  base += delta
  // ...
}

Such a scheme has an obvious downside in memory fragmentation. When the runtime allocates individual objects, it can choose to place them in any span where there is at least one unallocated slot. If the runtime allocates ranges of slots instead, finding large enough ranges becomes progressively harder.

This can be mitigated by having the runtime not guarantee that it will allocate N objects, but instead allocate at least 1 and best-effort up to N objects.

s := make([]*Foo, N)
var base, delta uintptr
var count int
for i := range s {
  if count == 0 {
    base, delta, count = runtime_multimallocgc(<Foo>, N-i)
  }
  s[i] = (*Foo)(unsafe.Pointer(base))
  base += delta
  count--
  // ...
}

With this scheme, the runtime can opportunistically return more than one contiguous element to the caller, even if finding N contiguous ones is impossible. In the worst case of fragmentation this scheme behaves like the current allocator, while in all other cases it should be faster due to the amortization of per-allocation overhead.

The main issue with the above is that there is nothing keep the allocations alive between the call to runtime_multimallocgc and the assignment to s[i]. Either a new kind of reference could be created that is able to keep a range of slots alive (but this is possibly a quite complex/invasive change) or runtime_multimallocgc should instead be passed a reference to an array/slice (possibly stack-allocated) that can receive the pointers to all individual allocations:

s := make([]*Foo, N)
p := [C]*Foo{}
var j, count int
for i := range s {
  if j == count {
    j, count = 0, runtime_multimallocgc(<Foo>, min(C, len(s)-i), &p)
  }  
  s[i] = p[j++]
  // ...
}

Furthermore, if the compiler could prove that user code will not be able to observe this, it could skip the temporary array and just use the slice directly:

s := make([]*Foo, N)
var j, count int
for i := range s {
  if j == count {
    j, count = 0, runtime_multimallocgc(<Foo>, len(s)-i, &s[0])
  }  
  s[i] = p[j++]
  // ... (user code does not observe that s[i+1:] has already been populated)
}

It is worth pointing out that, in general, a valid (although a bit meaningless) implementation of runtime_multimallocgc simply forwards to the regular mallocgc (this is important as this constitutes a backstop for catastrophic heap fragmentation):

func multimallocgc(count uint, size uintptr, typ *_type, needzero bool) (base unsafe.Pointer, delta uintptr, nalloc uint) {
  if doubleCheckMalloc {
    if gcphase == _GCmarktermination {
	  throw("multimallocgc called with gcphase == _GCmarktermination")
    }
  }
  if count == 0 {
    return unsafe.Pointer(&zerobase), 0, 0
  }
  return mallogc(size, typ, needzero), size, 1
}

While in the examples above the number of objects required N is seemingly known ahead of the loop, this is obviously not always the case. Indeed many loops don't have a known trip count, and even if they have it it is not guaranteed that every iteration will trigger an allocation, or that all iterations will actually run (early return/break). The scheme above could be made slightly more speculative by e.g. using a small, fixed C as the count argument to multimallocgc. In cases where there are unused allocated objects when the loop terminates, they would be wasted. runtimefree (see below) offers a solution to this problem.

Interactions with runtimefree

When runtimefree is active, multimallocgc would first pop any appropriate object from the object reuse cache. This will cause the objects to probably not have good memory locality, but by reusing them they are more likely to be in cache anyway.

In case some of the objects allocated by multimallocgc are unused by the loop, they can be passed back in bulk to the runtimefree logic for reuse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment