CAFxX/golang_bulk_alloc.md

## golang_bulk_alloc.md

      
    Raw
  

              golang_bulk_alloc.md
            
          
    Bulk memory allocations for Go

Consider the following code:
s := make([]*Foo, N)
for i := range s {
  s[i] = new(Foo)
  // ...
}
Currently the code above performs N+1 allocations, but we could realistically bring it down to 1+1 by rewriting it as
s := make([]*Foo, N)
b := make([]Foo, N)
for i := range s {
  s[i] = &b[i]
  // ...
}
This works today, but has the unfortunate side effect of having all elements of the backing array b remain alive as long
as any pointer to any of the elements still exists.
It would be ideal if we (or the compiler) could instead obtain from the runtime in a single call all N backing Foo objects,
but allocated as individual objects so that they can be individually garbage collected when each of them individually becomes
unreachable.
Note
In these examples runtime_multimallocgc is a call inserted by the compiler.

s := make([]*Foo, N)
base, delta := runtime_multimallocgc(<Foo>, N)
for i := range s {
  s[i] = (*Foo)(unsafe.Pointer(base))
  base += delta
  // ...
}
Such a scheme has an obvious downside in memory fragmentation. When the runtime allocates individual objects, it can choose to
place them in any span where there is at least one unallocated slot. If the runtime allocates ranges of slots instead, finding
large enough ranges becomes progressively harder.
This can be mitigated by having the runtime not guarantee that it will allocate N objects, but instead allocate at least 1 and
best-effort up to N objects.
s := make([]*Foo, N)
var base, delta uintptr
var count int
for i := range s {
  if count == 0 {
    base, delta, count = runtime_multimallocgc(<Foo>, N-i)
  }
  s[i] = (*Foo)(unsafe.Pointer(base))
  base += delta
  count--
  // ...
}
With this scheme, the runtime can opportunistically return more than one contiguous element to the caller, even if finding N
contiguous ones is impossible. In the worst case of fragmentation this scheme behaves like the current allocator, while in all
other cases it should be faster due to the amortization of per-allocation overhead.
The main issue with the above is that there is nothing keep the allocations alive between the call to runtime_multimallocgc and
the assignment to s[i]. Either a new kind of reference could be created that is able to keep a range of slots alive (but this is
possibly a quite complex/invasive change) or runtime_multimallocgc should instead be passed a reference to an array/slice
(possibly stack-allocated) that can receive the pointers to all individual allocations:
s := make([]*Foo, N)
p := [C]*Foo{}
var j, count int
for i := range s {
  if j == count {
    j, count = 0, runtime_multimallocgc(<Foo>, min(C, len(s)-i), &p)
  }  
  s[i] = p[j++]
  // ...
}
Furthermore, if the compiler could prove that user code will not be able to observe this, it could skip the temporary array
and just use the slice directly:
s := make([]*Foo, N)
var j, count int
for i := range s {
  if j == count {
    j, count = 0, runtime_multimallocgc(<Foo>, len(s)-i, &s[0])
  }  
  s[i] = p[j++]
  // ... (user code does not observe that s[i+1:] has already been populated)
}
It is worth pointing out that, in general, a valid (although a bit meaningless) implementation of runtime_multimallocgc
simply forwards to the regular mallocgc (this is important as this constitutes a backstop for catastrophic heap fragmentation):
func multimallocgc(count uint, size uintptr, typ *_type, needzero bool) (base unsafe.Pointer, delta uintptr, nalloc uint) {
  if doubleCheckMalloc {
    if gcphase == _GCmarktermination {
	  throw("multimallocgc called with gcphase == _GCmarktermination")
    }
  }
  if count == 0 {
    return unsafe.Pointer(&zerobase), 0, 0
  }
  return mallogc(size, typ, needzero), size, 1
}
While in the examples above the number of objects required N is seemingly known ahead of the loop, this is obviously not always
the case. Indeed many loops don't have a known trip count, and even if they have it it is not guaranteed that every iteration will
trigger an allocation, or that all iterations will actually run (early return/break). The scheme above could be made slightly more
speculative by e.g. using a small, fixed C as the count argument to multimallocgc. In cases where there are unused allocated
objects when the loop terminates, they would be wasted. runtimefree (see below) offers a solution to this problem.
Interactions with runtimefree

When runtimefree is active, multimallocgc would first pop any appropriate object from the object reuse cache. This will cause
the objects to probably not have good memory locality, but by reusing them they are more likely to be in cache anyway.
In case some of the objects allocated by multimallocgc are unused by the loop, they can be passed back in bulk to the runtimefree
logic for reuse.
No results found