omartell/codemesh2015.org

## codemesh2015.org

      
    Raw
  

              codemesh2015.org
            
          
    Kush, an introduction to schedulers

about me


  I work for GDS
  Cabinet Office
  we started by building GOV.UK
    
      replaced older sites like direct gov, business link
    
  
  we’re not just fixing websites
    
      we also build and run digital services
      working with depts across the country
      eg: register to vote
        
          saw about 200k people register at the same time
        
      
  visibility on govt services
  I’m not here to talk about GDS
    
      go to Phil’s talk
    
  
this talk


  it’s been hard to go to a conf in the last two years without
    hearing about docker (docker docker docker)
  I want to demystify schedulers
    
      some people feel this is too much too soon
      mesos, kubernetes, etc
    
  
  when you should use schedulers
    
      and when you shouldn’t
    
  
  I see schedulers everywhere

what is it?


  the job of a scheduler is to fairly allocate work
    
      fairness is an artistic term
      schedulers act fairly based on their defined objectives
    
  
  example: schedule the expression E = AB + CD
    
      E = AB + CD
      E = (1*2) + (3*4)  – explicit ordering
      tree of expressions
        
          addition is commutative, so order doesn’t matter
        
      
common scheduling problems


  unconstrained
  time-constrained
    
      all ops completed in a time-bound
        
          at least two steps
        
      
  resource-constrained
    
      if we only have one multiplication unit
    
  
types of scheduling


  ASAP (as soon as possible)
  ALAP (as late as possible)
  list scheduling
  force-directed scheduling
    
      weighting (commonly time)
    
  
why?


  we don’t make the best use of our infrastructure resources

example: linuxkernel cpu scheduler


  decides which ops get CPU ultilization right now
  challenge: make sure applications get a fair share of CPU time
    
      (called a “time slice”)
    
  
  let’s look at linux 2.4 cpu scheduler
    
      task list containing all processes, and work to be completed
        
          guarded by a r/w lock
          each process has a priority
        
      
      it had a number of shortcomings
        
          wasn’t efficient
            
              had to iterate through whole task list every time
            
          
          bad in SMP environments
            
              could easily pin a single CPU and leave others unused
            
          
  was rewritten
    
      remove global task list log
      each process gets its own queue and its own lock
    
  
  rewritten again in 2.6
    
      called the Completely Fair Scheduler
      uses red/black tree instead of queues
    
  
  scheduling algorithms:
    
      FF SCHED_FIFO
      RR SCHED_RR
      TS SCHED_OTHER (_NORMAL)
    
  
  nice
    
      runs a program with modified priority
      eg: nice -n 19 echo “hello”
    
  
  chrt or “change real time”
    
      chrt -p 16
      can change priority but also scheduling algorithm
    
  
  ionice - set IO scheduling priority
    
      ionice -c3 -p89
        
          sets process with pid 89 as an idle io process
        
      
i see schedulers


  many queues, locks and a scheduling algorithm over the top
  not just CPUs!
  eg in networking (see www.lartc.org)
  tc - traffice control
    
      eg: tc qdisc add dev lo root netem delay 500ms
        
          sets queuing discipline
          network delay of 500ms (netem == network emulation)
        
      
  eg in postgres: has something called autovacuum
    
      automates execution of VACUUM and ANALYZE
    
  
container schedulers


  Kubernetes:

type Scheduler interface
{
        Schedule(
                *api.Pod,
                ...
        )
}

  Docker swarm
  primitives
    
      CPU
      instances
      add your own: eg response time
    
  
  ephemeral node
    
      with a service on it
      different nodes have different properties in the cloud
        
          bad io
          noisy neighbours
        
      
  see @kelseyhightower’s talks too on container schedulers

others


  baggage reclaim
    
      a bunch of queues
      a lock
      you are the scheduling algorithm - you know what you care about
        which is your bag
      baggage reclaim is controlled by a software scheduler
    
  
  premier league football matches
    
      domain-specific rules
        
          man city and man u can’t play at home on the same day
            
              (infrastructure can’t cope)
            
          
          nottingham trent bridge
            
              three stadiums next to each other
              can’t all operate at the same time
            
          
  people
    
      pomodoro
      work planning
    
  
  you can choose to hand over knowledge work to schedulers (but
    they’re not a silver bullet)
    
      can probably do better capacity planning
    
  
Amir Chaudhry: Unikernels and hyper-elastic clouds


  who’s heard of unikernels?
  who’s actually used them? which ones?
    
      halvm
      mirage
      rump kernel
    
  
  talk resources

motivation


  software today
    
      is built locally, but deployed somewhere else
      is complex! (even though most apps are single-purpose)
    
  
  complexity is the enemy

MirageOS


  OCaml
    
      it has a good module system
    
  
links


  jekyll to unikernel
  bitcoin piñata

Matthew Podwysocki, Putting You Back in Charge of Your Data


  http://thaliproject.org/
  http://pouchdb.com/

Zach Tellman, Everything Will Flow


  https://twitter.com/ztellman

Queues!


  they’re simple, right?
    
      producer -> queue -> consumer
    
  
  they’re everywhere
  java.util.concurrent.Executor
    
      thread pool
      BlockingQueue on requests
    
  
  java.util.concurrent.BlockingQueue
    
      producer -> [notFull] [buffer] [notEmpty] -> consumer
    
  
  java.util.concurrent.locks.Condition
    
      software threads
      hardware threads
      more queues to schedule the threads
    
  
  queues are everywhere!

why?


  queues separate the “what” from the “when”

queueing theory


  branch of statistics, not software!
  Performance modeling and design of computer systems

closed systems


  one-at-a-time
  shell prompt
    
      command
      output
      command
      output
      …
    
  
  single person browsing a website
    
      click link
      wait for page to load
      click link
      wait for page to load
      …
    
  
open systems


  requests come in
    
      while we’re serving one, another can come in at any point
    
  
simulating open systems


  producers: exponential distribution
    
      it’s memoryless – the amount of time passed doesn’t change
        likelihood of producers arriving
    
  
  consumers: pareto distribution
  simulation examples
    
      failure mode: load increases until queueing time dominates
      the more consumers you add, the longer it takes to fall over
        
          …but the more suprising it is when it does
          echos of “why complex systems fail” here
        
      
  lesson: unbounded queues are fundamentally broken
  we’ve made assumptions about how users behave, but haven’t told
    them, or held them to it

dealing with too much data


  options:
    
      drop the data
      reject the data
      pause the data
    
  
  Drop
    
      only valid if newer data obsoletes older data, or if we just
        don’t care
        
          eg statsd uses UDP by default
            
              problem: when we’re most in need of high fidelity data,
                we’re least likely to have it, because UDP is best-effort
                and will degrade under load
            
          
  Reject
    
      eg twitter fail whale
      often, this is the only choice available
    
  
  Pause (aka backpressure)
    
      often the only choice for a close system, library, or subsystem
      backpressure swims upstream
        
          back to java BlockingQueue:
          producer -> [notFull] [buffer] [notEmpty] -> consumer
            
              the extra pieces are for backpressure reasons
              you don’t even need the buffer here!
            
          
unbuffered queues


  producer -> [puts] [takes] -> consumer
  unbuffered queues are closed systems
  why have buffers if we don’t need them?
    
      because backpressure isn’t free
        
          restarting threads
          restarting TCP traffic
          reading from disk
          reading from dbs
        
      
      a buffer allows higher, more consistent throughput
        
          at a cost of potentially higher latency, because things sit
            on the buffer waiting to be dealt with
        
      
lessons


  we should always plan for too much data
  backpressure
  TCP bufferbloat - example of problem of too many composed
    buffers

concrete example


  ma: data system that takes data and takes it to a safe at-rest
    place
  arbitrary volume of data -> ELB -> [thing] -> S3
    
      what goes in the middle?
        
          must horizontally scale
          must be low-maintenance
          not real-time
          loss of data ∝ loss of utility
        
      
      used:
        
          aleph -> hadoop s3 client -> s3
        
      
      but then:
        
          S3 starts rejecting writes
          hadoop starts accumulating writes waiting for S3 to come back
        
      
  replaced hadoop s3 client with:
    
      durable-queue and s3-journal
    
  
  still had problems:
    
      still acquired data in durable-queue without bound
      no visibility when this happened; no metrics
      had to start writing metrics for this
    
  
  netty’s threading model
    
      our typical metrics measure from when we start actually
        processing the request
        
          but this ignores the queueing delay!
          threadpools are not well instrumented
        
      
  dirigiste for better understanding of behaviour of thread pools
    
      distinguish task latency from queue latency
      better metrics
        
          not complete – they can’t be
          whole system includes client
        
      
lessons


  be clear what “completion” is
  be picky about your tools
    
      watch for sneaky unbounded queues!
    
  
  prefer metrics to raw performance
  you’re never measuring the complete system

publishing


  eg chat server
  one-to-many queue
  timeouts

we talked about queues


  unbounded queues aren’t. they’re bounded by something we’ve
    failed to define.

Q&A


  how do you deal with pain from low defaults (eg in haproxy?)
    
      you need some documentation on the knobs are and why you should
        care
      a one-size-fits-all parameter is bound to disappoint
    
  
  with the log-latency graphs you showed with the buffered queue
    example, is that a useful visibility tool for day-to-day systems
    we use?
    
      that’s my general intuition
      most people use graphite
      doing distributions in graphite is hard
    
  
  if we follow your example of using buffered queues, were there
    any metrics you feel we should definitely be tracking?
    
      utilization & contention
        
          once you get above 60-70% utilization, exhaustion gets quite
            likely
        
      
  queueing theory often assumes a fixed number of consumers; can it
    work in an autoscaling world?
    
      yes
      maybe cloud services resembles infinite consumers?
      can you do formal analysis of a dynamic system like this and
        predict exactly how it will behave? probably not
        
          better to test, benchmark, see what it does under load
        
      
Evan Czaplicki, Accidentally Concurrent


  works for Prezi to work on the Elm language
  I care a lot about how proglang research from last 30 yrs can be
    useful day-to-day on the web
    
      eg: why is immutability good?
    
  
message passing concurrency


  I take this from concurrent ML
  lightweight threads
  basic primitives:
    
      channel: () -> Channel a
      send: Channel a -> a -> ()
      recv:  Channel a -> a
      spawn: (a -> ()) -> a -> ThreadId
    
  
mutation


  what does mutation look like when you model it as a concurrent
    system with these primitives?
    
      type Msg a = Get | Set a
      “two spies in the park” approach to communication
      a new thread per variable
        
          becomes unmanagable
        
      
encapsulation

function BadGuy() {
    var x = 0;
    var y = 0;
    var health = 100;
    this.follow = function(hero) { ... };
    this.damage = function(n) { ... };
}

  hides multiple mutable cells behind a single interface
  introduces hierarchy
  but: hierarchy ≠ modularity
    
      reference can be passed from one code unit to another
        
          introduces extra comm lines between subtrees of hierarchy
        
      
      getters and setters on objects
      setters mean readers have ability to bash on objects
    
  
schedulers


  different kinds
    
      cooperative
        
          runs a thread until it’s done or it yields
          used by golang, …
        
      
      preemptive
        
          you can bail on something half-way through
          used by erlang, concurrent ML, …
        
      
  that old thing concurrency ≠ parallelism
  event loop
    
      “a poor man’s concurrency”
      javascript setTimeout(callback, 0); is the same as spawn
        callback()
    
  
  “Reactive” Programming
    
      “everything is a stream!”
        
          type alias Stream a = Channel a
          build things up with combinators
          map: (a->b) -> Stream a -> Stream b
        
      
      glitches
        
          diamond communication shape
        
      
diamond : Stream Float -> Stream Bool
diamond stream =
  merge
    (==)
    (map sqrt stream)
    (map sqrt stream)


  glitches cont’d
    
      looked at as a concurrent system, it’s clear that this will
        cause glitches
        
          there’s no reason to expect synchronization
        
      
      if you thought of your streams as a concurrent system, you
        probably wouldn’t write it this way in the first place
    
  
  FlatMap is weird!
    
      flatMap : (a -> Stream b) -> Stream a -> Stream b
      it is kind of like a threadpool, but really bizarrely done
      is it even a reasonable use of flatmap?
    
  
  dependencies between streams become a rat’s nest
    
      hero depends on keyboard
        
          hero’s health depends on location, bad guys, etc
        
      
      bad guys depend on hero
        
          want to chase him
        
      
      pause screen depends on
        
          hero’s health
          keyboard, mouse
        
      
      hero depends on pause screen
        
          maybe I can apply a potion to increase health
        
      
      this is crazy
    
  
big points


  writing high-quality concurrent programs is hard
  everyone is writing accidentally concurrent programs

better stuff


  immutability
    
      means no accidental communication channels
      revealing information doesn’t change your architecture
    
  
Q&A


  what about Rust’s approach to mutability where it just emphasizes
    ownership?
    
      ownership and encapsulation are separate concerns
    
  
Leah Hanson, How Julia Goes Fast


  @astrieanna
  http://blog.leahhanson.us
  julia is a programming language
    
      “as fast as C” to “2x slower than C”
    
  
what problem is julia solving?


  julia is for scientists
    
      engineers, mathematicians, finance people
      non-professional programmers who use programming as a tool
        
          least time programming for maximum benefit
        
      
  what do they need?
    
      easy to learn
      easy to use
      good for scripts
      good for libraries
      fast enough for medium to large data sets
      fast, extensible math
        
          linear algebra
          new numeric types
        
      
  easy and fast
  how is it better than what they already use? (eg numpy)
    
      need to learn python
      need to add your own function that doesn’t yet exist
        
          implement in python
            
              too slow
              (numpy’s things are written in C with python wrappers)
              learning C is not a good use of our users’ time
            
          
  fast julia code is written in julia

the big decisions


  dynamic, interpreted, easy

background for implementation


  JIT compilation
    
      at run time, you alternate between compiling and running code
    
  
  compiler needs to be fast
    
      static compilers can be slow because they don’t contribute to
        runtime
      this is no longer true for JITs
    
  
  compiler has access to runtime information
  julia is designed up front to be JIT compiled
    
      type system
        
          abstract or concrete types
            
              concrete types:
                
                  can be instantiated
                  determine mem layout
                  types can’t be modified after creation
                  can’t have subtypes
                  single supertype
                
              
              abstract types:
                
                  can’t instantiate
                  new subtypes can be added at any time
                  single supertype
                  zero or more subtypes
                
              
  multiple dispatch
    
      convenience for extensible math
      single dispatch is method dispatch we’re familiar with
        
          x.foo(y) – code impl depends on type of x
        
      
      multiple dispatch can depend on any param:
        
          foo(x,y) – code impl depends on types of x and y
        
      
      all named fns are generic
      each fn has one or more methods
    
  
x = ModInt(3,5)
x + 5 # ModInt + Int
5 + x # Int + ModInt

function Base.+(m::ModInt, i::Int64)
    return m + ModInt(..)
end

  this just can’t be done nicely with single dispatch

the details of how things go fast

dispatch


  dispatch happens a lot, so needs to be really fast
    
      foo(5) #dispatch, compile, run
      foo(6) #dispatch, run
      foo(7) #dispatch, run
    
  
  possible signatures:
    
      foo(Int64)
      foo(Number)
      foo(Float64)
      foo(Int64,Int64)
    
  
  do we need to keep redispatching? aren’t we always going to pick
    foo(Int64)?
    
      add a cache
      foo(5) #cache, dispatch, compile, run
      foo(6) #cache, run
      foo(7) #cache, run
      call into hashmap rather than going through a list
    
  
generic functions


  define Base.* in terms of +(Number,Number) and -(Number,Number)
  C++: templates vs dynamic dispatch
    
      dynamic dispatch is generic
        
          single Base.* impl to cover all Number instances
        
      
      templates do aggressive specialization
        
          individual Base.* impl for each Number instance
        
      
      tradeoff: code size vs speed
    
  
  inlining
    
      makes more optimizations available
      not reusable
    
  
  devirtualization
    
      write down the IP to avoid DNS
      ie hardcode the method implementation rather doing dispatch
    
  
  why do we normally dispatch?
    
      we’re avoiding dispatch as much as possible for performance
      worth knowing why we have it in the first place
        
          repl work
            
              redefine a function to fix a bug, want callers to reflect
                changes
              issue 265
            
          
type stability


  non-concrete types have to be boxed
  immutable types
    
      kind of concrete type
      once created, value can’t change
    
  
  boxed immutable values are bad for perf
    
      immutable types can’t appear to change
      we have to copy them each time we want to change them
      new heap allocation
    
  
macros for speed?


  julia has lisp-style macros
  macros are evaluated at compile time
  macros should be used sparingly
  Horner’s rule
    
      a*x*x + b*x + c
      skip a multiply with (a*x + b)*x + c
      for a cubic: 6 multiplies down to 3
      for a quartic: 10 multiplies down to 4
      etc (O(n^2) down to O(n))
    
  
  Horner’s rule can be implemented as a macro
    
      taken from PR 2987
    
  
  4x faster than matlab
  3x faster than scipy
    
      both of which call C/Fortran libs
    
  
  Julia is faster than C!
    
      compiled julia methods will have inlined constants, which are
        very optimizable
      C would involve a run-time loop over the array of coeffs
      LLVM can optimize the julia better than the C
      an abstraction which makes things faster
    
  
conclusion


  scientists like easy & fast
  dynamic for easy (but with limits)
  compiled for fast
  multiple dispatch for making fast code concise

Q&A


  what about R?
    
      R is slow for things that need loops (like monte carlo)
    
  
  does julia have support for openmp?
    
      I think so?
      you can call libs that use OpenMP
      there’s experimental support for OpenMP within julia
    
  
John Hughes, Mary Sheeran, Why Functional Programming Matters

FP à la 1940s


  who needs booleans?
  a boolean just makes a choice! They can just be functions!
    
      true  x y = x
      false x y = y
    
  
  we can then define if-then-else
    
      ifte bool t e = bool t e
    
  
  who needs integers?
    
      a (positive) integer just counts loop iterations
        
          two  f x = f (f x)
          one  f x = f x
          zero f x = x
        
      
      to recover a “normal” integer…
        
          two (+1) 0
            
              gives 2!
            
          
      add m n f x = m f (n f x)
      mul m n f x = m (n f) x
    
  
  factorial!

fact n =
  ifte (iszero n)
    one
    (mul n (fact decr n))
This means: booleans, integers, (and other data structures) can be
  entirely replaced by functions!
Church encodings
Before you try this at home…


  “cannot construct the infinite type…”


  ~fact
(forall a. (a->a) -> a -> a)…~

-
Factorial à la 1960


  LISP!
  This got lots of people excited!
    
      esp Peter Landin (see The Next 700 Programming Languages and ISWIM)
    
  
laws


  (maplist f (reverse l)) ≡ (reverse (maplist f l))
    
      …only if f is side-effect free
    
  
issues with conventional languages


  John Backus, “Can Programming Be Liberated from the von Neumann
    Style? A Functional Style and Its Algebra of Programs”
    
      Turing award 1977; paper 1978
    
  
  “Inherent defects at the most basic level cause them
    [conventional languages] to be both fat and weak”
  defects such as:

word-at-a-time operation


  inherited from the von Neumann machine
    
      the von Neumann way of designing machines has a terrible
        effect on writing programs
      don’t focus on the word-at-a-time bottleneck; focus on whole
        values
    
  
inability to use “combining forms”


  what we would today call “higher order functions”
  he called map f “αf”

lack of useful mathematical properties for reasoning about programs


  [f1,f2,..,fn] ∘ g ≡ [f1 ∘ g, f2 ∘ g,…, fn ∘ g]
  g ∘ [f1,f2,..,fn] ≡ [g ∘ f1, g ∘ f2,…, g ∘ fn]

Got a bit APLy:
Def IP = (/ +) ∘ (α ×) ∘ Trans

Peter Henderson


  functional geometry
  escher’s square limit fish
  devised an algorithmic definition of escher’s picture
    
      operators on pictures such as rot, over, above
      higher-order operators cycle, quartet
    
  
  then: define pictures as functions
    
      p(a,b,c) is the picture at position a, with height along
        vector b and width along vector c
      other combinators become easy to describe here:
        
          eg over (p,q) (a,b,c) becomes p(a,b,c) ∪ q(a,b,c)
        
      
  this functional geometry avoids Backus’s defects:
    
      combining forms
      operations on whole values
      algebra with laws
    
  
  and as a bonus: functions as representations
  taking it further: Conal Elliot’s Pan

FP à la 1990s


  Haskell: Paul Hudak got DARPA money to work on Haskell as a
    prototyping language
    
      Haskell vs Ada vs C++ vs awk…
    
  
  functions as data
    
      type Region  = Point -> Bool
        
          predicate representation of set of points
        
      
  reactions…
    
      “too cute for it’s own good”
      “higher-order functions just a trick, probably not useful in
        other contexts”
    
  
Lazy evaluation


  Henderson and Morris, A lazy evaluator
  Friedman and Wise, CONS should not evaluate its arguments
  “The Whole Value” can be infinite!
    
      the infinite list of natural numbers [0, 1, 2, 3, …]
      all the iterations of a function
      iterate f x = [x, f x, f (f x), ...]
    
  
  some things can be expressed very nicely with lazy eval
    
      eg newton-raphson, numerical derivative
    
  
my paper


  John Hughes, Why Functional Programming Matters
  lazy producer-consumer pattern
    
      producer: successive approximations of a numerical algorithm
        (eg derivative, integral)
        
          consumer: limit detection
        
      
      producer: search space
        
          consumer: search strategy
        
      
  pretty printing
    
      make choices between horizontal and vertical layout
        
          vertical: preserve indentation
          horizontal: discard indentation
        
      
      combinators: a $$ b puts a above b; a <> b puts a beside b
      combinators obey laws:
        
          <> and $$ are both associative
        
      
      once we understand the laws, the implementation can use the
        laws to make the best layout
    
  
my other paper


  Koen Claessen, John Hughes, “QuickCheck: A Lightweight Tool for
    Random Testing of Haskell Programs”
  randomly generate test cases from properties (properties based on
    laws)
  then shrink failing test cases to minimal counterexamples
  This is another lazy producer-consumer!
    
      producer: space of all possible tests
      consumer: QuickCheck search strategy
    
  
more applications


  Introduction to VLSI systems
    
      was intended to revolutionize programming
      did revolutionize hardware design
    
  
  A VLSI circuit is not a million miles away from Escher’s square
    limit!
    
      we used functional geometry ideas to design VLSI circuits
      we need to be certain we haven’t made a mistake as the feedback
        cycle from the fab was too slow (9 months)
    
  
  use Backus’s FP to create a language that describes streaming
    circuits
    
      then use algebraic laws of the language to optimize the
        circuits
      symbolic simulator to test
      muFP
    
  
  hardware sorting based on doubly-recursive definition:
    
      http://www.cs.kent.edu/~batcher/sort.pdf
    
  
  having a language of combining forms makes it much easier to
    reason about sorting networks
    
      though there are still outstanding unsolved problems
    
  
  Satnam Singh: Lava language http://blog.raintown.org/p/lava.html
  Pentium float division bug
    
      very costly mistake
      they suddenly got interested in formal verifiction
      forte
    
  
  bluespec bsv
    
      bluecheck
        
          QuickCheck in hardware design!
          iterative deepening and shrinking on FPGA designs!
        
      
four key ideas


  functions as representations
  whole values
  simple laws
  combining forms

Martin Thompson, A Quest for Predictable Latency


  subtitle: Adventures in Java Concurrency
  @mjpt777
  latency: responding in a timely manner
  the more things you try to juggle, the harder it becomes to
    predict how responsive you can be
  if you don’t respond in a timely manner, you’re effectively
    unavailable

causes of blocking


  concurrent algorithms
    
      notifying completion
      mutex
      synchronization/rendezvous
    
  
  systemic pauses
    
      JVM Safepoints (GC etc)
      Transparent Huge Pages (linux)
      Hardware
    
  
concurrent algorithms


  two approachs: locks and CAS
    
      both are hard to get right!
    
  
what is latency?


  queueing theory
    
      response time is sum of:
        
          enqueue time
          latent time (time in queue)
          dequeue time
          service time (time being processed)
        
      
  queueing theory focuses on latent time and service time, but
    sometimes enqueue and dequeue dominate!
  relationship between utilization and increase response time

adventures with locks and queues


  evils of blocking
  condition variables
    
      condition variable RTT (echo service)
      request: ping; response: pong
      measure histogram of response times
        
          don’t use averages! they lie!
        
      
      90th percentile: 8 µs
      8 µs is loads of time to talk between two threads!
        
          2 µs is enough to go between two machines on the same
            network!
        
      
      but it gets worse at higher percentiles
        
          max is 5525 µs!
        
      
  bad news: this is the best case scenario!
  context for benchmarks:
    
      https://github.com/real-logic/benchmarks
      java 8u60
      quad core machine
    
  
  producers: 1, 2 or 3 (can’t have more on a quad-core machine!)
  measuring: mean and 99th percentile
  baseline (wait-free) vs ArrayBlockingQueue vs LinkedBlockingQueue
    vs ConcurrentLinkedQueue
  burst length: 1 then 100
    
      real world traffic is bursty
      send a burst of 100; wait for 1 ack
    
  
  other considerations:
    
      Backpressure (queues should have bounded size)
      size methods
      flow rates
      garbage
      fan out
    
  
  in the real world, I cannot use java’s built-in queues, partly
    because they lack features I need
    
      if you call .size() on ConcurrentLinkedQueue, it walks the
        whole list
    
  
some alternative FIFOs

the disruptor


  get over the garbage problem
    
      preallocate a ring buffer and reuse it
    
  
  producers claim a slot using CAS
    
      once a slot is claimed, can write to the slot
      once written, can update the cursor
    
  
  consumer updates gating
  no need for locks!
  wrinkle: how do I deal with setting the cursor?
    
      I can’t update cursor to n before it has reached n-1
      thread claims slot; gets swapped out before writing to slot
        
          blocks whole queue until thread swaps back in (many ms!)
        
      
update in Disruptor 3.0


  cursor is used as CAS loop
  add an “available” array to mark when things are written
  made a huge difference

what if I just need a queue?


  disruptor isn’t best used as a queue (better for coordinating
    graph of dependencies)
  ManyToOneConcurrentArrayQueue (influences: Lamport + Fast Flow)
  link?
  tail: CAS counter
  head: counter
  reference ring buffer

back to the benchmarks


  Disruptor and ManyToOneConcurrentArrayQueue behave better in
    low contention
  gets worse under the bursty case
  each over 50 µs at this stage
  spinning CAS loops are the great equalizer

inter-process FIFOs


  ring buffer
  spinning CAS on tail counter
  write message once space claimed
  write header to indicate completion
    
      (zero out memory on consumption to make this work)
    
  
  MPSC Ring Buffer
  latency is good but throughput takes a hit from zeroing memory

Aeron IPC


  http://highscalability.com/blog/2014/11/17/aeron-do-we-really-need-another-messaging-system.html
  we wanted a CRDT
    
      replicated to another machine
      may be reordered or duplicated
      want to reassemble deterministically
    
  
  big shared-memory file
    
      tail pointer (but no head this time)
      advance tail, write message, write header (just as above)
    
  
  does the file really just go on forever?
    
      problems with this: page faults; page cache churn; VM pressure
      wash one, wear one, dry one
      active file, dirty file, clean file
    
  
  XADD instruction
    
      available in JVM now
      ring buffers can’t use this because the tail can overtake the head
      if you move the tail beyond the end of the buffer, you can
        detect this
        
          your responsibility is to rotate the buffer now
        
      
      messages persist while active & dirty buffers are around
    
  
  append-only data structures can be safe to read without locks
    
      they can detect when they’ve reached the end
    
  
  zeroing problem from before: how do we avoid the throughput
    problem?
    
      do it in a background thread!
      the old “dirty” becomes the new “clean”
    
  
back to the benchmarks!


  aeron IPC is slower than ManyToOneConcurrentArrayQueue in the
    happy case (no burst, one producer, average time)
  for bursty, 3 producers, 99th percentile:
    
      ManyToOneConcurrentArrayQueue breaks below 50µs to 34µs
        
          less “False Sharing”
          inlined data vs Reference Array
            
              less “Card Marking”
            
          
      Aeron gets even further down to 13µs (ish?)
        
          also mean is 10µs so this is phenomenal predictability
          avoids the spinning CAS
        
      
logging is a messaging problem!


  the abominations we have in Java for logging are disgusting!
    
      the APIs, the design, etc
    
  
  what design should we use?

where can we go next


  C vs Java
    
      Spin loops
      modern processors do out-of-order speculation
      with spin loops, they speculate wrong
        
          Proposed: Thread.spinYieldHint()
        
      
      data dependent loads
        
          heap aggregates: objectlayout
          stack allocation - value types
        
      
      memory copying
        
          baseline against memcpy() for differing alignment
        
      
      queue interface
        
          break conflated concerns to reduce blocking actions
          API design mistakes
            
              conflates “is something available?” with “is it empty?”
              should not have used normal collections API for concurrent
                stuff
            
          
conclusion


  AWS recently announced x1 instances
    
      2TB RAM + 100+ vcores!
      lots of spinning wheels
    
  
Q&A


  can you tell us more about XADD in java?
    
      AtomicInteger, AtomicLong, etc
    
  
Martin Kleppmann, Transactions: Myths, Surprises and Opportunities


  “Consistency guarantees? hahaha” (cue manic laughter)
  I’m writing a book!
  @martinkl

history


  1975: IBM System R
  2008? NoSQL
  2012? “NewSQL”
    
      eg SQL on top of HBase
    
  
  really a movement of transactions-or-not

ACID!


  Härder & Reuter, 1983
  “More mnemonic than precise” - Brewer, 2012

D for Durability


  archive tape?
  fsync to disk?
  replication?

C for Consistency


  not the same as C in CAP Theorem!
    
      aside: critique of cap theorem
    
  
  “tossed in to make the acronym work”

A for Atomicity


  not the same as in “atomic compare-and-swap”
  not about concurrency
  it’s about fault tolerance
  rolls back writes on abort
  better word might have been Abortability
  aborts are super helpful because they collapse a whole bunch
    of error classes:
    
      crashes, power failures
      constraint violation
      network fault
      deadlock
    
  
I for Isolation


  this is where we start talking about concurrency
  = SERIALIZABLE?
  in practice, multiple isolation levels
    
      Repeatable Read
      Snapshot Isolation
      Read Committed
      Read Uncommitted
    
  
  can anyone here describe the difference of the top of their
    head? (one hand goes up)
  this is a configurable property of many databases
    
      though many don’t go as high as Serializable
      and some call it Serializable but are actually weaker than
        this
    
  
isolation levels

preventing dirty reads and dirty writes


  read committed prevents these anomalies
  default in postgres, oracle, …
  that’s enough for transactions, right?
  what are the other weird isolation levels?

preventing read skew


  this can happen under read committed!
  concurrent transaction:
    
      
          add 100 to x
          subtract 100 from y
          commit
        
      
          read y
          read x
        
      
  reader can see old value of y, but new value of x
  Repeatable Read and Snapshot Isolation were invented to prevent
    this
  Traditional RR implementation is based on locking
  SI is used more often now to avoid locks
    
      poorly named in most DBs
    
  
write skew


  invariant: at least one doctor on call at all times
  doctors can trade shifts
  transaction:
    
      count how many doctors are on call
      if >=2:
        
          update
        
      
      commit
    
  
  but: two txns operate concurrently for two different doctors:
    
      both see n >= 2
      both remove themselves from rota
      invariant violated!
    
  
  pattern:
    
      measure
      decide
      write
      commit
      by the time the write is committed, the premise of the
        decision is no longer true!
    
  
  solutions
    
      two-phase locking (pessimistic)
        
          shared lock on measure step
            
              block entire table for writes, whenever you make a large
                read
            
          
          poor for perf
          motivation for weaker isolation levels
          System R team just didn’t hold locks for so long
            
              this is what created their crazy isolation levels
            
          
      H-Store (Stonebraker et al)
        
          VoltDB, datomic
          literally execute your txns in a serial order
            
              trivally correct
              as long as you make your txns fast enough!
                
                  otherwise your throughput will be abysmal
                
              
      SSI (Cehill Röhm & Fekete)
        
          Postgres
          detect conflicts & abort
            
              optmistic counterpart to pessimistic 2PL
            
          
          Idea: locks like in 2PL
            
              but locks don’t block, just gather information
            
          
what about other distributed systems?


  microservices?
  stream processing?
  serializability across services?
    
      atomic commitment
        
          2PC, 3PC
        
      
      total ordering
      consensus
        
          paxos, raft, zab
        
      
  strong connection between all of these things
    
      coordination
      failure amplification
    
  
  geographic distribution makes this worse
  distributed txns don’t work particularly well

without cross-service transactions


  so where are we now?
  compensating transactions
    
      ≅ abort/rollback at app level
      looks a bit like ACID A
    
  
  apologies
    
      detect & fix contraint violations after the fact, rather than
        preventing
        
          overselling beyond stock level; issue voucher instead
        
      
      looks a bit like ACID C
    
  
every sufficient large deployment of microservices contains an ad-hoc
  informally-specified bug-ridden, slow implementation of half of
  transactions

  ordering of events
    
      unfriend
      post nastygram to remaining friends
    
  
  friends service and posts service are separate services
  notifications services pushing out emails etc, reading from
    posts and friends services
  reordering due to delays
  notification goes to old set of friends, not new set of
    friends. oops!
  implicit causal relationships leading to violated expectations

current research


  things between strict serializability and eventual consistency
  causality:
    
      partial order of reads/writes
      can be maintained without global coordination
      “consistent snapshot” in SI
        
          consistent with causality
        
      
      but also efficient?
    
  
  isolation levels are divided into those that require
    coordination and those that don’t

references


  four slides’ worth!

Q&A


  how did we end up with this mess? naming etc
    
      on the academic side, the database community and distributed
        systems community have historically been completely separate
        
          evolved separate language for similar concepts
        
      
      wouldn’t have many of current problems if they had talked to
        each other in 1980s
      SQL standard had problems with isolation levels
        
          failed to describe them in an implementation-neutral way
          this is why Postgres uses SI but calls it REPEATABLE READ
        
      
  also see hermitage, a tool to test what kind of anomalies
    different isolation levels of dbs allow

Elise Huard, FRP and Functional Game Programming


  FRP intro
    
      signals, generators, etc
    
  
player <-
  transfer2 initialPlayer
            movePlayer
            directionKey
            gameOver'

monster <-
  transfer3 initialMonster
            wanderOrHunt
            player
            randomNumber
            gameOver'

gameState = GameState <$> renderState <*> soundState
Dependencies:

  player depends on gameOver
  but gameOver depends on player!

Solution: introduce delay: gameOver' = delay False gameOver
generator :: Signal (SignalGen a) -> SignalGen (Signal a)

playLevel :: Signal (Bool, Bool, Bool, Bool) -- event signals
             -> LevelNumber -- pattern match on level number
             -> Score
             -> Health
             -> SignalGen (...)
cons of FRP?


  if your game is sufficiently complex, you’ll have a lot of
    streams to manage
    
      if you add a new stream, you might need to make every other
        stream tap into it
        
          laborious for maintenance
        
      
  some added complexity in handling side effects
  performance?

pros of FRP


  conceptually simpler
    
      specifically, smaller units
    
  
  testability
    
      everything is a pure function
      can exploit property-based testing
      can record a user session to replay events
        
          can use this to verify that a bugfix really works
          elm does a similar thing
        
      
skepticism?


  this is not a critical system
    
      correctness is not as important as user experience
    
  
Q&A


  what about garbage? we’ve tried and found GC caused lag in game
    loop
    
      my game is not big; it’s not been an issue so far for me
    
  
  is the code available?
    
      https://github.com/elisehuard/game-in-haskell
No results found