Skip to content

Instantly share code, notes, and snippets.

@rahulunair
Forked from mingfeima/bert_optimization.md
Created July 8, 2022 06:13
Show Gist options
  • Select an option

  • Save rahulunair/807167e638d7a041d711d77a9cd13dc9 to your computer and use it in GitHub Desktop.

Select an option

Save rahulunair/807167e638d7a041d711d77a9cd13dc9 to your computer and use it in GitHub Desktop.
BERT Optimization

benchmark

Based on huggingface repo for performance evaluation, actual benchmark run script placed at repo. How to reproduce performance:

  1. prepare dataset according to link.
  2. update GLUE_DIR to actual dataset path in run_inference.sh.
  3. change env settings, the default setting is using 20 cores;

MKL v.s. MKLDNN

Inference performance result on Xeon 6148 (2x20 cores), single socket and single thread.

  • MKL: version 2019.4 (conda install mkl mkl-include)
  • MKLDNN: proposed in 21851

single instance (20 threads)

  • MKL
>>> ./run_inference.sh
408/408 [00:24<00:00, 16.69it/s]
  • MKLDNN
>>> ./run_instance.sh --mkldnn
408/408 [00:18<00:00, 21.95it/s]

multi instance (1 thread per instance)

  • MKL
>>> ./run_inference.sh --multi_instances
Average latency per example: 469.058ms
Total number of iterations: 1000
Total number of iterations per second (across all threads): 42.64
Total time: 23.453s
  • MKLDNN
>>> ./run_inference.sh --multi_instances --mkldnn
Average latency per example: 370.495ms
Total number of iterations: 1000
Total number of iterations per second (across all threads): 53.98
Total time: 18.525s

Impact of leading dimension padding

  • skylake has special requirements on leading dimension of GEMM, when LDA/LDB/LDC is multiple of 128, will cause cache flush issue, see ref.
  • The following table compares performance of BERT (glue/MRPC) GEMMs on MKL and MKLDNN with original size and padded size (+16).

Table-1: single socket test result (20 threads)

size(original) MKL MKLDNN size (padded) MKL MKLDNN
N=128, I=768, O=768 818.57 417.03 N=128, I=784, O=784 1246.08 1282.33
N=128, I=768, O=3072 1369.88 1818.96 N=128, I=784, O=3088 1908.46 1931.12
N=128, I=3072, O=768 676.20 1262.61 N=128, I=3088, O=784 1768.28 1658.30

unit: Gflops

  • Use the following script to reproduce this result:

run.sh:

num_threads=$1
script=$2
last_core=`expr $num_threads - 1`


echo "using $num_threads OMP threads"
echo "bind cores to 0~$last_core"

export OMP_NUM_THREADS=$num_threads
export KMP_AFFINITY=granularity=fine,compact,1,0

numactl --physcpubind=0-$last_core --membind=0 python $script

test_linear.py

import torch
import torch.nn as nn
from time import time

warmups = 1000
iters = 10000

def test_linear(batch_size, input_channel, output_channel):
    input = torch.randn(batch_size, input_channel)
    linear = nn.Linear(input_channel, output_channel)

    for i in range(warmups):
        output = linear(input)

    t1 = time()
    for i in range(iters):
        output = linear(input)
    t2 = time()
    tt = (t2-t1)/iters

    print("### Linear: (%d, %d) => (%d, %d): %f ms, %f Gflops"
            % (batch_size, input_channel, batch_size, output_channel,
              tt*1000, 2* batch_size*input_channel*output_channel/tt/1e9))

test_linear(128, 768, 768)
test_linear(128, 768, 3072)
test_linear(128, 3072, 768)
test_linear(128, 768+16, 768+16)
test_linear(128, 768+16, 3072+16)
test_linear(128, 3072+16, 768+16)

to run on single socket with 20 OMP threads:

./run.sh 20 test_linear.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment