Skip to content

Instantly share code, notes, and snippets.

@Ristovski
Last active March 5, 2026 12:50
Show Gist options
  • Select an option

  • Save Ristovski/1b16db418c85d930dd30872bdf1233a3 to your computer and use it in GitHub Desktop.

Select an option

Save Ristovski/1b16db418c85d930dd30872bdf1233a3 to your computer and use it in GitHub Desktop.
vkperf (0.99.5) tests various performance characteristics of Vulkan devices.
Devices in the system:
AMD Radeon Graphics (RADV RENOIR)
NVIDIA GeForce RTX 4070 Ti SUPER
llvmpipe (LLVM 19.1.7, 256 bits)
Selected device:
AMD Radeon Graphics (RADV RENOIR)
VendorID: 0x1002 (AMD/ATI)
DeviceID: 0x1638
Vulkan version: 1.4.305
Driver version: 25.0.5 (104857605, 0x6400005)
Driver name: radv
Driver info: Mesa 25.0.5
DriverID: MesaRadv
Driver conformance version: 1.4.0.0
GPU memory: 10GiB (10718MiB)
Max memory allocations: 4294967295
Standard (non-sparse) buffer alignment: 16
Number of triangles for tests: 100000
Sparse mode for tests: None
Timestamp number of bits: 64
Timestamp period: 10ns
Vulkan Instance version: 1.4.328
Operating system: < unknown, non-Windows >
Processor: AMD Ryzen 7 5700G with Radeon Graphics
Triangle throughput:
Triangle list (triangle list primitive type,
single per-scene vkCmdDraw() call, attributeless,
constant VS output): 759.6 mega-triangles/s
Indexed triangle list (triangle list primitive type, single
per-scene vkCmdDrawIndexed() call, no vertices shared between triangles,
attributeless, constant VS output): 758.4 mega-triangles/s
Indexed triangle list that reuses two indices of the previous triangle
(triangle list primitive type, single per-scene vkCmdDrawIndexed() call,
attributeless, constant VS output): 1.262 giga-triangles/s
Triangle strips of various lengths
(per-strip vkCmdDraw() call, 1 to 1000 triangles per strip,
attributeless, constant VS output):
strip length 1: 70.63 mega-triangles/s
strip length 2: 139.7 mega-triangles/s
strip length 5: 345.9 mega-triangles/s
strip length 8: 541.2 mega-triangles/s
strip length 10: 666.6 mega-triangles/s
strip length 20: 1.322 giga-triangles/s
strip length 25: 1.495 giga-triangles/s
strip length 40: 1.798 giga-triangles/s
strip length 50: 1.872 giga-triangles/s
strip length 100: 2.039 giga-triangles/s
strip length 125: 2.076 giga-triangles/s
strip length 1000: 2.216 giga-triangles/s
Indexed triangle strips of various lengths
(per-strip vkCmdDrawIndexed() call, 1-1000 triangles per strip,
no vertices shared between strips, each index used just once,
attributeless, constant VS output):
strip length 1: 70.78 mega-triangles/s
strip length 2: 140.1 mega-triangles/s
strip length 5: 346.6 mega-triangles/s
strip length 8: 543.1 mega-triangles/s
strip length 10: 668.9 mega-triangles/s
strip length 20: 1.326 giga-triangles/s
strip length 25: 1.626 giga-triangles/s
strip length 40: 2.027 giga-triangles/s
strip length 50: 2.140 giga-triangles/s
strip length 100: 2.140 giga-triangles/s
strip length 125: 2.173 giga-triangles/s
strip length 1000: 2.214 giga-triangles/s
Primitive restart indexed triangle strips of various lengths
(single per-scene vkCmdDrawIndexed() call, 1-1000 triangles per strip,
no vertices shared between strips, each index used just once,
attributeless, constant VS output):
strip length 1: 957.4 mega-triangles/s
strip length 2: 1.508 giga-triangles/s
strip length 5: 2.200 giga-triangles/s
strip length 8: 2.202 giga-triangles/s
strip length 1000: 2.202 giga-triangles/s
Primitive restart, each triangle is replaced by one -1
(single per-scene vkCmdDrawIndexed() call,
no fragments produced): 3.654 giga-triangles/s
Primitive restart, only zeros in the index buffer
(single per-scene vkCmdDrawIndexed() call,
no fragments produced): 756.2 mega-triangles/s
Instancing throughput of vkCmdDraw()
(one triangle per instance, constant VS output, one draw call,
attributeless): 759.4 mega-triangles/s
Instancing throughput of vkCmdDrawIndexed()
(one triangle per instance, constant VS output, one draw call,
attributeless): 758.4 mega-triangles/s
Instancing throughput of vkCmdDrawIndirect()
(one triangle per instance, one indirect draw call,
one indirect record, attributeless: 755.5 mega-triangles/s
Instancing throughput of vkCmdDrawIndexedIndirect()
(one triangle per instance, one indirect draw call,
one indirect record, attributeless: 754.8 mega-triangles/s
vkCmdDraw() throughput
(per-triangle vkCmdDraw() in command buffer,
attributeless, constant VS output): 70.64 mega-triangles/s
vkCmdDrawIndexed() throughput
(per-triangle vkCmdDrawIndexed() in command buffer,
attributeless, constant VS output): 70.75 mega-triangles/s
VkDrawIndirectCommand processing throughput
(per-triangle VkDrawIndirectCommand, one vkCmdDrawIndirect() call,
attributeless): 24.43 mega-indirectRecords/s
VkDrawIndirectCommand processing throughput with stride 32
(per-triangle VkDrawIndirectCommand, one vkCmdDrawIndirect() call,
attributeless): 24.43 mega-indirectRecords/s
VkDrawIndexedIndirectCommand processing throughput
(per-triangle VkDrawIndexedIndirectCommand,
1x vkCmdDrawIndexedIndirect() call,
attributeless): 23.43 mega-indirectRecords/s
VkDrawIndexedIndirectCommand processing throughput with stride 32
(per-triangle VkDrawIndexedIndirectCommand,
1x vkCmdDrawIndexedIndirect() call,
attributeless): 18.49 mega-indirectRecords/s
Vertex and geometry shader throughput:
VS throughput using vkCmdDraw() - minimal VS that just writes
constant output position (per-scene vkCmdDraw() call,
no attributes, no fragments produced): 2.278 giga-vertices/s
VS throughput using vkCmdDrawIndexed() - minimal VS that just writes
constant output position (per-scene vkCmdDrawIndexed() call,
no attributes, no fragments produced): 2.275 giga-vertices/s
VS producing output position from VertexIndex and InstanceIndex
using vkCmdDraw() (single per-scene vkCmdDraw() call,
attributeless, no fragments produced): 2.278 giga-vertices/s
VS producing output position from VertexIndex and InstanceIndex
using vkCmdDrawIndexed() (single per-scene vkCmdDrawIndexed() call,
attributeless, no fragments produced): 2.274 giga-vertices/s
GS one triangle in and no triangle out
(empty VS, attributeless): 759.4 mega-invocations/s
GS one triangle in and single constant triangle out
(empty VS, attributeless): 438.4 mega-invocations/s
GS one triangle in and two constant triangles out
(empty VS, attributeless): 316.9 mega-invocations/s
Attributes and buffers:
One attribute performance - 1x vec4 attribute
(attribute used, per-scene draw call): 2.239 giga-vertices/s
One buffer performance - 1x vec4 buffer
(1x read in VS, per-scene draw call): 2.237 giga-vertices/s
One buffer performance - 1x vec3 buffer
(1x read in VS, one draw call): 2.262 giga-vertices/s
Two attributes performance - 2x vec4 attribute
(both attributes used): 1.518 giga-vertices/s
Two buffers performance - 2x vec4 buffer
(both buffers read in VS): 1.384 giga-vertices/s
Two buffers performance - 2x vec3 buffer
(both buffers read in VS): 2.024 giga-vertices/s
Two interleaved attributes performance - 2x vec4
(2x vec4 attribute fetched from the single buffer in VS
from consecutive buffer locations: 1.507 giga-vertices/s
Two interleaved buffers performance - 2x vec4
(2x vec4 fetched from the single buffer in VS
from consecutive buffer locations: 1.508 giga-vertices/s
Packed buffer performance - 1x buffer using 32-byte struct unpacked
into position+normal+color+texCoord: 1.502 giga-vertices/s
Packed attribute performance - 2x uvec4 attribute unpacked
into position+normal+color+texCoord: 1.524 giga-vertices/s
Packed buffer performance - 2x uvec4 buffers unpacked
into position+normal+color+texCoord: 1.521 giga-vertices/s
Packed buffer performance - 2x buffer using 16-byte struct unpacked
into position+normal+color+texCoord: 1.541 giga-vertices/s
Packed buffer performance - 2x buffer using 16-byte struct
read multiple times and unpacked
into position+normal+color+texCoord: 1.529 giga-vertices/s
Four attributes performance - 4x vec4 attribute
(all attributes used): 790.1 mega-vertices/s
Four buffers performance - 4x vec4 buffer
(all buffers read in VS): 788.8 mega-vertices/s
Four buffers performance - 4x vec3 buffer
(all buffers read in VS): 1.041 giga-vertices/s
Four interleaved attributes performance - 4x vec4
(4x vec4 fetched from the single buffer
on consecutive locations: 787.8 mega-vertices/s
Four interleaved buffers performance - 4x vec4
(4x vec4 fetched from the single buffer
on consecutive locations: 790.8 mega-vertices/s
Four attributes performance - 2x vec4 and 2x uint attribute
(2x vec4f32 + 2x vec4u8, 2x conversion from vec4u8
to vec4): 1.250 giga-vertices/s
Transformations:
Matrix performance - one matrix as uniform for all triangles
(maxtrix read in VS,
coordinates in vec4 attribute): 2.236 giga-vertices/s
Matrix performance - per-triangle matrix in buffer
(different matrix read for each triangle in VS,
coordinates in vec4 attribute): 1.255 giga-vertices/s
Matrix performance - per-triangle matrix in attribute
(triangles are instanced and each triangle receives a different matrix,
coordinates in vec4 attribute: 2.068 giga-vertices/s
Matrix performance - one matrix in buffer for all triangles and 2x uvec4
packed attributes (each triangle reads matrix from the same place in
the buffer, attributes unpacked): 1.384 giga-vertices/s
Matrix performance - per-triangle matrix in the buffer and 2x uvec4 packed
attributes (each triangle reads a different matrix from a buffer,
attributes unpacked): 888.4 mega-vertices/s
Matrix performance - per-triangle matrix in buffer and 2x uvec4 packed
buffers (each triangle reads a different matrix from a buffer,
packed buffers unpacked): 930.4 mega-vertices/s
Matrix performance - GS reads per-triangle matrix from buffer and 2x uvec4
packed buffers (each triangle reads a different matrix from a buffer,
packed buffers unpacked in GS): 625.5 mega-vertices/s
Matrix performance - per-triangle matrix in buffer and four attributes
(each triangle reads a different matrix from a buffer,
4x vec4 attribute): 593.7 mega-vertices/s
Matrix performance - 1x per-triangle matrix in buffer, 2x uniform matrix and
and 2x uvec4 packed attributes (uniform view and projection matrices
multiplied with per-triangle model matrix and with unpacked attributes of
position, normal, color and texCoord: 880.3 mega-vertices/s
Matrix performance - 2x per-triangle matrix (mat4+mat3) in buffer,
3x uniform matrix (mat4+mat4+mat3) and 2x uvec4 packed attributes
(full position and normal computation with MVP and normal matrices,
all matrices and attributes multiplied): 678.5 mega-vertices/s
Matrix performance - 2x per-triangle matrix (mat4+mat3) in buffer,
2x non-changing matrix (mat4+mat4) in push constants,
1x constant matrix (mat3) and 2x uvec4 packed attributes (all
matrices and attributes multiplied): 668.5 mega-vertices/s
Matrix performance - 2x per-triangle matrix (mat4+mat3) in buffer, 2x
non-changing matrix (mat4+mat4) in specialization constants, 1x constant
matrix (mat3) defined by VS code and 2x uvec4 packed attributes (all
matrices and attributes multiplied): 693.0 mega-vertices/s
Matrix performance - 2x per-triangle matrix (mat4+mat3) in buffer,
3x constant matrix (mat4+mat4+mat3) defined by VS code and
2x uvec4 packed attributes (all matrices and attributes
multiplied): 697.4 mega-vertices/s
Matrix performance - GS five matrices processing, 2x per-triangle matrix
(mat4+mat3) in buffer, 3x uniform matrix (mat4+mat4+mat3) and
2x uvec4 packed attributes passed through VS (all matrices and
attributes multiplied): 509.2 mega-vertices/s
Matrix performance - GS five matrices processing, 2x per-triangle matrix
(mat4+mat3) in buffer, 3x uniform matrix (mat4+mat4+mat3) and
2x uvec4 packed data read from buffer in GS (all matrices and attributes
multiplied): 521.1 mega-vertices/s
Textured Phong and Matrix performance - 2x per-triangle matrix
in buffer (mat4+mat3), 3x uniform matrix (mat4+mat4+mat3) and
four attributes (vec4f32+vec3f32+vec4u8+vec2f32),
no fragments produced: 618.3 mega-vertices/s
Textured Phong and Matrix performance - 1x per-triangle matrix
in buffer (mat4), 2x uniform matrix (mat4+mat4) and
four attributes (vec4f32+vec3f32+vec4u8+vec2f32),
no fragments produced: 807.6 mega-vertices/s
Textured Phong and Matrix performance - 1x per-triangle matrix
in buffer (mat4), 2x uniform matrix (mat4+mat4) and 2x uvec4 packed
attribute, no fragments produced: 866.7 mega-vertices/s
Textured Phong and Matrix performance - 1x per-triangle row-major matrix
in buffer (mat4), 2x uniform not-row-major matrix (mat4+mat4),
2x uvec4 packed attributes,
no fragments produced: 949.0 mega-vertices/s
Textured Phong and Matrix performance - 1x per-triangle mat4x3 matrix
in buffer, 2x uniform matrix (mat4+mat4) and 2x uvec4 packed attributes,
no fragments produced: 1.045 giga-vertices/s
Textured Phong and Matrix performance - 1x per-triangle row-major mat4x3
matrix in buffer, 2x uniform matrix (mat4+mat4), 2x uvec4 packed
attribute, no fragments produced: 1.045 giga-vertices/s
Textured Phong and PAT performance - PAT v1 (Position-Attitude-Transform,
performing translation (vec3) and rotation (quaternion as vec4) using
implementation 1), PAT is per-triangle 2x vec4 in buffer,
2x uniform matrix (mat4+mat4), 2x uvec4 packed attributes,
no fragments produced: 1.176 giga-vertices/s
Textured Phong and PAT performance - PAT v2 (Position-Attitude-Transform,
performing translation (vec3) and rotation (quaternion as vec4) using
implementation 2), PAT is per-triangle 2x vec4 in buffer,
2x uniform matrix (mat4+mat4), 2x uvec4 packed attributes,
no fragments produced: 1.177 giga-vertices/s
Textured Phong and PAT performance - PAT v3 (Position-Attitude-Transform,
performing translation (vec3) and rotation (quaternion as vec4) using
implementation 3), PAT is per-triangle 2x vec4 in buffer,
2x uniform matrix (mat4+mat4), 2x uvec4 packed attributes,
no fragments produced: 1.174 giga-vertices/s
Textured Phong and PAT performance - constant single PAT v2 sourced from
the same index in buffer (vec3+vec4), 2x uniform matrix (mat4+mat4),
2x uvec4 packed attributes,
no fragments produced: 1.404 giga-vertices/s
Textured Phong and PAT performance - indexed draw call, per-triangle PAT v2
in buffer (vec3+vec4), 2x uniform matrix (mat4+mat4), 2x uvec4 packed
attribute, no fragments produced: 1.089 giga-vertices/s
Textured Phong and PAT performance - indexed draw call, constant single
PAT v2 sourced from the same index in buffer (vec3+vec4),
2x uniform matrix (mat4+mat4), 2x uvec4 packed attributes,
no fragments produced: 1.296 giga-vertices/s
Textured Phong and PAT performance - primitive restart, indexed draw call,
per-triangle PAT v2 in buffer (vec3+vec4), 2x uniform matrix (mat4+mat4),
2x uvec4 packed attributes,
no fragments produced: 1.119 giga-vertices/s
Textured Phong and PAT performance - primitive restart, indexed draw call,
constant single PAT v2 sourced from the same index in buffer (vec3+vec4),
2x uniform matrix (mat4+mat4), 2x uvec4 packed attributes,
no fragments produced: 1.317 giga-vertices/s
Textured Phong and double precision matrix performance - double precision
per-triangle matrix in buffer (dmat4), double precision per-scene view
matrix in uniform (dmat4), both matrices converted to single precision
before computations, single precision per-scene perspective matrix in
uniform (mat4), single precision vertex positions, packed attributes
(2x uvec4), no fragments produced: 676.3 mega-vertices/s
Textured Phong and double precision matrix performance - double precision
per-triangle matrix in buffer (dmat4), double precision per-scene view
matrix in uniform (dmat4), both matrices multiplied in double precision,
single precision vertex positions, single precision per-scene
perspective matrix in uniform (mat4), packed attributes (2x uvec4),
no fragments produced: 581.8 mega-vertices/s
Textured Phong and double precision matrix performance - double precision
per-triangle matrix in buffer (dmat4), double precision per-scene view
matrix in uniform (dmat4), both matrices multiplied in double precision,
double precision vertex positions (dvec3), single precision per-scene
perspective matrix in uniform (mat4), packed attributes (3x uvec4),
no fragments produced: 542.5 mega-vertices/s
Textured Phong and double precision matrix performance using GS - double
precision per-triangle matrix in buffer (dmat4), double precision
per-scene view matrix in uniform (dmat4), both matrices multiplied in
double precision, double precision vertex positions (dvec3), single
precision per-scene perspective matrix in uniform (mat4), packed
attributes (3x uvec4),
no fragments produced: 222.0 mega-vertices/s
Fragment throughput:
Single full-framebuffer quad,
constant color FS: 25.51 giga-fragments/s
10x full-framebuffer quad,
constant color FS: 35.45 giga-fragments/s
Four smooth interpolators (4x vec4),
10x fullscreen quad: 35.44 giga-fragments/s
Four flat interpolators (4x vec4),
10x fullscreen quad: 35.47 giga-fragments/s
Four textured phong interpolators (vec3+vec3+vec4+vec2),
10x fullscreen quad: 35.33 giga-fragments/s
Textured Phong, packed uniforms (four smooth interpolators
(vec3+vec3+vec4+vec2), 4x uniform (material (56 byte) +
globalAmbientLight (12 byte) + light (64 byte) + sampler2D),
10x fullscreen quad): 8.501 giga-fragments/s
Textured Phong, not packed uniforms (four smooth interpolators
(vec3+vec3+vec4+vec2), 4x uniform (material (72 byte) +
globalAmbientLight (12 byte) + light (80 byte) + sampler2D),
10x fullscreen quad): 8.500 giga-fragments/s
Simplified Phong, no texture, no specular (2x smooth interpolator
(vec3+vec3), 3x uniform (material (vec4+vec4) + globalAmbientLight
(vec3) + light (48 bytes: position+attenuation+ambient+diffuse)),
10x fullscreen quad): 15.65 giga-fragments/s
Simplified Phong, no texture, no specular, single uniform
(2x smooth interpolator (vec3+vec3), 1x uniform
(material+globalAmbientLight+light (vec4+vec4+vec4 + 3x vec4),
10x fullscreen quad): 15.64 giga-fragments/s
Constant color from uniform, 1x uniform (vec4) in FS,
10x fullscreen quad: 35.38 giga-fragments/s
Constant color from uniform, 1x uniform (uint) in FS,
10x fullscreen quad: 35.41 giga-fragments/s
Transfer throughput:
Transfer of consecutive blocks:
4 bytes: 101.212ns per transfer (0.0368068 GiB/s)
4 bytes: 96.904ns per transfer (0.0384431 GiB/s)
8 bytes: 108.368ns per transfer (0.0687526 GiB/s)
16 bytes: 122.692ns per transfer (0.121452 GiB/s)
32 bytes: 154.088ns per transfer (0.193411 GiB/s)
64 bytes: 182.656ns per transfer (0.326322 GiB/s)
128 bytes: 187.24ns per transfer (0.636666 GiB/s)
256 bytes: 163.691ns per transfer (1.45651 GiB/s)
512 bytes: 169.15ns per transfer (2.81901 GiB/s)
1024 bytes: 181.133ns per transfer (5.26506 GiB/s)
2048 bytes: 205.391ns per transfer (9.28644 GiB/s)
4096 bytes: 158.984ns per transfer (23.9942 GiB/s)
8192 bytes: 313.594ns per transfer (24.3289 GiB/s)
16384 bytes: 621.875ns per transfer (24.5367 GiB/s)
32768 bytes: 1243.12ns per transfer (24.5491 GiB/s)
65536 bytes: 2473.75ns per transfer (24.6731 GiB/s)
131072 bytes: 4970ns per transfer (24.5614 GiB/s)
262144 bytes: 9900ns per transfer (24.6607 GiB/s)
524288 bytes: 19830ns per transfer (24.6234 GiB/s)
1048576 bytes: 39620ns per transfer (24.6482 GiB/s)
2097152 bytes: 79280ns per transfer (24.6358 GiB/s)
Transfer of spaced blocks:
4 bytes: 101.916ns per transfer (0.0365526 GiB/s)
4 bytes: 102.104ns per transfer (0.0364853 GiB/s)
8 bytes: 115.428ns per transfer (0.0645474 GiB/s)
16 bytes: 143.096ns per transfer (0.104134 GiB/s)
32 bytes: 143.728ns per transfer (0.207352 GiB/s)
64 bytes: 145.828ns per transfer (0.408733 GiB/s)
128 bytes: 157.236ns per transfer (0.758155 GiB/s)
256 bytes: 161.997ns per transfer (1.47175 GiB/s)
512 bytes: 167.139ns per transfer (2.85294 GiB/s)
1024 bytes: 185.586ns per transfer (5.13872 GiB/s)
2048 bytes: 218.594ns per transfer (8.72554 GiB/s)
4096 bytes: 161.797ns per transfer (23.5771 GiB/s)
8192 bytes: 312.812ns per transfer (24.3897 GiB/s)
16384 bytes: 625.625ns per transfer (24.3897 GiB/s)
32768 bytes: 1251.88ns per transfer (24.3775 GiB/s)
65536 bytes: 2508.75ns per transfer (24.3289 GiB/s)
131072 bytes: 5012.5ns per transfer (24.3532 GiB/s)
262144 bytes: 9935ns per transfer (24.5738 GiB/s)
524288 bytes: 19740ns per transfer (24.7356 GiB/s)
1048576 bytes: 39580ns per transfer (24.6731 GiB/s)
2097152 bytes: 78960ns per transfer (24.7356 GiB/s)
Measurement statistics:
Triangle throughput measurement time: 3.05 seconds using 288 test rounds.
Vertex throughput measurement time: 0.349 seconds using 288 test rounds.
Attribute and Buffer measurement time: 1.27 seconds using 288 test rounds.
Transformation measurement time: 3.65 seconds using 288 test rounds.
Fragment throughput measurement time: 3.21 seconds using 288 test rounds.
Transfer throughput measurement time: 7.32 seconds using 288 test rounds.
Total device time: 18.5 seconds.
Total real time: 20 seconds.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment