Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save csabakecskemeti/278ffb8ab8e4e13b605bfc1735bfed9b to your computer and use it in GitHub Desktop.

Select an option

Save csabakecskemeti/278ffb8ab8e4e13b605bfc1735bfed9b to your computer and use it in GitHub Desktop.
GLM-4.5 f16.gguf generation
./build/bin/llama-simple -m /media/kecso/8t_nvme/zai-org.GLM-4.5.f16.gguf -ngl 0 -n 2048 "how many 'r's are in the word strawberry?"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
Device 1: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 5090) - 30933 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 5090) - 31598 MiB free
llama_model_loader: loaded meta data with 41 key-value pairs and 1761 tensors from /media/kecso/8t_nvme/zai-org.GLM-4.5.f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = glm4moe
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Zai org.GLM 4.5
llama_model_loader: - kv 3: general.version str = 4.5
llama_model_loader: - kv 4: general.basename str = zai-org.GLM
llama_model_loader: - kv 5: general.size_label str = 160x21B
llama_model_loader: - kv 6: general.license str = mit
llama_model_loader: - kv 7: general.tags arr[str,1] = ["text-generation"]
llama_model_loader: - kv 8: general.languages arr[str,2] = ["en", "zh"]
llama_model_loader: - kv 9: glm4moe.block_count u32 = 93
llama_model_loader: - kv 10: glm4moe.context_length u32 = 131072
llama_model_loader: - kv 11: glm4moe.embedding_length u32 = 5120
llama_model_loader: - kv 12: glm4moe.feed_forward_length u32 = 12288
llama_model_loader: - kv 13: glm4moe.attention.head_count u32 = 96
llama_model_loader: - kv 14: glm4moe.attention.head_count_kv u32 = 8
llama_model_loader: - kv 15: glm4moe.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 16: glm4moe.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 17: glm4moe.expert_used_count u32 = 8
llama_model_loader: - kv 18: glm4moe.attention.key_length u32 = 128
llama_model_loader: - kv 19: glm4moe.attention.value_length u32 = 128
llama_model_loader: - kv 20: general.file_type u32 = 1
llama_model_loader: - kv 21: glm4moe.rope.dimension_count u32 = 64
llama_model_loader: - kv 22: glm4moe.expert_count u32 = 160
llama_model_loader: - kv 23: glm4moe.expert_feed_forward_length u32 = 1536
llama_model_loader: - kv 24: glm4moe.expert_shared_count u32 = 1
llama_model_loader: - kv 25: glm4moe.leading_dense_block_count u32 = 3
llama_model_loader: - kv 26: glm4moe.expert_gating_func u32 = 2
llama_model_loader: - kv 27: glm4moe.expert_weights_scale f32 = 2.500000
llama_model_loader: - kv 28: glm4moe.expert_weights_norm bool = true
llama_model_loader: - kv 29: general.quantization_version u32 = 2
llama_model_loader: - kv 30: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 31: tokenizer.ggml.pre str = glm4
llama_model_loader: - kv 32: tokenizer.ggml.tokens arr[str,151552] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 33: tokenizer.ggml.token_type arr[i32,151552] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 34: tokenizer.ggml.merges arr[str,318088] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 35: tokenizer.ggml.eos_token_id u32 = 151329
llama_model_loader: - kv 36: tokenizer.ggml.padding_token_id u32 = 151329
llama_model_loader: - kv 37: tokenizer.ggml.eot_token_id u32 = 151336
llama_model_loader: - kv 38: tokenizer.ggml.unknown_token_id u32 = 151329
llama_model_loader: - kv 39: tokenizer.ggml.bos_token_id u32 = 151329
llama_model_loader: - kv 40: tokenizer.chat_template str = [gMASK]<sop>\n{%- if tools -%}\n<|syste...
llama_model_loader: - type f32: 838 tensors
llama_model_loader: - type f16: 923 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = F16
print_info: file size = 670.59 GiB (16.08 BPW)
init_tokenizer: initializing tokenizer for type 2
load: control token: 151363 '<|image|>' is not marked as EOG
load: control token: 151362 '<|end_of_box|>' is not marked as EOG
load: control token: 151361 '<|begin_of_box|>' is not marked as EOG
load: control token: 151349 '<|code_suffix|>' is not marked as EOG
load: control token: 151348 '<|code_middle|>' is not marked as EOG
load: control token: 151346 '<|end_of_transcription|>' is not marked as EOG
load: control token: 151343 '<|begin_of_audio|>' is not marked as EOG
load: control token: 151342 '<|end_of_video|>' is not marked as EOG
load: control token: 151341 '<|begin_of_video|>' is not marked as EOG
load: control token: 151338 '<|observation|>' is not marked as EOG
load: control token: 151333 '<sop>' is not marked as EOG
load: control token: 151331 '[gMASK]' is not marked as EOG
load: control token: 151330 '[MASK]' is not marked as EOG
load: control token: 151347 '<|code_prefix|>' is not marked as EOG
load: control token: 151360 '/nothink' is not marked as EOG
load: control token: 151337 '<|assistant|>' is not marked as EOG
load: control token: 151332 '[sMASK]' is not marked as EOG
load: control token: 151334 '<eop>' is not marked as EOG
load: control token: 151335 '<|system|>' is not marked as EOG
load: control token: 151336 '<|user|>' is not marked as EOG
load: control token: 151340 '<|end_of_image|>' is not marked as EOG
load: control token: 151339 '<|begin_of_image|>' is not marked as EOG
load: control token: 151364 '<|video|>' is not marked as EOG
load: control token: 151345 '<|begin_of_transcription|>' is not marked as EOG
load: control token: 151344 '<|end_of_audio|>' is not marked as EOG
load: special_eot_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 36
load: token to piece cache size = 0.9713 MB
print_info: arch = glm4moe
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 5120
print_info: n_layer = 93
print_info: n_head = 96
print_info: n_head_kv = 8
print_info: n_rot = 64
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 12
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 12288
print_info: n_expert = 160
print_info: n_expert_used = 8
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: model type = 355B.A32B
print_info: model params = 358.34 B
print_info: general.name = Zai org.GLM 4.5
print_info: vocab type = BPE
print_info: n_vocab = 151552
print_info: n_merges = 318088
print_info: BOS token = 151329 '<|endoftext|>'
print_info: EOS token = 151329 '<|endoftext|>'
print_info: EOT token = 151336 '<|user|>'
print_info: UNK token = 151329 '<|endoftext|>'
print_info: PAD token = 151329 '<|endoftext|>'
print_info: LF token = 198 'Ċ'
print_info: EOG token = 151329 '<|endoftext|>'
print_info: EOG token = 151336 '<|user|>'
print_info: max token length = 1024
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: layer 0 assigned to device CPU, is_swa = 0
load_tensors: layer 1 assigned to device CPU, is_swa = 0
load_tensors: layer 2 assigned to device CPU, is_swa = 0
load_tensors: layer 3 assigned to device CPU, is_swa = 0
load_tensors: layer 4 assigned to device CPU, is_swa = 0
load_tensors: layer 5 assigned to device CPU, is_swa = 0
load_tensors: layer 6 assigned to device CPU, is_swa = 0
load_tensors: layer 7 assigned to device CPU, is_swa = 0
load_tensors: layer 8 assigned to device CPU, is_swa = 0
load_tensors: layer 9 assigned to device CPU, is_swa = 0
load_tensors: layer 10 assigned to device CPU, is_swa = 0
load_tensors: layer 11 assigned to device CPU, is_swa = 0
load_tensors: layer 12 assigned to device CPU, is_swa = 0
load_tensors: layer 13 assigned to device CPU, is_swa = 0
load_tensors: layer 14 assigned to device CPU, is_swa = 0
load_tensors: layer 15 assigned to device CPU, is_swa = 0
load_tensors: layer 16 assigned to device CPU, is_swa = 0
load_tensors: layer 17 assigned to device CPU, is_swa = 0
load_tensors: layer 18 assigned to device CPU, is_swa = 0
load_tensors: layer 19 assigned to device CPU, is_swa = 0
load_tensors: layer 20 assigned to device CPU, is_swa = 0
load_tensors: layer 21 assigned to device CPU, is_swa = 0
load_tensors: layer 22 assigned to device CPU, is_swa = 0
load_tensors: layer 23 assigned to device CPU, is_swa = 0
load_tensors: layer 24 assigned to device CPU, is_swa = 0
load_tensors: layer 25 assigned to device CPU, is_swa = 0
load_tensors: layer 26 assigned to device CPU, is_swa = 0
load_tensors: layer 27 assigned to device CPU, is_swa = 0
load_tensors: layer 28 assigned to device CPU, is_swa = 0
load_tensors: layer 29 assigned to device CPU, is_swa = 0
load_tensors: layer 30 assigned to device CPU, is_swa = 0
load_tensors: layer 31 assigned to device CPU, is_swa = 0
load_tensors: layer 32 assigned to device CPU, is_swa = 0
load_tensors: layer 33 assigned to device CPU, is_swa = 0
load_tensors: layer 34 assigned to device CPU, is_swa = 0
load_tensors: layer 35 assigned to device CPU, is_swa = 0
load_tensors: layer 36 assigned to device CPU, is_swa = 0
load_tensors: layer 37 assigned to device CPU, is_swa = 0
load_tensors: layer 38 assigned to device CPU, is_swa = 0
load_tensors: layer 39 assigned to device CPU, is_swa = 0
load_tensors: layer 40 assigned to device CPU, is_swa = 0
load_tensors: layer 41 assigned to device CPU, is_swa = 0
load_tensors: layer 42 assigned to device CPU, is_swa = 0
load_tensors: layer 43 assigned to device CPU, is_swa = 0
load_tensors: layer 44 assigned to device CPU, is_swa = 0
load_tensors: layer 45 assigned to device CPU, is_swa = 0
load_tensors: layer 46 assigned to device CPU, is_swa = 0
load_tensors: layer 47 assigned to device CPU, is_swa = 0
load_tensors: layer 48 assigned to device CPU, is_swa = 0
load_tensors: layer 49 assigned to device CPU, is_swa = 0
load_tensors: layer 50 assigned to device CPU, is_swa = 0
load_tensors: layer 51 assigned to device CPU, is_swa = 0
load_tensors: layer 52 assigned to device CPU, is_swa = 0
load_tensors: layer 53 assigned to device CPU, is_swa = 0
load_tensors: layer 54 assigned to device CPU, is_swa = 0
load_tensors: layer 55 assigned to device CPU, is_swa = 0
load_tensors: layer 56 assigned to device CPU, is_swa = 0
load_tensors: layer 57 assigned to device CPU, is_swa = 0
load_tensors: layer 58 assigned to device CPU, is_swa = 0
load_tensors: layer 59 assigned to device CPU, is_swa = 0
load_tensors: layer 60 assigned to device CPU, is_swa = 0
load_tensors: layer 61 assigned to device CPU, is_swa = 0
load_tensors: layer 62 assigned to device CPU, is_swa = 0
load_tensors: layer 63 assigned to device CPU, is_swa = 0
load_tensors: layer 64 assigned to device CPU, is_swa = 0
load_tensors: layer 65 assigned to device CPU, is_swa = 0
load_tensors: layer 66 assigned to device CPU, is_swa = 0
load_tensors: layer 67 assigned to device CPU, is_swa = 0
load_tensors: layer 68 assigned to device CPU, is_swa = 0
load_tensors: layer 69 assigned to device CPU, is_swa = 0
load_tensors: layer 70 assigned to device CPU, is_swa = 0
load_tensors: layer 71 assigned to device CPU, is_swa = 0
load_tensors: layer 72 assigned to device CPU, is_swa = 0
load_tensors: layer 73 assigned to device CPU, is_swa = 0
load_tensors: layer 74 assigned to device CPU, is_swa = 0
load_tensors: layer 75 assigned to device CPU, is_swa = 0
load_tensors: layer 76 assigned to device CPU, is_swa = 0
load_tensors: layer 77 assigned to device CPU, is_swa = 0
load_tensors: layer 78 assigned to device CPU, is_swa = 0
load_tensors: layer 79 assigned to device CPU, is_swa = 0
load_tensors: layer 80 assigned to device CPU, is_swa = 0
load_tensors: layer 81 assigned to device CPU, is_swa = 0
load_tensors: layer 82 assigned to device CPU, is_swa = 0
load_tensors: layer 83 assigned to device CPU, is_swa = 0
load_tensors: layer 84 assigned to device CPU, is_swa = 0
load_tensors: layer 85 assigned to device CPU, is_swa = 0
load_tensors: layer 86 assigned to device CPU, is_swa = 0
load_tensors: layer 87 assigned to device CPU, is_swa = 0
load_tensors: layer 88 assigned to device CPU, is_swa = 0
load_tensors: layer 89 assigned to device CPU, is_swa = 0
load_tensors: layer 90 assigned to device CPU, is_swa = 0
load_tensors: layer 91 assigned to device CPU, is_swa = 0
load_tensors: layer 92 assigned to device CPU, is_swa = 0
load_tensors: layer 93 assigned to device CPU, is_swa = 0
model has unused tensor blk.92.eh_proj (size = 209715200 bytes) -- ignoring
model has unused tensor blk.92.embed_tokens (size = 3103784960 bytes) -- ignoring
model has unused tensor blk.92.enorm (size = 20480 bytes) -- ignoring
model has unused tensor blk.92.hnorm (size = 20480 bytes) -- ignoring
model has unused tensor blk.92.shared_head.head (size = 3103784960 bytes) -- ignoring
model has unused tensor blk.92.shared_head.norm (size = 20480 bytes) -- ignoring
load_tensors: tensor 'token_embd.weight' (f16) (and 1754 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead
load_tensors: offloading 0 repeating layers to GPU
load_tensors: offloaded 0/94 layers to GPU
load_tensors: CPU_Mapped model buffer size = 686680.17 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: n_batch is less than GGML_KQ_MASK_PAD - increasing to 64
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max = 1
llama_context: n_ctx = 2058
llama_context: n_ctx_per_seq = 2058
llama_context: n_batch = 64
llama_context: n_ubatch = 64
llama_context: causal_attn = 1
llama_context: flash_attn = 0
llama_context: kv_unified = true
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (2058) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context: CPU output buffer size = 0.58 MiB
create_memory: n_ctx = 2080 (padded)
llama_kv_cache_unified: layer 0: dev = CPU
llama_kv_cache_unified: layer 1: dev = CPU
llama_kv_cache_unified: layer 2: dev = CPU
llama_kv_cache_unified: layer 3: dev = CPU
llama_kv_cache_unified: layer 4: dev = CPU
llama_kv_cache_unified: layer 5: dev = CPU
llama_kv_cache_unified: layer 6: dev = CPU
llama_kv_cache_unified: layer 7: dev = CPU
llama_kv_cache_unified: layer 8: dev = CPU
llama_kv_cache_unified: layer 9: dev = CPU
llama_kv_cache_unified: layer 10: dev = CPU
llama_kv_cache_unified: layer 11: dev = CPU
llama_kv_cache_unified: layer 12: dev = CPU
llama_kv_cache_unified: layer 13: dev = CPU
llama_kv_cache_unified: layer 14: dev = CPU
llama_kv_cache_unified: layer 15: dev = CPU
llama_kv_cache_unified: layer 16: dev = CPU
llama_kv_cache_unified: layer 17: dev = CPU
llama_kv_cache_unified: layer 18: dev = CPU
llama_kv_cache_unified: layer 19: dev = CPU
llama_kv_cache_unified: layer 20: dev = CPU
llama_kv_cache_unified: layer 21: dev = CPU
llama_kv_cache_unified: layer 22: dev = CPU
llama_kv_cache_unified: layer 23: dev = CPU
llama_kv_cache_unified: layer 24: dev = CPU
llama_kv_cache_unified: layer 25: dev = CPU
llama_kv_cache_unified: layer 26: dev = CPU
llama_kv_cache_unified: layer 27: dev = CPU
llama_kv_cache_unified: layer 28: dev = CPU
llama_kv_cache_unified: layer 29: dev = CPU
llama_kv_cache_unified: layer 30: dev = CPU
llama_kv_cache_unified: layer 31: dev = CPU
llama_kv_cache_unified: layer 32: dev = CPU
llama_kv_cache_unified: layer 33: dev = CPU
llama_kv_cache_unified: layer 34: dev = CPU
llama_kv_cache_unified: layer 35: dev = CPU
llama_kv_cache_unified: layer 36: dev = CPU
llama_kv_cache_unified: layer 37: dev = CPU
llama_kv_cache_unified: layer 38: dev = CPU
llama_kv_cache_unified: layer 39: dev = CPU
llama_kv_cache_unified: layer 40: dev = CPU
llama_kv_cache_unified: layer 41: dev = CPU
llama_kv_cache_unified: layer 42: dev = CPU
llama_kv_cache_unified: layer 43: dev = CPU
llama_kv_cache_unified: layer 44: dev = CPU
llama_kv_cache_unified: layer 45: dev = CPU
llama_kv_cache_unified: layer 46: dev = CPU
llama_kv_cache_unified: layer 47: dev = CPU
llama_kv_cache_unified: layer 48: dev = CPU
llama_kv_cache_unified: layer 49: dev = CPU
llama_kv_cache_unified: layer 50: dev = CPU
llama_kv_cache_unified: layer 51: dev = CPU
llama_kv_cache_unified: layer 52: dev = CPU
llama_kv_cache_unified: layer 53: dev = CPU
llama_kv_cache_unified: layer 54: dev = CPU
llama_kv_cache_unified: layer 55: dev = CPU
llama_kv_cache_unified: layer 56: dev = CPU
llama_kv_cache_unified: layer 57: dev = CPU
llama_kv_cache_unified: layer 58: dev = CPU
llama_kv_cache_unified: layer 59: dev = CPU
llama_kv_cache_unified: layer 60: dev = CPU
llama_kv_cache_unified: layer 61: dev = CPU
llama_kv_cache_unified: layer 62: dev = CPU
llama_kv_cache_unified: layer 63: dev = CPU
llama_kv_cache_unified: layer 64: dev = CPU
llama_kv_cache_unified: layer 65: dev = CPU
llama_kv_cache_unified: layer 66: dev = CPU
llama_kv_cache_unified: layer 67: dev = CPU
llama_kv_cache_unified: layer 68: dev = CPU
llama_kv_cache_unified: layer 69: dev = CPU
llama_kv_cache_unified: layer 70: dev = CPU
llama_kv_cache_unified: layer 71: dev = CPU
llama_kv_cache_unified: layer 72: dev = CPU
llama_kv_cache_unified: layer 73: dev = CPU
llama_kv_cache_unified: layer 74: dev = CPU
llama_kv_cache_unified: layer 75: dev = CPU
llama_kv_cache_unified: layer 76: dev = CPU
llama_kv_cache_unified: layer 77: dev = CPU
llama_kv_cache_unified: layer 78: dev = CPU
llama_kv_cache_unified: layer 79: dev = CPU
llama_kv_cache_unified: layer 80: dev = CPU
llama_kv_cache_unified: layer 81: dev = CPU
llama_kv_cache_unified: layer 82: dev = CPU
llama_kv_cache_unified: layer 83: dev = CPU
llama_kv_cache_unified: layer 84: dev = CPU
llama_kv_cache_unified: layer 85: dev = CPU
llama_kv_cache_unified: layer 86: dev = CPU
llama_kv_cache_unified: layer 87: dev = CPU
llama_kv_cache_unified: layer 88: dev = CPU
llama_kv_cache_unified: layer 89: dev = CPU
llama_kv_cache_unified: layer 90: dev = CPU
llama_kv_cache_unified: layer 91: dev = CPU
llama_kv_cache_unified: layer 92: dev = CPU
llama_kv_cache_unified: CPU KV buffer size = 755.62 MiB
llama_kv_cache_unified: size = 755.62 MiB ( 2080 cells, 93 layers, 1/ 1 seqs), K (f16): 377.81 MiB, V (f16): 377.81 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 3
llama_context: max_nodes = 14040
llama_context: worst-case: n_tokens = 64, n_seqs = 1, n_outputs = 0
graph_reserve: reserving a graph for ubatch with n_tokens = 64, n_seqs = 1, n_outputs = 64
graph_reserve: reserving a graph for ubatch with n_tokens = 1, n_seqs = 1, n_outputs = 1
graph_reserve: reserving a graph for ubatch with n_tokens = 64, n_seqs = 1, n_outputs = 64
llama_context: CUDA0 compute buffer size = 2418.66 MiB
llama_context: CUDA_Host compute buffer size = 1.76 MiB
llama_context: graph nodes = 6978
llama_context: graph splits = 1852 (with bs=64), 187 (with bs=1)
how many 'r's are in the word strawberry? how many 'r's are in the word strawberry? how many 'r's are in the word strawberry? how many 'r's are in the word strawberry? how many 'r's are in the word strawberry? how many 'r^C
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment