Created
July 30, 2025 05:10
-
-
Save csabakecskemeti/278ffb8ab8e4e13b605bfc1735bfed9b to your computer and use it in GitHub Desktop.
GLM-4.5 f16.gguf generation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| ./build/bin/llama-simple -m /media/kecso/8t_nvme/zai-org.GLM-4.5.f16.gguf -ngl 0 -n 2048 "how many 'r's are in the word strawberry?" | |
| ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no | |
| ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no | |
| ggml_cuda_init: found 2 CUDA devices: | |
| Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes | |
| Device 1: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes | |
| llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 5090) - 30933 MiB free | |
| llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 5090) - 31598 MiB free | |
| llama_model_loader: loaded meta data with 41 key-value pairs and 1761 tensors from /media/kecso/8t_nvme/zai-org.GLM-4.5.f16.gguf (version GGUF V3 (latest)) | |
| llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. | |
| llama_model_loader: - kv 0: general.architecture str = glm4moe | |
| llama_model_loader: - kv 1: general.type str = model | |
| llama_model_loader: - kv 2: general.name str = Zai org.GLM 4.5 | |
| llama_model_loader: - kv 3: general.version str = 4.5 | |
| llama_model_loader: - kv 4: general.basename str = zai-org.GLM | |
| llama_model_loader: - kv 5: general.size_label str = 160x21B | |
| llama_model_loader: - kv 6: general.license str = mit | |
| llama_model_loader: - kv 7: general.tags arr[str,1] = ["text-generation"] | |
| llama_model_loader: - kv 8: general.languages arr[str,2] = ["en", "zh"] | |
| llama_model_loader: - kv 9: glm4moe.block_count u32 = 93 | |
| llama_model_loader: - kv 10: glm4moe.context_length u32 = 131072 | |
| llama_model_loader: - kv 11: glm4moe.embedding_length u32 = 5120 | |
| llama_model_loader: - kv 12: glm4moe.feed_forward_length u32 = 12288 | |
| llama_model_loader: - kv 13: glm4moe.attention.head_count u32 = 96 | |
| llama_model_loader: - kv 14: glm4moe.attention.head_count_kv u32 = 8 | |
| llama_model_loader: - kv 15: glm4moe.rope.freq_base f32 = 1000000.000000 | |
| llama_model_loader: - kv 16: glm4moe.attention.layer_norm_rms_epsilon f32 = 0.000010 | |
| llama_model_loader: - kv 17: glm4moe.expert_used_count u32 = 8 | |
| llama_model_loader: - kv 18: glm4moe.attention.key_length u32 = 128 | |
| llama_model_loader: - kv 19: glm4moe.attention.value_length u32 = 128 | |
| llama_model_loader: - kv 20: general.file_type u32 = 1 | |
| llama_model_loader: - kv 21: glm4moe.rope.dimension_count u32 = 64 | |
| llama_model_loader: - kv 22: glm4moe.expert_count u32 = 160 | |
| llama_model_loader: - kv 23: glm4moe.expert_feed_forward_length u32 = 1536 | |
| llama_model_loader: - kv 24: glm4moe.expert_shared_count u32 = 1 | |
| llama_model_loader: - kv 25: glm4moe.leading_dense_block_count u32 = 3 | |
| llama_model_loader: - kv 26: glm4moe.expert_gating_func u32 = 2 | |
| llama_model_loader: - kv 27: glm4moe.expert_weights_scale f32 = 2.500000 | |
| llama_model_loader: - kv 28: glm4moe.expert_weights_norm bool = true | |
| llama_model_loader: - kv 29: general.quantization_version u32 = 2 | |
| llama_model_loader: - kv 30: tokenizer.ggml.model str = gpt2 | |
| llama_model_loader: - kv 31: tokenizer.ggml.pre str = glm4 | |
| llama_model_loader: - kv 32: tokenizer.ggml.tokens arr[str,151552] = ["!", "\"", "#", "$", "%", "&", "'", ... | |
| llama_model_loader: - kv 33: tokenizer.ggml.token_type arr[i32,151552] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... | |
| llama_model_loader: - kv 34: tokenizer.ggml.merges arr[str,318088] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... | |
| llama_model_loader: - kv 35: tokenizer.ggml.eos_token_id u32 = 151329 | |
| llama_model_loader: - kv 36: tokenizer.ggml.padding_token_id u32 = 151329 | |
| llama_model_loader: - kv 37: tokenizer.ggml.eot_token_id u32 = 151336 | |
| llama_model_loader: - kv 38: tokenizer.ggml.unknown_token_id u32 = 151329 | |
| llama_model_loader: - kv 39: tokenizer.ggml.bos_token_id u32 = 151329 | |
| llama_model_loader: - kv 40: tokenizer.chat_template str = [gMASK]<sop>\n{%- if tools -%}\n<|syste... | |
| llama_model_loader: - type f32: 838 tensors | |
| llama_model_loader: - type f16: 923 tensors | |
| print_info: file format = GGUF V3 (latest) | |
| print_info: file type = F16 | |
| print_info: file size = 670.59 GiB (16.08 BPW) | |
| init_tokenizer: initializing tokenizer for type 2 | |
| load: control token: 151363 '<|image|>' is not marked as EOG | |
| load: control token: 151362 '<|end_of_box|>' is not marked as EOG | |
| load: control token: 151361 '<|begin_of_box|>' is not marked as EOG | |
| load: control token: 151349 '<|code_suffix|>' is not marked as EOG | |
| load: control token: 151348 '<|code_middle|>' is not marked as EOG | |
| load: control token: 151346 '<|end_of_transcription|>' is not marked as EOG | |
| load: control token: 151343 '<|begin_of_audio|>' is not marked as EOG | |
| load: control token: 151342 '<|end_of_video|>' is not marked as EOG | |
| load: control token: 151341 '<|begin_of_video|>' is not marked as EOG | |
| load: control token: 151338 '<|observation|>' is not marked as EOG | |
| load: control token: 151333 '<sop>' is not marked as EOG | |
| load: control token: 151331 '[gMASK]' is not marked as EOG | |
| load: control token: 151330 '[MASK]' is not marked as EOG | |
| load: control token: 151347 '<|code_prefix|>' is not marked as EOG | |
| load: control token: 151360 '/nothink' is not marked as EOG | |
| load: control token: 151337 '<|assistant|>' is not marked as EOG | |
| load: control token: 151332 '[sMASK]' is not marked as EOG | |
| load: control token: 151334 '<eop>' is not marked as EOG | |
| load: control token: 151335 '<|system|>' is not marked as EOG | |
| load: control token: 151336 '<|user|>' is not marked as EOG | |
| load: control token: 151340 '<|end_of_image|>' is not marked as EOG | |
| load: control token: 151339 '<|begin_of_image|>' is not marked as EOG | |
| load: control token: 151364 '<|video|>' is not marked as EOG | |
| load: control token: 151345 '<|begin_of_transcription|>' is not marked as EOG | |
| load: control token: 151344 '<|end_of_audio|>' is not marked as EOG | |
| load: special_eot_id is not in special_eog_ids - the tokenizer config may be incorrect | |
| load: special tokens cache size = 36 | |
| load: token to piece cache size = 0.9713 MB | |
| print_info: arch = glm4moe | |
| print_info: vocab_only = 0 | |
| print_info: n_ctx_train = 131072 | |
| print_info: n_embd = 5120 | |
| print_info: n_layer = 93 | |
| print_info: n_head = 96 | |
| print_info: n_head_kv = 8 | |
| print_info: n_rot = 64 | |
| print_info: n_swa = 0 | |
| print_info: is_swa_any = 0 | |
| print_info: n_embd_head_k = 128 | |
| print_info: n_embd_head_v = 128 | |
| print_info: n_gqa = 12 | |
| print_info: n_embd_k_gqa = 1024 | |
| print_info: n_embd_v_gqa = 1024 | |
| print_info: f_norm_eps = 0.0e+00 | |
| print_info: f_norm_rms_eps = 1.0e-05 | |
| print_info: f_clamp_kqv = 0.0e+00 | |
| print_info: f_max_alibi_bias = 0.0e+00 | |
| print_info: f_logit_scale = 0.0e+00 | |
| print_info: f_attn_scale = 0.0e+00 | |
| print_info: n_ff = 12288 | |
| print_info: n_expert = 160 | |
| print_info: n_expert_used = 8 | |
| print_info: causal attn = 1 | |
| print_info: pooling type = 0 | |
| print_info: rope type = 0 | |
| print_info: rope scaling = linear | |
| print_info: freq_base_train = 1000000.0 | |
| print_info: freq_scale_train = 1 | |
| print_info: n_ctx_orig_yarn = 131072 | |
| print_info: rope_finetuned = unknown | |
| print_info: model type = 355B.A32B | |
| print_info: model params = 358.34 B | |
| print_info: general.name = Zai org.GLM 4.5 | |
| print_info: vocab type = BPE | |
| print_info: n_vocab = 151552 | |
| print_info: n_merges = 318088 | |
| print_info: BOS token = 151329 '<|endoftext|>' | |
| print_info: EOS token = 151329 '<|endoftext|>' | |
| print_info: EOT token = 151336 '<|user|>' | |
| print_info: UNK token = 151329 '<|endoftext|>' | |
| print_info: PAD token = 151329 '<|endoftext|>' | |
| print_info: LF token = 198 'Ċ' | |
| print_info: EOG token = 151329 '<|endoftext|>' | |
| print_info: EOG token = 151336 '<|user|>' | |
| print_info: max token length = 1024 | |
| load_tensors: loading model tensors, this can take a while... (mmap = true) | |
| load_tensors: layer 0 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 1 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 2 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 3 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 4 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 5 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 6 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 7 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 8 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 9 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 10 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 11 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 12 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 13 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 14 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 15 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 16 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 17 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 18 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 19 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 20 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 21 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 22 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 23 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 24 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 25 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 26 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 27 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 28 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 29 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 30 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 31 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 32 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 33 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 34 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 35 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 36 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 37 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 38 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 39 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 40 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 41 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 42 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 43 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 44 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 45 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 46 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 47 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 48 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 49 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 50 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 51 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 52 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 53 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 54 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 55 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 56 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 57 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 58 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 59 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 60 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 61 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 62 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 63 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 64 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 65 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 66 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 67 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 68 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 69 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 70 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 71 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 72 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 73 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 74 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 75 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 76 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 77 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 78 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 79 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 80 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 81 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 82 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 83 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 84 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 85 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 86 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 87 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 88 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 89 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 90 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 91 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 92 assigned to device CPU, is_swa = 0 | |
| load_tensors: layer 93 assigned to device CPU, is_swa = 0 | |
| model has unused tensor blk.92.eh_proj (size = 209715200 bytes) -- ignoring | |
| model has unused tensor blk.92.embed_tokens (size = 3103784960 bytes) -- ignoring | |
| model has unused tensor blk.92.enorm (size = 20480 bytes) -- ignoring | |
| model has unused tensor blk.92.hnorm (size = 20480 bytes) -- ignoring | |
| model has unused tensor blk.92.shared_head.head (size = 3103784960 bytes) -- ignoring | |
| model has unused tensor blk.92.shared_head.norm (size = 20480 bytes) -- ignoring | |
| load_tensors: tensor 'token_embd.weight' (f16) (and 1754 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead | |
| load_tensors: offloading 0 repeating layers to GPU | |
| load_tensors: offloaded 0/94 layers to GPU | |
| load_tensors: CPU_Mapped model buffer size = 686680.17 MiB | |
| .................................................................................................... | |
| llama_context: constructing llama_context | |
| llama_context: n_batch is less than GGML_KQ_MASK_PAD - increasing to 64 | |
| llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache | |
| llama_context: n_seq_max = 1 | |
| llama_context: n_ctx = 2058 | |
| llama_context: n_ctx_per_seq = 2058 | |
| llama_context: n_batch = 64 | |
| llama_context: n_ubatch = 64 | |
| llama_context: causal_attn = 1 | |
| llama_context: flash_attn = 0 | |
| llama_context: kv_unified = true | |
| llama_context: freq_base = 1000000.0 | |
| llama_context: freq_scale = 1 | |
| llama_context: n_ctx_per_seq (2058) < n_ctx_train (131072) -- the full capacity of the model will not be utilized | |
| set_abort_callback: call | |
| llama_context: CPU output buffer size = 0.58 MiB | |
| create_memory: n_ctx = 2080 (padded) | |
| llama_kv_cache_unified: layer 0: dev = CPU | |
| llama_kv_cache_unified: layer 1: dev = CPU | |
| llama_kv_cache_unified: layer 2: dev = CPU | |
| llama_kv_cache_unified: layer 3: dev = CPU | |
| llama_kv_cache_unified: layer 4: dev = CPU | |
| llama_kv_cache_unified: layer 5: dev = CPU | |
| llama_kv_cache_unified: layer 6: dev = CPU | |
| llama_kv_cache_unified: layer 7: dev = CPU | |
| llama_kv_cache_unified: layer 8: dev = CPU | |
| llama_kv_cache_unified: layer 9: dev = CPU | |
| llama_kv_cache_unified: layer 10: dev = CPU | |
| llama_kv_cache_unified: layer 11: dev = CPU | |
| llama_kv_cache_unified: layer 12: dev = CPU | |
| llama_kv_cache_unified: layer 13: dev = CPU | |
| llama_kv_cache_unified: layer 14: dev = CPU | |
| llama_kv_cache_unified: layer 15: dev = CPU | |
| llama_kv_cache_unified: layer 16: dev = CPU | |
| llama_kv_cache_unified: layer 17: dev = CPU | |
| llama_kv_cache_unified: layer 18: dev = CPU | |
| llama_kv_cache_unified: layer 19: dev = CPU | |
| llama_kv_cache_unified: layer 20: dev = CPU | |
| llama_kv_cache_unified: layer 21: dev = CPU | |
| llama_kv_cache_unified: layer 22: dev = CPU | |
| llama_kv_cache_unified: layer 23: dev = CPU | |
| llama_kv_cache_unified: layer 24: dev = CPU | |
| llama_kv_cache_unified: layer 25: dev = CPU | |
| llama_kv_cache_unified: layer 26: dev = CPU | |
| llama_kv_cache_unified: layer 27: dev = CPU | |
| llama_kv_cache_unified: layer 28: dev = CPU | |
| llama_kv_cache_unified: layer 29: dev = CPU | |
| llama_kv_cache_unified: layer 30: dev = CPU | |
| llama_kv_cache_unified: layer 31: dev = CPU | |
| llama_kv_cache_unified: layer 32: dev = CPU | |
| llama_kv_cache_unified: layer 33: dev = CPU | |
| llama_kv_cache_unified: layer 34: dev = CPU | |
| llama_kv_cache_unified: layer 35: dev = CPU | |
| llama_kv_cache_unified: layer 36: dev = CPU | |
| llama_kv_cache_unified: layer 37: dev = CPU | |
| llama_kv_cache_unified: layer 38: dev = CPU | |
| llama_kv_cache_unified: layer 39: dev = CPU | |
| llama_kv_cache_unified: layer 40: dev = CPU | |
| llama_kv_cache_unified: layer 41: dev = CPU | |
| llama_kv_cache_unified: layer 42: dev = CPU | |
| llama_kv_cache_unified: layer 43: dev = CPU | |
| llama_kv_cache_unified: layer 44: dev = CPU | |
| llama_kv_cache_unified: layer 45: dev = CPU | |
| llama_kv_cache_unified: layer 46: dev = CPU | |
| llama_kv_cache_unified: layer 47: dev = CPU | |
| llama_kv_cache_unified: layer 48: dev = CPU | |
| llama_kv_cache_unified: layer 49: dev = CPU | |
| llama_kv_cache_unified: layer 50: dev = CPU | |
| llama_kv_cache_unified: layer 51: dev = CPU | |
| llama_kv_cache_unified: layer 52: dev = CPU | |
| llama_kv_cache_unified: layer 53: dev = CPU | |
| llama_kv_cache_unified: layer 54: dev = CPU | |
| llama_kv_cache_unified: layer 55: dev = CPU | |
| llama_kv_cache_unified: layer 56: dev = CPU | |
| llama_kv_cache_unified: layer 57: dev = CPU | |
| llama_kv_cache_unified: layer 58: dev = CPU | |
| llama_kv_cache_unified: layer 59: dev = CPU | |
| llama_kv_cache_unified: layer 60: dev = CPU | |
| llama_kv_cache_unified: layer 61: dev = CPU | |
| llama_kv_cache_unified: layer 62: dev = CPU | |
| llama_kv_cache_unified: layer 63: dev = CPU | |
| llama_kv_cache_unified: layer 64: dev = CPU | |
| llama_kv_cache_unified: layer 65: dev = CPU | |
| llama_kv_cache_unified: layer 66: dev = CPU | |
| llama_kv_cache_unified: layer 67: dev = CPU | |
| llama_kv_cache_unified: layer 68: dev = CPU | |
| llama_kv_cache_unified: layer 69: dev = CPU | |
| llama_kv_cache_unified: layer 70: dev = CPU | |
| llama_kv_cache_unified: layer 71: dev = CPU | |
| llama_kv_cache_unified: layer 72: dev = CPU | |
| llama_kv_cache_unified: layer 73: dev = CPU | |
| llama_kv_cache_unified: layer 74: dev = CPU | |
| llama_kv_cache_unified: layer 75: dev = CPU | |
| llama_kv_cache_unified: layer 76: dev = CPU | |
| llama_kv_cache_unified: layer 77: dev = CPU | |
| llama_kv_cache_unified: layer 78: dev = CPU | |
| llama_kv_cache_unified: layer 79: dev = CPU | |
| llama_kv_cache_unified: layer 80: dev = CPU | |
| llama_kv_cache_unified: layer 81: dev = CPU | |
| llama_kv_cache_unified: layer 82: dev = CPU | |
| llama_kv_cache_unified: layer 83: dev = CPU | |
| llama_kv_cache_unified: layer 84: dev = CPU | |
| llama_kv_cache_unified: layer 85: dev = CPU | |
| llama_kv_cache_unified: layer 86: dev = CPU | |
| llama_kv_cache_unified: layer 87: dev = CPU | |
| llama_kv_cache_unified: layer 88: dev = CPU | |
| llama_kv_cache_unified: layer 89: dev = CPU | |
| llama_kv_cache_unified: layer 90: dev = CPU | |
| llama_kv_cache_unified: layer 91: dev = CPU | |
| llama_kv_cache_unified: layer 92: dev = CPU | |
| llama_kv_cache_unified: CPU KV buffer size = 755.62 MiB | |
| llama_kv_cache_unified: size = 755.62 MiB ( 2080 cells, 93 layers, 1/ 1 seqs), K (f16): 377.81 MiB, V (f16): 377.81 MiB | |
| llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility | |
| llama_context: enumerating backends | |
| llama_context: backend_ptrs.size() = 3 | |
| llama_context: max_nodes = 14040 | |
| llama_context: worst-case: n_tokens = 64, n_seqs = 1, n_outputs = 0 | |
| graph_reserve: reserving a graph for ubatch with n_tokens = 64, n_seqs = 1, n_outputs = 64 | |
| graph_reserve: reserving a graph for ubatch with n_tokens = 1, n_seqs = 1, n_outputs = 1 | |
| graph_reserve: reserving a graph for ubatch with n_tokens = 64, n_seqs = 1, n_outputs = 64 | |
| llama_context: CUDA0 compute buffer size = 2418.66 MiB | |
| llama_context: CUDA_Host compute buffer size = 1.76 MiB | |
| llama_context: graph nodes = 6978 | |
| llama_context: graph splits = 1852 (with bs=64), 187 (with bs=1) | |
| how many 'r's are in the word strawberry? how many 'r's are in the word strawberry? how many 'r's are in the word strawberry? how many 'r's are in the word strawberry? how many 'r's are in the word strawberry? how many 'r^C |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment