MoserMichael/learning-huggingface.md

## learning-huggingface.md

      
    Raw
  

              learning-huggingface.md
            
          
    Learning about huggingface (and other AI stuff)


the site: https://huggingface.co/
at first the account is free, once you start using their GPU's you need a paid account
main page ui

model hub : lots of trained open source models (sorted by trending)
datasets : open source datasets!!!
spaces - hosted apps on hugging space
community (blogs/posts/daily papers)

the blogs are a great source for info! They don't push fear or tons of marketing over here, information is easier to digest then via any other source!


can create git repos hosted by huggingface

create a new git repository on hugging space
log in
https://huggingface.co/new - fill out the form and get a repo.
says huggingface git is using git-lfs - git extension for bigfiles. Here git entry is just a reference to bigfile (via the sha256 checksum of the bigfile).

must install git-lfs extension for git:

sudo apt-get install git-lfs
git lsf install


Benchmarks

There are 'spaces' that are dedicated to benchmark language models:

huggingface is keeping a curated list of benchmark 'spaces' https://huggingface.co/collections/open-llm-leaderboard/the-big-benchmarks-collection

= Open LLM leaderboard (always the first) - a 'space' dedicated to LLM benchmarks https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/

The about page has an explanation on each type of benchmarks mentioned on this page https://huggingface.co/docs/leaderboards/open_llm_leaderboard/about - sub link https://huggingface.co/docs/leaderboards/open_llm_leaderboard/about#tasks
section of each 'type' of model mentioned on the page https://huggingface.co/docs/leaderboards/open_llm_leaderboard/about#model-types

Pretrained Model: New, base models trained on a given text corpora using masked modeling.
Continuously Pretrained Model: New, base models continuously trained on further corpora (which may include IFT/chat data) using masked modeling.
Fine-Tuned on Domain-Specific Datasets Model: Pretrained models fine-tuned on more data.
hat Models (RLHF, DPO, IFT, …): Chat-like fine-tunes using IFT (datasets of task instruction), RLHF, DPO (changing the model loss with an added policy), etc.
Base Merges and Moerges Model: Merges or MoErges, models which have been merged or fused without additional fine-tuning.

model mergeing: combines the weights of differently trained models. This avoids 'catastrophic forgetting' - which occurs if you take a base model and teach it further.

Usually they combine models of he same 'family' - like to llama3 variants with the same number of transformer stages and decoder stages.
Frankenmerging/Passthrough can combine models of different families (??) How do they do that? They try different combinations and of layers /like take the output of layer 10-21 from model A and combinar that with layer 12-24 from model B/ and test what combination works bet... (amazing)


search

https://huggingface.co/models - displays a sidebar, where you can choose traits (
Under 'Other' - '4-bit precision' or 'mixture of experts'
(also says: 'GGUF: The standard for local CPU/GPU inference. AWQ or GPTQ: Common 4-bit GPU-optimized formats')
learning about local models


there is llama.cpp and ollama.


llama.cpp : c++ implementation model of inference
ollama - based on llama.cpp, is adding additional features around it (a framework), thought to be more user friendly.

Update: llama now also allows to work with multiple models (used to be exclusive to ollama) https://huggingface.co/blog/ggml-org/model-management-in-llamacpp

which large language models run fast on a local CPU or simple GPU in the video card?


The key to speed on a local CPU or budget GPU is utilizing
must use quantized versions (e.g., 4-bit or 8-bit models) and choosing models under 15 billion parameters

Quantization is Mandatory: Always look for GGUF or EXL2 quantized versions of models. A 4-bit quantized 7B model often performs similarly to its full-precision version but runs significantly faster and requires half the memor


how to run ollama in docker container?


#ENV MODEL "ministral-3b"
#ENV MODEL "qwen3-8b"
ENV MODEL "llama3"

# 1. Pull the official Ollama image (GPU support is often implicit if drivers are present)
docker pull ollama/ollama

# 2. Run the container (using GPU acceleration if available)
docker run -d --gpus all -v ollama:/root/.ollama -p 11434:11434 --name ollama_server ollama/ollama

# 3. Download a specific model (e.g., a 7B quantized Llama 3) into the running container
docker exec -it ollama_server ollama pull ${MODEL}


are there docker images that include the quantized models as part of the image? How can I build such an image?

No such images - because models are large and frequently updated..
Simple docker
# 1. Pull the official Ollama image (GPU support is often implicit if drivers are present)
docker pull ollama/ollama

# 2. Run the container (using GPU acceleration if available)
docker run -d --gpus all -v ollama:/root/.ollama -p 11434:11434 --name ollama_server ollama/ollama

# 3. Download a specific model (e.g., a 7B quantized Llama 3) into the running container
docker exec -it ollama_server ollama pull llama3

# Expose the standard Ollama port
EXPOSE 11434

# Set the entrypoint to start the server normally
ENTRYPOINT ["ollama", "serve"]


Two staged docker
# STAGE 1: Download the model
FROM ollama/ollama:latest AS builder

#ENV MODEL "ministral-3b"
#ENV MODEL "qwen3-8b"
#ENV MODEL "lfm2-2.6b"
ENV MODEL "llama3"

# Start Ollama server in the background and pull the model
# We use 'nohup' and 'sleep' to ensure the server is ready before pulling
RUN nohup ollama serve > /dev/null 2>&1 & \
    sleep 10 && \
    ollama pull ${MODEL}

# STAGE 2: Final Runtime Image
FROM ollama/ollama:latest

# Copy the pre-loaded model data from the builder stage
# Ollama stores all models and manifests in /root/.ollama
COPY --from=builder /root/.ollama /root/.ollama

# Expose the standard Ollama port
EXPOSE 11434


# Set the entrypoint to start the server normally
ENTRYPOINT ["ollama", "serve"]

- run via docker
docker run command: -p 11434:11434. 

access

via rest
curl http://localhost:11434/api/generate -d '{
  "model": "lfm2-2.6b",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

or via python sdk
import ollama
response = ollama.chat(model='lfm2-2.6b', messages=[
    {'role': 'user', 'content': 'Explain quantum physics in one sentence.'}
])
print(response['message']['content'])

strength & weaknesses of quantized models

(it's start of 2026)

LFM2-2.6B (Liquid AI)


Strengths:
- Unmatched Speed: Specifically built with a hybrid "Liquid" architecture that is up to 2× faster on CPUs than traditional transformer models.
- Efficiency: Extremely low memory footprint (~3GB RAM even at long contexts), making it the gold standard for devices with no dedicated GPU.
- Instruction Accuracy: Surprisingly high scores on instruction-following benchmarks (e.g., IFEval), often beating 7B–10B parameter models.

Weaknesses:
- Knowledge Gaps: Due to its small 2.6B size, it is not ideal for "encyclopedic" knowledge or heavy coding tasks.
- Fragility: Can "crumble" or fail if prompts fall outside its narrow instruction-tuned range.


Ministral-3-3B-Instruct


Strengths:
- Multimodal Native: Unlike most small models, it handles vision (images) and text natively within a compact 5GB footprint.
- Agent Ready: Strong built-in support for tool use (function calling) and structured JSON outputs.
- Licensing: Apache 2.0 license makes it very friendly for commercial/enterprise edge use.
Weaknesses:
- Reasoning Reliability: Users report the "Reasoning" variant can be inconsistent, sometimes failing to generate reasoning
- traces unless heavily prompted.
- Hardware Demand: While small, its vision encoder adds extra VRAM overhead during multimodal tasks.


Qwen3-8B Family


Strengths:
- Multilingual Titan: Supports over 119 languages and dialects, far exceeding Western-centric models.
"Thinking" Mode: Includes a dual-mode system allowing you to toggle between fast general chat and high-accuracy logical
- reasoning for math/coding.
- Strong Math/Logic: Surpasses previous generations (like Qwen2.5) in complex logical deduction and mathematical problem-solving.
Weaknesses:
- VRAM Limit: Pushes the limits of "budget" GPUs; at 8B parameters, it requires careful quantization (e.g., 4-bit) to run  smoothly on 8GB cards.
- Verbose: The "thinking" versions can sometimes become overly talkative, increasing the time to final answer.


NVIDIA Nemotron Nano 9B V2


Strengths:
- Coding Specialist: Highly optimized for developer assistance and code generation on consumer hardware.
- NVIDIA Synergy: Specifically tuned for TensorRT-LLM, allowing it to reach peak speeds on RTX video cards.
Weaknesses:
- Text Only: Lacks the multimodal (vision) capabilities found in the newer Ministral or Qwen3 variants.
- GPU Dependency: While it can run on CPU, its optimizations are heavily weighted toward NVIDIA hardware.


Mistral Nemo (12B)


Strengths:
- Large Context: Features a 128K token context window, allowing for much larger document analysis than the 2.6B–3B models.
- Reasoning Depth: As the largest model in this "fast" category, it offers noticeably better nuance and common-sense reasoning - for complex tasks.
Weaknesses:
- Heavier Quantization Needed: To fit on a standard 8GB–12GB GPU, you must use 4-bit quantization, which slightly degrades performance compared to its smaller, denser peers.
- Lower TPS: Text generation speed (tokens per second) is lower than LFM2 or Ministral due to the higher parameter count.


# AI News
Will be adding stuff from the news here. Things are changing frigging fast...
deepseek engram architecture


DeepSeek engram by prompt engineering


talk on DeepSeek V4 by Vuk Rosić


problem of knowledge representation in large language models: you have a n-gram (a three gram is a sequence of three adjacent words) and want to map it to additional information, relevant to a specific context, like 'Alexander the Great'
- says this is done implicitly by LLM, such a lookup is the result of the training process!
- the result of this lookup is in vector form (compare that to RAG, where the result is in text form, and has to be grokked by the LLM first!)
- the new architecture by DeepSeek: do this mapping of n-gram to vector outside of the LLM, that makes the knowledge of the LLM extensible, and the training process becomes less fragile too (you can add knowledge without catastrophic forgetting) !
- result of lookup is passed through a context aware gate, it checks if the lookup result is relevant to the current context (is that another agent or part of the same model?)


Now they say that 75% of memorized facts should still be as part of the model, otherwise reasoning capability is affected (method: they compare performance against the performance of a mixture of expert models)
- says reasoning benchmarks also improved with this kind of knowledge lookup: says the model got deeper, as it doesn't have to waste space with memorizing facts!
- more attention is freed for long range dependencies (connection between words that are far apart in the context window)


Limitation:
- this is not external an external knowledge base, this mapping is built at training time, and it can't be updated at inference time (probably because the vector representation of the result is tied to the state of the mode as such, now how do they update this vector representation during the training process ???)


big advantage:
- the vector knowledge is kept in RAM (which is cheaper), not as part of the neural network stored in the CPU.


deepseek paper in full - archive link (many questions, so I will need to read it :-)
No results found