- Transformers
- Ollama
- llama.cpp
- ExLlamaV2
- AutoGPTQ
- AutoAWQ
- TensorRT-LLM
docs about inference backends: https://www.bentoml.com/blog/benchmarking-llm-inference-backends
- oobabooga
- Stable Diffusion web UI
- SillyTavern
- LM Studio
- Axolatotl
- GPT4all
- Open WebUI
- I've used this one
- Enchanted
- Mac native
- Langchain (TS & Python)
- LLamaindex (TS & Python)
- ModelFusion (TS)
- Haystack (Python)
- Used by AWS, Nvidia, IBM, Intel
- CrewAI (Python)
- Transformers (Python)
- Made by HuggingFace
- PyTorch
- Tensorflow
- JAX
- vokturz/can-it-run-llm
- nyxkrage/gguf-vram-calculator
- QLoRA
- For fine-tuning models
- bycloud
- HuggingFace
- Fireship
- Not exclusively about LLMs/AI
- David Ondrej
Models are usually saved on one of these formats:
- GGUF
- It's a sucessor of GGML
- Tech doc about GGUF (from HuggingFace)
- GGML
- Safetensors
- Exl2
- AWQ
These files contains contexts used by the LLMs
1 tokens ~= 0.75 words
Q4_0Q4_1Q5_0Q5_1Q8_0
Q3_K_SQ3_K_MQ3_K_LQ4_K_SQ4_K_MQ5_K_SQ5_K_MQ6_K