This repository provides a complete setup guide for deploying Tencent's WeDLM model with an OpenAI-compatible API server using FastAPI. Follow the instructions below to install dependencies, download the model, and launch the server.
- Operating System: Linux or macOS (with bash support)
- Python: Version 3.13 or higher
- Hardware: GPU with sufficient VRAM to run WeDLM-8B-Instruct (recommended: 24GB+ VRAM)
- Disk Space: At least 20GB for model files and dependencies
Execute the following commands in your terminal to set up the environment and install all required components:
# Install uv package manager
wget -qO- https://astral.sh/uv/install.sh | sh
# Create Python 3.13 virtual environment
uv venv --python 3.13 venv-wedlm
# Activate virtual environment
source venv-wedlm/bin/activate
# Install WeDLM from source
pip install git+https://github.com/tencent/WeDLM.git
# Download WeDLM-8B-Instruct model files
hf download tencent/WeDLM-8B-Instruct --local-dir WeDLM-8B-Instruct
# Install flash-attention for optimized inference
uv pip install flash-attn --no-build-isolation
# Install FastAPI and server dependencies
uv pip install fastapi uvicorn pydanticThe installation process consists of several steps, each serving a specific purpose in setting up your development environment.
uv is a modern, extremely fast Python package installer and environment manager written in Rust. It significantly accelerates the installation process compared to traditional tools like pip.
The virtual environment creation isolates your WeDLM dependencies from other Python projects on your system, preventing version conflicts and ensuring reproducible deployments.
WeDLM installation pulls the latest version directly from Tencent's GitHub repository, ensuring you have access to the most recent features and bug fixes.
Model download retrieves the WeDLM-8B-Instruct model weights from HuggingFace Hub and stores them locally in the WeDLM-8B-Instruct directory for inference.
flash-attn is a flash attention implementation that substantially reduces memory usage and increases inference speed for transformer models, making large model deployment more efficient.
FastAPI, uvicorn, and pydantic form the foundation of your API server, providing fast async HTTP capabilities and robust data validation.
After completing the installation, launch the WeDLM OpenAI-compatible API server:
python 20251230_wedlm_openai_server.pyOnce started, the server will be accessible at http://localhost:8000 by default. The API is fully compatible with OpenAI's chat completions endpoint format, allowing you to use standard OpenAI client libraries with minimal code changes.
With your server running, you can make requests using any OpenAI-compatible client:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed"
)
response = client.chat.completions.create(
model="WeDLM-8B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain how quantum computing works."}
],
temperature=0.7,
max_tokens=1024
)
print(response.choices[0].message.content)You can customize server behavior using environment variables:
| Variable | Description | Default |
|---|---|---|
MODEL_PATH |
Path to WeDLM model directory | WeDLM-8B-Instruct |
HOST |
Server bind address | 0.0.0.0 |
PORT |
Server port | 8000 |
DEVICE |
Compute device (cuda, cpu, mps) |
cuda |
MAX_BATCH_SIZE |
Maximum batch size for inference | 1 |
If you encounter CUDA out-of-memory errors, reduce the batch size or enable CPU offloading by setting CUDA_VISIBLE_DEVICES= appropriately. For systems with limited VRAM, consider using model quantization or reducing the max_tokens parameter in your requests.
For flash-attention installation issues on certain systems, ensure you have the latest CUDA toolkit installed and that your PyTorch version matches your CUDA version.
Thanks for your work.
I think the correct API use example should be