How to Run DeepSeek-V4 Locally: Pro and Flash Setup Guide
How to Run DeepSeek-V4 Locally: Pro and Flash Setup Guide
DeepSeek-V4 is one of the most ambitious open-weight model releases from DeepSeek so far. The family includes DeepSeek-V4-Pro, a 1.6T-parameter Mixture-of-Experts model with 49B activated parameters, and DeepSeek-V4-Flash, a smaller 284B-parameter MoE model with 13B activated parameters. Both models support a context length of up to one million tokens.
That combination sounds exciting, but it also creates a practical question: can you actually run DeepSeek-V4 locally?
The answer is yes, but with an important caveat. DeepSeek-V4 is not a laptop-sized model. Even the Flash version is a serious multi-GPU deployment. This guide walks through the local setup path using the official DeepSeek model repositories on Hugging Face, explains the hardware you should plan for, and shows how to use the official inference and encoding files correctly.
Reference model pages:
DeepSeek-V4-Pro vs DeepSeek-V4-Flash
Before downloading anything, choose the right model variant.
| Model | Total Parameters | Activated Parameters | Context Length | Precision | Best For |
|---|---|---|---|---|---|
| DeepSeek-V4-Flash | 284B | 13B | 1M | FP4 + FP8 mixed | Faster local experiments, lower-cost serving, coding assistants, long-context testing |
| DeepSeek-V4-Pro | 1.6T | 49B | 1M | FP4 + FP8 mixed | Maximum quality, research labs, large GPU clusters, serious reasoning and agentic tasks |
The most important detail is that DeepSeek-V4 uses a Mixture-of-Experts (MoE) architecture. Only part of the model is activated for each token, which reduces compute cost. However, you still need to store and load the model weights. That means GPU memory and storage requirements remain very high.
For most developers, DeepSeek-V4-Flash is the realistic starting point. DeepSeek-V4-Pro is better treated as a cluster-scale deployment.
What Makes DeepSeek-V4 Different?
According to DeepSeek's model card, the V4 series introduces several major upgrades:
- Hybrid Attention Architecture: DeepSeek combines Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) to improve long-context efficiency. In the one-million-token setting, DeepSeek-V4-Pro reportedly uses much less KV cache than DeepSeek-V3.2.
- Manifold-Constrained Hyper-Connections (mHC): This improves stability across very deep networks while preserving model capacity.
- Muon Optimizer: DeepSeek uses Muon during training for better convergence and stability.
- Long Context: Both Pro and Flash support up to 1M tokens, with DeepSeek recommending at least 384K context for Think Max mode.
- Multiple Reasoning Modes: DeepSeek-V4 supports Non-think, Think High, and Think Max style usage.
For local deployment, the two most important practical changes are the mixed FP4/FP8 precision and the custom chat encoding format.
Hardware Requirements
DeepSeek-V4 is not designed for consumer GPUs such as RTX 4090 unless you are only experimenting with heavily modified community quantizations in the future. For the official weights, plan around server GPUs.
Practical Hardware Planning
| Use Case | Suggested Hardware | Notes |
|---|---|---|
| DeepSeek-V4-Flash test deployment | 4-8 high-memory NVIDIA GPUs | H100/H200/A100-class GPUs are the practical target |
| DeepSeek-V4-Flash production serving | 8+ high-memory GPUs | More GPUs help throughput and long-context workloads |
| DeepSeek-V4-Pro research deployment | Large multi-node GPU cluster | Treat this as cluster infrastructure, not a single workstation model |
| Think Max with long context | Extra GPU memory and KV cache budget | DeepSeek recommends at least 384K context for Think Max |
Storage Requirements
Plan for large local storage before starting the download:
- Use NVMe SSD storage whenever possible.
- Keep extra space for converted weights.
- Avoid downloading directly to a small system disk.
- Expect the Pro model to require far more storage than Flash.
A safe layout is:
/data/models/deepseek-v4-flash-hf # original Hugging Face files
/data/models/deepseek-v4-flash-infer # converted inference weights
/data/cache/huggingface # HF cacheIf you are renting a cloud GPU server, choose an instance with local NVMe or attach a large high-throughput volume. For VPS-style deployment planning, you can compare GPU or high-memory servers through providers such as LightNode, but make sure the instance actually has the GPU memory required for this class of model.
Software Requirements
You need a Linux environment with recent NVIDIA drivers and CUDA.
Recommended baseline:
| Component | Recommendation |
|---|---|
| OS | Ubuntu 22.04 or newer |
| Python | 3.10+ |
| GPU Driver | Recent NVIDIA data center driver |
| CUDA | CUDA 12.x preferred |
| PyTorch | CUDA-enabled build |
| Git LFS | Required for model files |
| Hugging Face CLI | Required for reliable downloads |
Install the basic tools:
sudo apt update
sudo apt install -y git git-lfs python3 python3-venv python3-pip
git lfs install
pip install -U huggingface_hubIf you use a Python virtual environment:
python3 -m venv dsv4-env
source dsv4-env/bin/activate
pip install -U pip wheel setuptools
pip install -U huggingface_hub torch transformers safetensorsStep 1: Download DeepSeek-V4-Flash
For most users, start with Flash:
mkdir -p /data/models
cd /data/models
huggingface-cli download deepseek-ai/DeepSeek-V4-Flash \
--local-dir DeepSeek-V4-Flash \
--local-dir-use-symlinks FalseIf you want the Pro model instead:
huggingface-cli download deepseek-ai/DeepSeek-V4-Pro \
--local-dir DeepSeek-V4-Pro \
--local-dir-use-symlinks FalseIf the download is interrupted, simply run the same command again. Hugging Face will resume the download.
Step 2: Inspect the Official Repository Structure
After downloading, check the model folder:
cd /data/models/DeepSeek-V4-Flash
lsThe model card points to two important folders:
inference/- official local inference code, including weight conversion and generation scriptsencoding/- prompt encoding and output parsing utilities for DeepSeek-V4
This matters because DeepSeek-V4 does not ship with a normal Jinja-format chat template. You should not assume that every generic OpenAI-compatible chat wrapper will format prompts correctly out of the box.
Step 3: Convert the Weights for Official Inference
The official inference README uses a conversion step before running generation.
From the model repository:
cd /data/models/DeepSeek-V4-Flash/inference
export HF_CKPT_PATH=/data/models/DeepSeek-V4-Flash
export SAVE_PATH=/data/models/DeepSeek-V4-Flash-infer
export EXPERTS=256
export MP=4
export CONFIG=config.json
python convert.py \
--hf-ckpt-path ${HF_CKPT_PATH} \
--save-path ${SAVE_PATH} \
--n-experts ${EXPERTS} \
--model-parallel ${MP}Parameter notes:
| Variable | Meaning |
|---|---|
HF_CKPT_PATH | Path to the original Hugging Face model files |
SAVE_PATH | Output path for converted inference weights |
EXPERTS=256 | Number of experts used by the DeepSeek-V4 inference conversion |
MP=4 | Model parallel size; usually match this to the number of GPUs used for the run |
CONFIG | Model config file used by the generation script |
If you use more GPUs, adjust MP accordingly. For example, on an 8-GPU node:
export MP=8FP8 Expert Option
The official inference README notes that if you want to use FP8 experts instead of FP4 experts, remove this line from config.json:
"expert_dtype": "fp4"Then pass --expert-dtype fp8 during conversion:
python convert.py \
--hf-ckpt-path ${HF_CKPT_PATH} \
--save-path ${SAVE_PATH} \
--n-experts ${EXPERTS} \
--model-parallel ${MP} \
--expert-dtype fp8For most users, start with the default mixed FP4/FP8 setup first. Change precision only after you have a working baseline.
Step 4: Start an Interactive Chat
Once conversion finishes, run the official generation script:
cd /data/models/DeepSeek-V4-Flash/inference
export MP=4
export SAVE_PATH=/data/models/DeepSeek-V4-Flash-infer
export CONFIG=config.json
torchrun --nproc-per-node ${MP} generate.py \
--ckpt-path ${SAVE_PATH} \
--config ${CONFIG} \
--interactiveFor a batch input file:
torchrun --nproc-per-node ${MP} generate.py \
--ckpt-path ${SAVE_PATH} \
--config ${CONFIG} \
--input-file prompts.txtFor a multi-node run:
torchrun \
--nnodes ${NODES} \
--nproc-per-node $((MP / NODES)) \
--node-rank $RANK \
--master-addr $ADDR \
generate.py \
--ckpt-path ${SAVE_PATH} \
--config ${CONFIG} \
--input-file prompts.txtMake sure every node can access the converted checkpoint path, or copy the converted files to the same path on each machine.
Step 5: Use the Correct Sampling Settings
DeepSeek recommends the following sampling parameters for local deployment:
temperature = 1.0
top_p = 1.0If your generation script exposes these as CLI flags, use them directly. If not, set them in the script or config where sampling parameters are defined.
For Think Max mode, DeepSeek recommends using a context window of at least:
384K tokensDo not start with a huge context window during your first test. Start small, confirm the model loads and generates correctly, then increase context length gradually while monitoring GPU memory.
Step 6: Understand DeepSeek-V4 Chat Encoding
DeepSeek-V4 does not include a standard Jinja chat template. Instead, the repository provides an encoding/ folder with Python utilities.
The basic usage looks like this:
from encoding_dsv4 import encode_messages, parse_message_from_completion_text
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is 2+2?"},
]
prompt = encode_messages(messages, thinking_mode="thinking")
print(prompt)For non-thinking chat, use chat mode:
prompt = encode_messages(messages, thinking_mode="chat")For thinking mode, the model uses explicit reasoning delimiters:
<think> ... </think>The parser can convert generated text back into structured assistant messages:
completion = "Simple arithmetic.</think>2 + 2 = 4.<|end▁of▁sentence|>"
parsed = parse_message_from_completion_text(completion, thinking_mode="thinking")
print(parsed)This is especially important if you want to build your own local API wrapper around DeepSeek-V4.
Reasoning Modes Explained
DeepSeek-V4 supports three practical reasoning styles:
| Mode | Behavior | Use Case |
|---|---|---|
| Non-think | Fast direct answers | Simple Q&A, summarization, routine coding help |
| Think High | Reasoned answers with deliberate analysis | Debugging, planning, math, architecture decisions |
| Think Max | Maximum reasoning effort | Hard coding tasks, agentic workflows, research-level problem solving |
For a local server, you may want to expose these as separate model names, for example:
deepseek-v4-flash-chat
deepseek-v4-flash-thinking
deepseek-v4-flash-maxInternally, each route can use different prompt encoding, context limits, and generation parameters.
Can You Run DeepSeek-V4 with vLLM or SGLang?
At launch, the safest path is the official DeepSeek inference code in the model repository. Generic serving frameworks may need updates before they fully support DeepSeek-V4's architecture, mixed precision, long-context behavior, and custom encoding.
A practical approach is:
- First, run the official
inference/generate.pypath successfully. - Confirm output quality and prompt formatting with the official
encoding/utilities. - Then check whether your preferred framework has added explicit DeepSeek-V4 support.
- Only migrate to vLLM, SGLang, TensorRT-LLM, or another serving framework after support is confirmed.
This avoids a common failure mode: the model loads, but chat quality is poor because the prompt template is wrong.
Building a Simple Local API Wrapper
If you want an OpenAI-style local endpoint, you can wrap the official generation path with FastAPI. The exact implementation depends on how you integrate generate.py, but the high-level flow is:
- Receive OpenAI-compatible
messages. - Convert them using
encoding_dsv4.encode_messages(). - Send the encoded prompt to the DeepSeek-V4 inference engine.
- Parse the output using
parse_message_from_completion_text(). - Return an OpenAI-compatible JSON response.
Pseudo-code:
from encoding_dsv4 import encode_messages, parse_message_from_completion_text
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain KV cache in simple terms."},
]
prompt = encode_messages(messages, thinking_mode="thinking")
# Send prompt to your local DeepSeek-V4 inference worker
raw_completion = run_deepseek_v4(prompt)
assistant_message = parse_message_from_completion_text(
raw_completion,
thinking_mode="thinking",
)
print(assistant_message["content"])For production, add:
- request queueing
- streaming output
- timeout handling
- GPU health checks
- max context enforcement
- structured logs
- authentication
Troubleshooting
1. CUDA Out of Memory
Reduce memory pressure by:
- lowering context length
- reducing batch size
- increasing tensor/model parallel size
- using more GPUs
- starting with DeepSeek-V4-Flash instead of Pro
Long context is usually the first thing to reduce during debugging.
2. Download Fails or Hangs
Use huggingface-cli download instead of browser downloads. Re-run the same command to resume.
You can also set a dedicated cache directory:
export HF_HOME=/data/cache/huggingface
export HUGGINGFACE_HUB_CACHE=/data/cache/huggingface/hub3. The Model Generates Strange Chat Output
Check prompt formatting. DeepSeek-V4 does not use a standard Jinja chat template. Use the official encoding/ implementation.
4. Multi-GPU Run Fails
Verify that PyTorch can see all GPUs:
python - <<'PY'
import torch
print(torch.cuda.device_count())
for i in range(torch.cuda.device_count()):
print(i, torch.cuda.get_device_name(i))
PYAlso check NCCL networking for multi-node runs:
export NCCL_DEBUG=INFO5. Think Max Is Too Slow
Think Max is designed to spend more compute on difficult reasoning. Use it only for tasks that justify the cost. For normal assistant usage, Non-think or Think High is usually more practical.
Recommended Deployment Strategy
If you are deploying DeepSeek-V4 locally for the first time, follow this sequence:
- Start with DeepSeek-V4-Flash.
- Use the official inference code.
- Use a small test context first.
- Confirm the official encoding works.
- Increase context length gradually.
- Add an API wrapper only after local generation is stable.
- Consider Pro only when you have cluster-scale GPU resources.
Final Thoughts
DeepSeek-V4 is powerful, but it is not a casual local model. The Flash version is the practical entry point, while Pro belongs in serious multi-GPU or multi-node environments. The key to a successful setup is to respect the official workflow: download the Hugging Face repository, convert the weights with the provided inference tools, run generation with torchrun, and use the dedicated DeepSeek-V4 encoding utilities instead of assuming a generic chat template.
If you only need to experiment with prompts, the hosted DeepSeek chat service or API routes may be easier. But if you need data privacy, full control, no per-token billing, or custom infrastructure, running DeepSeek-V4 locally gives you a strong foundation for building private long-context AI systems.