How to Run DeepSeek-V4 Locally: Pro and Flash Setup Guide

About 8 min

How to Run DeepSeek-V4 Locally: Pro and Flash Setup Guide

DeepSeek-V4 is one of the most ambitious open-weight model releases from DeepSeek so far. The family includes DeepSeek-V4-Pro, a 1.6T-parameter Mixture-of-Experts model with 49B activated parameters, and DeepSeek-V4-Flash, a smaller 284B-parameter MoE model with 13B activated parameters. Both models support a context length of up to one million tokens.

That combination sounds exciting, but it also creates a practical question: can you actually run DeepSeek-V4 locally?

The answer is yes, but with an important caveat. DeepSeek-V4 is not a laptop-sized model. Even the Flash version is a serious multi-GPU deployment. This guide walks through the local setup path using the official DeepSeek model repositories on Hugging Face, explains the hardware you should plan for, and shows how to use the official inference and encoding files correctly.

Reference model pages:

DeepSeek-V4-Pro vs DeepSeek-V4-Flash

Before downloading anything, choose the right model variant.

Model	Total Parameters	Activated Parameters	Context Length	Precision	Best For
DeepSeek-V4-Flash	284B	13B	1M	FP4 + FP8 mixed	Faster local experiments, lower-cost serving, coding assistants, long-context testing
DeepSeek-V4-Pro	1.6T	49B	1M	FP4 + FP8 mixed	Maximum quality, research labs, large GPU clusters, serious reasoning and agentic tasks

The most important detail is that DeepSeek-V4 uses a Mixture-of-Experts (MoE) architecture. Only part of the model is activated for each token, which reduces compute cost. However, you still need to store and load the model weights. That means GPU memory and storage requirements remain very high.

For most developers, DeepSeek-V4-Flash is the realistic starting point. DeepSeek-V4-Pro is better treated as a cluster-scale deployment.

What Makes DeepSeek-V4 Different?

According to DeepSeek's model card, the V4 series introduces several major upgrades:

Hybrid Attention Architecture: DeepSeek combines Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) to improve long-context efficiency. In the one-million-token setting, DeepSeek-V4-Pro reportedly uses much less KV cache than DeepSeek-V3.2.
Manifold-Constrained Hyper-Connections (mHC): This improves stability across very deep networks while preserving model capacity.
Muon Optimizer: DeepSeek uses Muon during training for better convergence and stability.
Long Context: Both Pro and Flash support up to 1M tokens, with DeepSeek recommending at least 384K context for Think Max mode.
Multiple Reasoning Modes: DeepSeek-V4 supports Non-think, Think High, and Think Max style usage.

For local deployment, the two most important practical changes are the mixed FP4/FP8 precision and the custom chat encoding format.

Hardware Requirements

DeepSeek-V4 is not designed for consumer GPUs such as RTX 4090 unless you are only experimenting with heavily modified community quantizations in the future. For the official weights, plan around server GPUs.

Practical Hardware Planning

Use Case	Suggested Hardware	Notes
DeepSeek-V4-Flash test deployment	4-8 high-memory NVIDIA GPUs	H100/H200/A100-class GPUs are the practical target
DeepSeek-V4-Flash production serving	8+ high-memory GPUs	More GPUs help throughput and long-context workloads
DeepSeek-V4-Pro research deployment	Large multi-node GPU cluster	Treat this as cluster infrastructure, not a single workstation model
Think Max with long context	Extra GPU memory and KV cache budget	DeepSeek recommends at least 384K context for Think Max

Storage Requirements

Plan for large local storage before starting the download:

Use NVMe SSD storage whenever possible.
Keep extra space for converted weights.
Avoid downloading directly to a small system disk.
Expect the Pro model to require far more storage than Flash.

A safe layout is:

/data/models/deepseek-v4-flash-hf      # original Hugging Face files
/data/models/deepseek-v4-flash-infer   # converted inference weights
/data/cache/huggingface                # HF cache

If you are renting a cloud GPU server, choose an instance with local NVMe or attach a large high-throughput volume. For VPS-style deployment planning, you can compare GPU or high-memory servers through providers such as LightNode, but make sure the instance actually has the GPU memory required for this class of model.

Software Requirements

You need a Linux environment with recent NVIDIA drivers and CUDA.

Recommended baseline:

Component	Recommendation
OS	Ubuntu 22.04 or newer
Python	3.10+
GPU Driver	Recent NVIDIA data center driver
CUDA	CUDA 12.x preferred
PyTorch	CUDA-enabled build
Git LFS	Required for model files
Hugging Face CLI	Required for reliable downloads

Install the basic tools:

sudo apt update
sudo apt install -y git git-lfs python3 python3-venv python3-pip

git lfs install
pip install -U huggingface_hub

If you use a Python virtual environment:

python3 -m venv dsv4-env
source dsv4-env/bin/activate
pip install -U pip wheel setuptools
pip install -U huggingface_hub torch transformers safetensors

Step 1: Download DeepSeek-V4-Flash

For most users, start with Flash:

mkdir -p /data/models
cd /data/models

huggingface-cli download deepseek-ai/DeepSeek-V4-Flash \
  --local-dir DeepSeek-V4-Flash \
  --local-dir-use-symlinks False

If you want the Pro model instead:

huggingface-cli download deepseek-ai/DeepSeek-V4-Pro \
  --local-dir DeepSeek-V4-Pro \
  --local-dir-use-symlinks False

If the download is interrupted, simply run the same command again. Hugging Face will resume the download.

Step 2: Inspect the Official Repository Structure

After downloading, check the model folder:

cd /data/models/DeepSeek-V4-Flash
ls

The model card points to two important folders:

inference/ - official local inference code, including weight conversion and generation scripts
encoding/ - prompt encoding and output parsing utilities for DeepSeek-V4

This matters because DeepSeek-V4 does not ship with a normal Jinja-format chat template. You should not assume that every generic OpenAI-compatible chat wrapper will format prompts correctly out of the box.

Step 3: Convert the Weights for Official Inference

The official inference README uses a conversion step before running generation.

From the model repository:

cd /data/models/DeepSeek-V4-Flash/inference

export HF_CKPT_PATH=/data/models/DeepSeek-V4-Flash
export SAVE_PATH=/data/models/DeepSeek-V4-Flash-infer
export EXPERTS=256
export MP=4
export CONFIG=config.json

python convert.py \
  --hf-ckpt-path ${HF_CKPT_PATH} \
  --save-path ${SAVE_PATH} \
  --n-experts ${EXPERTS} \
  --model-parallel ${MP}

Parameter notes:

Variable	Meaning
`HF_CKPT_PATH`	Path to the original Hugging Face model files
`SAVE_PATH`	Output path for converted inference weights
`EXPERTS=256`	Number of experts used by the DeepSeek-V4 inference conversion
`MP=4`	Model parallel size; usually match this to the number of GPUs used for the run
`CONFIG`	Model config file used by the generation script

If you use more GPUs, adjust MP accordingly. For example, on an 8-GPU node:

export MP=8

FP8 Expert Option

The official inference README notes that if you want to use FP8 experts instead of FP4 experts, remove this line from config.json:

"expert_dtype": "fp4"

Then pass --expert-dtype fp8 during conversion:

python convert.py \
  --hf-ckpt-path ${HF_CKPT_PATH} \
  --save-path ${SAVE_PATH} \
  --n-experts ${EXPERTS} \
  --model-parallel ${MP} \
  --expert-dtype fp8

For most users, start with the default mixed FP4/FP8 setup first. Change precision only after you have a working baseline.

Step 4: Start an Interactive Chat

Once conversion finishes, run the official generation script:

cd /data/models/DeepSeek-V4-Flash/inference

export MP=4
export SAVE_PATH=/data/models/DeepSeek-V4-Flash-infer
export CONFIG=config.json

torchrun --nproc-per-node ${MP} generate.py \
  --ckpt-path ${SAVE_PATH} \
  --config ${CONFIG} \
  --interactive

For a batch input file:

torchrun --nproc-per-node ${MP} generate.py \
  --ckpt-path ${SAVE_PATH} \
  --config ${CONFIG} \
  --input-file prompts.txt

For a multi-node run:

torchrun \
  --nnodes ${NODES} \
  --nproc-per-node $((MP / NODES)) \
  --node-rank $RANK \
  --master-addr $ADDR \
  generate.py \
  --ckpt-path ${SAVE_PATH} \
  --config ${CONFIG} \
  --input-file prompts.txt

Make sure every node can access the converted checkpoint path, or copy the converted files to the same path on each machine.

Step 5: Use the Correct Sampling Settings

DeepSeek recommends the following sampling parameters for local deployment:

temperature = 1.0
top_p = 1.0

If your generation script exposes these as CLI flags, use them directly. If not, set them in the script or config where sampling parameters are defined.

For Think Max mode, DeepSeek recommends using a context window of at least:

384K tokens

Do not start with a huge context window during your first test. Start small, confirm the model loads and generates correctly, then increase context length gradually while monitoring GPU memory.

Step 6: Understand DeepSeek-V4 Chat Encoding

DeepSeek-V4 does not include a standard Jinja chat template. Instead, the repository provides an encoding/ folder with Python utilities.

The basic usage looks like this:

from encoding_dsv4 import encode_messages, parse_message_from_completion_text

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is 2+2?"},
]

prompt = encode_messages(messages, thinking_mode="thinking")
print(prompt)

For non-thinking chat, use chat mode:

prompt = encode_messages(messages, thinking_mode="chat")

For thinking mode, the model uses explicit reasoning delimiters:

<think> ... </think>

The parser can convert generated text back into structured assistant messages:

completion = "Simple arithmetic.</think>2 + 2 = 4.<｜end▁of▁sentence｜>"
parsed = parse_message_from_completion_text(completion, thinking_mode="thinking")
print(parsed)

This is especially important if you want to build your own local API wrapper around DeepSeek-V4.

Reasoning Modes Explained

DeepSeek-V4 supports three practical reasoning styles:

Mode	Behavior	Use Case
Non-think	Fast direct answers	Simple Q&A, summarization, routine coding help
Think High	Reasoned answers with deliberate analysis	Debugging, planning, math, architecture decisions
Think Max	Maximum reasoning effort	Hard coding tasks, agentic workflows, research-level problem solving

For a local server, you may want to expose these as separate model names, for example:

deepseek-v4-flash-chat
deepseek-v4-flash-thinking
deepseek-v4-flash-max

Internally, each route can use different prompt encoding, context limits, and generation parameters.

Can You Run DeepSeek-V4 with vLLM or SGLang?

At launch, the safest path is the official DeepSeek inference code in the model repository. Generic serving frameworks may need updates before they fully support DeepSeek-V4's architecture, mixed precision, long-context behavior, and custom encoding.

A practical approach is:

First, run the official inference/generate.py path successfully.
Confirm output quality and prompt formatting with the official encoding/ utilities.
Then check whether your preferred framework has added explicit DeepSeek-V4 support.
Only migrate to vLLM, SGLang, TensorRT-LLM, or another serving framework after support is confirmed.

This avoids a common failure mode: the model loads, but chat quality is poor because the prompt template is wrong.

Building a Simple Local API Wrapper

If you want an OpenAI-style local endpoint, you can wrap the official generation path with FastAPI. The exact implementation depends on how you integrate generate.py, but the high-level flow is:

Receive OpenAI-compatible messages.
Convert them using encoding_dsv4.encode_messages().
Send the encoded prompt to the DeepSeek-V4 inference engine.
Parse the output using parse_message_from_completion_text().
Return an OpenAI-compatible JSON response.

Pseudo-code:

from encoding_dsv4 import encode_messages, parse_message_from_completion_text

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain KV cache in simple terms."},
]

prompt = encode_messages(messages, thinking_mode="thinking")

# Send prompt to your local DeepSeek-V4 inference worker
raw_completion = run_deepseek_v4(prompt)

assistant_message = parse_message_from_completion_text(
    raw_completion,
    thinking_mode="thinking",
)

print(assistant_message["content"])

For production, add:

request queueing
streaming output
timeout handling
GPU health checks
max context enforcement
structured logs
authentication

Troubleshooting

1. CUDA Out of Memory

Reduce memory pressure by:

lowering context length
reducing batch size
increasing tensor/model parallel size
using more GPUs
starting with DeepSeek-V4-Flash instead of Pro

Long context is usually the first thing to reduce during debugging.

2. Download Fails or Hangs

Use huggingface-cli download instead of browser downloads. Re-run the same command to resume.

You can also set a dedicated cache directory:

export HF_HOME=/data/cache/huggingface
export HUGGINGFACE_HUB_CACHE=/data/cache/huggingface/hub

3. The Model Generates Strange Chat Output

Check prompt formatting. DeepSeek-V4 does not use a standard Jinja chat template. Use the official encoding/ implementation.

4. Multi-GPU Run Fails

Verify that PyTorch can see all GPUs:

python - <<'PY'
import torch
print(torch.cuda.device_count())
for i in range(torch.cuda.device_count()):
    print(i, torch.cuda.get_device_name(i))
PY

Also check NCCL networking for multi-node runs:

export NCCL_DEBUG=INFO

5. Think Max Is Too Slow

Think Max is designed to spend more compute on difficult reasoning. Use it only for tasks that justify the cost. For normal assistant usage, Non-think or Think High is usually more practical.

Recommended Deployment Strategy

If you are deploying DeepSeek-V4 locally for the first time, follow this sequence:

Start with DeepSeek-V4-Flash.
Use the official inference code.
Use a small test context first.
Confirm the official encoding works.
Increase context length gradually.
Add an API wrapper only after local generation is stable.
Consider Pro only when you have cluster-scale GPU resources.

Final Thoughts

DeepSeek-V4 is powerful, but it is not a casual local model. The Flash version is the practical entry point, while Pro belongs in serious multi-GPU or multi-node environments. The key to a successful setup is to respect the official workflow: download the Hugging Face repository, convert the weights with the provided inference tools, run generation with torchrun, and use the dedicated DeepSeek-V4 encoding utilities instead of assuming a generic chat template.

If you only need to experiment with prompts, the hosted DeepSeek chat service or API routes may be easier. But if you need data privacy, full control, no per-token billing, or custom infrastructure, running DeepSeek-V4 locally gives you a strong foundation for building private long-context AI systems.