How to Run Gemma 4 31B Locally: Unsloth, Ollama, llama.cpp, and HuggingFace
How to Run Gemma 4 31B Locally: Unsloth, Ollama, llama.cpp, and HuggingFace
Google DeepMind released Gemma 4 in early 2026, and the 31B instruction-tuned variant hits a sweet spot: big enough to compete with proprietary models on reasoning benchmarks, small enough to run on a decent consumer GPU. It scores 85.2% on MMLU Pro and 89.2% on AIME 2026 without tools, which puts it in the same conversation as models twice its size.
The catch has always been "how do I actually run this thing?" A 30.7B parameter model in full precision needs about 62GB of VRAM. Nobody has that lying around on a single card. But with the right quantization and the right tools, you can get it running on a 24GB RTX 4090, or even offload partially to CPU on a 16GB card. We recommend using LightNode as your VPS provider if you want GPU instances without the commitment.
This guide covers four methods, with Unsloth as the primary recommendation for most people.
Table of Contents
- Gemma 4 Model Family Overview
- Hardware Requirements
- Method 1: Unsloth Studio (Recommended)
- Method 2: Ollama
- Method 3: llama.cpp
- Method 4: HuggingFace Transformers
- Understanding GGUF Quantization Formats
- Performance Tips
- Troubleshooting
- Which Method Should You Pick?
Gemma 4 Model Family Overview
Gemma 4 comes in four sizes. Picking the right one matters because the hardware jump between them is steep.
| Variant | Total Params | Active Params | Context | Modalities | Best For |
|---|---|---|---|---|---|
| E2B | 5.1B | 2.3B effective | 128K | Text, Image, Audio | Phones, Raspberry Pi |
| E4B | 8B | 4.5B effective | 128K | Text, Image, Audio | Laptops, CPU-only |
| 26B A4B (MoE) | 25.2B | 3.8B active | 256K | Text, Image | Fast inference, less VRAM |
| 31B (Dense) | 30.7B | 30.7B | 256K | Text, Image | Maximum quality |
The 26B A4B is the clever one: 25.2B total parameters, but only 3.8B are active during inference thanks to a Mixture-of-Experts architecture (8 active experts out of 128, plus 1 shared). It runs almost as fast as a 4B model while delivering quality close to the full 31B. If your GPU has 12-16GB VRAM, the 26B A4B in Q4 quantization is probably your best bet.
The 31B Dense is the one this guide focuses on. It's the full-fat model with all parameters active on every forward pass. Best quality, highest hardware requirements.
All four variants support configurable thinking mode (chain-of-thought reasoning), native system prompts, function calling, and 140+ languages.
Hardware Requirements
Before picking a method, figure out what hardware you're working with.
For Gemma 4 31B-it
| Quantization | VRAM Needed | Quality Loss | Typical Hardware |
|---|---|---|---|
| FP16 (full precision) | ~62 GB | None | A100, multiple GPUs |
| Q8_0 (8-bit) | ~32 GB | Negligible | RTX 4090 (24GB) + CPU offload |
| Q5_K_M (5-bit) | ~22 GB | Minimal | RTX 4090, RTX 3090 |
| Q4_K_M (4-bit) | ~18 GB | Small | RTX 4080, RTX 3090 |
| Q3_K_M (3-bit) | ~14 GB | Noticeable | RTX 4070, partial offload |
For Gemma 4 26B A4B (MoE)
| Quantization | VRAM Needed | Quality Loss | Typical Hardware |
|---|---|---|---|
| Q5_K_M | ~14 GB | Minimal | RTX 4070 Ti |
| Q4_K_M | ~10 GB | Small | RTX 4070, RTX 3080 |
| Q3_K_M | ~8 GB | Noticeable | RTX 4060 Ti 8GB |
If you're on CPU only, the E4B or E2B variants will run comfortably. The 31B on CPU is technically possible but painfully slow (expect 1-3 tokens/second on a modern CPU).
RAM requirement: Add 8-16GB of system RAM on top of VRAM for the runtime overhead, more if you're offloading layers to CPU.
Method 1: Unsloth Studio (Recommended)
Unsloth is the easiest way to run Gemma 4 locally. It's a web UI that handles model downloading, quantization selection, and inference in one package. Works on Windows, Linux, macOS, and WSL.
Why Unsloth
- One-click model search and download from HuggingFace
- Automatically picks the right GGUF quantization for your hardware
- Built-in chat interface with support for images, PDFs, and documents
- Tool calling and web search built in
- Code execution sandbox
- No command-line wrangling
Installation
macOS, Linux, WSL:
curl -fsSL https://unsloth.ai/install.sh | shWindows (PowerShell):
irm https://unsloth.ai/install.ps1 | iexDocker:
docker run -d -e JUPYTER_PASSWORD="mypassword" \
-p 8888:8888 -p 8000:8000 -p 2222:22 \
-v $(pwd)/work:/workspace/work \
--gpus all \
unsloth/unslothLaunching
unsloth studio -H 0.0.0.0 -p 8888Open http://localhost:8888 in your browser. You'll see the Unsloth Studio interface.
Running Gemma 4 31B
- Search for the model: In the model search bar, type
gemma-4-31B - Pick a quantization: Unsloth hosts pre-quantized GGUF files. For a 24GB GPU, select
Q4_K_MorQ5_K_M. For 16GB, go withQ3_K_M - Download: Click download. The Q4_K_M variant is about 18GB
- Start chatting: Once downloaded, the model loads into the chat interface automatically
Unsloth provides these GGUF variants for Gemma 4 31B-it:
| File | Size | Quantization |
|---|---|---|
gemma-4-31B-it-Q3_K_M.gguf | ~14 GB | 3-bit (balanced) |
gemma-4-31B-it-Q4_K_M.gguf | ~18 GB | 4-bit (recommended) |
gemma-4-31B-it-Q5_K_M.gguf | ~22 GB | 5-bit (high quality) |
gemma-4-31B-it-Q6_K.gguf | ~26 GB | 6-bit (near-lossless) |
gemma-4-31B-it-Q8_0.gguf | ~32 GB | 8-bit (virtually lossless) |
The HuggingFace repo is at unsloth/gemma-4-31B-it-GGUF.
Using the Chat Interface
Unsloth Studio's chat supports:
- Text conversations with thinking mode toggle
- Image uploads: Drag and drop images for visual question answering
- PDF/DOCX uploads: Extract and discuss document contents
- Code execution: The model can write and test code in a sandbox
- Custom system prompts: Set behavior and persona
To enable Gemma 4's thinking mode, toggle the "Thinking" option in the chat settings. This activates chain-of-thought reasoning, where the model works through problems step-by-step before giving its final answer.
Fine-Tuning with Unsloth
If you want to go beyond inference, Unsloth also handles training:
- LoRA fine-tuning: Train adapters with up to 70% less VRAM
- GRPO reinforcement learning: The most efficient RL library available
- Data Recipes: Auto-create training datasets from PDFs, CSVs, DOCX files
- Multi-GPU support: Available now with improvements coming
For Gemma 4 31B fine-tuning, you'll need at least one 24GB GPU with QLoRA (4-bit quantized training).
Updating Unsloth
Run the same install command again:
# macOS/Linux/WSL
curl -fsSL https://unsloth.ai/install.sh | sh
# Windows
irm https://unsloth.ai/install.ps1 | iexMethod 2: Ollama
Ollama is the fastest way to get running if you prefer the command line. It handles model downloads, GPU detection, and serving automatically.
Installation
# macOS
brew install ollama
# Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows
# Download from https://ollama.com/downloadRunning Gemma 4
# Pull and run the 31B model
ollama run gemma4:31b-it
# Or the smaller MoE variant for less VRAM
ollama run gemma4:26b-a4b-it
# Or the smaller dense models
ollama run gemma4:e4b-it
ollama run gemma4:e2b-itOllama automatically quantizes to Q4_K_M by default. If you want a different quantization:
# Run with specific quantization
ollama run gemma4:31b-it-q5_K_MUsing the API
Ollama exposes a local API on port 11434:
import requests
response = requests.post('http://localhost:11434/api/chat', json={
"model": "gemma4:31b-it",
"messages": [
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Write a Python function to merge two sorted lists."}
],
"stream": False
})
print(response.json()['message']['content'])Ollama Pros and Cons
Pros: Zero configuration, automatic GPU detection, clean CLI, API server included, simple model management.
Cons: Fewer quantization options than llama.cpp, no built-in image support for all models (check current compatibility), less control over inference parameters.
Method 3: llama.cpp
For maximum control over quantization, memory usage, and inference parameters, llama.cpp is the way to go. It's what powers Ollama and Unsloth under the hood for GGUF inference.
Building from Source
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# With CUDA support (NVIDIA)
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)
# With Metal support (macOS Apple Silicon)
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j$(nproc)
# CPU only
cmake -B build
cmake --build build --config Release -j$(nproc)Downloading the GGUF Model
Grab the quantized model from Unsloth's HuggingFace repo:
# Install huggingface-cli
pip install huggingface-hub
# Download Q4_K_M (recommended for 24GB GPUs)
huggingface-cli download unsloth/gemma-4-31B-it-GGUF \
gemma-4-31B-it-Q4_K_M.gguf \
--local-dir ./models
# Or Q5_K_M for better quality
huggingface-cli download unsloth/gemma-4-31B-it-GGUF \
gemma-4-31B-it-Q5_K_M.gguf \
--local-dir ./modelsRunning the Model
# Basic chat
./build/bin/llama-cli \
-m ./models/gemma-4-31B-it-Q4_K_M.gguf \
-ngl 99 \
-c 8192 \
--temp 1.0 \
--top-p 0.95 \
--top-k 64 \
-p "You are a helpful assistant.\nUser: Explain quantum computing in simple terms.\nAssistant:"Key flags:
-ngl 99: Offload all layers to GPU. Reduce this number if you don't have enough VRAM (e.g.,-ngl 40offloads about two-thirds of layers)-c 8192: Context length in tokens. Increase up to 256K for long documents, but more context uses more VRAM--temp 1.0: Google recommends temperature=1.0 for Gemma 4--top-p 0.95and--top-k 64: Recommended sampling parameters
Starting a Server
./build/bin/llama-server \
-m ./models/gemma-4-31B-it-Q4_K_M.gguf \
-ngl 99 \
-c 8192 \
--host 0.0.0.0 \
--port 8080 \
--temp 1.0 \
--top-p 0.95 \
--top-k 64Then access the web UI at http://localhost:8080 or call the OpenAI-compatible API:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="unused")
response = client.chat.completions.create(
model="gemma-4-31b",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a Rust function to reverse a linked list."}
]
)
print(response.choices[0].message.content)GPU Offloading Strategy
If your GPU doesn't have enough VRAM for the full model, you can split layers between GPU and CPU:
# For a 16GB GPU with the Q4 model (~18GB total)
# Offload about 40 layers to GPU, rest to CPU
./build/bin/llama-cli \
-m ./models/gemma-4-31B-it-Q4_K_M.gguf \
-ngl 40 \
-c 4096This runs slower than full GPU offload but fits on smaller cards. Expect roughly 5-15 tokens/second depending on your CPU and how many layers you offload.
Method 4: HuggingFace Transformers
If you're building an application and need programmatic control, HuggingFace Transformers gives you direct access to the model with full precision or custom quantization.
Installation
pip install -U transformers torch accelerateFor image support:
pip install -U transformers torch torchvision accelerateRunning in Full Precision (62GB+ VRAM)
from transformers import AutoProcessor, AutoModelForCausalLM
MODEL_ID = "google/gemma-4-31B-it"
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
dtype="auto",
device_map="auto"
)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain the difference between TCP and UDP."},
]
text = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False
)
inputs = processor(text=text, return_tensors="pt").to(model.device)
input_len = inputs["input_ids"].shape[-1]
outputs = model.generate(**inputs, max_new_tokens=1024)
response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)
result = processor.parse_response(response)
print(result)Running with 4-bit Quantization (18GB VRAM)
from transformers import AutoProcessor, AutoModelForCausalLM, BitsAndBytesConfig
MODEL_ID = "google/gemma-4-31B-it"
# 4-bit quantization config
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16
)
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
quantization_config=quantization_config,
device_map="auto"
)Processing Images
The 31B model supports text and image input:
from transformers import AutoProcessor, AutoModelForMultimodalLM
MODEL_ID = "google/gemma-4-31B-it"
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForMultimodalLM.from_pretrained(
MODEL_ID,
dtype="auto",
device_map="auto"
)
messages = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://example.com/photo.jpg"},
{"type": "text", "text": "Describe what you see in this image."}
]
}
]
inputs = processor.apply_chat_template(
messages,
tokenize=True,
return_dict=True,
return_tensors="pt",
add_generation_prompt=True,
).to(model.device)
input_len = inputs["input_ids"].shape[-1]
outputs = model.generate(**inputs, max_new_tokens=512)
response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)
result = processor.parse_response(response)
print(result)Enabling Thinking Mode
Gemma 4 supports chain-of-thought reasoning. Enable it by setting enable_thinking=True:
text = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True # Activates reasoning mode
)When thinking is enabled, the model outputs its internal reasoning followed by the final answer. Use processor.parse_response(response) to separate the thinking from the answer.
Understanding GGUF Quantization Formats
If you're downloading GGUF files, you'll see a lot of suffixes. Here's what they mean in practice.
| Format | Bits | Size (31B) | When to Use |
|---|---|---|---|
| Q8_0 | 8-bit | ~32 GB | Best quality, needs 32GB+ VRAM |
| Q6_K | 6-bit | ~26 GB | Near-lossless, 24GB+ VRAM |
| Q5_K_M | 5-bit | ~22 GB | Sweet spot for quality/size |
| Q4_K_M | 4-bit | ~18 GB | Best balance, fits 24GB GPU |
| Q3_K_M | 3-bit | ~14 GB | Smaller GPU, some quality loss |
| Q2_K | 2-bit | ~10 GB | Desperate measures, noticeable degradation |
My recommendation: Q4_K_M for 24GB GPUs, Q5_K_M if you can spare the extra 4GB. The quality difference between Q4_K_M and Q5_K_M is measurable on benchmarks but hard to notice in casual use. Going below Q3_K_M is not worth it unless you have no other option.
The _K_M suffix means "K-quantization, medium." There are also _K_S (small, more compression) and _K_L (large, less compression) variants. _K_M is the default recommendation.
Performance Tips
Context Length Management
Gemma 4 31B supports up to 256K tokens of context, but each token in the context costs VRAM. A few practical guidelines:
- 4K tokens: Comfortable on any GPU that fits the model
- 8K tokens: Standard for most conversations, still comfortable
- 32K tokens: Needs about 4-6GB extra VRAM depending on quantization
- 128K+ tokens: Requires substantial VRAM or aggressive offloading
Start with -c 8192 and increase only when you need it.
Sampling Parameters
Google recommends these settings for Gemma 4:
temperature = 1.0
top_p = 0.95
top_k = 64These are different from what most models use. Don't use temperature=0.7 with Gemma 4; it's trained for temperature=1.0 and produces better results at that setting.
Flash Attention
If you're using HuggingFace Transformers, enable Flash Attention for faster inference and lower memory usage:
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
attn_implementation="flash_attention_2",
device_map="auto"
)This requires pip install flash-attn and a compatible GPU (most NVIDIA RTX cards work).
Multi-GPU Setup
If you have multiple GPUs, device_map="auto" in Transformers handles splitting automatically. For llama.cpp:
./build/bin/llama-cli \
-m ./models/gemma-4-31B-it-Q4_K_M.gguf \
-ngl 99 \
-ts 1,1 \ # Split equally between 2 GPUs
-c 8192Troubleshooting
Out of Memory (CUDA OOM)
The most common issue. Solutions in order of desperation:
- Use a smaller quantization: Switch from Q5 to Q4, or Q4 to Q3
- Reduce context length: Lower
-cfrom 8192 to 4096 or 2048 - Offload to CPU: Reduce
-nglto offload some layers - Use the 26B A4B MoE variant: Same quality tier, fraction of the VRAM
- Use the E4B variant: Runs on anything
Slow Inference on CPU
If you're running on CPU, expect 1-3 tokens/second for the 31B model. Options:
- Switch to the E4B or E2B model (10-20 tok/s on CPU)
- Use a GPU cloud instance (LightNode offers GPU VPS options)
- Build llama.cpp with your CPU's instruction sets enabled (AVX2, AVX-512)
Model Download Failures
The Q4_K_M file is about 18GB. If the download keeps failing:
# Use huggingface-cli with resume support
huggingface-cli download unsloth/gemma-4-31B-it-GGUF \
gemma-4-31B-it-Q4_K_M.gguf \
--local-dir ./models \
--local-dir-use-symlinks FalseOr use a download manager that supports resume. The HuggingFace CDN can be flaky for large files.
"Model not supported" Errors
Make sure you're using the latest version of your tools. Gemma 4 is new and older versions of llama.cpp, Ollama, and Transformers don't support it:
# Update llama.cpp
cd llama.cpp && git pull && cmake --build build --config Release -j$(nproc)
# Update Ollama
ollama pull gemma4:31b-it # This auto-updates if needed
# Update Transformers
pip install -U transformersWhich Method Should You Pick?
| Scenario | Best Method |
|---|---|
| You want a GUI, don't want to touch the terminal | Unsloth Studio |
| You want the fastest setup, CLI is fine | Ollama |
| You need maximum control over inference | llama.cpp |
| You're building an application | HuggingFace Transformers |
| You have limited VRAM (8-16GB) | Unsloth or Ollama with Q3/Q4 |
| You have 24GB+ VRAM | Any method, use Q4_K_M or Q5_K_M |
| You need image understanding | Unsloth Studio or HuggingFace Transformers |
| You want to fine-tune | Unsloth (LoRA/GRPO training built in) |
For most people just getting started, Unsloth Studio is the path of least resistance. Install it, search for Gemma 4, pick a quantization that fits your GPU, and start chatting. The whole process takes about 15 minutes from install to first conversation.
If you're comfortable with the terminal and just want to run the model, Ollama gets you there in two commands. And if you need programmatic access or are building something on top of the model, HuggingFace Transformers with 4-bit quantization gives you the full Python API.
Wrapping Up
Running Gemma 4 31B locally has gotten remarkably practical. A year ago, a 30B model at this quality level would have been a research project. Now it's a 15-minute setup process with Unsloth or Ollama, and it runs on consumer hardware you can buy today.
The model itself holds its own against proprietary alternatives in reasoning, coding, and multimodal tasks. 256K context, built-in thinking mode, image understanding, and function calling make it genuinely useful for real work, not just experimentation.
For hosting the model on a remote GPU, LightNode offers GPU VPS instances with hourly billing, so you can spin one up when you need it and shut it down when you don't.
The Gemma 4 model card on HuggingFace has the full technical details, and the Unsloth GGUF repo has all the quantized variants ready to download.