How to Run Gemma 4 31B Locally: Unsloth, Ollama, llama.cpp, and HuggingFace

About 10 min

How to Run Gemma 4 31B Locally: Unsloth, Ollama, llama.cpp, and HuggingFace

Google DeepMind released Gemma 4 in early 2026, and the 31B instruction-tuned variant hits a sweet spot: big enough to compete with proprietary models on reasoning benchmarks, small enough to run on a decent consumer GPU. It scores 85.2% on MMLU Pro and 89.2% on AIME 2026 without tools, which puts it in the same conversation as models twice its size.

The catch has always been "how do I actually run this thing?" A 30.7B parameter model in full precision needs about 62GB of VRAM. Nobody has that lying around on a single card. But with the right quantization and the right tools, you can get it running on a 24GB RTX 4090, or even offload partially to CPU on a 16GB card. We recommend using LightNode as your VPS provider if you want GPU instances without the commitment.

This guide covers four methods, with Unsloth as the primary recommendation for most people.

Gemma 4 Model Family Overview
Hardware Requirements
Method 1: Unsloth Studio (Recommended)
Method 2: Ollama
Method 3: llama.cpp
Method 4: HuggingFace Transformers
Understanding GGUF Quantization Formats
Performance Tips
Troubleshooting
Which Method Should You Pick?

Gemma 4 Model Family Overview

Gemma 4 comes in four sizes. Picking the right one matters because the hardware jump between them is steep.

Variant	Total Params	Active Params	Context	Modalities	Best For
E2B	5.1B	2.3B effective	128K	Text, Image, Audio	Phones, Raspberry Pi
E4B	8B	4.5B effective	128K	Text, Image, Audio	Laptops, CPU-only
26B A4B (MoE)	25.2B	3.8B active	256K	Text, Image	Fast inference, less VRAM
31B (Dense)	30.7B	30.7B	256K	Text, Image	Maximum quality

The 26B A4B is the clever one: 25.2B total parameters, but only 3.8B are active during inference thanks to a Mixture-of-Experts architecture (8 active experts out of 128, plus 1 shared). It runs almost as fast as a 4B model while delivering quality close to the full 31B. If your GPU has 12-16GB VRAM, the 26B A4B in Q4 quantization is probably your best bet.

The 31B Dense is the one this guide focuses on. It's the full-fat model with all parameters active on every forward pass. Best quality, highest hardware requirements.

All four variants support configurable thinking mode (chain-of-thought reasoning), native system prompts, function calling, and 140+ languages.

Hardware Requirements

Before picking a method, figure out what hardware you're working with.

For Gemma 4 31B-it

Quantization	VRAM Needed	Quality Loss	Typical Hardware
FP16 (full precision)	~62 GB	None	A100, multiple GPUs
Q8_0 (8-bit)	~32 GB	Negligible	RTX 4090 (24GB) + CPU offload
Q5_K_M (5-bit)	~22 GB	Minimal	RTX 4090, RTX 3090
Q4_K_M (4-bit)	~18 GB	Small	RTX 4080, RTX 3090
Q3_K_M (3-bit)	~14 GB	Noticeable	RTX 4070, partial offload

For Gemma 4 26B A4B (MoE)

Quantization	VRAM Needed	Quality Loss	Typical Hardware
Q5_K_M	~14 GB	Minimal	RTX 4070 Ti
Q4_K_M	~10 GB	Small	RTX 4070, RTX 3080
Q3_K_M	~8 GB	Noticeable	RTX 4060 Ti 8GB

If you're on CPU only, the E4B or E2B variants will run comfortably. The 31B on CPU is technically possible but painfully slow (expect 1-3 tokens/second on a modern CPU).

RAM requirement: Add 8-16GB of system RAM on top of VRAM for the runtime overhead, more if you're offloading layers to CPU.

Method 1: Unsloth Studio (Recommended)

Unsloth is the easiest way to run Gemma 4 locally. It's a web UI that handles model downloading, quantization selection, and inference in one package. Works on Windows, Linux, macOS, and WSL.

Why Unsloth

One-click model search and download from HuggingFace
Automatically picks the right GGUF quantization for your hardware
Built-in chat interface with support for images, PDFs, and documents
Tool calling and web search built in
Code execution sandbox
No command-line wrangling

Installation

macOS, Linux, WSL:

curl -fsSL https://unsloth.ai/install.sh | sh

Windows (PowerShell):

irm https://unsloth.ai/install.ps1 | iex

Docker:

docker run -d -e JUPYTER_PASSWORD="mypassword" \
  -p 8888:8888 -p 8000:8000 -p 2222:22 \
  -v $(pwd)/work:/workspace/work \
  --gpus all \
  unsloth/unsloth

Launching

unsloth studio -H 0.0.0.0 -p 8888

Open http://localhost:8888 in your browser. You'll see the Unsloth Studio interface.

Running Gemma 4 31B

Search for the model: In the model search bar, type gemma-4-31B
Pick a quantization: Unsloth hosts pre-quantized GGUF files. For a 24GB GPU, select Q4_K_M or Q5_K_M. For 16GB, go with Q3_K_M
Download: Click download. The Q4_K_M variant is about 18GB
Start chatting: Once downloaded, the model loads into the chat interface automatically

Unsloth provides these GGUF variants for Gemma 4 31B-it:

File	Size	Quantization
`gemma-4-31B-it-Q3_K_M.gguf`	~14 GB	3-bit (balanced)
`gemma-4-31B-it-Q4_K_M.gguf`	~18 GB	4-bit (recommended)
`gemma-4-31B-it-Q5_K_M.gguf`	~22 GB	5-bit (high quality)
`gemma-4-31B-it-Q6_K.gguf`	~26 GB	6-bit (near-lossless)
`gemma-4-31B-it-Q8_0.gguf`	~32 GB	8-bit (virtually lossless)

The HuggingFace repo is at unsloth/gemma-4-31B-it-GGUF.

Using the Chat Interface

Unsloth Studio's chat supports:

Text conversations with thinking mode toggle
Image uploads: Drag and drop images for visual question answering
PDF/DOCX uploads: Extract and discuss document contents
Code execution: The model can write and test code in a sandbox
Custom system prompts: Set behavior and persona

To enable Gemma 4's thinking mode, toggle the "Thinking" option in the chat settings. This activates chain-of-thought reasoning, where the model works through problems step-by-step before giving its final answer.

Fine-Tuning with Unsloth

If you want to go beyond inference, Unsloth also handles training:

LoRA fine-tuning: Train adapters with up to 70% less VRAM
GRPO reinforcement learning: The most efficient RL library available
Data Recipes: Auto-create training datasets from PDFs, CSVs, DOCX files
Multi-GPU support: Available now with improvements coming

For Gemma 4 31B fine-tuning, you'll need at least one 24GB GPU with QLoRA (4-bit quantized training).

Updating Unsloth

Run the same install command again:

# macOS/Linux/WSL
curl -fsSL https://unsloth.ai/install.sh | sh

# Windows
irm https://unsloth.ai/install.ps1 | iex

Method 2: Ollama

Ollama is the fastest way to get running if you prefer the command line. It handles model downloads, GPU detection, and serving automatically.

Installation

# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows
# Download from https://ollama.com/download

Running Gemma 4

# Pull and run the 31B model
ollama run gemma4:31b-it

# Or the smaller MoE variant for less VRAM
ollama run gemma4:26b-a4b-it

# Or the smaller dense models
ollama run gemma4:e4b-it
ollama run gemma4:e2b-it

Ollama automatically quantizes to Q4_K_M by default. If you want a different quantization:

# Run with specific quantization
ollama run gemma4:31b-it-q5_K_M

Using the API

Ollama exposes a local API on port 11434:

import requests

response = requests.post('http://localhost:11434/api/chat', json={
    "model": "gemma4:31b-it",
    "messages": [
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Write a Python function to merge two sorted lists."}
    ],
    "stream": False
})

print(response.json()['message']['content'])

Ollama Pros and Cons

Pros: Zero configuration, automatic GPU detection, clean CLI, API server included, simple model management.

Cons: Fewer quantization options than llama.cpp, no built-in image support for all models (check current compatibility), less control over inference parameters.

Method 3: llama.cpp

For maximum control over quantization, memory usage, and inference parameters, llama.cpp is the way to go. It's what powers Ollama and Unsloth under the hood for GGUF inference.

Building from Source

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# With CUDA support (NVIDIA)
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)

# With Metal support (macOS Apple Silicon)
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j$(nproc)

# CPU only
cmake -B build
cmake --build build --config Release -j$(nproc)

Downloading the GGUF Model

Grab the quantized model from Unsloth's HuggingFace repo:

# Install huggingface-cli
pip install huggingface-hub

# Download Q4_K_M (recommended for 24GB GPUs)
huggingface-cli download unsloth/gemma-4-31B-it-GGUF \
  gemma-4-31B-it-Q4_K_M.gguf \
  --local-dir ./models

# Or Q5_K_M for better quality
huggingface-cli download unsloth/gemma-4-31B-it-GGUF \
  gemma-4-31B-it-Q5_K_M.gguf \
  --local-dir ./models

Running the Model

# Basic chat
./build/bin/llama-cli \
  -m ./models/gemma-4-31B-it-Q4_K_M.gguf \
  -ngl 99 \
  -c 8192 \
  --temp 1.0 \
  --top-p 0.95 \
  --top-k 64 \
  -p "You are a helpful assistant.\nUser: Explain quantum computing in simple terms.\nAssistant:"

Key flags:

-ngl 99: Offload all layers to GPU. Reduce this number if you don't have enough VRAM (e.g., -ngl 40 offloads about two-thirds of layers)
-c 8192: Context length in tokens. Increase up to 256K for long documents, but more context uses more VRAM
--temp 1.0: Google recommends temperature=1.0 for Gemma 4
--top-p 0.95 and --top-k 64: Recommended sampling parameters

Starting a Server

./build/bin/llama-server \
  -m ./models/gemma-4-31B-it-Q4_K_M.gguf \
  -ngl 99 \
  -c 8192 \
  --host 0.0.0.0 \
  --port 8080 \
  --temp 1.0 \
  --top-p 0.95 \
  --top-k 64

Then access the web UI at http://localhost:8080 or call the OpenAI-compatible API:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="unused")

response = client.chat.completions.create(
    model="gemma-4-31b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a Rust function to reverse a linked list."}
    ]
)

print(response.choices[0].message.content)

GPU Offloading Strategy

If your GPU doesn't have enough VRAM for the full model, you can split layers between GPU and CPU:

# For a 16GB GPU with the Q4 model (~18GB total)
# Offload about 40 layers to GPU, rest to CPU
./build/bin/llama-cli \
  -m ./models/gemma-4-31B-it-Q4_K_M.gguf \
  -ngl 40 \
  -c 4096

This runs slower than full GPU offload but fits on smaller cards. Expect roughly 5-15 tokens/second depending on your CPU and how many layers you offload.

Method 4: HuggingFace Transformers

If you're building an application and need programmatic control, HuggingFace Transformers gives you direct access to the model with full precision or custom quantization.

Installation

pip install -U transformers torch accelerate

For image support:

pip install -U transformers torch torchvision accelerate

Running in Full Precision (62GB+ VRAM)

from transformers import AutoProcessor, AutoModelForCausalLM

MODEL_ID = "google/gemma-4-31B-it"

processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    dtype="auto",
    device_map="auto"
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain the difference between TCP and UDP."},
]

text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False
)

inputs = processor(text=text, return_tensors="pt").to(model.device)
input_len = inputs["input_ids"].shape[-1]

outputs = model.generate(**inputs, max_new_tokens=1024)
response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)
result = processor.parse_response(response)
print(result)

Running with 4-bit Quantization (18GB VRAM)

from transformers import AutoProcessor, AutoModelForCausalLM, BitsAndBytesConfig

MODEL_ID = "google/gemma-4-31B-it"

# 4-bit quantization config
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)

processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config=quantization_config,
    device_map="auto"
)

Processing Images

The 31B model supports text and image input:

from transformers import AutoProcessor, AutoModelForMultimodalLM

MODEL_ID = "google/gemma-4-31B-it"

processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForMultimodalLM.from_pretrained(
    MODEL_ID,
    dtype="auto",
    device_map="auto"
)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://example.com/photo.jpg"},
            {"type": "text", "text": "Describe what you see in this image."}
        ]
    }
]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    add_generation_prompt=True,
).to(model.device)

input_len = inputs["input_ids"].shape[-1]
outputs = model.generate(**inputs, max_new_tokens=512)
response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)
result = processor.parse_response(response)
print(result)

Enabling Thinking Mode

Gemma 4 supports chain-of-thought reasoning. Enable it by setting enable_thinking=True:

text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True  # Activates reasoning mode
)

When thinking is enabled, the model outputs its internal reasoning followed by the final answer. Use processor.parse_response(response) to separate the thinking from the answer.

Understanding GGUF Quantization Formats

If you're downloading GGUF files, you'll see a lot of suffixes. Here's what they mean in practice.

Format	Bits	Size (31B)	When to Use
Q8_0	8-bit	~32 GB	Best quality, needs 32GB+ VRAM
Q6_K	6-bit	~26 GB	Near-lossless, 24GB+ VRAM
Q5_K_M	5-bit	~22 GB	Sweet spot for quality/size
Q4_K_M	4-bit	~18 GB	Best balance, fits 24GB GPU
Q3_K_M	3-bit	~14 GB	Smaller GPU, some quality loss
Q2_K	2-bit	~10 GB	Desperate measures, noticeable degradation

My recommendation: Q4_K_M for 24GB GPUs, Q5_K_M if you can spare the extra 4GB. The quality difference between Q4_K_M and Q5_K_M is measurable on benchmarks but hard to notice in casual use. Going below Q3_K_M is not worth it unless you have no other option.

The _K_M suffix means "K-quantization, medium." There are also _K_S (small, more compression) and _K_L (large, less compression) variants. _K_M is the default recommendation.

Performance Tips

Context Length Management

Gemma 4 31B supports up to 256K tokens of context, but each token in the context costs VRAM. A few practical guidelines:

4K tokens: Comfortable on any GPU that fits the model
8K tokens: Standard for most conversations, still comfortable
32K tokens: Needs about 4-6GB extra VRAM depending on quantization
128K+ tokens: Requires substantial VRAM or aggressive offloading

Start with -c 8192 and increase only when you need it.

Sampling Parameters

Google recommends these settings for Gemma 4:

temperature = 1.0
top_p = 0.95
top_k = 64

These are different from what most models use. Don't use temperature=0.7 with Gemma 4; it's trained for temperature=1.0 and produces better results at that setting.

Flash Attention

If you're using HuggingFace Transformers, enable Flash Attention for faster inference and lower memory usage:

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    attn_implementation="flash_attention_2",
    device_map="auto"
)

This requires pip install flash-attn and a compatible GPU (most NVIDIA RTX cards work).

Multi-GPU Setup

If you have multiple GPUs, device_map="auto" in Transformers handles splitting automatically. For llama.cpp:

./build/bin/llama-cli \
  -m ./models/gemma-4-31B-it-Q4_K_M.gguf \
  -ngl 99 \
  -ts 1,1 \  # Split equally between 2 GPUs
  -c 8192

Troubleshooting

Out of Memory (CUDA OOM)

The most common issue. Solutions in order of desperation:

Use a smaller quantization: Switch from Q5 to Q4, or Q4 to Q3
Reduce context length: Lower -c from 8192 to 4096 or 2048
Offload to CPU: Reduce -ngl to offload some layers
Use the 26B A4B MoE variant: Same quality tier, fraction of the VRAM
Use the E4B variant: Runs on anything

Slow Inference on CPU

If you're running on CPU, expect 1-3 tokens/second for the 31B model. Options:

Switch to the E4B or E2B model (10-20 tok/s on CPU)
Use a GPU cloud instance (LightNode offers GPU VPS options)
Build llama.cpp with your CPU's instruction sets enabled (AVX2, AVX-512)

Model Download Failures

The Q4_K_M file is about 18GB. If the download keeps failing:

# Use huggingface-cli with resume support
huggingface-cli download unsloth/gemma-4-31B-it-GGUF \
  gemma-4-31B-it-Q4_K_M.gguf \
  --local-dir ./models \
  --local-dir-use-symlinks False

Or use a download manager that supports resume. The HuggingFace CDN can be flaky for large files.

"Model not supported" Errors

Make sure you're using the latest version of your tools. Gemma 4 is new and older versions of llama.cpp, Ollama, and Transformers don't support it:

# Update llama.cpp
cd llama.cpp && git pull && cmake --build build --config Release -j$(nproc)

# Update Ollama
ollama pull gemma4:31b-it  # This auto-updates if needed

# Update Transformers
pip install -U transformers

Which Method Should You Pick?

Scenario	Best Method
You want a GUI, don't want to touch the terminal	Unsloth Studio
You want the fastest setup, CLI is fine	Ollama
You need maximum control over inference	llama.cpp
You're building an application	HuggingFace Transformers
You have limited VRAM (8-16GB)	Unsloth or Ollama with Q3/Q4
You have 24GB+ VRAM	Any method, use Q4_K_M or Q5_K_M
You need image understanding	Unsloth Studio or HuggingFace Transformers
You want to fine-tune	Unsloth (LoRA/GRPO training built in)

For most people just getting started, Unsloth Studio is the path of least resistance. Install it, search for Gemma 4, pick a quantization that fits your GPU, and start chatting. The whole process takes about 15 minutes from install to first conversation.

If you're comfortable with the terminal and just want to run the model, Ollama gets you there in two commands. And if you need programmatic access or are building something on top of the model, HuggingFace Transformers with 4-bit quantization gives you the full Python API.

Wrapping Up

Running Gemma 4 31B locally has gotten remarkably practical. A year ago, a 30B model at this quality level would have been a research project. Now it's a 15-minute setup process with Unsloth or Ollama, and it runs on consumer hardware you can buy today.

The model itself holds its own against proprietary alternatives in reasoning, coding, and multimodal tasks. 256K context, built-in thinking mode, image understanding, and function calling make it genuinely useful for real work, not just experimentation.

For hosting the model on a remote GPU, LightNode offers GPU VPS instances with hourly billing, so you can spin one up when you need it and shut it down when you don't.

The Gemma 4 model card on HuggingFace has the full technical details, and the Unsloth GGUF repo has all the quantized variants ready to download.