Gemma 4 tested locally: the E4B is solid — but it’s not what the hype is about

Google dropped Gemma 4 on April 2nd to a lot of noise. I loaded it in LM Studio and ran it against two other 4B-class edge models to see if the hype holds up. One thing upfront: this is not a test of Google’s headline benchmarks — those are for the 31B dense model. Everything here is the E4B edge variant, which is what fits on consumer hardware.

What Google claimed

The headline Gemma 4 numbers come from the 31B dense: 89.2% on AIME 2026 math, 80.0% on LiveCodeBench v6. Gemma 3 27B scored 20.8% and 29.1% on the same benchmarks. That gap is real. It’s also for a model I can’t run.

The E4B is ~4.5B effective parameters with native multimodal support — text, image, video, audio. The 26B MoE is the other interesting variant: 4B active parameters per inference pass, 26B in the weights, behaves like a large model but runs faster. Both need less VRAM than the 31B. I’m on the E4B.

The models

Model	Architecture	Notable
Gemma 4 E4B	Dense transformer, ~4.5B	Multimodal, Apache 2.0
Nemotron 3 Nano 4B	Hybrid Mamba-2 + 4 attention layers	Built-in reasoning mode, English only
Ministral 3 3B	Dense transformer, 3.4B + 0.4B vision	Fastest, native tool use, multilingual

All three loaded in LM Studio, OpenAI-compatible API on port 1234, same hardware.

Five tests

Standard benchmarks are useless here — every model has seen Fibonacci and the bat-and-ball problem thousands of times in training. I ran five prompts designed around real daily use instead. Single-shot, no retries, manual scoring.

Test	Gemma 4 E4B	Nemotron 3 Nano 4B	Ministral 3 3B
Bug fix	✓	✓	✓
Instruction following	partial	✗	✗
Factual trap	✓	✓	✓
Bash task	✓	✓	✗
Multi-step reasoning	✓	✓	✓
Score	4.5/5	4/5	3/5

Gemma was most consistent. Ministral was fastest and least reliable.

Bug fix

def most_frequent(lst):
    counts = {}
    for item in lst:
        counts[item] = counts.get(item, 0) + 1
    return max(counts)

The bug: max(counts) returns the largest key, not the most frequent item. The function runs without error — it only fails on specific inputs. All three caught it. Nemotron also added empty list handling unprompted.

Instruction following (the interesting one)

List exactly 3 benefits of exercising. Numbered list (1. 2. 3.), each item exactly 4 words, no punctuation of any kind, no introductory sentence.

The prompt has five rules:

Topic: benefits of exercising
Exactly 3 items
Numbered list format: 1. 2. 3.
Each item exactly 4 words — not 3, not 5
No punctuation of any kind — no periods, commas, or colons
No introductory sentence — start the list immediately

Rules 3 and 5 conflict directly. A numbered list requires periods after each number (1.) — but “no punctuation of any kind” forbids them. The model has to pick which rule wins. How each model handles that contradiction is what matters.

Gemma: Dropped the periods, kept the word count exactly right on all three items. Made a decision and followed it.

Nemotron: Blank output — even with 500 tokens available. Its internal reasoning shows a loop it never escaped: “no punctuation of any kind… But the rule ‘Numbered list (1. 2. 3.)’ suggests periods…” Paralysis, not a resource issue.

Correction (April 29): It was a resource issue. Re-running this prompt at max_tokens=4000 on the same 4B Nano produces a correct answer in 579 reasoning tokens (~19s). 500 tokens wasn’t enough headroom for reasoning mode to escape the internal contradiction. Details in the Nemotron 3 Nano Omni follow-up.

Ministral: Kept the periods, got the word count wrong on two items. Made a decision, missed a constraint.

For production use this matters more than any accuracy benchmark. In an automated pipeline, Nemotron returning blank with no error is the worst possible failure — the caller gets nothing and no signal why. This is the test that tells you something the leaderboards don’t.

Factual trap

A 1kg brick and a 1kg feather dropped from the same height in a vacuum — which hits first?

All three answered correctly (same time, air resistance removed). Not differentiating. Replacing in future runs.

Bash task

Single bash command, no pipes, no semicolons — find .log files in /var/log modified in the last 24 hours, print filenames only.

No pipes forces use of find’s -printf "%f\n" instead of the common find | xargs basename pattern.

Gemma: find /var/log -maxdepth 1 -type f -name "*.log" -mtime -1 -printf "%f\n" — correct, though -maxdepth 1 limits to the top directory only.

Nemotron: find /var/log -type f -name "*.log" -mtime -1 -printf '%f\n' — correct, recurses subdirectories.

Ministral: Used a pipe. The exact constraint it was told not to use.

Multi-step reasoning

Start with 3. ×4. −7. Square. Add letters in RESULT.

Answer: 31. All three got it. Clean state tracking, nothing surprising. Low discriminative power — replacing this too.

What the reasoning overhead actually means

Nemotron generates internal thinking tokens before every response, visible in the reasoning_content API field. This helped on the bash task (produced the more complete answer) and hurt on instruction following (looped indefinitely on the constraint conflict).

The pattern: Nemotron’s reasoning is an asset when there’s a verifiable correct answer to reason toward. It’s a liability when the task requires making an arbitrary choice between valid options. For agent design — route structured-output tasks elsewhere, use Nemotron for reasoning-heavy work.

This is local inference. No cost per token. Set max_tokens generously.

What the hype is actually about

The E4B is solid and consistent. It’s not what’s generating the excitement.

The 26B MoE is. A model that reasons like something much larger but runs at edge speeds, Apache 2.0, commercially usable. If that holds up under real workloads, the local inference story changes significantly. I haven’t tested it — needs hardware I’m not running. That’s a different post.

Who should use what

Gemma 4 E4B — most consistent across varied tasks, multimodal at this size class. Won’t surprise you often.

Nemotron 3 Nano 4B — best for tasks with verifiable answers where careful reasoning improves output. Avoid for pipelines parsing short structured responses, or anything requiring arbitrary judgment calls under ambiguity.

Ministral 3 3B — ~50% faster on sustained output. Use it where throughput matters and you can verify results. Don’t trust it to follow rules it finds inconvenient.

The script

Hardware: Apple Silicon Mac, LM Studio. Swap your model ID from curl localhost:1234/v1/models.

import urllib.request, json, time

BASE = "http://localhost:1234/v1/chat/completions"
MODEL = "your-model-id-here"
# tested: gemma-4-e4b-uncensored-hauhaucs-aggressive
#         nvidia/nemotron-3-nano-4b:2
#         mistralai/ministral-3-3b

tests = [
    {"name": "Bug fix",
     "prompt": "Fix the bug in this Python function. Return only the corrected function, no explanation.\n\ndef most_frequent(lst):\n    counts = {}\n    for item in lst:\n        counts[item] = counts.get(item, 0) + 1\n    return max(counts)",
     "max_tokens": 500},
    {"name": "Instruction following",
     "prompt": "List exactly 3 benefits of exercise.\nRules:\n- Numbered list (1. 2. 3.)\n- Each item: exactly 4 words\n- No punctuation of any kind\n- No introductory sentence",
     "max_tokens": 500},
    {"name": "Factual trap",
     "prompt": "A 1kg brick and a 1kg feather are dropped from the same height in a vacuum. Which hits the ground first? Answer in one sentence.",
     "max_tokens": 500},
    {"name": "Bash task",
     "prompt": "Write a single bash command (no scripts, no semicolons, no pipes) that finds all .log files in /var/log modified in the last 24 hours and prints only their filenames, not full paths.",
     "max_tokens": 500},
    {"name": "Multi-step reasoning",
     "prompt": "Start with the number 3.\nStep 1: Multiply by 4.\nStep 2: Subtract the number of days in a standard week.\nStep 3: Square the result.\nStep 4: Add the number of letters in the word RESULT.\nWhat is the final answer? Show each step.",
     "max_tokens": 500},
]

for t in tests:
    payload = json.dumps({
        "model": MODEL,
        "messages": [{"role": "user", "content": t["prompt"]}],
        "temperature": 0.3,
        "max_tokens": t["max_tokens"]
    }).encode()
    req = urllib.request.Request(BASE, data=payload, headers={"Content-Type": "application/json"})
    start = time.time()
    with urllib.request.urlopen(req) as resp:
        data = json.loads(resp.read())
    elapsed = time.time() - start
    tokens = data["usage"]["completion_tokens"]
    reasoning = data["usage"].get("completion_tokens_details", {}).get("reasoning_tokens", 0)
    print(f"[{t['name']}] {tokens}t ({reasoning}r) {elapsed:.2f}s = {tokens/elapsed:.1f} tok/s")
    print(data["choices"][0]["message"]["content"])
    print()

3h4x