Playing around Gemma4:e4b

Category:engineering
Date:

Playing Around Gemma4:e4b

The following is generated by GLM-5 as I direct it to compare the models' review of a specific traversal problem. My opinion: It is genuinely impressive for a small model. Realistic, but still impressive. Ran on a 16GB M1 Pro MacBook Pro.

What GLM-5 Has to Say

We asked each model to review epi_judge_python/search_maze.py, a Python implementation of maze pathfinding using DFS. The file contains two approaches and a testing wrapper.

Code Under Review

def search_maze(
    maze: List[List[int]], s: Coordinate, e: Coordinate
) -> List[Coordinate]:
    def traverse_from_coordinate(c: Coordinate) -> bool:
        if c == e:
            backtracked.append(e)
            return True

        if maze[c.x][c.y] == BLACK:
            return False

        backtracked.append(c)
        maze[c.x][c.y] = BLACK

        if c.x - 1 >= 0 and maze[c.x - 1][c.y] == WHITE:
            if traverse_from_coordinate(Coordinate(c.x - 1, c.y)):
                return True
        if c.y - 1 >= 0 and maze[c.x][c.y - 1] == WHITE:
            if traverse_from_coordinate(Coordinate(c.x, c.y - 1)):
                return True
        if c.x + 1 < len(maze) and maze[c.x + 1][c.y] == WHITE:
            if traverse_from_coordinate(Coordinate(c.x + 1, c.y)):
                return True
        if c.y + 1 < len(maze[c.x]) and maze[c.x][c.y + 1] == WHITE:
            if traverse_from_coordinate(Coordinate(c.x, c.y + 1)):
                return True

        del backtracked[-1]
        return False

    backtracked: List[Coordinate] = []
    traverse_from_coordinate(s)
    return backtracked

Key characteristics:

  • DFS with backtracking
  • In-place mutation of maze (marks visited cells as BLACK)
  • Path returned start→end order
  • Edge cases handled (start==end, single-cell maze)

Models Tested

ModelSizeVariant
Qwen3.5:4b~4BEfficient (unknown Q)
Gemma4:e4b~4BEfficient (unknown Q)
Gemma4:e4b-it-q8_0~4BInstruction-tuned, Q8
Gemma4:e4b-it-q4_K_M~4BInstruction-tuned, Q4_K_M

Results

Qwen3.5:4b — Hallucinated Bugs

Accuracy: ❌ Failed

Claimed bugs:

  1. "Missing start node in path" — FALSE (line 27 adds it)
  2. "Path is reversed (end→start)" — FALSE (path is start→end)
  3. "Start==End returns []" — FALSE (returns [s])
  4. "Single-cell maze returns []" —FALSE (returns coordinate)

What went wrong:

  • Couldn't trace execution flow
  • Made confident assertions without verification
  • Pattern-matched superficially ("DFS" + "append" → assumed bugs)

Verdict: Unreliable for code review. Hallucinates with confidence.

Gemma4:e4b — Accurate but Shallow

Accuracy: ✓ Passed Depth: Medium

What it got right:

  • No hallucinated bugs
  • Correctly identified in-place mutation side effect
  • Correctly noted Coordinate is hashable (string conversion unnecessary)
  • Suggested reasonable style improvements

What it missed:

  • No deep correctness verification
  • No time complexity analysis
  • Minor terminology slip ("passing" vs closure)

Verdict: Trustworthy but shallow. Good for quick sanity checks.

Gemma4:e4b-it-q8_0 — Deep and Accurate

Accuracy: ✓ Passed Depth: High

What it got right:

  • Verified algorithm correctness
  • Time complexity: O(V+E)
  • String conversion overhead in v1
  • Boundary check analysis (len(maze[c.x]) for jagged arrays)
  • Coordinate vs tuple comparison quibble (correct but minor)

What sets it apart:

  • Recognized the algorithm is sound without hallucinating bugs
  • Provided technical analysis (complexity, overhead)
  • Understood namedtuples compare by value

Tradeoff: Notably slower, lagged during inference (memory pressure on 16GB RAM).

Verdict: Best quality, but slow. Use for deep analysis sessions.

Gemma4:e4b-it-q4_K_M — Balanced

Accuracy: ✓ Passed Depth: Medium

What it got right:

  • No hallucinated bugs
  • Correct side effect analysis
  • Useful structural suggestions (encapsulation)
  • Reasonable style feedback

What it missed:

  • No time complexity analysis
  • Less technical depth than Q8_0

Performance: Fast, responsive, no lag.

Verdict: Sweet spot. 90% of Q8_0's quality at 3x speed.

Comparison Matrix

Qwen e4bGemma e4bGemma Q4_K_MGemma Q8_0
Accuracy
Hallucinations4 false bugs000
Technical depthLowMediumMediumHigh
Time complexityWrongNoneNone✓ O(V+E)
SpeedFastFastFastSlow (lag)
Memory fit (16GB)ComfortableComfortableComfortableTight

Key Insights

1. Small models hallucinate confidently

Qwen3.5:4b invented bugs that didn't exist, framed as authoritative findings. This is the primary failure mode of small models on complex reasoning tasks—they can't trace execution, so they pattern-match and guess.

2. Instruction tuning matters

Both Gemma instruction-tuned variants (Q4_K_M and Q8_0) outperformed non-instruction-tuned variants at code review tasks. They understood the prompt format and provided structured analysis.

3. Quantization affects reasoning quality

QuantizationQualitySpeedMemory
Q4_K_MGoodFast9.6GB
Q8_0BetterSlow12GB

For a 4B model, Q8 preserves more reasoning capability. But the gap is marginal—the model's size is the primary constraint.

4. Model family matters more than quantization

Gemma Q4_K_M > Qwen (any quantization). The Gemma family appears to have better reasoning fundamentals regardless of bit depth.

Recommendations

For 16GB RAM systems:

Use caseModelWhy
Daily driverGemma Q4_K_MFast, accurate, comfortable memory fit
Deep analysisCloud model (e.g., Claude)Unlimited reasoning depth
One-off deep localGemma Q8_0Close all apps, accept lag

For higher RAM (24GB+):

Use caseModelWhy
Daily driverGemma Q4_K_MFast iteration
Deep analysisGemma Q8_0Local deep reasoning

The Hierarchy

Quality:     Q8_0 ≈ Q4_K_M > Gemma e4b >>> Qwen (hallucinates)
Speed:       Q4_K_M ≈ Gemma e4b ≈ Qwen >>> Q8_0 (lags)
Reliability: All Gemma >>> Qwen

Conclusion

For local code review on a 16GB machine: Gemma4:e4b-it-q4_K_M is the practical choice. It's fast, doesn't hallucinate, provides useful feedback, and leaves memory headroom. Reserve Q8_0 for occasional deep sessions, and use cloud models for complex reasoning tasks that exceed 4B capacity.

The Qwen failure is instructive: small models can sound authoritative while being completely wrong. Trust but verify—especially with 4B parameters.