Vish Sangale
NOTEDecember 2025

Bonsai-LLM: The 'Small LLMs Lab' Philosophy

In an era of trillion-parameter models and massive compute clusters, it’s easy to forget that capacity matters, constraints are the point, and craft beats brute force.

This is the philosophy behind BONSAI-LLM, a “Small LLMs Lab” where I focus on training and optimizing compact models that punch far above their weight.

1. Quality Over Quantity: Data Curation

The project prioritizes “high-signal” data over raw volume. By using a curated 10B token sample of FineWeb-Edu, my implementation achieved a validation perplexity of 23.85, significantly lower than models trained on 10x more “dirty” web crawl data.

2. Implementing Gemma 3: Architectural Stability

The repository features a from-scratch implementation of the Gemma 3 architecture. Beyond standard Transformer blocks, I incorporated several “2025-ready” features to maximize efficiency:

QK Norm (Query-Key Normalization)

To prevent attention scores from exploding during high-throughput training, I implemented QK Norm. Before the dot-product calculation, both Query ($Q$) and Key ($K$) vectors are normalized: \(Q' = \text{RMSNorm}(Q), \quad K' = \text{RMSNorm}(K)\) \(\text{Attention}(Q', K', V) = \text{Softmax}\left(\frac{Q' K'^T}{\sqrt{d_k}}\right) V\)

Grouped-Query Attention (GQA) and SWA

To optimize memory bandwidth and speed, the model uses GQA, allowing multiple query heads to share a single key-value head. This is paired with Sliding Window Attention (SWA) for local layers, reducing complexity to $O(N \times W)$.

3. Results: Convergence and Reasoning Performance

The “Bonsai” approach was validated through extensive training on the FineWeb dataset.

Metric Baseline (Step 0) FineWeb V0 (Final)
Validation Loss 10.99 3.1719
Perplexity (PPL) ~60,000 23.8537
Throughput (tok/sec) ~5,600 20,084.77

Zero-Shot Reasoning Benchmarks

The model’s ability to reason was tested using the lm-eval harness across three standard tasks:

Task Metric Bonsai-LLM (1B) Random Baseline
PIQA Accuracy 62.08% ~50%
HellaSwag Accuracy 28.75% ~25%
ARC-Easy Accuracy 41.20% ~25%

Research Insight: The jump in PIQA (62.08%) is particularly notable for a 1B parameter model, indicating that the architecture has effectively captured physical commonsense reasoning from the high-quality FineWeb-Edu corpus.

4. Modernizing the MLP: SwiGLU

The model utilizes a SwiGLU-style MLP for increased representational capacity: \(\text{Output} = \text{DownProj}(\text{SiLU}(\text{GateProj}(x)) \otimes \text{UpProj}(x))\) This activation function provides a better balance of nonlinearity and gradient flow compared to standard GeLU, contributing to the model’s stable convergence.

Conclusion

BONSAI-LLM is a reminder that the future of AI isn’t just about scaling up—it’s about scaling down intelligently. As we move toward on-device AI and specialized agents, the ability to craft small, efficient, and capable models will be a defining skill for the next generation of researchers.


Explore the “Small LLMs Lab” implementation and full benchmark logs on GitHub: vishsangale/BONSAI-LLM