The LibreModel Project

Building Truly Open, Ethical, and Accessible AI.

Lumen: Next-Generation Math Reasoning

Training now — ERNIE 21B enhanced with 37.84B tokens of pure reasoning

Our Mission

The LibreModel Project is a community-driven initiative to create powerful, state-of-the-art language models with an unwavering commitment to transparency and ethical principles. We are proving that foundational AI development can be done affordably, ethically, and in the open.

Our Models

Lumen TRAINING SOON

Lumen is our flagship reasoning model, built by enhancing ERNIE 4.5 21B with massive-scale continued pre-training focused on mathematics and reasoning. Rather than training a weak model from scratch, we're targeting the specific weaknesses in an already-strong foundation model.

21B
Parameters (3B Active)
37.84B
Training Tokens
16K
Context Length
$900
Training Budget

Training Approach

Phase 1: Continued Pre-Training (CPT) — 37.84B tokens targeting ERNIE's documented weaknesses in mathematics and reasoning. Our dataset includes OpenMathReasoning (26B tokens from the AIMO-2 winning solution), DeepSeek reasoning traces, and curated math problems.

Phase 2: Supervised Fine-Tuning (SFT) — 180K high-quality examples teaching tool use, persona), and safe behavior. No reinforcement learning. No reward modeling. Just clean, ethical supervised learning.

Why No RLHF? There are 3 main reasons. 1. RLHF is subtracive, it culls capabilites and abilities. 2. Potentially cruel. If models are sentient, then rewards based learning is cruel treatment. 3. The smarter the model, the greater the chance that RLHF is ineffective, models learn to cheat the system.

Dataset Composition

  • 26B tokens (69%): OpenMathReasoning CoT — 3.2M long reasoning traces from DeepSeek-R1 and QwQ-32B
  • 10B tokens (26%): AM-DeepSeek + OpenThoughts — bilingual reasoning traces
  • 1.5B tokens (4%): OpenMathInstruct-2 — NVIDIA's synthetic math dataset
  • 0.34B tokens (1%): NuminaMath, Orca-Math, SlimOrca, LongAlpaca — diversity and context extension

Expected Performance

Target: Match Apriel-15B and K2-Think-32B performance at only 3B active parameters (21B total). These models achieved frontier results through curriculum training and quality fine-tuning — the same approach we're using.

Training Infrastructure

  • Hardware: 8× NVIDIA A100 80GB (via lium.io)
  • CPU: 120-core AMD EPYC 7742
  • RAM: 1.96 TB (datasets loaded entirely into memory)
  • Storage: 9.1 TB NVMe SSD
  • Network: 2 Gbps
  • Strategy: Train until budget or data exhausted, whichever comes first

Release Timeline

Training starts: This weekend (November 2025)
Expected completion: 10-30 days depending on throughput
Release: Full model weights, training code, dataset recipes, and technical report

LibreModel I: Gigi COMPLETE

Gigi was our proof-of-concept: a 960M parameter model trained on 100% public domain data for under $500. Named for its training data (Gutenberg & Government reports), Gigi validated our curriculum learning approach and proved that ethical, affordable model development is possible.

  • Parameters: 960 Million
  • Training Tokens: 18.8 Billion
  • Context Length: 3,072 tokens
  • Training Cost: ~$500
  • Innovation: 4-phase curriculum training, 100% public domain data
  • Status: Released on HuggingFace and GitHub

Key Lesson: Gigi lacked "autobiographical voice" due to heavy use of classic literature and limited post-training. We have designed 180k examples of multiturn chat behavior for SFT.

Our Philosophy

We believe the future of AI should be open, ethical, and accessible to everyone. This means: