MAP-NEO Mini

Model Description

MAP-NEO Mini is a 253M parameter autoregressive language model built from scratch with modern architectural improvements. It demonstrates that high-quality language models can be trained efficiently on modest hardware while achieving competitive performance through careful data curation and architectural choices.

Developed by: Antony Austin
Model type: Autoregressive Language Model
Language(s): English
License: MIT
Architecture: Custom transformer with RoPE, RMSNorm, SwiGLU, and Flash Attention

Key Features

Efficient Training: Trained on RTX 5070 Laptop GPU (8GB VRAM) in ~4 hours
Extended Context: 16,384 token context window (16x typical small models)
Memory Efficient: Only 1.3GB VRAM for 1,800 tokens inference
Fast Inference: ~150+ tokens/second on consumer GPU
High Quality Data: Trained on curated RefinedWeb subset

Architecture Details

Model Architecture

Parameters: 253,085,696 (253M)
Layers: 16 transformer blocks
Hidden Size: 1,024
Attention Heads: 16
Head Dimension: 64
FFN Hidden Size: 2,736 (2.67x hidden size)
Vocabulary Size: 50,257 (GPT-2 tokenizer)
Max Sequence Length: 16,384 tokens

Architectural Innovations

RMSNorm: Root Mean Square Layer Normalization for training stability
RoPE: Rotary Positional Embeddings for better positional understanding
SwiGLU: Swish-Gated Linear Units for improved FFN performance
Flash Attention: Memory-efficient attention computation
Weight Tying: Input/output embeddings shared for parameter efficiency

Training Data

Dataset

Source: tiiuae/falcon-refinedweb (curated subset)
Size: 100,000 high-quality web documents
Tokens: ~41 million tokens
Sequence Length: 1,024 tokens per sequence
Sequences: 40,965 packed sequences

Data Quality

Length filtering: 200-10,000 characters
Language detection: English only
Quality scoring: High-quality web content
Deduplication: Exact and near-duplicate removal

Training Procedure

Training Configuration

Hardware: NVIDIA RTX 5070 Laptop GPU (8GB VRAM)
Precision: bfloat16 mixed precision
Batch Size: 1 per device
Gradient Accumulation: 32 steps
Effective Batch Size: 32
Learning Rate: 3e-4
Scheduler: Cosine with linear warmup
Warmup Steps: 3,750
Total Steps: 150,000
Training Time: ~4 hours

Optimization Details

Optimizer: AdamW (β₁=0.9, β₂=0.95, weight_decay=0.01)
Gradient Clipping: 1.0
Gradient Checkpointing: Enabled for memory efficiency
Loss Function: Cross-entropy loss

Context Extension

Base Context: 2,048 tokens
Extended Context: 16,384 tokens
Method: Linear interpolation of positional embeddings
Validation: Successfully tested up to 3,600 tokens

Performance

Training Metrics

Final Loss: 3.907
Training Speed: ~10 iterations/second
Peak Memory: ~8GB VRAM
Convergence: Smooth loss curve, no overfitting

Inference Performance

Speed: ~150+ tokens/second (RTX 5070)
Memory Usage: 1.3GB for 1,800 token context
Context Limit: 3,600 tokens practical limit
Temperature: Recommended 0.7-0.9 for creative tasks

Usage

Quick Start

import torch
from transformers import AutoTokenizer
from model_neo import NeoMini, NeoMiniConfig

# Load model
config = NeoMiniConfig()
model = NeoMini(config)
checkpoint = torch.load("extended_context_model.pt")
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Generate text
prompt = "The future of AI is"
input_ids = tokenizer.encode(prompt, return_tensors="pt")
with torch.no_grad():
    output = model.generate(input_ids, max_length=100, temperature=0.8)
print(tokenizer.decode(output))

Interactive Chat

python interactive_chat.py

Generation Parameters

Temperature: 0.7-0.9 for creative tasks, 0.3-0.5 for factual
Top-k: 40-50
Top-p: 0.8-0.9
Repetition Penalty: 1.1-1.3

Limitations

Current Limitations

Base Model Only: Not instruction-tuned (requires fine-tuning for chat)
Context Window: Practical limit of ~3,600 tokens despite 16K architecture
Hardware Requirements: Requires CUDA-capable GPU for optimal performance
Knowledge Cutoff: Limited to web data patterns, no specific knowledge cutoff

Known Issues

Occasionally generates repetitive patterns (fixable with fine-tuning)
May not follow instructions well (base model behavior)
Sometimes produces formatting artifacts from web data

Ethical Considerations

Bias and Fairness

Trained on web data which may contain societal biases
No explicit bias mitigation applied during training
Users should be aware of potential biased outputs

Use Cases

Intended Uses:

Research and experimentation
Text generation and completion
Creative writing assistance
Educational purposes

Out-of-Scope Uses:

Medical or legal advice
High-stakes decision making
Content that could cause harm

Environmental Impact

Carbon Footprint

Training Hardware: Single RTX 5070 Laptop GPU (100W)
Training Time: 4 hours
Estimated CO₂: ~0.3 kg CO₂ equivalent
Efficiency: 253M parameters per 0.3 kg CO₂

Model Card Authors

[Antony Austin] - Model development and training [30/08/2025] - Model card creation

Citation

@misc{mapneo_mini_2025,
  title={MAP-NEO Mini: An Efficient 253M Parameter Language Model},
  author={[Antony Austin]},
  year={2025},
  howpublished={\url{https://huggingface.co/Austin207/Map-NEO}},
  note={Trained on NVIDIA RTX 5070 Laptop GPU with RefinedWeb data}
}

Technical Details

Hardware Requirements

Minimum: 4GB VRAM for inference
Recommended: 8GB VRAM for extended context
Training: 8GB+ VRAM with mixed precision
CPU: Any modern CPU (inference possible but slow)

Future Work

Planned Improvements

Conversational fine-tuning with UltraChat dataset
Instruction following capabilities
Multi-language support
Quantized versions (4-bit, 8-bit)
ONNX export for edge deployment

Research Directions

Context window optimization beyond 16K
More efficient attention mechanisms
Improved training data curation
Specialized domain fine-tuning

Acknowledgments

Falcon RefinedWeb: High-quality training data
Hugging Face: Transformers library and infrastructure
Community: Open-source ML community for architectural insights

Last Updated: August 30, 2025 Model Version: 1.0.0 Status: Base model (pre-conversational fine-tuning)

Downloads last month: -; Downloads are not tracked for this model. How to track

Dataset used to train Austin207/Map-NEO

Evaluation results

Final Training Loss on RefinedWeb (100K subset)
self-reported

3.900