Zagreus-0.4B-ita

Zagreus-0.4B-ita is a bilingual English/Italian foundational Small Language Model (SLM) trained from scratch by the mii-llm community (Made in Italy – Large Language Model) on the Seeweb HPC infrastructure.

This is a base (pre-trained) model — it is not instruction-tuned and is intended for researchers, developers, and practitioners who want to fine-tune or build upon a high-quality bilingual English/Italian foundation. It serves as the base for the entire Nesso model family.

The Zagreus family represents one of the few openly released, high-performing small language models dedicated to European Romance languages, trained entirely from first principles with a fully transparent pipeline.

Model Details

Property	Value
Architecture	Modified Llama-3.2 (fully dense)
Parameters	~400M
Hidden size	960
Intermediate size	2560
Layers	32
Attention heads	15 (KV heads: 5)
Activation	SiLU
Context length	4096 tokens
Tokenizer	Llama-3.2 (`vocab_size`: 128,256)
Positional encoding	RoPE (`theta`: 10000.0)
Tied embeddings	Yes
Precision	BF16
Languages	English (~~400B tokens), Italian (~~400B tokens)
Training tokens	~1 trillion
Training framework	Nanotron (mii-llm fork)
Infrastructure	64× NVIDIA A100 GPUs (8 nodes × 8 GPUs), Seeweb HPC

Training Data

All datasets used are fully open source and released by Hugging Face:

Dataset	Tokens	Description
FineWeb (350BT sample)	~350B	High-quality English web text
FineWeb-2 (ita_Latn)	—	Italian web text
FinePDFs (ita_Latn)	—	Italian PDF documents
StarCoder Data	~250B	Multilingual code

Token distribution: ~400B English + ~400B Italian + ~200B Code ≈ 1 trillion tokens total

Tokenization

Raw datasets were tokenized using the Llama-3.2 tokenizer (meta-llama/Llama-3.2-1B) via the datatrove library. The process ran for over three weeks of continuous computation on CPU nodes via Slurm, generating approximately 3–5 TB of tokenized data shards.

Architecture Choice

We adopted a modified Llama-3.2 fully dense architecture. The choice of a dense model over Mixture-of-Experts (MoE) in the small-parameter regime (~500M) was deliberate: in tightly constrained capacity settings, routing overhead and expert under-utilization typical of MoE architectures may offset their theoretical efficiency advantages. Dense models provide better compute utilization and more stable training dynamics at this scale.

Pre-training Configuration

Full Nanotron YAML configuration used for training:

checkpoints:
  checkpoint_interval: 5000
  checkpoints_path: checkpoints_zagreus_ita_v2
  checkpoints_path_is_shared_file_system: false
  resume_checkpoint_path: /training/pretraining/nanotron/checkpoints_zagreus_ita_v2/630000
  save_final_state: false
  save_initial_state: false
data_stages:
- data:
    dataset:
      dataset_folder:
      - /training/pretraining/fineweb-ita/tokenized
      - /training/pretraining/fineweb-edu-350BT/000_tokenized_output
      - /training/pretraining/fineweb-edu-350BT/011_tokenized_output
      - /training/pretraining/fineweb-edu-350BT/012_tokenized_output
      - /training/pretraining/fineweb-edu-350BT/013_tokenized_output
      - /training/pretraining/fineweb-edu-350BT/014_tokenized_output
      - /training/pretraining/fineweb-edu-350BT/015_tokenized_output
      - /training/pretraining/fineweb-edu-350BT/016_tokenized_output
      - /training/pretraining/finepdf-ita/000_tokenized_output
      - /training/pretraining/starcoder_tokenized/000_tokenized_output
    num_loading_workers: 0
    seed: 8
  name: stable phase
  start_training_step: 1
general:
  benchmark_csv_path: null
  consumed_train_samples: null
  ignore_sanity_checks: true
  project: zagreus
  run: zagreus-350M
  seed: 8
  step: null
logging:
  iteration_step_info_interval: 1
  log_level: info
  log_level_replica: info
model:
  ddp_bucket_cap_mb: 100
  dtype: bfloat16
  init_method:
    std: 0.03227
  make_vocab_size_divisible_by: 1
  model_config:
    bos_token_id: 128000
    eos_token_id: 128001
    hidden_act: silu
    hidden_size: 960
    initializer_range: 0.02
    intermediate_size: 2560
    is_llama_config: true
    max_position_embeddings: 4096
    num_attention_heads: 15
    num_hidden_layers: 32
    num_key_value_heads: 5
    pad_token_id: null
    pretraining_tp: 1
    rms_norm_eps: 1.0e-05
    rope_interleaved: false
    rope_scaling: null
    rope_theta: 10000.0
    tie_word_embeddings: true
    use_cache: true
    vocab_size: 128256
optimizer:
  accumulate_grad_in_fp32: true
  clip_grad: 1.0
  learning_rate_scheduler:
    learning_rate: 0.003
    lr_decay_starting_step: 750000
    lr_decay_steps: 50000
    lr_decay_style: linear
    lr_warmup_steps: 4000
    lr_warmup_style: linear
    min_decay_lr: 1.0e-7
  optimizer_factory:
    adam_beta1: 0.9
    adam_beta2: 0.95
    adam_eps: 1.0e-08
    name: adamW
    torch_adam_is_fused: true
  weight_decay: 0.01
  zero_stage: 0
parallelism:
  dp: 64
  expert_parallel_size: 1
  pp: 1
  pp_engine: 1f1b
  recompute_layer: false
  tp: 1
  tp_linear_async_communication: true
  tp_mode: REDUCE_SCATTER
  tp_recompute_allgather: true
profiler: null
tokenizer:
  tokenizer_max_length: null
  tokenizer_name_or_path: meta-llama/Llama-3.2-1B
  tokenizer_revision: null
tokens:
  batch_accumulation_per_replica: 1
  limit_test_batches: 0
  limit_val_batches: 0
  micro_batch_size: 4
  sequence_length: 4096
  train_steps: 2000000
  val_check_interval: 5000

Slurm Launch Script

#SBATCH --job-name=350_it
#SBATCH --account=YOUR_ACCOUNT
#SBATCH --partition=PARTITION
#SBATCH --nodes=8
#SBATCH --gres=gpu:8            # 8 A100 per node = 64 total
#SBATCH --cpus-per-task=32
#SBATCH --time=4-00:00:00
#SBATCH --output=slurm-%j.out

################ 0. Environment ################
module purge
module load profile/global
module load python/3.11 cuda/12.2 cudnn nccl gcc

source /path/to/venv/nanotron/bin/activate

export HF_HOME=/path/to/hf_home
export TRANSFORMERS_OFFLINE=1
export HF_HUB_OFFLINE=1
export HF_DATASETS_OFFLINE=1
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export NCCL_IB_DISABLE=0
export NCCL_SOCKET_IFNAME="ib0,eno,eth"
export WANDB_MODE=disabled

################ 1. Distributed vars ############
GPUS_PER_NODE=4
NNODES=$SLURM_JOB_NUM_NODES
NODE_RANK=$SLURM_NODEID
MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n1)
MASTER_PORT=29400
RDZV_ID=$SLURM_JOB_ID

################ 2. Launch ######################
srun torchrun \
      --nnodes $NNODES \
      --nproc_per_node $GPUS_PER_NODE \
      --rdzv_id $RDZV_ID \
      --rdzv_backend c10d \
      --rdzv_endpoint $MASTER_ADDR:$MASTER_PORT \
      /path/to/nanotron/run_train.py \
      --config-file smollm2/zagreus_350M_ita.yaml

Checkpoint Conversion to Hugging Face Format

torchrun --nproc_per_node=1 -m examples.llama.convert_nanotron_to_hf \
  --checkpoint_path=checkpoints/544000 \
  --save_path=hf_checkpoints/544000 \
  --tokenizer_name meta-llama/Llama-3.2-1B

Evaluation

Evaluation Commands

lm-eval --model hf --model_args pretrained=<checkpoint> \
  --tasks m_mmlu_it --num_fewshot 5 --device cuda:0 --batch_size 1

lm-eval --model hf --model_args pretrained=<checkpoint> \
  --tasks hellaswag_it,arc_it --device cuda:0 --batch_size 1

Checkpoint Progression

The table below tracks benchmark scores across training checkpoints, demonstrating steady model improvement throughout pre-training:

Checkpoint	MMLU IT ↑	HellaSwag IT ↑	ARC IT ↑	Average
v2-95k	0.2529	0.3366	0.2652	0.2849
v2-205k	0.2628	—	—	0.2628
v2-290k	0.2428	0.3492	0.2335	0.2752
v2-305k	0.2598	0.3562	0.2652	0.2937
v2-365k	0.2566	0.3664	0.2712	0.2981
v2-390k	0.2556	0.3438	0.2498	0.2831
v2-460k	0.2540	0.3778	0.2549	0.2956
v2-520k	0.2540	0.3778	0.2549	0.2956
v2-590k	0.2547	0.3651	0.2455	0.2884
v2-630k	0.2562	0.3632	0.2643	0.2946
v2-680k	0.2538	0.3740	0.2592	0.2957
v2-775k	0.2535	0.3750	0.2583	0.2956

Evalita Benchmark

Evalita is a comprehensive Italian NLP evaluation suite benchmarking models across a wide range of linguistic tasks, from classification and extraction to generation and semantic understanding. Evaluation was conducted using the evalita-mp task suite from lm-evaluation-harness.

Evaluation Command

lm_eval --model hf \
  --model_args pretrained=mii-llm/zagreus-0.4B-ita \
  --tasks evalita-mp \
  --device cuda:0 \
  --batch_size 1

Results

Task	Metric	Score
Evalita-LLM (Overall)	acc	0.3226
Admission Test	acc	0.2137
FAQ	acc	0.2681
Hate Speech Detection	f1	0.6056
Lexical Substitution	f1	0.0000
NER	f1	0.1611
Relation Extraction	f1	0.1244
Sentiment Analysis	f1	0.3660
Summarization (Fanpage)	rouge1	0.1947
Text Entailment	acc	0.5133
Word in Context	f1	0.4697

Evalita results serve as a zero-shot baseline for the base model. For the comparison between the base model and the SFT variant, see the Open-Zagreus-0.4B model card.

Usage

This is a base model — it performs causal language modelling (text completion) and is not instruction-tuned. It is best suited as a starting point for fine-tuning.

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "mii-llm/zagreus-0.4B-ita"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Base model: text completion, not instruction following
prompt = "L'intelligenza artificiale è una disciplina che"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

output = model.generate(
    **inputs,
    max_new_tokens=200,
    temperature=0.8,
    do_sample=True,
    repetition_penalty=1.1
)

print(tokenizer.decode(output[0], skip_special_tokens=True))

For instruction-following, use the post-trained variants:

🗣️ Nesso-0.4B-instruct — conversational and instruction following
🤖 Nesso-0.4B-agentic — function calling and agentic tasks
🔓 Open-Zagreus-0.4B — fully open-source SFT variant

Full Model Family

Base Models (Zagreus)

Model	Languages	HuggingFace
Zagreus-0.4B-ita (this model)	English + Italian	🤗 Link
Zagreus-0.4B-spa	English + Spanish	🤗 Link
Zagreus-0.4B-por	English + Portuguese	🤗 Link
Zagreus-0.4B-fra	English + French	🤗 Link

Post-trained Models (Nesso)

Model	Use Case	HuggingFace
Nesso-0.4B-instruct	Conversational / Instruction following	🤗 Link
Nesso-0.4B-agentic	Function calling / Agentic	🤗 Link
Open-Zagreus-0.4B	Fully open source	🤗 Link

Citation

If you use this model in your research, please cite:

@misc{zagreus2025,
  title        = {The Joy and Pain of Training an LLM from Scratch:
                  A Technical Report on the Zagreus and Nesso Model Families},
  author       = {mii-llm community},
  year         = {2025},
  howpublished = {\url{https://github.com/mii-llm/zagreus-nesso-slm}},
}

Acknowledgements

Antonio Baldassarra (CEO, Seeweb) and Marco Cristofanilli (Head of AI, Seeweb) for commissioning and sponsoring the infrastructure
The Hugging Face team for Nanotron, datatrove, FineWeb, FineWeb-2, and FinePDFs
The mii-llm open-source community for contributions to multilingual evaluation harnesses and the Nanotron fork