Krave 2.5
Krave 2.5 is an open-source Mixture-of-Experts (MoE) language model for users who want to run their own local or private LLM setup.
Table of Contents
1. Introduction
Krave 2.5 is a strong Mixture-of-Experts (MoE) language model with 671B total parameters and 37B activated per token. It adopts Multi-head Latent Attention (MLA) and a MoE architecture for efficient inference and cost-effective training. This repository is intended as an open-source LLM project, and users should provide their own weight files for it to function.
2. Model Summary
Architecture
- Auxiliary-loss-free load balancing β minimizes performance degradation while encouraging balanced load across experts.
- Multi-Token Prediction (MTP) β improves model performance and enables speculative decoding for faster inference.
- Multi-head Latent Attention (MLA) β efficient attention mechanism with low-rank key-value compression.
3. Model Downloads
This project does not ship model weights. To run Krave 2.5, you must provide your own compatible weight files.
4. How to Run Locally
System Requirements
Linux with Python 3.10+. Mac and Windows are not supported.
Dependencies:
torch==2.4.1
triton==3.0.0
transformers==4.46.3
safetensors==0.4.5
Setup
Clone the repository:
git clone https://github.com/kraveorg/Krave-2.5.git
cd Krave-2.5
Install dependencies:
cd inference
pip install -r requirements.txt
Provide your own weight files in the expected checkpoint directory before running inference.
Convert Weights
If you already have compatible weights, convert them to the required format:
python convert.py --hf-ckpt-path /path/to/your-weights \
--save-path /path/to/Krave-2.5-Demo \
--n-experts 256 --model-parallel 16
Run β Interactive Mode
torchrun --nproc-per-node=8 inference/generate.py \
--ckpt-path /path/to/your-weights \
--config inference/configs/config_671B.json \
--interactive --temperature 0.7 --max-new-tokens 200
Run β Batch Mode
torchrun --nproc-per-node=8 inference/generate.py \
--ckpt-path /path/to/your-weights \
--config inference/configs/config_671B.json \
--input-file prompts.txt
Hardware Recommendations
- Minimum: 8x A100 (80GB) GPUs
- Recommended: 8x H100 (80GB) GPUs per node, 2 nodes
- NCCL for multi-GPU / multi-node communication
Supported Inference Frameworks
- SGLang β BF16 and FP8, multi-node tensor parallelism
- LMDeploy β FP8 and BF16, offline and online serving
- TensorRT-LLM β BF16, INT4/INT8
- vLLM β FP8 and BF16, tensor and pipeline parallelism
- LightLLM β FP8 and BF16, single and multi-node
- AMD GPU β via SGLang, BF16 and FP8
- Huawei Ascend NPU β INT8 and BF16
5. Krave Engine
Krave 2.5 includes a built-in Krave Engine β a lightweight Python interface for loading and running the model programmatically.
from engine import KraveEngine
engine = KraveEngine(
ckpt_path="/path/to/your-weights",
config="inference/configs/config_671B.json"
)
response = engine.generate("Explain quantum computing in simple terms.")
print(response)
See engine.py for full API documentation.
6. License
This code repository is licensed under the MIT License. Model weights are subject to the Model License. Krave 2.5 supports commercial use.
- Downloads last month
- -