Llama-3.1-8B-Instruct-FP16-GGUF
Description
Llama 3.1 8B Instruct in FP16 (baseline)
File: llama-3.1-8b-instruct-f16.gguf
Size: 16G
Format: GGUF
Category: baseline
Quick Start
Download
huggingface-cli download annus-lums/Llama-3.1-8B-Instruct-FP16-GGUF llama-3.1-8b-instruct-f16.gguf --local-dir .
Usage with llama.cpp
./llama-cli -m llama-3.1-8b-instruct-f16.gguf -p "Explain quantum computing" -n 128
Benchmark
./llama-bench -m llama-3.1-8b-instruct-f16.gguf -r 3
Project: AI on Edge Devices
This model is part of an LLM compression research project.
Pipeline
- Pruning: 20% structured Taylor pruning (MLP layers only)
- SmoothQuant: Activation smoothing for stable quantization
- Mixed Precision: Sensitivity-based bit-width allocation (Q4/Q5/Q6)
Results
- Size: 73.6% reduction (15 GB โ 4 GB)
- Speed: 273% faster inference (1.16 โ 4.33 tok/s)
- Deployment: Successfully runs on Raspberry Pi 4
Model Card
Created by: Group 2 (Annus, Arslan, Naveed, Danyal)
Institution: LUMS
Date: December 2024
Citation
@misc{llama31-compressed-Llama-3.1-8B-Instruct-FP16-GGUF,
author = {Group 2},
title = {Llama-3.1-8B-Instruct-FP16-GGUF},
year = {2024},
publisher = {HuggingFace},
url = {https://huggingface.co/annus-lums/Llama-3.1-8B-Instruct-FP16-GGUF}
}
- Downloads last month
- 8
Hardware compatibility
Log In to add your hardware
16-bit
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support
Model tree for annus-lums/Llama-3.1-8B-Instruct-FP16-GGUF
Base model
meta-llama/Llama-3.1-8B
Finetuned
meta-llama/Llama-3.1-8B-Instruct