Llama-3.1-8B-Instruct-FP16-GGUF

Description

Llama 3.1 8B Instruct in FP16 (baseline)

File: llama-3.1-8b-instruct-f16.gguf
Size: 16G
Format: GGUF
Category: baseline

Quick Start

Download

huggingface-cli download annus-lums/Llama-3.1-8B-Instruct-FP16-GGUF llama-3.1-8b-instruct-f16.gguf --local-dir .

Usage with llama.cpp

./llama-cli -m llama-3.1-8b-instruct-f16.gguf -p "Explain quantum computing" -n 128

Benchmark

./llama-bench -m llama-3.1-8b-instruct-f16.gguf -r 3

Project: AI on Edge Devices

This model is part of an LLM compression research project.

Pipeline

  1. Pruning: 20% structured Taylor pruning (MLP layers only)
  2. SmoothQuant: Activation smoothing for stable quantization
  3. Mixed Precision: Sensitivity-based bit-width allocation (Q4/Q5/Q6)

Results

  • Size: 73.6% reduction (15 GB โ†’ 4 GB)
  • Speed: 273% faster inference (1.16 โ†’ 4.33 tok/s)
  • Deployment: Successfully runs on Raspberry Pi 4

Model Card

Created by: Group 2 (Annus, Arslan, Naveed, Danyal)
Institution: LUMS
Date: December 2024

Citation

@misc{llama31-compressed-Llama-3.1-8B-Instruct-FP16-GGUF,
  author = {Group 2},
  title = {Llama-3.1-8B-Instruct-FP16-GGUF},
  year = {2024},
  publisher = {HuggingFace},
  url = {https://huggingface.co/annus-lums/Llama-3.1-8B-Instruct-FP16-GGUF}
}
Downloads last month
8
GGUF
Model size
8B params
Architecture
llama
Hardware compatibility
Log In to add your hardware

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for annus-lums/Llama-3.1-8B-Instruct-FP16-GGUF

Quantized
(588)
this model