Llama-3.1-8B-Instruct-FP16-GGUF

Description

Llama 3.1 8B Instruct in FP16 (baseline)

File: llama-3.1-8b-instruct-f16.gguf
Size: 16G
Format: GGUF
Category: baseline

Quick Start

Download

huggingface-cli download annus-lums/Llama-3.1-8B-Instruct-FP16-GGUF llama-3.1-8b-instruct-f16.gguf --local-dir .

Usage with llama.cpp

./llama-cli -m llama-3.1-8b-instruct-f16.gguf -p "Explain quantum computing" -n 128

Benchmark

./llama-bench -m llama-3.1-8b-instruct-f16.gguf -r 3

Project: AI on Edge Devices

This model is part of an LLM compression research project.

Pipeline

Pruning: 20% structured Taylor pruning (MLP layers only)
SmoothQuant: Activation smoothing for stable quantization
Mixed Precision: Sensitivity-based bit-width allocation (Q4/Q5/Q6)

Results

Size: 73.6% reduction (15 GB → 4 GB)
Speed: 273% faster inference (1.16 → 4.33 tok/s)
Deployment: Successfully runs on Raspberry Pi 4

Model Card

Created by: Group 2 (Annus, Arslan, Naveed, Danyal)
Institution: LUMS
Date: December 2024

Citation

@misc{llama31-compressed-Llama-3.1-8B-Instruct-FP16-GGUF,
  author = {Group 2},
  title = {Llama-3.1-8B-Instruct-FP16-GGUF},
  year = {2024},
  publisher = {HuggingFace},
  url = {https://huggingface.co/annus-lums/Llama-3.1-8B-Instruct-FP16-GGUF}
}

Downloads last month: 8

GGUF

Model size

8B params

Architecture

llama

Hardware compatibility

16-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for annus-lums/Llama-3.1-8B-Instruct-FP16-GGUF

Base model

meta-llama/Llama-3.1-8B

Finetuned

meta-llama/Llama-3.1-8B-Instruct

Quantized

(588)

this model