Instructions to use elyza/Llama-3-ELYZA-JP-8B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use elyza/Llama-3-ELYZA-JP-8B-GGUF with Transformers:

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("elyza/Llama-3-ELYZA-JP-8B-GGUF", dtype="auto")

llama-cpp-python

How to use elyza/Llama-3-ELYZA-JP-8B-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="elyza/Llama-3-ELYZA-JP-8B-GGUF",
	filename="Llama-3-ELYZA-JP-8B-q4_k_m.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use elyza/Llama-3-ELYZA-JP-8B-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf elyza/Llama-3-ELYZA-JP-8B-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf elyza/Llama-3-ELYZA-JP-8B-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf elyza/Llama-3-ELYZA-JP-8B-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf elyza/Llama-3-ELYZA-JP-8B-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf elyza/Llama-3-ELYZA-JP-8B-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf elyza/Llama-3-ELYZA-JP-8B-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf elyza/Llama-3-ELYZA-JP-8B-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf elyza/Llama-3-ELYZA-JP-8B-GGUF:Q4_K_M

Use Docker

docker model run hf.co/elyza/Llama-3-ELYZA-JP-8B-GGUF:Q4_K_M

LM Studio
Jan
Ollama
How to use elyza/Llama-3-ELYZA-JP-8B-GGUF with Ollama:
```
ollama run hf.co/elyza/Llama-3-ELYZA-JP-8B-GGUF:Q4_K_M
```

Unsloth Studio

How to use elyza/Llama-3-ELYZA-JP-8B-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for elyza/Llama-3-ELYZA-JP-8B-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for elyza/Llama-3-ELYZA-JP-8B-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for elyza/Llama-3-ELYZA-JP-8B-GGUF to start chatting

Docker Model Runner
How to use elyza/Llama-3-ELYZA-JP-8B-GGUF with Docker Model Runner:
```
docker model run hf.co/elyza/Llama-3-ELYZA-JP-8B-GGUF:Q4_K_M
```

Lemonade

How to use elyza/Llama-3-ELYZA-JP-8B-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull elyza/Llama-3-ELYZA-JP-8B-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.Llama-3-ELYZA-JP-8B-GGUF-Q4_K_M

List all available models

lemonade list

Llama-3-ELYZA-JP-8B-GGUF

Model Description

Llama-3-ELYZA-JP-8B is a large language model trained by ELYZA, Inc. Based on meta-llama/Meta-Llama-3-8B-Instruct, it has been enhanced for Japanese usage through additional pre-training and instruction tuning. (Built with Meta Llama3)

For more details, please refer to our blog post.

Quantization

We have prepared two quantized model options, GGUF and AWQ. This is the GGUF (Q4_K_M) model, converted using llama.cpp.

The following table shows the performance degradation due to quantization:

Model	ELYZA-tasks-100 GPT4 score
Llama-3-ELYZA-JP-8B	3.655
Llama-3-ELYZA-JP-8B-GGUF (Q4_K_M)	3.57
Llama-3-ELYZA-JP-8B-AWQ	3.39

Use with llama.cpp

Install llama.cpp through brew (works on Mac and Linux):

brew install llama.cpp

Invoke the llama.cpp server:

$ llama-server \
--hf-repo elyza/Llama-3-ELYZA-JP-8B-GGUF \
--hf-file Llama-3-ELYZA-JP-8B-q4_k_m.gguf \
--port 8080

Call the API using curl:

$ curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
  "messages": [
    { "role": "system", "content": "あなたは誠実で優秀な日本人のアシスタントです。特に指示が無い場合は、常に日本語で回答してください。" },
    { "role": "user", "content": "古代ギリシャを学ぶ上で知っておくべきポイントは？" }
  ],
  "temperature": 0.6,
  "max_tokens": -1,
  "stream": false
}'

Call the API using Python:

import openai

client = openai.OpenAI(
    base_url="http://localhost:8080/v1",
    api_key = "dummy_api_key"
)

completion = client.chat.completions.create(
    model="dummy_model_name",
    messages=[
        {"role": "system", "content": "あなたは誠実で優秀な日本人のアシスタントです。特に指示が無い場合は、常に日本語で回答してください。"},
        {"role": "user", "content": "古代ギリシャを学ぶ上で知っておくべきポイントは？"}
    ]
)

Use with Desktop App

There are various desktop applications that can handle GGUF models, but here we will introduce how to use the model in the no-code environment LM Studio.

Installation: Download and install LM Studio.
Downloading the Model: Search for elyza/Llama-3-ELYZA-JP-8B-GGUF in the search bar on the home page 🏠, and download Llama-3-ELYZA-JP-8B-q4_k_m.gguf.
Start Chatting: Click on 💬 in the sidebar, select Llama-3-ELYZA-JP-8B-GGUF from "Select a Model to load" in the header, and load the model. You can now freely chat with the local LLM.
Setting Options: You can set options from the sidebar on the right. Faster inference can be achieved by setting Quick GPU Offload to Max in the GPU Settings.
(For Developers) Starting an API Server: Click <-> in the left sidebar and move to the Local Server tab. Select the model and click Start Server to launch an OpenAI API-compatible API server.

This demo showcases Llama-3-ELYZA-JP-8B-GGUF running smoothly on a MacBook Pro (M1 Pro), achieving an inference speed of approximately 20 tokens per second.

Developers

Listed in alphabetical order.

License

Meta Llama 3 Community License

How to Cite

@misc{elyzallama2024,
      title={elyza/Llama-3-ELYZA-JP-8B},
      url={https://huggingface.co/elyza/Llama-3-ELYZA-JP-8B},
      author={Masato Hirakawa and Shintaro Horie and Tomoaki Nakamura and Daisuke Oba and Sam Passaglia and Akira Sasaki},
      year={2024},
}

Citations

@article{llama3modelcard,
    title={Llama 3 Model Card},
    author={AI@Meta},
    year={2024},
    url = {https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md}
}

Downloads last month: 2,187

GGUF

Model size

8B params

Architecture

llama

Hardware compatibility

4-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Spaces using elyza/Llama-3-ELYZA-JP-8B-GGUF 5

Collection including elyza/Llama-3-ELYZA-JP-8B-GGUF

Llama-3-ELYZA-JP

Collection

Llama-3 models augmented for Japanese usage • 7 items • Updated Dec 13, 2024 • 11