Instructions to use elyza/Llama-3-ELYZA-JP-8B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use elyza/Llama-3-ELYZA-JP-8B-GGUF with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("elyza/Llama-3-ELYZA-JP-8B-GGUF", dtype="auto") - llama-cpp-python
How to use elyza/Llama-3-ELYZA-JP-8B-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="elyza/Llama-3-ELYZA-JP-8B-GGUF", filename="Llama-3-ELYZA-JP-8B-q4_k_m.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use elyza/Llama-3-ELYZA-JP-8B-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf elyza/Llama-3-ELYZA-JP-8B-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf elyza/Llama-3-ELYZA-JP-8B-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf elyza/Llama-3-ELYZA-JP-8B-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf elyza/Llama-3-ELYZA-JP-8B-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf elyza/Llama-3-ELYZA-JP-8B-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf elyza/Llama-3-ELYZA-JP-8B-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf elyza/Llama-3-ELYZA-JP-8B-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf elyza/Llama-3-ELYZA-JP-8B-GGUF:Q4_K_M
Use Docker
docker model run hf.co/elyza/Llama-3-ELYZA-JP-8B-GGUF:Q4_K_M
- LM Studio
- Jan
- Ollama
How to use elyza/Llama-3-ELYZA-JP-8B-GGUF with Ollama:
ollama run hf.co/elyza/Llama-3-ELYZA-JP-8B-GGUF:Q4_K_M
- Unsloth Studio
How to use elyza/Llama-3-ELYZA-JP-8B-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for elyza/Llama-3-ELYZA-JP-8B-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for elyza/Llama-3-ELYZA-JP-8B-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for elyza/Llama-3-ELYZA-JP-8B-GGUF to start chatting
- Docker Model Runner
How to use elyza/Llama-3-ELYZA-JP-8B-GGUF with Docker Model Runner:
docker model run hf.co/elyza/Llama-3-ELYZA-JP-8B-GGUF:Q4_K_M
- Lemonade
How to use elyza/Llama-3-ELYZA-JP-8B-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull elyza/Llama-3-ELYZA-JP-8B-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.Llama-3-ELYZA-JP-8B-GGUF-Q4_K_M
List all available models
lemonade list
Llama-3-ELYZA-JP-8B-GGUF
Model Description
Llama-3-ELYZA-JP-8B is a large language model trained by ELYZA, Inc. Based on meta-llama/Meta-Llama-3-8B-Instruct, it has been enhanced for Japanese usage through additional pre-training and instruction tuning. (Built with Meta Llama3)
For more details, please refer to our blog post.
Quantization
We have prepared two quantized model options, GGUF and AWQ. This is the GGUF (Q4_K_M) model, converted using llama.cpp.
The following table shows the performance degradation due to quantization:
| Model | ELYZA-tasks-100 GPT4 score |
|---|---|
| Llama-3-ELYZA-JP-8B | 3.655 |
| Llama-3-ELYZA-JP-8B-GGUF (Q4_K_M) | 3.57 |
| Llama-3-ELYZA-JP-8B-AWQ | 3.39 |
Use with llama.cpp
Install llama.cpp through brew (works on Mac and Linux):
brew install llama.cpp
Invoke the llama.cpp server:
$ llama-server \
--hf-repo elyza/Llama-3-ELYZA-JP-8B-GGUF \
--hf-file Llama-3-ELYZA-JP-8B-q4_k_m.gguf \
--port 8080
Call the API using curl:
$ curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{ "role": "system", "content": "あなたは誠実で優秀な日本人のアシスタントです。特に指示が無い場合は、常に日本語で回答してください。" },
{ "role": "user", "content": "古代ギリシャを学ぶ上で知っておくべきポイントは?" }
],
"temperature": 0.6,
"max_tokens": -1,
"stream": false
}'
Call the API using Python:
import openai
client = openai.OpenAI(
base_url="http://localhost:8080/v1",
api_key = "dummy_api_key"
)
completion = client.chat.completions.create(
model="dummy_model_name",
messages=[
{"role": "system", "content": "あなたは誠実で優秀な日本人のアシスタントです。特に指示が無い場合は、常に日本語で回答してください。"},
{"role": "user", "content": "古代ギリシャを学ぶ上で知っておくべきポイントは?"}
]
)
Use with Desktop App
There are various desktop applications that can handle GGUF models, but here we will introduce how to use the model in the no-code environment LM Studio.
- Installation: Download and install LM Studio.
- Downloading the Model: Search for
elyza/Llama-3-ELYZA-JP-8B-GGUFin the search bar on the home page 🏠, and downloadLlama-3-ELYZA-JP-8B-q4_k_m.gguf. - Start Chatting: Click on 💬 in the sidebar, select
Llama-3-ELYZA-JP-8B-GGUFfrom "Select a Model to load" in the header, and load the model. You can now freely chat with the local LLM. - Setting Options: You can set options from the sidebar on the right. Faster inference can be achieved by setting Quick GPU Offload to Max in the GPU Settings.
- (For Developers) Starting an API Server: Click
<->in the left sidebar and move to the Local Server tab. Select the model and click Start Server to launch an OpenAI API-compatible API server.
This demo showcases Llama-3-ELYZA-JP-8B-GGUF running smoothly on a MacBook Pro (M1 Pro), achieving an inference speed of approximately 20 tokens per second.
Developers
Listed in alphabetical order.
License
Meta Llama 3 Community License
How to Cite
@misc{elyzallama2024,
title={elyza/Llama-3-ELYZA-JP-8B},
url={https://huggingface.co/elyza/Llama-3-ELYZA-JP-8B},
author={Masato Hirakawa and Shintaro Horie and Tomoaki Nakamura and Daisuke Oba and Sam Passaglia and Akira Sasaki},
year={2024},
}
Citations
@article{llama3modelcard,
title={Llama 3 Model Card},
author={AI@Meta},
year={2024},
url = {https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md}
}
- Downloads last month
- 2,187
4-bit


docker model run hf.co/elyza/Llama-3-ELYZA-JP-8B-GGUF:Q4_K_M