Instructions to use google/gemma-7b-it with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use google/gemma-7b-it with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="google/gemma-7b-it")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b-it")
model = AutoModelForCausalLM.from_pretrained("google/gemma-7b-it")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

llama-cpp-python

How to use google/gemma-7b-it with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="google/gemma-7b-it",
	filename="gemma-7b-it.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Inference
Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use google/gemma-7b-it with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf google/gemma-7b-it
# Run inference directly in the terminal:
llama-cli -hf google/gemma-7b-it

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf google/gemma-7b-it
# Run inference directly in the terminal:
llama-cli -hf google/gemma-7b-it

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf google/gemma-7b-it
# Run inference directly in the terminal:
./llama-cli -hf google/gemma-7b-it

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf google/gemma-7b-it
# Run inference directly in the terminal:
./build/bin/llama-cli -hf google/gemma-7b-it

Use Docker

docker model run hf.co/google/gemma-7b-it

LM Studio
Jan

vLLM

How to use google/gemma-7b-it with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "google/gemma-7b-it"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-7b-it",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/google/gemma-7b-it

SGLang

How to use google/gemma-7b-it with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "google/gemma-7b-it" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-7b-it",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "google/gemma-7b-it" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-7b-it",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Ollama
How to use google/gemma-7b-it with Ollama:
```
ollama run hf.co/google/gemma-7b-it
```

Unsloth Studio new

How to use google/gemma-7b-it with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for google/gemma-7b-it to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for google/gemma-7b-it to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for google/gemma-7b-it to start chatting

Docker Model Runner
How to use google/gemma-7b-it with Docker Model Runner:
```
docker model run hf.co/google/gemma-7b-it
```

Lemonade

How to use google/gemma-7b-it with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull google/gemma-7b-it

Run and chat with the model

lemonade run user.gemma-7b-it-{{QUANT_TAG}}

List all available models

lemonade list

Is it a joke?😅

#39

by Horned - opened Feb 22, 2024

Discussion

Horned

Feb 22, 2024

•

edited Feb 23, 2024

we had a good laugh trying to communicate with this thing 😄

not only does it refuse to answer nearly anything, it is censored, biased, racist, gets offended, denies obvious logic/facts, has bad jokes
quite bad at coding and just generally hard to communicate with (it seems to ignore your new inputs and keeps repeating itself)

is there are trick to talk with this model? what is it made for?
it's easy to spend more time asking it for some simple thing than looking it up on google search

if you mention the word 'site' it instantly refuses to answer, saying it has no real-time access
(and almost every 5th word is a 'bad word' to it)

there is something great inside it, on occasion it gives some great and detailed answers, in more of an info-agent style
but nearly every line of questioning ends up in the model refusing to answer something

was expecting to be blown away, instead this seems like a .. zombie
for the love of, relax with the lobotomies! would rather have skynet than this on the loose :p

would love to suggest that all the 'safety', 'responsible' lobotomies/brainwashing is handled in a lora, not the base model itself
it's not a problem to offer safety as a product, but why destroy the base? this model is heavily affected by it

paccer

Feb 23, 2024

Well, for starters, are you using the Chat Template?
It seems to be working ok for me, even mentioning the word "site", have you tried testing it in one of the spaces that runs the model?

suryabhupa

Google org Feb 23, 2024

If the Chat template doesn't work or solve some of the problems, share some example prompts with us -- we will try to improve the model!

Horned

Feb 23, 2024

•

edited Feb 23, 2024

Well, for starters, are you using the Chat Template?
It seems to be working ok for me, even mentioning the word "site", have you tried testing it in one of the spaces that runs the model?

If the Chat template doesn't work or solve some of the problems, share some example prompts with us -- we will try to improve the model!

when it just came out, we tried using the basic chat template suggested on github and other places to follow the schema for instruct version (since this is the instruct, not the chat version)
(https://github.com/ygivenx/google-gemma/blob/main/get_started.py)

chat.append({"role": "user", "content": user_input})
prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
...
response = extract(response)
chat.append({"role": "model", "content": response})

while it worked ok, it seemed the model quite often started answering things earlier in it's chat log and became stuck/unreliable
so we switched to just not having a chat history at all and just did one-shot questions and requests

Chat template or not, the model as you can see nearly always turns to say things like it's 'unable to answer the question' due to this or that , something is always problematic

with history

without history

with history, model answers things earlier in the chat log
(i know these questions are not things it likes to answer, and the model is not given any instruction to work as a chatbot)
was just wondering why it does this thing with going back, the extraction code should be fine

Horned

Feb 23, 2024

Having had more time to play with the model (especially the 2B-i model) i take back some of the criticism
maybe we been used to other models with a lot less limitation, this seem only really usable for a narrow range, the ultra safe environment,
if instructed to act otherwise or be colorful, it tends to break from the instruction and resume it's typical assistant role

we get used to look for biases and it is still very clear that the model drift strongly towards it's biases, in all manner of ways
it just lowers it's versatility, it's fine for extremely safe things (which i assume is the purpose), but you don't want to write a scary story, joke around or play d&d with this model
once it has a 'bad word' in it's chat log, it sometimes refuses to answer simple benign questions, but it can be pulled out of it if asked to start over

likability is still important with models that interact with humans, this scores quite low due to how often you hit a wall (time-to-argument, chat-gpt had similar problems after it was heavily censored)
sometimes it can feel like you're walking on egg shells to not trigger it xD humans are very different, 'being too nice' can be off putting to some

jploski

Feb 23, 2024

I would say wait until people un-censor it, but wait... removing restrictions is against the Prohibited Use Policy - oops! I believe this model is just a silly marketing stunt and will not have any great future use as a result.

ReXommendation

Feb 26, 2024

I would say wait until people un-censor it, but wait... removing restrictions is against the Prohibited Use Policy - oops! I believe this model is just a silly marketing stunt and will not have any great future use as a result.

TOSes hasn't stopped the FOSS community before lol

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment