Instructions to use google/gemma-7b-it with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use google/gemma-7b-it with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="google/gemma-7b-it") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b-it") model = AutoModelForCausalLM.from_pretrained("google/gemma-7b-it") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - llama-cpp-python
How to use google/gemma-7b-it with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="google/gemma-7b-it", filename="gemma-7b-it.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use google/gemma-7b-it with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf google/gemma-7b-it # Run inference directly in the terminal: llama-cli -hf google/gemma-7b-it
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf google/gemma-7b-it # Run inference directly in the terminal: llama-cli -hf google/gemma-7b-it
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf google/gemma-7b-it # Run inference directly in the terminal: ./llama-cli -hf google/gemma-7b-it
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf google/gemma-7b-it # Run inference directly in the terminal: ./build/bin/llama-cli -hf google/gemma-7b-it
Use Docker
docker model run hf.co/google/gemma-7b-it
- LM Studio
- Jan
- vLLM
How to use google/gemma-7b-it with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "google/gemma-7b-it" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-7b-it", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/google/gemma-7b-it
- SGLang
How to use google/gemma-7b-it with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "google/gemma-7b-it" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-7b-it", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "google/gemma-7b-it" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-7b-it", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Ollama
How to use google/gemma-7b-it with Ollama:
ollama run hf.co/google/gemma-7b-it
- Unsloth Studio new
How to use google/gemma-7b-it with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for google/gemma-7b-it to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for google/gemma-7b-it to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for google/gemma-7b-it to start chatting
- Docker Model Runner
How to use google/gemma-7b-it with Docker Model Runner:
docker model run hf.co/google/gemma-7b-it
- Lemonade
How to use google/gemma-7b-it with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull google/gemma-7b-it
Run and chat with the model
lemonade run user.gemma-7b-it-{{QUANT_TAG}}List all available models
lemonade list
Is it a joke?๐
we had a good laugh trying to communicate with this thing ๐
not only does it refuse to answer nearly anything, it is censored, biased, racist, gets offended, denies obvious logic/facts, has bad jokes
quite bad at coding and just generally hard to communicate with (it seems to ignore your new inputs and keeps repeating itself)
is there are trick to talk with this model? what is it made for?
it's easy to spend more time asking it for some simple thing than looking it up on google search
if you mention the word 'site' it instantly refuses to answer, saying it has no real-time access
(and almost every 5th word is a 'bad word' to it)
there is something great inside it, on occasion it gives some great and detailed answers, in more of an info-agent style
but nearly every line of questioning ends up in the model refusing to answer something
was expecting to be blown away, instead this seems like a .. zombie
for the love of, relax with the lobotomies! would rather have skynet than this on the loose :p
would love to suggest that all the 'safety', 'responsible' lobotomies/brainwashing is handled in a lora, not the base model itself
it's not a problem to offer safety as a product, but why destroy the base? this model is heavily affected by it
Well, for starters, are you using the Chat Template?
It seems to be working ok for me, even mentioning the word "site", have you tried testing it in one of the spaces that runs the model?
If the Chat template doesn't work or solve some of the problems, share some example prompts with us -- we will try to improve the model!
Well, for starters, are you using the Chat Template?
It seems to be working ok for me, even mentioning the word "site", have you tried testing it in one of the spaces that runs the model?
If the Chat template doesn't work or solve some of the problems, share some example prompts with us -- we will try to improve the model!
when it just came out, we tried using the basic chat template suggested on github and other places to follow the schema for instruct version (since this is the instruct, not the chat version)
(https://github.com/ygivenx/google-gemma/blob/main/get_started.py)
chat.append({"role": "user", "content": user_input})
prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
...
response = extract(response)
chat.append({"role": "model", "content": response})
while it worked ok, it seemed the model quite often started answering things earlier in it's chat log and became stuck/unreliable
so we switched to just not having a chat history at all and just did one-shot questions and requests
Chat template or not, the model as you can see nearly always turns to say things like it's 'unable to answer the question' due to this or that , something is always problematic
with history, model answers things earlier in the chat log
(i know these questions are not things it likes to answer, and the model is not given any instruction to work as a chatbot)
was just wondering why it does this thing with going back, the extraction code should be fine
Having had more time to play with the model (especially the 2B-i model) i take back some of the criticism
maybe we been used to other models with a lot less limitation, this seem only really usable for a narrow range, the ultra safe environment,
if instructed to act otherwise or be colorful, it tends to break from the instruction and resume it's typical assistant role
we get used to look for biases and it is still very clear that the model drift strongly towards it's biases, in all manner of ways
it just lowers it's versatility, it's fine for extremely safe things (which i assume is the purpose), but you don't want to write a scary story, joke around or play d&d with this model
once it has a 'bad word' in it's chat log, it sometimes refuses to answer simple benign questions, but it can be pulled out of it if asked to start over
likability is still important with models that interact with humans, this scores quite low due to how often you hit a wall (time-to-argument, chat-gpt had similar problems after it was heavily censored)
sometimes it can feel like you're walking on egg shells to not trigger it xD humans are very different, 'being too nice' can be off putting to some
I would say wait until people un-censor it, but wait... removing restrictions is against the Prohibited Use Policy - oops! I believe this model is just a silly marketing stunt and will not have any great future use as a result.
I would say wait until people un-censor it, but wait... removing restrictions is against the Prohibited Use Policy - oops! I believe this model is just a silly marketing stunt and will not have any great future use as a result.
TOSes hasn't stopped the FOSS community before lol


