Instructions to use codellama/CodeLlama-70b-hf with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use codellama/CodeLlama-70b-hf with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="codellama/CodeLlama-70b-hf")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLlama-70b-hf") model = AutoModelForCausalLM.from_pretrained("codellama/CodeLlama-70b-hf") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use codellama/CodeLlama-70b-hf with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "codellama/CodeLlama-70b-hf" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "codellama/CodeLlama-70b-hf", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/codellama/CodeLlama-70b-hf
- SGLang
How to use codellama/CodeLlama-70b-hf with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "codellama/CodeLlama-70b-hf" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "codellama/CodeLlama-70b-hf", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "codellama/CodeLlama-70b-hf" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "codellama/CodeLlama-70b-hf", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use codellama/CodeLlama-70b-hf with Docker Model Runner:
docker model run hf.co/codellama/CodeLlama-70b-hf
How do I resolve this issue?
#11
by anshumankmr - opened
Input validation error: inputs tokens + max_new_tokens must be <= 1512. Given: 9959 inputs tokens and 512 max_new_tokens
Code
llm = HuggingFaceEndpoint(
endpoint_url=f"{your_endpoint_url}",
max_new_tokens=512,
top_k=10,
top_p=0.1,
typical_p=0.95,
temperature=0.01,
repetition_penalty=1.03,
)
db_chain = create_sql_query_chain(llm=llm, db=db)
user_query = "<BLAH BLAH>"
context = """
"""
prompt = f"""Please that your job is to write an SQL query to extract this data from a Postgres database and not to actually create visualizations.The visualization creation will be done later.Given an input question, first create a syntactically correct PostgreSQL query to run. DO NOT include any extra content.
Info About Dataset {context}
Use the following format:
Question: "Question here"
SQL Query to run
Question: {user_query}"""
def find_sql_queries(text):
# Regex pattern to match basic SQL queries
words_to_remove = ["SQLQuery:", "sql"]
for word_to_remove in words_to_remove:
text = text.replace(word_to_remove, "")
text = text[text.find("SELECT"): text.find(";")]
return text
result = find_sql_queries(db_chain.invoke({"question": prompt}))
print(result)
I deployed codellama 70B model
You should look at what db_chain.invoke({"question": prompt}) is doing. You will see somewhere in there that it is generating a very large value for inputs and sending it to HuggingFace (HuggingFace is your llm variable).