What I am looking for now is the text-generation-launcher code.
maybe here?
text-generation-launcher is not a thin wrapper that “calls generate()”. It is a Rust process supervisor that starts the TGI router (HTTP server) plus the model shard server(s). The Vertex /predict bridge, and the “return token logprobs” behavior, live in the router, not in the shell entrypoint.
Below is the practical map: where the code is, how Vertex routing works, and how to actually get per-token logprobs (and why you still will not get full logits).
1) Where is text-generation-launcher code?
It is in the same TGI repo, under the launcher/ crate.
The TGI README links directly to the launcher source (launcher/src/main.rs) and explicitly says “see all options … in the code or in the CLI.” (GitHub)
TGI’s architecture docs describe the split: launcher starts router and server components. (Hugging Face)
So if you were grepping only under router/ you will miss the launcher.
Useful direct links:
2) Vertex AI /predict vs TGI /generate: where is the “bridge”?
What Vertex does
Vertex forwards an Endpoint predict call to a container HTTP path defined by containerSpec.predictRoute. Vertex also exposes that chosen path inside the container via AIP_PREDICT_ROUTE. (Google Cloud Documentation)
That is why you do not see ENV AIP_PREDICT_ROUTE=/generate in the Dockerfile. It is injected by Vertex at deploy time, not baked into the image. (Google Cloud Documentation)
What Hugging Face’s TGI DLC does on Vertex
Hugging Face’s “TGI DLC on Vertex” examples state that Vertex predict() sends requests to the container’s /predict route, using Vertex I/O payload formatting. (Hugging Face)
So the “adapter” is expected to be inside the router and to make /predict accept Vertex-style payloads (typically instances=[...], parameters={...}), then translate internally.
This also explains your observation: by default you often get text only back from Vertex examples (they literally print(output.predictions[0])). That does not prove TGI is not used. It usually just means you did not request detailed decoding info. (Hugging Face)
3) Getting per-token scores in TGI: what is actually supported?
Terminology you care about
- Logits: raw, unnormalized scores for every vocab token at a step.
- Logprobs: normalized log probabilities (after softmax) for returned tokens (and optionally top-k alternatives).
TGI primarily exposes logprobs, not full logits. That is intentional because returning full-vocab logits for every step is massive.
What you can get from TGI
You can request:
- Per generated token logprob (the sampled token).
- Optionally top_n_tokens alternatives per step (logprobs for the top N tokens).
- Optionally prompt (prefill) token logprobs if enabled.
These are controlled by request parameters such as:
details=true
top_n_tokens=<N>
decoder_input_details=true (prompt token logprobs, but only if details=true) (Hugging Face)
The Hugging Face client docs state this clearly: details=True returns tokens and probabilities, and top_n_tokens returns info about the N most likely tokens per generation step. (Hugging Face)
The server-side “gotcha” (launcher flag)
Prompt logprobs are disabled by default for VRAM reasons. The launcher has a flag:
--enable-prefill-logprobs (re-allows prompt logprobs; costs VRAM) (Hugging Face)
So if you want prompt token logprobs, you need both:
4) Concrete requests you should try
A) If you can hit TGI directly (non-Vertex path): /generate
curl http://HOST:PORT/generate \
-H 'Content-Type: application/json' \
-d '{
"inputs": "Write a haiku about GPUs.",
"parameters": {
"max_new_tokens": 30,
"details": true,
"top_n_tokens": 5
}
}'
Expected shape (conceptually): you get a details object containing a list of per-step tokens, each token having id, text, and logprob (and optionally top_tokens if you asked for top_n_tokens). This is the same “Token has logprob” schema you already noticed. (Hugging Face)
B) Vertex Endpoint predict payload (what HF’s Vertex DLC examples use)
Try putting the same knobs under instances[*].parameters:
{
"instances": [
{
"inputs": "Write a haiku about GPUs.",
"parameters": {
"max_new_tokens": 30,
"details": true,
"top_n_tokens": 5
}
}
]
}
Two key points:
- The default (
details omitted or false) usually yields just generated text. (Hugging Face)
- If the router’s Vertex adapter passes details through, you should start seeing token-level info inside
predictions[0] (either as a dict or a nested structure, depending on the adapter).
C) Prompt logprobs in Vertex
Add:
"decoder_input_details": true
…but only after you confirm the container was launched with --enable-prefill-logprobs. (Hugging Face)
5) Why you still cannot “compute logits for each output token” from this
If by “logits” you mean full raw logits vectors (size = vocab) at each step: TGI does not expose that as a normal output for bandwidth and cost reasons.
If by “logits for the chosen token” you mean the single raw logit value for the sampled token: logprob alone is not enough to reconstruct it, because:
- logprob = logit(token) − logsumexp(all logits)
TGI does not usually give you logsumexp, and returning it is not a standard option.
So the realistic options are:
- Use logprobs (sequence scoring, per-token likelihood, perplexity-style metrics).
- Use top_n_tokens as an approximation to inspect alternatives. (Hugging Face)
- If you truly need full logits, you need a custom server or a fork that returns them (or a different engine that supports a “logits” debug endpoint).
6) Your “no adapter / not using TGI” suspicion: the most common explanation
On Vertex, “text only” responses are often just the default mode. Hugging Face’s own Vertex+TGI example prints a plain text prediction from predict() without asking for details. (Hugging Face)
So first confirm with one decisive test:
- Send
details=true and top_n_tokens=1.
- If you still only get a string, then the Vertex adapter is either stripping details or you are not running the TGI router you think you are.
7) Similar reports and discussion threads
(If you want, paste one real Vertex response you got with details=true and I will tell you whether it is being stripped, renamed, or nested.)
Good references to keep open
Summary
text-generation-launcher source is in launcher/src/main.rs in the TGI repo. (GitHub)
- Vertex
/predict routing is configured by predictRoute and injected via AIP_PREDICT_ROUTE. Not hardcoded in the image. (Google Cloud Documentation)
- To get per-token logprobs, request
details=true (and optionally top_n_tokens). (Hugging Face)
- Prompt logprobs require
decoder_input_details=true plus launcher --enable-prefill-logprobs. (Hugging Face)
- Full logits are not a standard TGI output. Use logprobs or fork/custom server if you truly need logits.