Add scores output from TGI

Hi everyone,

Happy New Year 2026 !!

Quick question: I want to output the scores (in order to compute the logits for each output token) when I am using TGI directly from the Text inference container.

Unfortunately, I cannot see any parameter for this, in the Text-generation-launcher arguments.
Text-generation-launcher arguments**
**
Any hint?

Thanks

Best regards
Jerome

1 Like

Hum, I see that the Token object contains the logprob value in its schema, and that TGI generate returns a list of Token objects.

So as far as I can understand, the Text-generation-launcher using TGI should already return the logprob associated with tokens…

Jerome

1 Like

Seems limitation?


You will not find a “return token scores” switch in text-generation-launcher (except for prompt logprobs and server caps). In TGI, token logprobs are returned when you ask for them in the request.

What TGI can return (and what it cannot)

1) Per-output-token logprob (most common “score”)

Call /generate with parameters.details=true. The response schema includes details.tokens[] and each token has a logprob. (Hugging Face)

That is typically what people mean by “scores per output token”.

2) Top-N alternative tokens (closest thing to “distribution info”)

Add parameters.top_n_tokens = N. The response can include details.top_tokens with id/text/logprob. (Hugging Face)
The launcher only caps this via --max-top-n-tokens (default 5). (Hugging Face)

3) Prompt token logprobs (“prefill”)

Prompt logprobs live under details.prefill[] (same id/text/logprob schema). (Hugging Face)
But TGI disables prompt logprobs by default due to VRAM cost. You must launch with --enable-prefill-logprobs. (Hugging Face)
Then request decoder_input_details=true (usually together with details=true). (Hugging Face)

4) Full logits vectors

TGI’s public HTTP APIs are not designed to return the full per-step logits vector over the entire vocabulary (payload and throughput cost). The supported “workaround” is top-N tokens (top_n_tokens). (Hugging Face)

Important: logprob is not logits

  • Logits are raw model scores before softmax.
  • Logprobs are after softmax normalization.

TGI returns logprobs (and optionally top-N logprobs). That is enough to compute:

  • sequence log-likelihood = sum of token logprobs
  • per-token confidence for the chosen token
  • approximate distributions via top-N

It is not enough to reconstruct raw logits without additional info.


Minimal working examples

A) Output-token logprobs via /generate

curl -s http://localhost:8080/generate \
  -H 'Content-Type: application/json' \
  -d '{
    "inputs": "What is deep learning?",
    "parameters": {
      "max_new_tokens": 10,
      "details": true
    }
  }'

Read response.details.tokens[i].logprob. (Hugging Face)

B) Output-token top-N candidates

curl -s http://localhost:8080/generate \
  -H 'Content-Type: application/json' \
  -d '{
    "inputs": "The capital of France is",
    "parameters": {
      "max_new_tokens": 1,
      "details": true,
      "top_n_tokens": 10
    }
  }'

Requires the server cap to allow it:

  • start launcher with --max-top-n-tokens 10 (or more). (Hugging Face)

C) Prompt logprobs (prefill)

Start server with:

Then request:

curl -s http://localhost:8080/generate \
  -H 'Content-Type: application/json' \
  -d '{
    "inputs": "Today is a",
    "parameters": {
      "max_new_tokens": 2,
      "details": true,
      "decoder_input_details": true
    }
  }'

Read response.details.prefill[]. (Hugging Face)


Why you “see Token.logprob in the schema” but don’t get it

Because TGI only includes token objects in the response when you request details. The schema explicitly separates the plain response (just text) from the “details” object that carries tokens/prefill/top_tokens with logprob. (Hugging Face)


Known pitfalls people hit (and where to look)

  1. Prefill stays empty even with decoder_input_details=True
    Reported against TGI docker 3.0.2: prefill=[] despite asking for it. (GitHub)
    If you see this, verify you launched with --enable-prefill-logprobs first. (Hugging Face) If still empty, it can be version-specific.

  2. Prompt logprobs request is a recurring feature ask (historical context)
    Users explicitly asked for OpenAI-like “echo prompt logprobs” behavior. (GitHub)

  3. top_n_tokens broke in older releases
    Example: v1.3.x returning HTTP 424 when top_n_tokens > 0. (GitHub)
    If you rely on top-N, pin or upgrade and test.

  4. Async vs sync prefill logprob mismatch
    Reported mismatch when sending requests asynchronously. (GitHub)


Best practical “workarounds” if you truly need logits-like behavior

  • Use details=true and compute what you need from logprobs (most scoring tasks only need this). (Hugging Face)
  • If you need more distribution information, use top_n_tokens (and raise --max-top-n-tokens). (Hugging Face)
  • If you truly need full logits vectors, run the model in-process (Transformers generate(..., output_scores=True, return_dict_in_generate=True)) or fork TGI. TGI’s HTTP API is not built for full-logits payloads.

Best references to keep open

  • Consuming TGI (endpoints: /generate, /generate_stream, /v1/chat/completions) (Hugging Face)
  • Launcher arguments (--max-top-n-tokens, --enable-prefill-logprobs) (Hugging Face)
  • Request/response schema for text-generation (details, decoder_input_details, top_n_tokens, and where logprob appears) (Hugging Face)

Summary

  • Output token “scores” in TGI = token logprobs, returned when you set details=true. (Hugging Face)
  • Prompt token logprobs require launcher --enable-prefill-logprobs + request decoder_input_details=true. (Hugging Face)
  • “Logits for each step” in the full sense are not normally returned; use top_n_tokens as the supported approximation. (Hugging Face)

Thanks John for this detailed explanation.

I am using the HF TGI container image deployed on a Vertex AI endpoint.

I have looked at the entrypoint.sh referenced by the Docker file and this script is starting the text-generation-launcher. I guess that this launcher is based on the TGI API, so is calling the generate method with the passed parameters.

But in my VertexAI environment, the endpoint is expecting a predict not a generate. And the bridge between the Vertex AI /predict and the TGI /generate seems absent from the container content.

Any idea?

Thanks

Jerome

1 Like

I am looking in the HG container, and I confirm that I cannot find any ENV AIP_PREDICT_ROUTE=/generate that could be used as a bridge between VertexAI and TGI.

Thanks

Jerome

1 Like

The lack of explicit bridge between the predict and the generate may be due to the fact that the redirection is done by the text-generation-launcher directly. Gemini tells me that this is RUST code. I will try to find it.

Jerome

1 Like

Hmm…?


You are mixing two separate layers:

  1. What TGI can return (logprobs, top-k token logprobs, sometimes prompt-token logprobs).
  2. What your “front door” returns on Vertex AI (often a /predict wrapper that may or may not pass through the full TGI JSON).

Below is the practical path for both “plain TGI” and “TGI behind Vertex AI”.


1) “Scores” in TGI means logprobs, not full logits

  • TGI can return per-generated-token log probability (logprob) when you ask for generation details. (Hugging Face)
  • TGI can also return top-N alternative tokens (top-k) with their logprobs per step, but it is capped server-side. (Hugging Face)
  • TGI generally does not return the full logits vector over the whole vocabulary for every step. That would be huge (vocab-size floats per token) and is not what TGI is designed to stream back.

If your goal is “logits for each output token you actually produced”, logprob is usually enough because it is the normalized score for the chosen token at that step. If you truly need “raw logits before softmax”, TGI is typically the wrong interface.


2) How to get per-output-token logprobs from TGI

This is not a launcher flag. It is a request parameter.

You must send parameters.details=true. The response then includes details.tokens[], and each token includes logprob. (Hugging Face)

Example direct call to TGI:

curl -s http://HOST:PORT/generate \
  -H 'Content-Type: application/json' \
  -d '{
    "inputs": "Explain entropy in one sentence.",
    "parameters": {
      "max_new_tokens": 32,
      "details": true
    }
  }' | jq .

What to look for in the JSON:

  • generated_text
  • details.tokens list, each with a logprob (and token text, ids, etc.). (Hugging Face)

Want top-k alternatives per step

Add top_n_tokens in the request. The launcher can cap it via --max-top-n-tokens, so if you request more than the cap you will be limited. (Hugging Face)

"parameters": {
  "max_new_tokens": 32,
  "details": true,
  "top_n_tokens": 10
}

3) Prompt-token logprobs are special (“prefill logprobs”)

You noticed decoder_input_details. That is the knob to ask for prompt token details, but many people then discover “prefill is empty”.

Reason: prompt-token logprobs can be disabled by default for cost reasons, and you must enable them at launch.

  • Request-side: decoder_input_details=true (and usually also details=true). (Hugging Face)
  • Server-side: start text-generation-launcher with --enable-prefill-logprobs (env var ENABLE_PREFILL_LOGPROBS). (Hugging Face)
  • This “prefill missing” behavior has been reported repeatedly. (GitHub)

So the “full scoring” request looks like:

"parameters": {
  "details": true,
  "decoder_input_details": true,
  "max_new_tokens": 32
}

…but it only works if the container was launched with prefill logprobs enabled. (Hugging Face)


4) Vertex AI: why you don’t see the /predict/generate bridge

Key fact

Vertex AI sends inference traffic to whatever path you configure as predictRoute on the model container spec. If you do not set it, Vertex uses its default predict path. The route is a Vertex deployment setting, not a hard requirement of TGI. (Google Cloud Documentation)

So:

  • If you deployed a vanilla upstream ghcr.io/huggingface/text-generation-inference image, it exposes /generate (and others), but it does not magically implement Vertex’s /predict contract.
  • If you deployed a Hugging Face Vertex DLC, it often includes an adapter that accepts Vertex payloads at /predict. The Hugging Face Gemma-on-Vertex example explicitly calls predict() and sends instances=[{"inputs": ..., "parameters": {...}}]. (Hugging Face)

That explains your observation: you won’t necessarily find an env var like AIP_PREDICT_ROUTE=/generate inside the image, because the “bridge” is either:

  • implemented by an adapter server, or
  • configured at deployment time via predictRoute, or
  • not present at all (vanilla image).

Also, yes, TGI has a Rust HTTP layer and Python backend. That part is true. (GitHub)


5) Practical workarounds on Vertex AI (ordered by “most likely to work”)

Option A (best): call TGI directly by enabling arbitrary routes (invokeRoutePrefix)

Vertex AI supports enabling arbitrary custom routes by setting invokeRoutePrefix to "/*". Then /invoke/foo/bar is forwarded to /foo/bar inside the container. (Google Cloud Documentation)

That means you can hit TGI’s real endpoint:

  • Vertex: /invoke/generate
  • Container: /generate

This bypasses any /predict wrapper that might be stripping details.

Why this matters: wrappers sometimes return only generated_text (string) rather than the full TGI JSON. A similar “wrapper only returns generated_text” situation is documented for SageMaker, and the fix is “pass details parameters”, but only if the wrapper forwards them and returns the full payload. (GitHub)

Option B: use the Hugging Face Vertex DLC /predict contract, and pass details inside parameters

The HF Vertex example shows that the payload shape is:

instances=[{
  "inputs": "...",
  "parameters": {...}
}]

(Hugging Face)

So you would try:

output = endpoint.predict(instances=[{
  "inputs": prompt,
  "parameters": {
    "max_new_tokens": 32,
    "details": True,
    "top_n_tokens": 5
  }
}])

If the response still comes back as a plain string, then that wrapper is likely not returning the full TGI JSON and you need Option A.

Option C: build your own tiny adapter (if A and B fail)

Run a small FastAPI/Flask server that exposes /predict, accepts Vertex’s instances[], converts to TGI’s {inputs, parameters} and calls http://localhost:8080/generate, then returns the full response (including details). This is a standard pattern when deploying non-Vertex-native servers.


6) Similar cases and known pitfalls (worth reading)

People asking for “scores / logprobs” in TGI

Prompt logprobs (“prefill”) confusion

Top-k alternatives and quirks

“Wrapper returns only generated_text”


7) High-quality docs to keep open


Summary

  • To get per-output-token “scores”: send parameters.details=true. This returns details.tokens[].logprob. (Hugging Face)
  • For top-k alternatives: add top_n_tokens, but it is capped by --max-top-n-tokens. (Hugging Face)
  • For prompt-token logprobs: request decoder_input_details=true and launch with --enable-prefill-logprobs. (Hugging Face)
  • On Vertex: if /predict wrappers strip details, enable arbitrary routes (invokeRoutePrefix="/*") and call /invoke/generate to hit TGI directly. (Google Cloud Documentation)

Thanks John.

I have deployed an HF DLC and it seems that there is no adapter. I send the payload following the expected VertexAI schema, but the response returned corresponds to text only, and with a signature different from the TGI’s one, so I guess that the TGI backend has not been used.

I am now testing option A and I will go for option C if A is not working.

Best regards

Jerome

1 Like

I still have an additional question:

When I see the Docker file, the entry point is defined as follow:

ENTRYPOINT [“./entrypoint.sh”]

And this script launches the text-generation-launcher:
exec text-generation-launcher $@

When I look at the TGI implementation, I notice that there is a router dedicated to VertexAI that seems to handle the payload translation.

What I am looking for now is the text-generation-launcher code. I cannot find it in the TGI repo.

Thanks

Jerome

1 Like

What I am looking for now is the text-generation-launcher code.

maybe here?


text-generation-launcher is not a thin wrapper that “calls generate()”. It is a Rust process supervisor that starts the TGI router (HTTP server) plus the model shard server(s). The Vertex /predict bridge, and the “return token logprobs” behavior, live in the router, not in the shell entrypoint.

Below is the practical map: where the code is, how Vertex routing works, and how to actually get per-token logprobs (and why you still will not get full logits).


1) Where is text-generation-launcher code?

It is in the same TGI repo, under the launcher/ crate.

The TGI README links directly to the launcher source (launcher/src/main.rs) and explicitly says “see all options … in the code or in the CLI.” (GitHub)

TGI’s architecture docs describe the split: launcher starts router and server components. (Hugging Face)

So if you were grepping only under router/ you will miss the launcher.

Useful direct links:


2) Vertex AI /predict vs TGI /generate: where is the “bridge”?

What Vertex does

Vertex forwards an Endpoint predict call to a container HTTP path defined by containerSpec.predictRoute. Vertex also exposes that chosen path inside the container via AIP_PREDICT_ROUTE. (Google Cloud Documentation)

That is why you do not see ENV AIP_PREDICT_ROUTE=/generate in the Dockerfile. It is injected by Vertex at deploy time, not baked into the image. (Google Cloud Documentation)

What Hugging Face’s TGI DLC does on Vertex

Hugging Face’s “TGI DLC on Vertex” examples state that Vertex predict() sends requests to the container’s /predict route, using Vertex I/O payload formatting. (Hugging Face)

So the “adapter” is expected to be inside the router and to make /predict accept Vertex-style payloads (typically instances=[...], parameters={...}), then translate internally.

This also explains your observation: by default you often get text only back from Vertex examples (they literally print(output.predictions[0])). That does not prove TGI is not used. It usually just means you did not request detailed decoding info. (Hugging Face)


3) Getting per-token scores in TGI: what is actually supported?

Terminology you care about

  • Logits: raw, unnormalized scores for every vocab token at a step.
  • Logprobs: normalized log probabilities (after softmax) for returned tokens (and optionally top-k alternatives).

TGI primarily exposes logprobs, not full logits. That is intentional because returning full-vocab logits for every step is massive.

What you can get from TGI

You can request:

  1. Per generated token logprob (the sampled token).
  2. Optionally top_n_tokens alternatives per step (logprobs for the top N tokens).
  3. Optionally prompt (prefill) token logprobs if enabled.

These are controlled by request parameters such as:

  • details=true
  • top_n_tokens=<N>
  • decoder_input_details=true (prompt token logprobs, but only if details=true) (Hugging Face)

The Hugging Face client docs state this clearly: details=True returns tokens and probabilities, and top_n_tokens returns info about the N most likely tokens per generation step. (Hugging Face)

The server-side “gotcha” (launcher flag)

Prompt logprobs are disabled by default for VRAM reasons. The launcher has a flag:

  • --enable-prefill-logprobs (re-allows prompt logprobs; costs VRAM) (Hugging Face)

So if you want prompt token logprobs, you need both:


4) Concrete requests you should try

A) If you can hit TGI directly (non-Vertex path): /generate

curl http://HOST:PORT/generate \
  -H 'Content-Type: application/json' \
  -d '{
    "inputs": "Write a haiku about GPUs.",
    "parameters": {
      "max_new_tokens": 30,
      "details": true,
      "top_n_tokens": 5
    }
  }'

Expected shape (conceptually): you get a details object containing a list of per-step tokens, each token having id, text, and logprob (and optionally top_tokens if you asked for top_n_tokens). This is the same “Token has logprob” schema you already noticed. (Hugging Face)

B) Vertex Endpoint predict payload (what HF’s Vertex DLC examples use)

Try putting the same knobs under instances[*].parameters:

{
  "instances": [
    {
      "inputs": "Write a haiku about GPUs.",
      "parameters": {
        "max_new_tokens": 30,
        "details": true,
        "top_n_tokens": 5
      }
    }
  ]
}

Two key points:

  • The default (details omitted or false) usually yields just generated text. (Hugging Face)
  • If the router’s Vertex adapter passes details through, you should start seeing token-level info inside predictions[0] (either as a dict or a nested structure, depending on the adapter).

C) Prompt logprobs in Vertex

Add:

"decoder_input_details": true

…but only after you confirm the container was launched with --enable-prefill-logprobs. (Hugging Face)


5) Why you still cannot “compute logits for each output token” from this

If by “logits” you mean full raw logits vectors (size = vocab) at each step: TGI does not expose that as a normal output for bandwidth and cost reasons.

If by “logits for the chosen token” you mean the single raw logit value for the sampled token: logprob alone is not enough to reconstruct it, because:

  • logprob = logit(token) − logsumexp(all logits)

TGI does not usually give you logsumexp, and returning it is not a standard option.

So the realistic options are:

  • Use logprobs (sequence scoring, per-token likelihood, perplexity-style metrics).
  • Use top_n_tokens as an approximation to inspect alternatives. (Hugging Face)
  • If you truly need full logits, you need a custom server or a fork that returns them (or a different engine that supports a “logits” debug endpoint).

6) Your “no adapter / not using TGI” suspicion: the most common explanation

On Vertex, “text only” responses are often just the default mode. Hugging Face’s own Vertex+TGI example prints a plain text prediction from predict() without asking for details. (Hugging Face)

So first confirm with one decisive test:

  • Send details=true and top_n_tokens=1.
  • If you still only get a string, then the Vertex adapter is either stripping details or you are not running the TGI router you think you are.

7) Similar reports and discussion threads

(If you want, paste one real Vertex response you got with details=true and I will tell you whether it is being stripped, renamed, or nested.)


Good references to keep open


Summary

  • text-generation-launcher source is in launcher/src/main.rs in the TGI repo. (GitHub)
  • Vertex /predict routing is configured by predictRoute and injected via AIP_PREDICT_ROUTE. Not hardcoded in the image. (Google Cloud Documentation)
  • To get per-token logprobs, request details=true (and optionally top_n_tokens). (Hugging Face)
  • Prompt logprobs require decoder_input_details=true plus launcher --enable-prefill-logprobs. (Hugging Face)
  • Full logits are not a standard TGI output. Use logprobs or fork/custom server if you truly need logits.