Field-specific analyzer chains in Vespa

How to configure different Lucene analyzer chains for different fields in the same language?

I’m migrating from Solr to Vespa and trying to replicate our BM25 functionality. In Solr, we have field-specific analyzer chains - for example, different analysis pipelines for product titles, company names, and product descriptions, even though they’re all in English.

Our Solr setup includes custom analyzer chains with various filters (WordDelimiterGraph, PatternReplace, Snowball stemming, Shingles, Synonyms, etc.) that differ between fields.

I understand that Vespa supports Lucene linguistics via LuceneLinguistics component, but the configuration appears to be language-based:

xml

<component id="com.yahoo.language.lucene.LuceneLinguistics"
           bundle="vespa-lucene-linguistics-crazy">
  <config name="com.yahoo.language.lucene.lucene-analysis">
    <configDir>linguistics</configDir>
    <analysis>
      <item key="en">

My question: How can I configure different analyzer chains for different fields that contain text in the same language? Is there a way to specify field-level linguistic processing rather than language-level processing?

1 Like

Hmm…?


You cannot get “arbitrary Solr-style, per-field analyzer chains” in Vespa by naming an analyzer on each field. The supported hook is language + stemming mode. You get up to 5 distinct Lucene analyzer variants per language by keying them as:

LANGUAGE_CODE[/STEM_MODE] where STEM_MODE ∈ { NONE, DEFAULT, ALL, SHORTEST, BEST }. (Vespa Documentation)

So the practical answer is:

  • Yes, you can have different Lucene analyzer chains for different English fields.
  • But only by mapping each field onto one of those 5 stemming modes, and configuring a separate Lucene analyzer per en/<mode> key. (Vespa Documentation)
  • Anything beyond 5 “field profiles” needs modeling workarounds (extra fields, query structure, or custom components).

Below is the detailed “how” and the pitfalls you will hit.


Background: how Vespa text analysis is wired (vs Solr)

In Solr, you attach an analyzer chain directly to a field type, and each field can pick a different type.

In Vespa, “linguistics” (tokenization, normalization, stemming) is applied in two places:

  1. Indexing time (document ingestion) to produce tokens for inverted indexes.
  2. Query time to transform query terms in a way that matches what indexing did.

Lucene Linguistics in Vespa is explicitly described as:

  • “a Lucene analyzer to handle text processing for a language with an optional variation per stemming mode.” (Vespa Documentation)
  • configured by “linguistics keys” in the format LANGUAGE_CODE[/STEM_MODE]. (Vespa Documentation)

That is why it looks “language based”: the top-level selector is language, and the only built-in “sub-selector” is stemming mode.

Also important: Lucene Linguistics does not do language detection, so you must provide language on feed and search. (Vespa Documentation)


What you can do: treat stemming modes as “analysis profiles”

Step 1: Define multiple analyzers for English in services.xml

You can define analyzers per linguistics key in the <analysis> map. The Vespa docs show the structure (tokenizer, tokenFilters, configDir, etc.). (Vespa Documentation)

Example skeleton (illustrative names and filters):

<component id="linguistics"
           class="com.yahoo.language.lucene.LuceneLinguistics"
           bundle="your-bundle-name">
  <config name="com.yahoo.language.lucene.lucene-analysis">
    <configDir>lucene-linguistics</configDir>
    <analysis>

      <!-- “Company names”: minimal, no stemming -->
      <item key="en/NONE">
        <tokenizer>
          <name>standard</name>
        </tokenizer>
        <tokenFilters>
          <item><name>lowercase</name></item>
          <!-- wordDelimiterGraph / patternReplace / etc here -->
        </tokenFilters>
      </item>

      <!-- “Descriptions”: normal English stemming, stopwords, etc -->
      <item key="en/DEFAULT">
        <tokenizer>
          <name>standard</name>
        </tokenizer>
        <tokenFilters>
          <item>
            <name>stop</name>
            <conf>
              <item key="words">en/stopwords.txt</item>
              <item key="ignoreCase">true</item>
            </conf>
          </item>
          <item><name>englishMinimalStem</name></item>
        </tokenFilters>
      </item>

      <!-- “Titles”: more aggressive chain, maybe shingles, etc -->
      <item key="en/BEST">
        <tokenizer>
          <name>standard</name>
        </tokenizer>
        <tokenFilters>
          <item><name>lowercase</name></item>
          <!-- shingle / synonyms / etc here -->
        </tokenFilters>
      </item>

    </analysis>
  </config>
</component>

This is exactly what Vespa supports: customize Lucene analysis per linguistics key using LuceneLinguistics configuration, and you can also register full Analyzer classes via ComponentsRegistry if you prefer code. (Vespa Documentation)

Rule of thumb: treat each stemming mode as one “field analysis profile”.

Step 2: Assign each field to a stemming mode in the schema

You can select stemming mode per field as part of the indexing pipeline. The Text Matching guide shows an indexing expression that explicitly requests stemming mode "BEST":

indexing: input album | tokenize normalize stem:"BEST" | index (Vespa Documentation)

So for your fields you’d do the same pattern:

schema products {
  document products {

    field title type string {
      indexing: input title | tokenize normalize stem:"BEST" | summary | index
      index: enable-bm25
    }

    field company_name type string {
      indexing: input company_name | tokenize normalize stem:"NONE" | summary | index
      index: enable-bm25
    }

    field description type string {
      indexing: input description | tokenize normalize stem:"DEFAULT" | summary | index
      index: enable-bm25
    }
  }
}

You now have different Lucene analyzer chains per field, as long as they fit into the 5 stemming buckets. The “5 modes exist and can be specified in the field schema” is explicitly documented. (Vespa Documentation)

Step 3: Make sure BM25 is enabled where you need it

Vespa’s BM25 rank feature requires index: enable-bm25 on the field, per the BM25 rank feature docs. (Vespa Documentation)


The biggest pitfall: fieldsets + inconsistent analysis

When you search multiple fields as a single unit (fieldset), Vespa warns if matching/normalization/stemming differ, because the query is processed once but fields were indexed with different settings.

Stack Overflow answer (Jo Kristian Bergum) explains the core problem:

  • “Query is only processed using one set of configuration while on the document side during indexing each field is processed with its own settings … might lead to recall issues.” (Stack Overflow)

Vespa FAQ says the same and gives the canonical workaround:

  • A fieldset must have compatible tokenization, otherwise you get the “may lead to recall and ranking issues” warning.
  • If you want the same user query applied to multiple fields with different tokenization, include userInput multiple times, each scoped to a different field/fieldset. (Vespa Documentation)

Example from the FAQ (keep this pattern in mind):

select * from sources * where
  ({defaultIndex: 'fieldsetOrField1'}userInput(@query)) or
  ({defaultIndex: 'fieldsetOrField2'}userInput(@query))

(Vespa Documentation)

Also note: in YQL, userInput() supports a language annotation for linguistics treatment of that call. (Vespa Documentation)
That matters because Lucene Linguistics requires you to provide language. (Vespa Documentation)

Practical design outcome:

  • Do not put title (BEST) and company_name (NONE) in the same fieldset.
  • Create separate fieldsets like title_fs, company_fs, desc_fs where each fieldset is internally consistent.

Synonyms, shingles, WordDelimiterGraph: how to think about them in Vespa

1) Index-time expansion vs query-time rewriting

In Solr it is common to do aggressive index-time expansion (synonyms, shingles) differently per field.

In Vespa, you can do Lucene-style filters in the analyzer chain per en/<mode> key, but you should be deliberate because:

  • index-time expansion increases index size and can change BM25 statistics
  • query-time rewriting is often easier to iterate

Vespa has “semantic query rewrites” (rules-based rewriting). There is an open issue documenting tricky interactions with stemming and rewrite evaluation order, which is a real pitfall if you mix rewrite rules and stemmed fields. (GitHub)

The issue summary (important points):

  • semantic query rewrites happen before document-type specific stemming
  • rewrite rules can be stemmed, and rewritten forms can be protected from further transforms
  • recommendation in the issue: fields used in rewrite matching should use stemming:none (GitHub)

So if you do synonyms via semantic rules, prefer:

  • rewrite-matching fields with NONE, or carefully test how stemming affects rule matching.

2) Shingles

Shingles are frequently used to boost phrase-like matches in titles.

In Vespa you often get better control by:

  • keeping the main title field as normal tokens
  • adding a separate derived field (or separate query clause) for phrase boosts, instead of shingling everything everywhere

If you truly need Lucene ShingleFilter semantics, you can put it into the analyzer for en/BEST and use that stemming mode for the title field. (Vespa Documentation)
But keep it scoped. Shingling descriptions is usually expensive.

3) WordDelimiterGraph and pattern replacement

These are classic e-commerce needs (SKU splits, hyphenation, camelCase, etc.).

Your options are:

  • Put them into the Lucene analyzer for the relevant stemming mode key (recommended when you can fit into the 5 profiles). (Vespa Documentation)
  • Or do preprocessing in a document processor to populate additional “search helper fields” (see workaround section below).

What if you need more than 5 field-specific analyzer chains?

You have three realistic patterns.

Pattern A: “5 profiles” + extra derived fields (most common)

Use the 5 stemming modes as your main profiles, and for anything else:

  • create synthetic fields derived from the same input
  • query them separately and combine in ranking

The Text Matching guide explicitly shows using synthetic fields for different match behavior (index vs attribute), and suggests n-gram matching for substring cases. (Vespa Documentation)

Example idea:

  • title uses en/BEST
  • title_exact is an attribute field for exact matching
  • title_ngram for substring or partial matches

Pattern B: query composition instead of fieldsets

Instead of one fieldset, OR together multiple userInput() clauses each targeting a field/fieldset with consistent settings, as shown in the FAQ. (Vespa Documentation)

This is the “Vespa-native” way to keep analysis differences without recall bugs.

Pattern C: custom linguistics component (heavyweight)

If you truly require many distinct Lucene chains keyed arbitrarily (title vs brand vs category vs … beyond 5), Lucene Linguistics is not designed for that selector space. At that point you either:

  • implement your own linguistics selection logic in a custom component, or
  • restructure into derived fields

Most teams do A + B.


My recommendation for your exact Solr-to-Vespa case (titles, company, description)

1) Collapse your Solr field types into 3 Vespa “profiles”

A typical e-commerce mapping:

Profile 1: company names (en/NONE)

  • lowercase, delimiter splitting, mild normalization
  • no stemming
  • keep synonyms limited (brand synonyms often need exact-ish control)

Profile 2: titles (en/BEST)

  • delimiter splitting
  • optional shingles (only here)
  • optional synonym expansion if you truly need it on titles

Profile 3: descriptions (en/DEFAULT or en/ALL)

  • stopwords + stemming
  • avoid shingles
  • synonyms often better as query-time rewrite

This fits inside the 5-mode limit. (Vespa Documentation)

2) Do not use a single “default fieldset” across all three

If you mix different stemming/matching settings in one fieldset, you will get warnings and real mismatches.

Instead:

  • fieldset title_fs { title }
  • fieldset company_fs { company_name }
  • fieldset desc_fs { description }

Then query:

  • parse user input once per fieldset, OR them, and weight/boost in ranking.

3) Be explicit about language

Because Lucene Linguistics does not detect language, feed and query should set language. (Vespa Documentation)
At query time you can set language for userInput() in YQL.


Similar cases and “things people run into” online

  1. Fieldset inconsistency warnings are common and not cosmetic. They reflect real “query processed once vs fields indexed differently” behavior.

  2. Stemming interactions with semantic rules (query rewriting) can produce surprising mismatches. The open Vespa issue documents the evaluation order and why stemming:none may be required for rewrite matching fields. (GitHub)

  3. Solr-to-Vespa migrations using Lucene Linguistics are an explicit design goal of Lucene Linguistics.


Curated resources (guides, posts, issues)

Core docs and references:

Migration context and real-world stories:

Similar Q and known pitfalls:


Summary

  • You can approximate Solr per-field analyzers by configuring multiple English analyzers keyed as en/<stemming-mode> and assigning fields to those modes. (Vespa Documentation)
  • You only get 5 profiles per language: NONE, DEFAULT, ALL, SHORTEST, BEST. (Vespa Documentation)
  • Do not put fields with different analysis into one fieldset. Instead OR multiple userInput() clauses per fieldset.
  • For synonyms via semantic rules, watch stemming interactions. Consider stemming:none for rewrite matching fields. (GitHub)