31% (Baymard Institute) of all product searches in online shops return no result - even though the requested product exists in the catalog. Classical keyword search based on BM25 fails at synonyms, intent variants and long-tail queries. Semantic product search with vector embeddings, an HNSW index and cross-encoder reranking closes this gap on a technical level. Searchers convert up to 2-3x higher (Algolia) than pure browsers - in Amazon's case, conversion rises from 2% to 12%. For online shops and Shopware projects, a modern hybrid search architecture is therefore a direct lever on revenue, AOV and catalog utilisation.
Where classical keyword search fails
BM25 and its variants have been the standard for full-text retrieval for more than 30 years. They weight terms by frequency, document length and inverse document frequency - and deliver solid baseline results in many e-commerce scenarios. But as soon as a query deviates from the vocabulary stored in the catalog, result quality drops sharply. 70% (Baymard) of the 60 largest e-commerce sites return no relevant hits on synonym searches. 41% (Baymard) of shops do not fully support the eight most common query types. Across the industry, zero-result rates sit between 10 and 30% (Lucidworks/Wizzy), often above 25% without intelligent search. Algolia's production target is below 2% - values above 3-5% indicate acute search issues (Algolia).
- Synonyms and jargon: 'trainer' vs. 'sneaker' vs. 'running shoe' - BM25 treats these as unrelated tokens.
- Intent variants: 'running shoes for asphalt' and 'road running shoes' describe the same thing but barely share terms.
- Long-tail queries: Natural phrasing like 'comfortable waterproof shoes for autumn hiking' rarely matches 1:1.
- Error tolerance: typos, alternative spellings, German compound words.
- Multilingual catalogs: international shops with queries across languages need cross-language representations.
- Attribute semantics: a 'machine washable' query misses products whose description only mentions '30-degree wash'.
The consequences are measurable: an apparel case study saw search exit rates drop by 35% (Netguru) within three months of introducing semantic search. Global cart abandonment is 70.22%, on mobile 80.2% (Opensend) - bad search is one of the drivers. Anyone serious about search quality is intervening directly in the same KPIs addressed in checkout optimization.
Vector embeddings: semantics as numerical space
An embedding is a dense numerical representation of a text, image or mixed object. A language model maps a product title like 'Waterproof trail running shoes men size 43' to a vector with typically 384 to 2,048 dimensions. Products with similar meaning - even with different wording - receive vectors that sit close together in space. Semantic search exploits this property: the query is embedded into the same vector space and compared to all product vectors via a distance metric (cosine, dot product, L2).
This covers synonyms, paraphrases and linguistic intent variants implicitly - without a hand-maintained dictionary. Embedding models are typically pretrained on MS MARCO, BEIR or domain-specific e-commerce datasets and can be fine-tuned on shop-specific product language.
Embeddings translate language into geometry: search shifts from a matching problem to a nearest-neighbour problem in high-dimensional space. Everything that follows - HNSW index, hybrid fusion, reranking - optimises either speed or precision of this nearest-neighbour search.
Embedding models compared
The model landscape is broad: Sentence-BERT variants on MS MARCO, the E5 family (small/base/large), multilingual-e5-large for cross-language, and commercial APIs such as OpenAI text-embedding-3-small or voyage-3-large with 2,048 dimensions. The right choice depends on catalog size, languages, latency budget and hosting model. One notable finding: the compact E5-small (118M parameters) beats models 70x its size in some benchmarks and delivers latencies below 30 ms with up to 100% top-5 accuracy in e-commerce tests (aimultiple, Supermemory).
| Model | Parameters | Dimensions | Typ. latency | Use case |
|---|---|---|---|---|
| E5-small-v2 | 118M | 384 | < 30 ms | Self-hosted, small-to-mid catalogs |
| multilingual-e5-large | 560M | 1024 | 30-80 ms | International shops, cross-language |
| voyage-3-large | API | 2048 | API round-trip | High-accuracy, managed |
| OpenAI text-embedding-3-small | API | 1536 (variable) | p90 ~500 ms | Managed, p99 spikes possible |
| Sentence-BERT (MS MARCO) | 110M-335M | 768 | 20-60 ms | Baseline, open weights |
When latency is business-critical, distribution matters more than mean values: OpenAI text-embedding-3-small shows p90 latencies around 500 ms with p99 spikes up to 5 seconds (Nixiesearch) - a risk for live-search UX. Self-hosted models on an NVIDIA L4 reach around 2,000 tokens/s on 7B embedding models; a one-time indexing of 1bn items takes roughly 5.8 days (Baseten/Introl). Against this, API costs sit at USD 0.02-0.18 per million tokens - make-or-buy depends on volume and compliance needs.
HNSW index and vector databases
A linear nearest-neighbour scan across millions of product vectors cannot meet millisecond budgets. This is where HNSW (Hierarchical Navigable Small World) comes in - a graph-based approximate nearest-neighbour index that navigates hierarchically across several layers and finds relevant neighbours in logarithmic time. Key parameters are graph connectivity (M), the build-time parameter (efConstruction) and the query-time parameter (efSearch or num_candidates), which trades recall against latency.
- Elasticsearch / OpenSearch: HNSW as a dense_vector field with tight BM25 integration inside the same query.
- Qdrant: Rust-based vector engine with payload filters, quantisation and hybrid search primitives.
- Weaviate: schema-driven vector DB with integrated modules for generative search.
- pgvector (PostgreSQL): HNSW and IVFFlat indexes directly inside the relation - attractive when the shop already runs on PostgreSQL.
- Milvus: scales to billions of vectors with strong quantisation options (PQ, SQ, BBQ).
- Lucene-based unified indexes: combine BM25 and HNSW in one segment - 8.9 to 186 times faster than separate indexes (Elastic Labs).
Performance numbers from production benchmarks: Elasticsearch BBQ reaches 5x the speed of OpenSearch with FAISS and up to 8x throughput on filtered queries (Elastic 2025). Typical kNN latencies sit at 7-16 ms in Elasticsearch, even below 15 ms under write load (Elastic/Baseten). The database choice is less about recommendations and more about fit with the existing stack - all systems listed are production-ready. For new programming projects, a small proof of concept on real catalog data beats synthetic benchmarks.
Hybrid search: BM25 + dense + RRF
Pure dense search degrades on exact product IDs, SKUs, brand and measurement terms: someone typing 'ISO 9001 stainless steel 304' does not want semantically similar products but exact term matches. Pure sparse search (BM25) degrades on synonyms and natural language. The answer is hybrid search: both retrievers run in parallel, and their result lists are fused.
The most robust fusion mechanism is Reciprocal Rank Fusion (RRF): the rank position of a document in both lists is added (inversely weighted) - without normalising score scales. RRF delivers +15 to +30% recall over single methods (Premai.io). On the WANDS e-commerce benchmark, it produced +1.7% mean NDCG over pure dense search (Hindsight/Vectorize 2026) with higher robustness at the same time.
{
"retriever": {
"rrf": {
"retrievers": [
{
"standard": {
"query": {
"multi_match": {
"query": "waterproof running shoes men size 43",
"fields": ["title^3", "brand^2", "description", "attributes.*"],
"type": "best_fields",
"fuzziness": "AUTO"
}
}
}
},
{
"knn": {
"field": "embedding",
"query_vector_builder": {
"text_embedding": {
"model_id": "e5-small-v2",
"model_text": "waterproof running shoes men size 43"
}
},
"k": 50,
"num_candidates": 200
}
}
],
"rank_window_size": 100,
"rank_constant": 60
}
},
"size": 20
}The trick lies in sensible field weighting: title and brand via BM25 with higher boost, description and attributes primarily through the embedding. For a technical view on the data model, see the article on AI-optimised [product data](/en/blog/produktdaten-ki-optimieren-strukturierte-daten-2026/).
Robustness also benefits from query routing: for a pure SKU, an exact-match path with keyword boost takes over. For very short queries (1-2 tokens), BM25 carries more weight; for long, natural language queries, the dense side contributes more relevance. This switching is typically driven by a heuristic or a small classifier before retrieval and prevents hybrid search from delivering uniformly 'average' results instead of cleanly serving the query class at hand.
Cross-encoder reranking as a second stage
Hybrid retrieval typically returns 50-200 candidates. The cross-encoder is a second, more precise model that jointly encodes each query-product pair and produces a relevance score. Unlike a bi-encoder (one vector per side), the cross-encoder sees query and document simultaneously and reaches significantly higher precision - at the price of higher compute. That is why it is applied only to the top-K candidates from the first stage.
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def rerank(query: str, candidates: list[dict], top_n: int = 20) -> list[dict]:
pairs = [(query, c["title"] + " " + c["description"]) for c in candidates]
scores = reranker.predict(pairs, batch_size=32)
for c, s in zip(candidates, scores):
c["rerank_score"] = float(s)
return sorted(candidates, key=lambda x: x["rerank_score"], reverse=True)[:top_n]Cross-encoders typically add 20-80 ms of latency, depending on model size and candidate count. With GPU inference or quantised models, the overall pipeline still stays within the live-search budget. The quality gain is largest on ambiguous queries and fine intent differences - exactly where classical search breaks down.
In practice a two-model setup works well: a small reranker (such as MiniLM-L-6) on the hot path with strict latency limits and a larger model (such as MonoT5 or bge-reranker-base) on the async path for recommendation lists, category pages or SEO programmes like programmatic SEO. Cross-encoders also benefit from feature enrichment: instead of encoding only the title and description, brand, main category and one or two key attributes can be passed as a text prefix. This measurably lifts NDCG without swapping the model.
Query expansion with LLMs
Before retrieval, an LLM can reformulate, expand or structure the query: generate synonyms, extract implicit attributes, fix spelling errors, or split a natural language query into a structured filter part plus a free-text part. This is particularly valuable for voice commerce and chat interfaces, covered in more depth in the article on voice commerce.
{
"system": "You are an e-commerce query parser. Extract structured filters and a cleaned free-text query. Respond with JSON only.",
"user": "comfortable waterproof running shoes men size 43 for autumn jogging",
"expected_output": {
"freetext": "waterproof running shoes autumn",
"filters": {
"category": "running shoes",
"gender": "men",
"size_eu": 43,
"feature": ["waterproof", "comfort"]
},
"synonyms": ["running sneakers", "jogging shoes", "trail runners"]
}
}Combined with a synonym graph derived from catalog and search-log data, this builds a query-understanding stage that maps natural user language onto the catalog's domain vocabulary - without brittle rule systems. Caveat: LLM calls add latency and cost; for high-frequency queries, caching on the normalised query level is worthwhile.
Another building block is retrieval-augmented generation (RAG) for advisory queries: questions such as 'Which running shoe works for overpronation?' are answered against the top-K results plus relevant guide content. The model explains and points to concrete products - especially useful in advice-heavy categories. Thematically this connects to the article on generative engine optimization, which covers the SEO side of the same development.
Quantisation: cutting memory cost
A 1,536-dimensional Float32 vector occupies 6 KB. For 5 million products, that is already 30 GB - without replication, without graph overhead. Quantisation cuts this footprint drastically: scalar INT8 halves to quarters memory, binary quantisation (BBQ) and product quantisation (PQ) go considerably further.
| Method | Memory reduction | Typ. recall | Use case |
|---|---|---|---|
| Float32 (baseline) | 0% | 100% | Development, highest quality |
| Scalar INT8 | -75% | 90-99% | Production default |
| Binary BBQ | -96% | 85-95% | Very large catalogs, with rescoring |
| FAISS PQ | -87-96% | 80-95% | Billions of vectors, batch use cases |
MongoDB Atlas documents 90-95% accuracy at latencies under 50 ms on 15.3M vectors × 2,048 dimensions (MongoDB) - quantisation is no longer 'experimental' but a production default. The right choice depends on recall requirements, candidate counts, and whether a rescoring stage can follow at higher precision.
Latency budget: 50-200ms for live search
Live-search UX demands response times below 200 ms - above that, waiting becomes perceptible. A realistic budget for the full pipeline looks like this:
| Stage | Keyword search | Dense only | Hybrid + rerank |
|---|---|---|---|
| Query embedding | - | 10-50 ms | 10-50 ms |
| Retrieval (BM25 / HNSW) | 5-15 ms | 7-16 ms | 10-25 ms |
| RRF fusion | - | - | 1-3 ms |
| Cross-encoder rerank | - | - | 20-80 ms |
| Transport + rendering | 10-30 ms | 10-30 ms | 10-30 ms |
| **Total (typical)** | **20-50 ms** | **30-100 ms** | **50-200 ms** |
These values come from production measurements on Elasticsearch, OpenSearch and Qdrant (Elastic/Baseten) and serve as orientation - catalog size, replication, filtering and network topology shift them case by case. For globally distributed shops, a look at [edge caching strategies](/en/blog/managed-hosting-online-shops-2026/) helps reduce search response times regionally.
Practical levers to hold the budget: keep the query embedding on a dedicated inference server with a warm model cache, cap num_candidates sensibly, restrict the cross-encoder to top-30 or top-50 and batch its inference. On the infrastructure side, HTTP/2 or HTTP/3, gRPC for internal hops and strict per-stage timeouts help - a slow reranker must not block search but fall back to the hybrid result. Monitoring p50/p95/p99 is mandatory, not optional.
Typical mistakes in semantic rollouts
- Dense-only instead of hybrid: the fastest route to 'search no longer finds my SKU'. Hybrid is the safe default.
- No evaluation suite: without offline metrics (nDCG@10, recall@50, MRR) and online A/B tests, no change is provable.
- Ignoring product data: an embedding is only as good as its input. Sparsely described products yield blurry vectors.
- No negative signals: click logs and purchases are valuable feedback - ignoring them throws away the most useful fine-tuning signal.
- Reranker always and everywhere: apply cross-encoders only to top-K, not to the full candidate list.
- Forgotten categories/filters: semantic hits must stay inside active filters and stock constraints.
- No fallback: if the vector service stalls briefly, BM25 must continue to serve - otherwise the search function disappears entirely.
- Model drift: language and catalog change. Without periodic re-indexing and re-evaluation, search ages silently.
A six-phase rollout roadmap
- Measure the baseline: zero-result rate, click-through on top-5, search exit rate, conversion rate for searchers vs non-searchers. No baseline, no provable gains.
- Clean the data: titles, categories, attributes, synonyms. Embeddings are only as precise as the product text - see PIM systems.
- Choose a model and index: one-time embedding of the catalog, storage in an HNSW index, quantisation enabled. Define the re-indexing workflow.
- Build the hybrid query: BM25 + kNN in parallel, RRF fusion, field-boost tuning. Include filter constraints.
- Integrate the reranker: cross-encoder on top-50, measure latency, adjust model size. Offline evaluation against baseline.
- A/B test and iterate: online rollout with traffic split, conversion measurement, fine-tuning on click and purchase signals. Then continuous monitoring.
An apparel retailer documented -35% search exit rate within three months following this approach (Netguru). Envive reports +8-12% conversion uplift for mid-sized catalogs, +15-20% for enterprise and +20-25% AOV on high-intent searches (Envive 2026). Personalised semantic search in the Algolia benchmark even shows +50% conversion rate (Algolia 2025). Figures vary by industry and implementation maturity - but the direction is consistent.
This article is based on data and benchmarks from: Baymard Institute, Algolia, Envive, Lucidworks, Netguru, Elastic Search Labs, MongoDB, Premai.io, Hindsight, Vectorize, aimultiple, Supermemory, Baseten, Nixiesearch and Opensend. Performance numbers can vary based on catalog, infrastructure and query mix - the values given are orientation, not guarantee.
Search as a product discovery engine
In 2026, semantic product search is no longer an experimental add-on but the infrastructural foundation of modern shops. The building blocks - embeddings, HNSW, hybrid fusion, reranking, quantisation - are production-ready, verifiably effective and integrable within reasonable latency budgets. Treating search as a mere filter facade gives up conversion and loses ground against shops that treat search as a primary discovery engine. For a strategic entry point, see the overview article on AI-powered product search and the piece on AI product recommendations - both describe complementary parts of the same discovery architecture.
Typically not. Pure dense search degrades on exact SKU, brand and measurement terms. Hybrid retrieval with BM25 + vector search and RRF fusion reaches 15-30% higher recall (Premai.io) and is substantially more robust. For most shop catalogs, hybrid is the sensible default; dense-only is a special case.
A hybrid search stack with reranking typically fits within 50-200 ms in total - query embedding (10-50 ms), HNSW retrieval (7-16 ms according to Elastic/Baseten) and cross-encoder reranking (20-80 ms). That keeps search inside the live-search window. Clean infrastructure, quantisation and caching keep these values stable under load in our experience.
Experience suggests a compact model such as E5-small-v2 (118M parameters, 384 dimensions) is a very good starting point - latency below 30 ms and benchmark performance that can match models 70 times its size (aimultiple, Supermemory). For multilingual catalogs, multilingual-e5-large is worth considering; for top-end quality, commercial APIs. The final choice should rest on catalog-specific benchmarks.
Without quantisation, a 1,536-dimensional Float32 vector takes around 6 KB - for 5 million products that is about 30 GB. INT8 quantisation cuts this by roughly 75% while largely preserving recall; binary quantisation (BBQ) by up to 96% (MongoDB Atlas documents 90-95% accuracy at 15.3M × 2,048 dim below 50 ms latency). Quantisation should be the production default.
Not necessarily. Elasticsearch and OpenSearch support HNSW natively and combine BM25 with vector search in the same query. If the shop already runs on PostgreSQL, pgvector is an obvious option. Dedicated engines such as Qdrant, Weaviate or Milvus offer advantages for very large catalogs or specialised quantisation features. The decision usually follows the existing infrastructure rather than a general recommendation.
Two layers: offline with relevance metrics like nDCG@10, recall@50 and MRR on an annotated query set; online via A/B tests against the existing search, focused on zero-result rate, click-through rate, search exit rate, conversion rate and AOV of searchers. Typical observations range from +8-12% conversion on mid-sized catalogs to +15-20% on enterprise setups (Envive 2026) - but vary considerably depending on the starting quality of search.