Reproducible Benchmark Results

Validate before you commit. Run the benchmarks yourself.

Dataset for Fine-tuning

Amazon ESCI (Shopping Queries Dataset)

  • • 130K real e-commerce queries
  • • Human-labeled relevance judgments
  • • 1.2M products from Amazon catalog
  • • Categories: Electronics, Home, Apparel, etc.

Evaluation Methodology

NDCG@10 (Normalized Discounted Cumulative Gain)

  • • Industry standard for search relevance evaluation
  • • Penalizes both missing relevant results and wrong ordering
  • • Higher is better (0.0 to 1.0 scale)

NDCG@10 Benchmark Results

Search ConfigurationNDCG@10
Kwiree (fine-tuned)0.332
OpenSearch Semantic + Rerank0.319
OpenSearch BM25 + Rerank0.307
OpenSearch Hybrid0.302
OpenSearch Semantic0.296
OpenSearch BM250.258

Improvement over BM25

+28.6%

Improvement over Semantic

+12.2%

What This Means in Practice

Fewer Wrong Results

Customers see products that match their intent, not just keyword matches.

Better Semantic Understanding

"wireless headphones for running" finds sport-focused products, not just any wireless headphones.

Improved Attribute Handling

Product specs and attributes are understood semantically, not just as keywords.

More Relevant Tail Queries

Long-tail queries that would normally return poor results now find relevant products.

How to Reproduce

Terminal
# 1. Clone the benchmark repository
% git clone https://github.com/kwiree/benchmark
% cd benchmark

# 2. Install dependencies
% pip install -r requirements.txt

# 3. Run the benchmark
% python benchmark.py \
  --openai-api-key YOUR_OPENAI_KEY \
  --kwiree-api-key YOUR_KWIREE_KEY

# 4. Compare results to published metrics

Benchmark on Your Own Catalog

Don't trust generic benchmarks? Test on your actual data.

What you provide:

  • • Product catalog (CSV, <10K products)
  • • Query log (CSV, <1K queries)

What you get:

  • • NDCG@10 on your query sample
  • • Recall@10 comparison
  • • Zero-result rate analysis
  • • P50, P95, P99 latency metrics
  • • Coverage analysis

Detailed free report within 5 business days. Zero commitment.