Reproducible Benchmark Results
Validate before you commit. Run the benchmarks yourself.
Dataset for Fine-tuning
Amazon ESCI (Shopping Queries Dataset)
- • 130K real e-commerce queries
- • Human-labeled relevance judgments
- • 1.2M products from Amazon catalog
- • Categories: Electronics, Home, Apparel, etc.
Evaluation Methodology
NDCG@10 (Normalized Discounted Cumulative Gain)
- • Industry standard for search relevance evaluation
- • Penalizes both missing relevant results and wrong ordering
- • Higher is better (0.0 to 1.0 scale)
NDCG@10 Benchmark Results
| Search Configuration | NDCG@10 |
|---|---|
| Kwiree (fine-tuned) | 0.332 |
| OpenSearch Semantic + Rerank | 0.319 |
| OpenSearch BM25 + Rerank | 0.307 |
| OpenSearch Hybrid | 0.302 |
| OpenSearch Semantic | 0.296 |
| OpenSearch BM25 | 0.258 |
Improvement over BM25
+28.6%
Improvement over Semantic
+12.2%
What This Means in Practice
Fewer Wrong Results
Customers see products that match their intent, not just keyword matches.
Better Semantic Understanding
"wireless headphones for running" finds sport-focused products, not just any wireless headphones.
Improved Attribute Handling
Product specs and attributes are understood semantically, not just as keywords.
More Relevant Tail Queries
Long-tail queries that would normally return poor results now find relevant products.
How to Reproduce
Terminal
# 1. Clone the benchmark repository
% git clone https://github.com/kwiree/benchmark
% cd benchmark
# 2. Install dependencies
% pip install -r requirements.txt
# 3. Run the benchmark
% python benchmark.py \
--openai-api-key YOUR_OPENAI_KEY \
--kwiree-api-key YOUR_KWIREE_KEY
# 4. Compare results to published metricsBenchmark on Your Own Catalog
Don't trust generic benchmarks? Test on your actual data.
What you provide:
- • Product catalog (CSV, <10K products)
- • Query log (CSV, <1K queries)
What you get:
- • NDCG@10 on your query sample
- • Recall@10 comparison
- • Zero-result rate analysis
- • P50, P95, P99 latency metrics
- • Coverage analysis
Detailed free report within 5 business days. Zero commitment.