File size: 4,586 Bytes
6cf9b41
 
 
 
 
 
 
 
 
9557978
 
 
 
 
6cf9b41
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1d0e4e4
6cf9b41
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
---
license: mit
language:
- en
tags:
- roberta
- ecommerce
- classification
datasets:
- amazon-esci
base_model:
- FacebookAI/roberta-base
pipeline_tag: zero-shot-classification
library_name: transformers
---

E-commerce Search Query Router
==============================

Model Description
-----------------

This model is a classifier fine-tuned from a `roberta-base` to determine the optimal search strategy for e-commerce queries. It classifies a given query into one of two labels:

-   **`lexical_search`**: Indicates that the query is best handled by a traditional, keyword-based search engine like Lucene using BM25. These are typically specific queries like SKUs, exact product names, or part numbers.

-   **`vector_search`**: Indicates that the query is better suited for a semantic, vector-based search. These are often ambiguous, conceptual, or "long-tail" queries where user intent is more important than specific keywords (e.g., "a gift for my dad who likes fishing").

The model is intended to be used as an intelligent "query router" in a hybrid search system, dynamically weighting the results from lexical and vector search engines to improve relevance.

Intended Use & Use Case
-----------------------

The primary use case for this model is to power a hybrid search relevance system. The intended workflow is as follows:

1.  A user enters a search query.
2.  The query is sent to this classifier model.
3.  The model returns probabilities for the `lexical_search` and `vector_search` classes.
4.  These probabilities are used as weights to blend the relevance scores from two separate search backends (e.g., Solr and Qdrant).
5.  The final, blended scores are used to rank the products shown to the user.

### How to Use

Here's how to use the model with the `transformers` library `pipeline`:

```python
from transformers import pipeline

router_pipeline = pipeline(
    "text-classification",
    model="timofeyk/roberta-query-router-ecommerce",
    return_all_scores=True
)

# Example of a conceptual query
conceptual_query = "father day gift"
# Example of a specific query
specific_query = "16x16 pillow cover"

queries = [conceptual_query, specific_query]

for q in queries:
    print(f"Predicting label for query: {q}")
    results = router_pipeline(q)
    print(results[0])
    # Expected output might look like:
    # [{'label': 'lexical_search', 'score': 0.46258628368377686}, {'label': 'vector_search', 'score': 0.5374137163162231}]

    scores = {item['label']: item['score'] for item in results[0]}
    w_vector = scores['vector_search']
    w_lexical = scores['lexical_search']

    print(f"Vector Search Weight: {w_vector:.2f}")
    print(f"Lexical Search Weight: {w_lexical:.2f}")


```

Training Data
-------------

This model was trained on a custom dataset of anonymized, real-world e-commerce queries. The dataset was generated using Amazon ESCI Dataset as a source. The labels were generated programmatically based on search performance, creating a signal for the model to learn from:

1.  A large set of user queries and their corresponding ground-truth "ideal" product results (based on user engagement) were collected.
2.  Each query was executed against both a Solr (lexical) and a Qdrant (vector) search engine.
3.  The **nDCG** relevance score was calculated for both result sets against the ground truth.
4.  The query was labeled `lexical_search` if Solr achieved a higher nDCG score, and `vector_search` otherwise.

Training Procedure
------------------

The model was fine-tuned using the Hugging Face `Trainer`. To account for a potential class imbalance in the training data, a custom `Trainer` with a **weighted CrossEntropyLoss** was used, preventing the model from favoring the majority class.

### Training params
```
TrainingArguments(
    learning_rate=1e-05,
    lr_scheduler_type=SchedulerType.COSINE,
    max_grad_norm=1.0,
    num_train_epochs=3,
    optim=OptimizerNames.ADAMW_TORCH_FUSED,
    optim_args=None,
    per_device_eval_batch_size=128,
    per_device_train_batch_size=32,
    prediction_loss_only=False,
    warmup_ratio=0.05,
    weight_decay=0.01,
)
```

Citation
--------

If you use this model in your work, please consider citing it:

```
@misc{timofeyk_roberta-query-router-ecommerce,
  author = {Timofey Klyubin},
  title = {E-commerce Search Query Router},
  year = {2025},
  publisher = {Hugging Face},
  journal = {Hugging Face repository},
  howpublished = {\url{[https://huggingface.co/timofeyk/roberta-query-router-ecommerce](https://huggingface.co/timofeyk/roberta-query-router-ecommerce)}}
}
```