# onesearch-family

**Repository Path**: gitstr/onesearch-family

## Basic Information

- **Project Name**: onesearch-family
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-05-02
- **Last Updated**: 2026-05-02

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

🎉🎉 Congratulation:  [OneSearch](https://arxiv.org/pdf/2509.03236) have been accpeted by ICML'2026 !! 🎉🎉

# OneSearch Family

- HomePage: https://windsighiii.github.io/OneSearch/
- Corresponding Author: https://benchen4395.github.io/
- Paper: [OneSearch-V1](https://arxiv.org/abs/2509.03236),  [OneSearch-V2](https://arxiv.org/abs/2603.24422)
- Original Dataset: [KuaiSearch](https://github.com/benchen4395/KuaiSearch)

Here is a collection of training components and data tools for **OneSearch**, a generative retrieval and recommendation system built on large language models. The system maps queries and items to discrete hierarchical **Semantic IDs (SIDs)** and trains an LLM to generate target SIDs end-to-end.

-  All codes and data cases have been released.

If you have any questions, feel free to contact us (benchen4395@gmail.com). 

## Repository Structure

```
onesearch-family/
├── rq-opq/                   # Semantic ID construction: RQ + OPQ quantization
├── self-distillation/        # Self-distillation fine-tuning (SDFT)
├── rlhf/                     # Reinforcement learning from human feedback
│   ├── reward.py             # Composite item-level reward design
│   ├── tpma.py               # Token-Position Marginal Advantage (TPMA-GRPO)
│   └── listwisedpo.py        # Listwise DPO trainer utilities
└── dataset_examples.md       # SFT data format examples for all training stages
```

---

## Semantic ID Format

Each query or item is assigned a multi-level semantic ID produced by the RQ-OPQ pipeline:

```
<a_{id1}><b_{id2}><c_{id3}><d_{id4}><d_{id5}>
```

Coarser levels (`a`, `b`, `c`) capture semantic category; finer levels (`d`) are generated by OPQ sub-codes over the residual embedding.

---


## Modules

### 1. RQ-OPQ — Semantic ID Construction

> Path: `rq-opq/`

Converts raw embeddings into discrete hierarchical semantic IDs via two sequential steps:

**Requirements:**
```bash
pip install faiss-cpu pyahocorasick numpy pandas tqdm 
```

**Step 0 — Keyword Enhancement (`keyword_enhance.py`)**

Before quantization, raw query/item embeddings are enriched with structured attribute signals from a keyword dictionary covering 18 attribute types (Table 1, `keyword_cases.txt`):

```
fused_emb = L2_norm( 0.5 × raw_emb + 0.5 × mean(matched_keyword_embs) )
```

*Table 1. 18 structured attribute types using NER in the e-commerce search platform.*

| | | | | **Attributes** | | | | |
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| Entity | Modifier | Brand | Material | Style | Function | Location | Audience | Color |
| Scene | Specifications | Price | Model | Anchor | Series | Marketing | Season | Pattern |

```python
keys, emb_arrays, source_embs, ners = keyword_enhance_emb(
    infile=query_txt_with_emb, 
    dict_path=ner_dict_path, 
    dim=embed_dim)
```

**Step 1 — RQ: Residual Quantization (`rq_dynamic.py`)**

Performs multi-layer residual K-means clustering. Each vector is iteratively assigned to cluster centroids, and the residual is passed to the next layer.

```bash
nohup python -u rq_dynamic.py \
    --folder {output_folder} \
    --querytxt {query_txt_with_emb} \
    --itemtxt {item_txt_with_emb} \
    --keyfile {merged_key.txt} \
    --embfile {merged_emb.pkl} \
    --k '1024-1024-1024' \
    &> log/rq &
```

| Argument | Type | Description |
|----------|------|-------------|
| `--folder` | str | Output folder path |
| `--querytxt` | str | Query file with embeddings, format: `query\tners\temb` |
| `--itemtxt` | str | Item file with embeddings, format: `item_id\ttitle\tners\temb` |
| `--keyfile` | str | Merged key file path |
| `--embfile` | str | Merged embedding pkl file path |
| `--querynum` | int | Number of queries (when merged files already exist) |
| `--k` | str | Cluster counts per layer, e.g. `1024-1024-1024` |
| `--referfolder` | str | Existing RQ model folder (for incremental training) |
| `--referlayer` | int | Number of completed layers in existing model |
| `--isnorm` | int | L2-normalize residuals: 0 = no, 1 = yes |

Output:
```
{output_folder}/
├── RQCodeList-{L}.pkl      # cluster centroids per layer
├── IdList-{L}.pkl          # cluster assignments per layer
├── residual_emb_{L}.pkl    # residual embeddings per layer
├── cluster.pkl             # hierarchical cluster statistics for items
├── results-l{L}-query.txt  # RQ semantic IDs for queries
└── results-l{L}-item.txt   # RQ semantic IDs for items
```

**Step 2 — OPQ: Optimized Product Quantization (`rq_opq.py`)**

Applies FAISS OPQ on the RQ residual embeddings, appending 2 sub-codes to each RQ ID:

```
rq1_rq2_rq3  →  rq1_rq2_rq3_opq1_opq2
```

```python
# Train on combined query + item embeddings
main(query_sid_file, query_residual_emb_file,
     item_sid_file, item_residual_emb_file, model_path)

# Infer: assign OPQ codes to any key+embedding file
get_opq_ids(model_path, txt_path, emb_path, output_path)
```

Input: `key\trq_sid` text file + pickle embedding array of shape `(N, dim)`.  
Output: `key\trq_sid_opq0_opq1` (second OPQ code offset by +256 to avoid token collision).


---
## 2. SFT with Self-Distillation

OneSearch is trained in three progressive SFT stages. See [`dataset_examples.md`](dataset_examples.md) for concrete data examples.

| Stage | Name | Task |
|-------|------|------|
| Stage 1 | Semantic Alignment | query/item text ↔ SID, SID → category, CoT keyword tasks |
| Stage 2 | Q-I Co-occurrence | query ↔ item retrieval (text and SID) |
| Stage 3 | User Personalization | query + user history → item SID |

---


> Path: `self-distillation/`

Implements **Joint Self-Distillation Fine-Tuning (Joint SDFT)**, integrated into the [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) training framework. The model simultaneously acts as teacher and student on different views of the same input.

**Key idea:** The teacher receives a richer input (e.g., full keyword context); the student receives a reduced input (e.g., query only). A KL divergence loss distills teacher soft predictions into the student, improving generalization without a separate teacher model.

**Two training modes:**

| Mode | Description |
|------|-------------|
| `joint` | Single model, two forward passes per step. Teacher and student share parameters; gradients are accumulated from all losses. |
| `ema` | Teacher is an independent copy updated via exponential moving average (EMA) from the student. No gradient flows through teacher. |

**SDFT loss:**
```
L_total = L_CE(student) + w_teacher * L_CE(teacher) + w_kl * KL(teacher || student)
```

**Training shell**:
```bash
llamafactory-cli train onesearch_sft_and_sd.yaml
```

**Training config** (`onesearch_sft_and_sd.yaml`):

```yaml
use_joint_sdft: true
sdft_mode: joint
sdft_kl_weight: 0.1
sdft_distill_temperature: 1.0
sdft_teacher_ce_weight: 1.0
sdft_ema_decay: 0.9          # only used in ema mode
sdft_keyword_pattern: "..."  # regex to locate keyword span in input
```

---

### 3. RLHF

> Path: `rlhf/`

Reinforcement learning components for OneSearch-V2, covering reward design and preference optimization.

#### 3.1 Composite Reward (`reward.py`)

Item-level reward composed of three additive signals:

```
R_item = R_C&O + R_CTR + R_Rel
```


#### 3.2 Token-Position Marginal Advantage — TPMA-GRPO (`tpma.py`)

Extends GRPO by decomposing sequence-level reward into **position-level marginal contributions**, respecting the coarse-to-fine causal structure of SID generation.


#### 3.3 Listwise DPO (`listwisedpo.py`)

Extends standard pairwise DPO to **listwise (S-DPO)** with multiple negatives per positive.

**Training shell**:
```bash
python train_grpo.py
```

---

## Data Format Examples

See [`dataset_examples.md`](dataset_examples.md) for input/output examples across all SFT stages.
- All data can be constructed with our open dataset [KuaiSearch](https://arxiv.org/pdf/2602.11518)

---
## Citation

### [OneSearch-V1](https://arxiv.org/abs/2509.03236)
```
@misc{chen2025onesearchpreliminaryexplorationunified,
      title={OneSearch: A Preliminary Exploration of the Unified End-to-End Generative Framework for E-commerce Search}, 
      author={Ben Chen and Xian Guo and Siyuan Wang and Zihan Liang and Yue Lv and Yufei Ma and Xinlong Xiao and Bowen Xue and Xuxin Zhang and Ying Yang and Huangyu Dai and Xing Xu and Tong Zhao and Mingcan Peng and Xiaoyang Zheng and Chao Wang and Qihang Zhao and Zhixin Zhai and Yang Zhao and Bochao Liu and Jingshan Lv and Xiao Liang and Yuqing Ding and Jing Chen and Chenyi Lei and Wenwu Ou and Han Li and Kun Gai},
      year={2025},
      eprint={2509.03236},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2509.03236}, 
}
```

### [OneSearch-V2](https://arxiv.org/abs/2603.24422)
```
@misc{chen2026onesearchv2latentreasoningenhanced,
      title={OneSearch-V2: The Latent Reasoning Enhanced Self-distillation Generative Search Framework}, 
      author={Ben Chen and Siyuan Wang and Yufei Ma and Zihan Liang and Xuxin Zhang and Yue Lv and Ying Yang and Huangyu Dai and Lingtao Mao and Tong Zhao and Zhipeng Qian and Xinyu Sun and Zhixin Zhai and Yang Zhao and Bochao Liu and Jingshan Lv and Xiao Liang and Hui Kong and Jing Chen and Han Li and Chenyi Lei and Wenwu Ou and Kun Gai},
      year={2026},
      eprint={2603.24422},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2603.24422}, 
}
```