File size: 2,790 Bytes
e5cf043
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8a7cd1b
 
 
e5cf043
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
---
language: en
library_name: bm25s
tags:
- bm25
- bm25s
- retrieval
- search
- lexical
---

# BM25S Index

This is a BM25S index created with the [`bm25s` library](https://github.com/xhluca/bm25s) (version `0.1.7`), an ultra-fast implementation of BM25. It can be used for lexical retrieval tasks.

💻[BM25S GitHub Repository](https://github.com/xhluca/bm25s)\
🌐[BM25S Homepage](https://bm25s.github.io)

## Installation

You can install the `bm25s` library with `pip`:

```bash
pip install "bm25s==0.1.7"

# Include extra dependencies like stemmer
pip install "bm25s[full]==0.1.7"

# For huggingface hub usage
pip install huggingface_hub
```

## Loading a `bm25s` index

You can use this index for information retrieval tasks. Here is an example:

```python
import bm25s
from bm25s.hf import BM25HF

# Load the index
retriever = BM25HF.load_from_hub("yuchenlin/BM25S_index_Llama-3-Magpie-Pro-1M-v0.1")

# You can retrieve now
query = "a cat is a feline"
results = retriever.retrieve(bm25s.tokenize(query), k=3)
```

## Saving a `bm25s` index

You can save a `bm25s` index to the Hugging Face Hub. Here is an example:

```python
import bm25s
from bm25s.hf import BM25HF

corpus = [
    "a cat is a feline and likes to purr",
    "a dog is the human's best friend and loves to play",
    "a bird is a beautiful animal that can fly",
    "a fish is a creature that lives in water and swims",
]

retriever = BM25HF(corpus=corpus)
retriever.index(bm25s.tokenize(corpus))

token = None  # You can get a token from the Hugging Face website
retriever.save_to_hub("yuchenlin/BM25S_index_Llama-3-Magpie-Pro-1M-v0.1", token=token)
```

## Advanced usage

You can leverage more advanced features of the BM25S library during `load_from_hub`:

```python
# Load corpus and index in memory-map (mmap=True) to reduce memory
retriever = BM25HF.load_from_hub("yuchenlin/BM25S_index_Llama-3-Magpie-Pro-1M-v0.1", load_corpus=True, mmap=True)

# Load a different branch/revision
retriever = BM25HF.load_from_hub("yuchenlin/BM25S_index_Llama-3-Magpie-Pro-1M-v0.1", revision="main")

# Change directory where the local files should be downloaded
retriever = BM25HF.load_from_hub("yuchenlin/BM25S_index_Llama-3-Magpie-Pro-1M-v0.1", local_dir="/path/to/dir")

# Load private repositories with a token:
retriever = BM25HF.load_from_hub("yuchenlin/BM25S_index_Llama-3-Magpie-Pro-1M-v0.1", token=token)
```

## Stats

This dataset was created using the following data:

| Statistic | Value |
| --- | --- |
| Number of documents | 920259 |
| Number of tokens | 7882267 |
| Average tokens per document | 8.57 |

## Parameters

The index was created with the following parameters:

| Parameter | Value |
| --- | --- |
| k1 | `1.5` |
| b | `0.75` |
| delta | `0.5` |
| method | `lucene` |
| idf method | `lucene` |