---
inference: false
language: en
license:
  - cc0-1.0
library_name: txtai
tags:
- sentence-similarity
datasets:
- arxiv_dataset
---

# arXiv txtai embeddings index

This is a [txtai](https://github.com/neuml/txtai) embeddings index for the [arXiv dataset](https://hf.co/datasets/arxiv_dataset) [metadata](https://info.arxiv.org/help/prep.html).

txtai must be [installed](https://neuml.github.io/txtai/install/) to use this model.

## Example

This index can be loaded from the Hugging Face Hub with txtai as shown below.

```python
from txtai.embeddings import Embeddings

# Load the index from the HF Hub
embeddings = Embeddings()
embeddings.load(provider="huggingface-hub", container="neuml/txtai-arxiv")

# Run a search
embeddings.search("txtai is an all-in-one embeddings database for semantic search, LLM orchestration and language model workflows.")
```

## Use Cases

An embeddings index generated by txtai is a fully encapsulated index format. It doesn't require a database server or dependencies outside of the Python install.

The arXiv index works well as a fact-based context source for retrieval augmented generation (RAG). In other words, search results from this model can be passed to LLM prompts as the context in which to answer questions.

Additionally, this model can identify articles to cite in research. Passing a title + abstract pair will find similar existing articles.

## Build the index

The following steps show how to build this index.

- Install required build dependencies
```bash
pip install txtchat datasets
```

- Follow these [instructions](https://huggingface.co/datasets/arxiv_dataset/blob/main/arxiv_dataset.py#L67) to download the dataset

- Build txtai-arxiv index
```bash
python -m txtchat.data.arxiv.index \
       -d <path to directory with file downloaded in previous step> \
       -o txtai-arxiv
```

## More information

See the following links for more information on the arXiv metadata dataset.

- [Dataset on Hugging Face](https://huggingface.co/datasets/arxiv_dataset)
- [Dataset on Kaggle](https://www.kaggle.com/datasets/Cornell-University/arxiv)
- [Metadata description](https://info.arxiv.org/help/prep.html)