arxiv:2309.07597

C-Pack: Packaged Resources To Advance General Chinese Embedding

Published on Sep 14, 2023

Upvote

Authors:

Shitao Xiao ,

Zheng Liu ,

Peitian Zhang ,

Niklas Muennighoff

Abstract

We introduce C-Pack, a package of resources that significantly advance the field of general Chinese embeddings. C-Pack includes three critical resources. 1) C-MTEB is a comprehensive benchmark for Chinese text embeddings covering 6 tasks and 35 datasets. 2) C-MTP is a massive text embedding dataset curated from labeled and unlabeled Chinese corpora for training embedding models. 3) C-TEM is a family of embedding models covering multiple sizes. Our models outperform all prior Chinese text embeddings on C-MTEB by up to +10% upon the time of the release. We also integrate and optimize the entire suite of training methods for C-TEM. Along with our resources on general Chinese embedding, we release our data and models for English text embeddings. The English models achieve state-of-the-art performance on MTEB benchmark; meanwhile, our released English data is 2 times larger than the Chinese data. All these resources are made publicly available at https://github.com/FlagOpen/FlagEmbedding.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 49

Browse 49 models citing this paper

C-Pack: Packaged Resources To Advance General Chinese Embedding

Abstract

Community

Models citing this paper 49

Datasets citing this paper 2

Spaces citing this paper 681

Collections including this paper 1