Conan-Embedding-v2

What's New?

  • Performance

    Conan-Embedding-v2 has now achieved SOTA performance on the MTEB leaderboard for both Chinese and English.

  • Cross-lingual Retrieval between Chinese and English

    Conan-Embedding-v2 supports cross-lingual retrieval between Chinese and English samples.

  • Longer Context Support

    Conan-Embedding-v2 now supports a context length of 32,768 tokens.

  • Conan 1.4B Large Model Trained from Scratch

    A vocabulary and large language model trained from scratch, with a pre-trained model and vocabulary more tailored to the Embedding scenario, delivering stronger performance.

    The Conan-1.4B base model will be open-sourced. Community workers can train their own Embedding models based on the Conan-1.4B base model.

Performance

Performance of Conan-Embedding-v2 on MTEB for Chinese and English

MTEB Result

English

Embedding TaskMertric Class. Acc. (12) Clust V-Meas. (11) PairClass AP (3) Rerank MAP (4) Retri nDCG @ 10 (15) STS Spear. (12) SummSpear. (1) Avg.(56)
bge-multilingual-gemma2 88.08 54.65 85.97 59.72 59.24 83.88 31.20 69.88
e5-mistral-7b-instruct 79.89 51.44 88.42 49.78 57.62 84.32 36.57 67.98
gte-Qwen2-7B-instruct 86.58 56.92 85.90 61.42 59.11 83.06 31.35 69.95
stella-en-1.5B-v5 87.63 57.69 88.07 61.21 61.01 84.51 31.49 71.19
bge-en-icl 88.95 57.89 88.14 59.86 62.16 84.24 30.77 71.67
NV-Embed-v2 90.37 58.46 88.67 60.65 62.65 84.31 30.70 72.31
Conan-embedding-v2 90.15 60.86 93.47 60.89 66.40 85.73 28.08 74.22

Chinese

Embedding TaskMertric Class.Acc. (9) ClustV-Meas. (4) PairClassAP (2) RerankMAP (4) RetrinDCG @ 10 (8) STSSpear. (8) Avg.(35)
e5-mistral-7b-instruct 72.96 52.30 72.19 61.86 61.75 48.34 59.92
gte-Qwen2-1.5B-instruct 72.53 54.61 86.91 68.21 71.86 60.05 67.12
bge-multilingual-gemma2 75.31 59.30 86.67 68.28 73.73 55.19 67.64
gte-Qwen2-7B-instruct 75.77 66.06 87.48 68.92 75.71 65.20 71.62
xiaobu-embedding-v2 76.53 65.17 91.87 72.58 76.50 64.18 72.36
Conan-embedding-v1 76.77 66.33 91.66 72.76 76.67 63.67 72.50
Conan-embedding-v2 76.47 68.84 92.44 74.41 78.31 65.48 74.24

Model Detail

Model Structure

Conan-Embedding-v2 Structure:

SentenceTransformer(  
    (0): Transformer({
        'max_seq_length': 32768, 
        'do_lower_case': False
        }) with Transformer model: ConanEmbedModel,
    (1): Pooling({
        'word_embedding_dimension': 3584, 
        'pooling_mode_cls_token': False, 
        'pooling_mode_mean_tokens': True, 
        'pooling_mode_max_tokens': False, 
        'pooling_mode_mean_sqrt_len_tokens': False, 
        'pooling_mode_weightedmean_tokens': False, 
        'pooling_mode_lasttoken': False, 
        'include_prompt': True
        }),
    (2): Dense({
        'in_features': 3584, 
        'out_features': 3584, 
        'bias': True, 
        'activation_function': 'torch.nn.modules.linear.Identity'
        })
)

Key Specifications of Conan-1.4B (Transformer):

  • Number of Parameters (Non-Dense-Layer): 1.48B

  • Vocabulary Size: 150,000

  • Number of Layers: 8

  • Hidden Layer Dimension: 3584

  • Number of Attention Heads (GOA): 32 for Q and 8 for KV

  • Intermediate Dimension of FFN Layer: 8192

  • Maximum Context Window: 32,768 Tokens

For more model details, please refer to model/modeling_conan.py and config.json, or stay tuned for the upcoming open-source release of Conan-1.4B Base Model.

Tokenizer

We trained the Tokenizer on a large-scale multilingual dataset to build a standard BBPE(Byte-level Byte Pair Encoding) tokenizer with a vocabulary size of 150,000.

Technical Report

We will soon release our technical report.

Using Conan-Embedding-v2

Use /model/conan_api_client.py to access our test API. A sample call is as follows:

from modeling_conan import ConanClient

AK = os.getenv("CONAN_AK")
SK = os.getenv("CONAN_SK")
client = ConanClient(ak=AK, sk=SK, url="https://ai.om.qq.com/api/conan/v2")
res = client.embed("Hello!")
print(res)

This is a temporary calling solution. Please contact us to obtain an access token.

In the future, we will provide high-performance, cost-effective, and reliable Embedding services on Tencent Cloud.


About

Created by the Tencent BAC Group. All rights reserved.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support