You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

NMIXX-INSTRUCTOR

This repository contains a instructor-base-based SentenceTransformer model fine-tuned with a triplet-loss setup on the nmixx-fin/NMIXX_train dataset. It produces high-quality sentence embeddings for Korean financial text, optimized for semantic similarity tasks in the finance domain.


How to Use

from transformers import AutoTokenizer, T5EncoderModel
import torch
import torch.nn.functional as F

# 방법 1: T5EncoderModel 사용 (encoder만 사용)
repo_name = "nmixx-fin/nmixx-instructor"
tokenizer = AutoTokenizer.from_pretrained(repo_name)
model = T5EncoderModel.from_pretrained(repo_name)

# 2. Move to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()

# 3. Prepare input sentences
sentences = [
    "이 모델은 한국 금융 도메인에 특화된 임베딩을 제공합니다.",
    "NMIXX 데이터셋으로 fine-tuning된 sentence transformer입니다.",
]

# 4. Tokenize
encoded_input = tokenizer(
    sentences, padding=True, truncation=True, max_length=512, return_tensors="pt"
)
input_ids = encoded_input["input_ids"].to(device)
attention_mask = encoded_input["attention_mask"].to(device)

# 5. Forward pass (encoder only)
with torch.no_grad():
    model_output = model(input_ids=input_ids, attention_mask=attention_mask)


# 6. CLS Pooling
sentence_embeddings = model_output.last_hidden_state[:, 0]  # Use first token as CLS

# 7. L2 Normalization
sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)

print("Sentence embeddings shape:", sentence_embeddings.shape)
print(sentence_embeddings.cpu())
Downloads last month
21
Safetensors
Model size
223M params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nmixx-fin/nmixx-instructor

Finetuned
(1)
this model

Dataset used to train nmixx-fin/nmixx-instructor