---
license: apache-2.0
base_model: hkunlp/instructor-base
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- Korean
- financial-nlp
- nmixx
language:
- ko
datasets:
- nmixx-fin/NMIXX_train
pipeline_tag: sentence-similarity
---

# NMIXX-INSTRUCTOR

This repository contains a instructor-base-based SentenceTransformer model fine-tuned with a triplet-loss setup on the `nmixx-fin/NMIXX_train` dataset. It produces high-quality sentence embeddings for Korean financial text, optimized for semantic similarity tasks in the finance domain.

---

## How to Use

```python
from transformers import AutoTokenizer, T5EncoderModel
import torch
import torch.nn.functional as F

# 방법 1: T5EncoderModel 사용 (encoder만 사용)
repo_name = "nmixx-fin/nmixx-instructor"
tokenizer = AutoTokenizer.from_pretrained(repo_name)
model = T5EncoderModel.from_pretrained(repo_name)

# 2. Move to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()

# 3. Prepare input sentences
sentences = [
    "이 모델은 한국 금융 도메인에 특화된 임베딩을 제공합니다.",
    "NMIXX 데이터셋으로 fine-tuning된 sentence transformer입니다.",
]

# 4. Tokenize
encoded_input = tokenizer(
    sentences, padding=True, truncation=True, max_length=512, return_tensors="pt"
)
input_ids = encoded_input["input_ids"].to(device)
attention_mask = encoded_input["attention_mask"].to(device)

# 5. Forward pass (encoder only)
with torch.no_grad():
    model_output = model(input_ids=input_ids, attention_mask=attention_mask)


# 6. CLS Pooling
sentence_embeddings = model_output.last_hidden_state[:, 0]  # Use first token as CLS

# 7. L2 Normalization
sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)

print("Sentence embeddings shape:", sentence_embeddings.shape)
print(sentence_embeddings.cpu())
```