--- license: apache-2.0 base_model: hkunlp/instructor-base tags: - sentence-transformers - sentence-similarity - feature-extraction - Korean - financial-nlp - nmixx language: - ko datasets: - nmixx-fin/NMIXX_train pipeline_tag: sentence-similarity --- # NMIXX-INSTRUCTOR This repository contains a instructor-base-based SentenceTransformer model fine-tuned with a triplet-loss setup on the `nmixx-fin/NMIXX_train` dataset. It produces high-quality sentence embeddings for Korean financial text, optimized for semantic similarity tasks in the finance domain. --- ## How to Use ```python from transformers import AutoTokenizer, T5EncoderModel import torch import torch.nn.functional as F # 방법 1: T5EncoderModel 사용 (encoder만 사용) repo_name = "nmixx-fin/nmixx-instructor" tokenizer = AutoTokenizer.from_pretrained(repo_name) model = T5EncoderModel.from_pretrained(repo_name) # 2. Move to GPU if available device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model.to(device) model.eval() # 3. Prepare input sentences sentences = [ "이 모델은 한국 금융 도메인에 특화된 임베딩을 제공합니다.", "NMIXX 데이터셋으로 fine-tuning된 sentence transformer입니다.", ] # 4. Tokenize encoded_input = tokenizer( sentences, padding=True, truncation=True, max_length=512, return_tensors="pt" ) input_ids = encoded_input["input_ids"].to(device) attention_mask = encoded_input["attention_mask"].to(device) # 5. Forward pass (encoder only) with torch.no_grad(): model_output = model(input_ids=input_ids, attention_mask=attention_mask) # 6. CLS Pooling sentence_embeddings = model_output.last_hidden_state[:, 0] # Use first token as CLS # 7. L2 Normalization sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1) print("Sentence embeddings shape:", sentence_embeddings.shape) print(sentence_embeddings.cpu()) ```