Hikka-Forge: Fine-tuned Multilingual Sentence Transformer for Anime Semantic Search (UA/EN)
This is a sentence-transformers model fine-tuned from sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
. It is specifically trained to map Ukrainian and English sentences & paragraphs from the anime domain into a 384-dimensional dense vector space.
The model is designed for tasks such as semantic search, textual similarity, and clustering within an anime context. It excels at capturing not only direct keywords but also abstract concepts, genres, and the overall atmosphere of a title.
The training dataset was provided by hikka.io, a comprehensive Ukrainian encyclopedia for anime, manga, and light novels.
Model Details
Model Description
- Model Type: Sentence Transformer
- Base model:
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- Languages: Ukrainian (uk), English (en)
- Fine-tuning Dataset: Proprietary dataset from hikka.io
- Maximum Sequence Length: 128 tokens
- Output Dimensionality: 384 dimensions
- Similarity Function: Cosine Similarity
Model Sources
- Repository: This model on Hugging Face
- Original Model: paraphrase-multilingual-MiniLM-L12-v2
- Documentation: Sentence Transformers Documentation
Usage
First, install the Sentence Transformers library:
pip install -U sentence-transformers
Then, you can load the model and use it for semantic search or similarity tasks.
from sentence_transformers import SentenceTransformer, util
# Download the model from the 🤗 Hub
model = SentenceTransformer("Lorg0n/hikka-forge-paraphrase-multilingual-MiniLM-L12-v2")
# Example query (can be in Ukrainian or English)
query = "аніме про меланхолійну подорож після перемоги над королем демонів"
# "anime about a melancholic journey after defeating the demon king"
# A corpus of documents to search through
corpus = [
"Frieren is an elf mage who was part of the hero's party that defeated the Demon King. After the journey, she witnesses her human companions pass away due to old age and embarks on a new journey to understand humanity.",
"To Your Eternity follows an immortal being sent to Earth with no emotions nor identity. The being is able to take on the shape of those that leave a strong impression on it.",
"K-On! is a lighthearted story about four high school girls who join the light music club to save it from being disbanded. They spend their days practicing, performing, and hanging out together."
]
# Encode the query and corpus into dense vector embeddings
query_embedding = model.encode(query, convert_to_tensor=True)
corpus_embeddings = model.encode(corpus, convert_to_tensor=True)
# Compute cosine similarity scores
cosine_scores = util.cos_sim(query_embedding, corpus_embeddings)
# Print the results
print(f"Query: {query}\n")
for i, score in enumerate(cosine_scores[0]):
print(f"Similarity: {score:.4f}\t | Document: {corpus[i][:80]}...")
# Expected Output:
# Query: аніме про меланхолійну подорож після перемоги над королем демонів
#
# Similarity: 0.4013 | Document: Frieren is an elf mage who was part of the hero's party that defeated the Demon ...
# Similarity: 0.1800 | Document: To Your Eternity follows an immortal being sent to Earth with no emotions nor id...
# Similarity: 0.0091 | Document: K-On! is a lighthearted story about four high school girls who join the light mu...
Training Details
Training Dataset
The model was fine-tuned on a proprietary, high-quality dataset from hikka.io, consisting of 177,822 carefully constructed training pairs. The dataset was engineered to teach the model various semantic relationships within the anime domain:
Cross-lingual Connections (UA ↔ EN):
- Pairs of titles and their corresponding synopses in both languages (
ua_title
↔en_synopsis
). - Pairs of titles in Ukrainian and English (
ua_title
↔en_title
). - Pairs of translated genre names (
Бойовик
↔Action
). - Pairs from an auxiliary translated dataset to augment bilingual understanding.
- Pairs of titles and their corresponding synopses in both languages (
Intra-lingual Connections (UA ↔ UA, EN ↔ EN):
- Pairs of key sentences (first, middle, last) from a synopsis with the full synopsis text. This teaches the model that a part is semantically related to the whole text.
Metadata & Synonymy Injection:
- Pairs linking all known titles of an anime (Ukrainian, English, Japanese, synonyms) to each other, teaching the model that they refer to the same entity.
- Pairs linking genres and studios to anime titles to ground the model in relevant metadata.
- Loss Function: The model was trained using
MultipleNegativesRankingLoss
, a highly effective method for learning semantic similarity. It utilizes other examples in a batch as negative samples, which is a very efficient training paradigm.
Evaluation
The fine-tuned model demonstrates a significantly improved understanding of domain-specific and abstract concepts compared to the base model. During evaluation, it showed:
- Superior understanding of niche genres: It correctly identified "Yuru Camp" (Дівчачий табір) from the query
"a calming, healing 'iyashikei' anime"
, while the base model returned more generic results. - Grasping abstract concepts: It correctly found "Magical Girl Site" for the query
"деконструкція жанру махо-шьоджьо, де дівчата-чарівниці страждають психологічно"
(deconstruction of the maho-shoujo genre where magical girls suffer psychologically). - Better atmospheric matching: It showed higher similarity to thematically similar anime (like "Frieren" and "To Your Eternity") and lower similarity to dissimilar ones, proving a deeper contextual understanding.
Training Hyperparameters
learning_rate
: 2e-05per_device_train_batch_size
: 32num_train_epochs
: 4warmup_ratio
: 0.1fp16
: Trueloss
: MultipleNegativesRankingLoss
Citation
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
- Downloads last month
- 22