Hikka-Forge: Fine-tuned Multilingual Sentence Transformer for Anime Semantic Search (UA/EN)

This is a sentence-transformers model fine-tuned from sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2. It is specifically trained to map Ukrainian and English sentences & paragraphs from the anime domain into a 384-dimensional dense vector space.

The model is designed for tasks such as semantic search, textual similarity, and clustering within an anime context. It excels at capturing not only direct keywords but also abstract concepts, genres, and the overall atmosphere of a title.

The training dataset was provided by hikka.io, a comprehensive Ukrainian encyclopedia for anime, manga, and light novels.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
Languages: Ukrainian (uk), English (en)
Fine-tuning Dataset: Proprietary dataset from hikka.io
Maximum Sequence Length: 128 tokens
Output Dimensionality: 384 dimensions
Similarity Function: Cosine Similarity

Model Sources

Repository: This model on Hugging Face
Original Model: paraphrase-multilingual-MiniLM-L12-v2
Documentation: Sentence Transformers Documentation

Usage

First, install the Sentence Transformers library:

pip install -U sentence-transformers

Then, you can load the model and use it for semantic search or similarity tasks.

from sentence_transformers import SentenceTransformer, util

# Download the model from the 🤗 Hub
model = SentenceTransformer("Lorg0n/hikka-forge-paraphrase-multilingual-MiniLM-L12-v2")

# Example query (can be in Ukrainian or English)
query = "аніме про меланхолійну подорож після перемоги над королем демонів"
# "anime about a melancholic journey after defeating the demon king"

# A corpus of documents to search through
corpus = [
    "Frieren is an elf mage who was part of the hero's party that defeated the Demon King. After the journey, she witnesses her human companions pass away due to old age and embarks on a new journey to understand humanity.",
    "To Your Eternity follows an immortal being sent to Earth with no emotions nor identity. The being is able to take on the shape of those that leave a strong impression on it.",
    "K-On! is a lighthearted story about four high school girls who join the light music club to save it from being disbanded. They spend their days practicing, performing, and hanging out together."
]

# Encode the query and corpus into dense vector embeddings
query_embedding = model.encode(query, convert_to_tensor=True)
corpus_embeddings = model.encode(corpus, convert_to_tensor=True)

# Compute cosine similarity scores
cosine_scores = util.cos_sim(query_embedding, corpus_embeddings)

# Print the results
print(f"Query: {query}\n")
for i, score in enumerate(cosine_scores[0]):
    print(f"Similarity: {score:.4f}\t | Document: {corpus[i][:80]}...")

# Expected Output:
# Query: аніме про меланхолійну подорож після перемоги над королем демонів
#
# Similarity: 0.4013	 | Document: Frieren is an elf mage who was part of the hero's party that defeated the Demon ...
# Similarity: 0.1800	 | Document: To Your Eternity follows an immortal being sent to Earth with no emotions nor id...
# Similarity: 0.0091	 | Document: K-On! is a lighthearted story about four high school girls who join the light mu...

Training Details

Training Dataset

The model was fine-tuned on a proprietary, high-quality dataset from hikka.io, consisting of 177,822 carefully constructed training pairs. The dataset was engineered to teach the model various semantic relationships within the anime domain:

Cross-lingual Connections (UA ↔ EN):
- Pairs of titles and their corresponding synopses in both languages (ua_title ↔ en_synopsis).
- Pairs of titles in Ukrainian and English (ua_title ↔ en_title).
- Pairs of translated genre names (Бойовик ↔ Action).
- Pairs from an auxiliary translated dataset to augment bilingual understanding.
Intra-lingual Connections (UA ↔ UA, EN ↔ EN):
- Pairs of key sentences (first, middle, last) from a synopsis with the full synopsis text. This teaches the model that a part is semantically related to the whole text.
Metadata & Synonymy Injection:
- Pairs linking all known titles of an anime (Ukrainian, English, Japanese, synonyms) to each other, teaching the model that they refer to the same entity.
- Pairs linking genres and studios to anime titles to ground the model in relevant metadata.

Loss Function: The model was trained using MultipleNegativesRankingLoss, a highly effective method for learning semantic similarity. It utilizes other examples in a batch as negative samples, which is a very efficient training paradigm.

Evaluation

The fine-tuned model demonstrates a significantly improved understanding of domain-specific and abstract concepts compared to the base model. During evaluation, it showed:

Superior understanding of niche genres: It correctly identified "Yuru Camp" (Дівчачий табір) from the query "a calming, healing 'iyashikei' anime", while the base model returned more generic results.
Grasping abstract concepts: It correctly found "Magical Girl Site" for the query "деконструкція жанру махо-шьоджьо, де дівчата-чарівниці страждають психологічно" (deconstruction of the maho-shoujo genre where magical girls suffer psychologically).
Better atmospheric matching: It showed higher similarity to thematically similar anime (like "Frieren" and "To Your Eternity") and lower similarity to dissimilar ones, proving a deeper contextual understanding.

Training Hyperparameters

learning_rate: 2e-05
per_device_train_batch_size: 32
num_train_epochs: 4
warmup_ratio: 0.1
fp16: True
loss: MultipleNegativesRankingLoss

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

Lorg0n
/

hikka-forge-paraphrase-multilingual-MiniLM-L12-v2