data-silence
/

frozen_news_classifier_ft

@@ -3,6 +3,10 @@ license: apache-2.0
 base_model: sentence-transformers/LaBSE
 tags:
 - generated_from_trainer
 metrics:
 - accuracy
 - f1
@@ -11,6 +15,12 @@ metrics:
 model-index:
 - name: frozen_news_classifier_ft
   results: []
 ---
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
@@ -18,7 +28,9 @@ should probably proofread and complete it, then remove this comment. -->
 # frozen_news_classifier_ft
-This model is a fine-tuned version of [sentence-transformers/LaBSE](https://huggingface.co/sentence-transformers/LaBSE) on an unknown dataset.
 It achieves the following results on the evaluation set:
 - Loss: 0.7314
 - Accuracy: 0.7793
@@ -26,19 +38,68 @@ It achieves the following results on the evaluation set:
 - Precision: 0.7785
 - Recall: 0.7793
 ## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
-## Training procedure
 ### Training hyperparameters

 base_model: sentence-transformers/LaBSE
 tags:
 - generated_from_trainer
+- news
+- russian
+- media
+- text-classification
 metrics:
 - accuracy
 - f1
 model-index:
 - name: frozen_news_classifier_ft
   results: []
+datasets:
+- data-silence/rus_news_classifier
+pipeline_tag: text-classification
+language:
+- ru
+library_name: transformers
 ---
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 # frozen_news_classifier_ft
+This model is a fine-tuned version of [sentence-transformers/LaBSE](https://huggingface.co/sentence-transformers/LaBSE) on my [news dataset](https://huggingface.co/datasets/data-silence/rus_news_classifier).
+The learning news dataset is a well-balanced sample of recent news from the last five years.
 It achieves the following results on the evaluation set:
 - Loss: 0.7314
 - Accuracy: 0.7793
 - Precision: 0.7785
 - Recall: 0.7793
+## How to use
+```python
+import torch
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+universal_model_name = "data-silence/frozen_news_classifier_ft"
+universal_tokenizer = AutoTokenizer.from_pretrained(universal_model_name)
+universal_model = AutoModelForSequenceClassification.from_pretrained(universal_model_name)
+# Перевод моделей в режим оценки и на нужное устройство
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+universal_model = universal_model.to(device)
+universal_model.eval()
+id2label = {
+    0: 'climate', 1: 'conflicts', 2: 'culture', 3: 'economy', 4: 'gloss',
+    5: 'health', 6: 'politics', 7: 'science', 8: 'society', 9: 'sports', 10: 'travel'
+}
+def create_sentence_or_batch_embeddings(sent: list[str]) -> list[list[float]]:
+    """Получает эмбеддинги списка текстов"""
+    # Токенизация входного текста
+    inputs = universal_tokenizer(sent, return_tensors="pt", padding=True, truncation=True).to(device)
+    with torch.no_grad():
+        outputs = universal_model.base_model(**inputs)
+    embeddings = outputs.pooler_output
+    embeddings = torch.nn.functional.normalize(embeddings, dim=1)
+    return embeddings.tolist()
+def predict_category(news: list[str]) -> list[str]:
+    """Предсказывает категорию по тексту новости / новостей"""
+    # Токенизация с активацией выравнивания и усечения
+    inputs = universal_tokenizer(news, return_tensors="pt", truncation=True, padding=True)
+    # Получение логитов модели
+    with torch.no_grad():
+        outputs = universal_model(**inputs)
+        logits = outputs.logits
+    # Получение индексов предсказанных меток
+    predicted_labels = torch.argmax(logits, dim=-1).tolist()
+    # Преобразование индексов в категории
+    predicted_categories = [id2label[label] for label in predicted_labels]
+    return predicted_categories
+```
 ## Model description
+The goal of this model was to create a universal model for categorizing Russian-language news that would preserve the ability of the basic LaBSE model to generate multi-lingual text embeddings in a single vector space.
+## Intended uses & limitations
+Compared to my specialized model [any-news-classifier](https://huggingface.co/data-silence/any-news-classifier), which is designed to solve news classification problems, this model shows meaningfully worse metrics.
 ### Training hyperparameters