ruscxn-classifier / README.md
Futyn-Maker's picture
Update Model Card
d6e6daa verified
metadata
tags:
  - transformers
  - text-classification
  - russian
  - constructicon
  - nlp
  - linguistics
base_model: intfloat/multilingual-e5-large
language:
  - ru
pipeline_tag: text-classification
widget:
  - text: 'passage: NP-Nom так и VP-Pfv[Sep]query: Петр так и замер.'
    example_title: Positive example
  - text: 'passage: NP-Nom так и VP-Pfv[Sep]query: Мы хорошо поработали.'
    example_title: Negative example
  - text: 'passage: мягко говоря, Cl[Sep]query: Мягко говоря, это была ошибка.'
    example_title: Positive example

Russian Constructicon Classifier

A binary classification model for determining whether a Russian Constructicon pattern is present in a given text example. Fine-tuned from intfloat/multilingual-e5-large in two stages: first as a semantic model on Russian Constructicon data, then for binary classification.

Model Details

  • Base model: intfloat/multilingual-e5-large
  • Task: Binary text classification
  • Language: Russian
  • Training: Two-stage fine-tuning on Russian Constructicon data

Usage

Primary Usage (RusCxnPipe Library)

This model is designed for use with the RusCxnPipe library:

from ruscxnpipe import ConstructionClassifier

classifier = ConstructionClassifier(
    model_name="Futyn-Maker/ruscxn-classifier"
)

# Classify candidates (output from semantic search)
queries = ["Петр так и замер."]
candidates = [[{"id": "pattern1", "pattern": "NP-Nom так и VP-Pfv"}]]

results = classifier.classify_candidates(queries, candidates)
print(results[0][0]['is_present'])  # 1 if present, 0 if absent

Direct Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model = AutoModelForSequenceClassification.from_pretrained("Futyn-Maker/ruscxn-classifier")
tokenizer = AutoTokenizer.from_pretrained("Futyn-Maker/ruscxn-classifier")

# Format: "passage: [pattern][Sep]query: [example]"
text = "passage: NP-Nom так и VP-Pfv[Sep]query: Петр так и замер."
inputs = tokenizer(text, return_tensors="pt", truncation=True)

with torch.no_grad():
    outputs = model(**inputs)
    prediction = torch.softmax(outputs.logits, dim=-1)
    is_present = torch.argmax(prediction, dim=-1).item()

print(f"Construction present: {is_present}")  # 1 = present, 0 = absent

Input Format

The model expects input in the format: "passage: [pattern][Sep]query: [example]"

  • query: The Russian text to analyze
  • passage: The constructicon pattern to check for

Training

  1. Stage 1: Semantic embedding training on Russian Constructicon examples and patterns
  2. Stage 2: Binary classification fine-tuning to predict construction presence

Output

  • Label 0: Construction is NOT present in the text
  • Label 1: Construction IS present in the text

Framework Versions

  • Transformers: 4.51.3
  • PyTorch: 2.7.0+cu126
  • Python: 3.10.12