--- license: apache-2.0 language: - ru base_model: - ai-forever/ruBert-large tags: - difficulty - cefr - regression --- # Model Card for Model ID Regression model which predicts difficulty score for an input text. Predicted scores can be mapped to CEFR levels. ## Model Details Frozen BERT-large layers with a regressor on top. Trained on a mix of manually annotated datasets (more details on data will follow). ## How to Get Started with the Model Use the code below to get started with the model. ``` class CustomModel(BertPreTrainedModel): def __init__(self, config, load_path=None, use_auth_token: str = None,): super().__init__(config) self.bert = BertModel(config) self.pre_classifier = nn.Linear(config.hidden_size, 128) self.dropout = nn.Dropout(0.2) self.classifier = nn.Linear(128, 1) # Apply Xavier initialization nn.init.xavier_uniform_(self.pre_classifier.weight) nn.init.xavier_uniform_(self.classifier.weight) if self.pre_classifier.bias is not None: nn.init.constant_(self.pre_classifier.bias, 0) if self.classifier.bias is not None: nn.init.constant_(self.classifier.bias, 0) def forward( self, input_ids, labels=None, attention_mask=None, token_type_ids=None, position_ids=None, ): outputs = self.bert( input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, position_ids=position_ids, ) pooled_output = outputs[0][:, 0] pooled_output = self.pre_classifier(pooled_output) pooled_output = nn.ReLU()(pooled_output) pooled_output = self.dropout(pooled_output) logits = self.classifier(pooled_output) if labels is not None: loss_fn = nn.MSELoss() loss = loss_fn(logits.view(-1), labels.view(-1)) return loss, logits else: return None, logits tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) config = AutoConfig.from_pretrained(model_path, trust_remote_code=True) config.num_labels = 1 model = CustomModel(config) model.load_state_dict(torch.load(f'{model_path}/pytorch_model.bin')) inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True) inputs = {key: value.to(device) for key, value in inputs.items()} with torch.no_grad(): _, logits = model(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], token_type_ids=inputs["token_type_ids"]) ``` To map to CEFR, use: ``` reg2cl2 = {'1.0': 'A1', '1.5': 'A12', '2.0': 'A2', '2.5': 'A2', '3.0': 'B1', '3.5': 'B12', '4.0': 'B2', '4.5': 'B2', '5.0': 'C1', '5.5': 'C12', '6.0': 'C2', '0.0': 'A1'} print("Predicted output (logits):", logits.item(), reg2cl2[str(float(round(logits.item())))]) ``` ## Training Details #### Training Hyperparameters + learning_rate: 3e-4 + num_train_epochs: 15.0 + batch_size: 32 + weight_decay: 0.1 + adam_beta1: 0.9 + adam_beta2: 0.99 + adam_epsilon: 1e-8 + max_grad_norm: 1.0 + fp16: True ## Evaluation on test set ![Evaluation results](ru_regression.png) ## Citation Please refer to this repo when using the model.