File size: 6,976 Bytes
d4346c1
 
 
 
 
5da98f0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d4346c1
 
5da98f0
d4346c1
5da98f0
d4346c1
 
5da98f0
 
 
 
d4346c1
5da98f0
d4346c1
5da98f0
d4346c1
5da98f0
 
 
 
 
d4346c1
5da98f0
 
 
 
 
 
 
 
 
d4346c1
5da98f0
d4346c1
5da98f0
 
d4346c1
 
5da98f0
d4346c1
5da98f0
d4346c1
5da98f0
 
 
 
d4346c1
 
5da98f0
d4346c1
5da98f0
d4346c1
5da98f0
 
 
d4346c1
 
5da98f0
 
d4346c1
5da98f0
 
d4346c1
5da98f0
d4346c1
5da98f0
 
 
 
d4346c1
5da98f0
d4346c1
5da98f0
 
 
 
 
 
d4346c1
5da98f0
d4346c1
5da98f0
d4346c1
5da98f0
d4346c1
5da98f0
 
d4346c1
5da98f0
d4346c1
5da98f0
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
---
library_name: transformers
tags:
- trl
- sft
license: apache-2.0
base_model: HuggingFaceTB/SmolLM2-1.7B-Instruct
datasets:
- EngSAF
metrics:
- accuracy
- f1
- precision
- recall
- cohen_kappa
- rmse
model-index:
- name: SmolLM2-1.7B-Instruct-EngSaf-858K
  results:
  - task:
      name: Text Generation
      type: text-generation
    dataset:
      name: EngSAF
      type: EngSAF
      config: EngSAF
      split: train
      args: EngSAF
    metrics:
    - name: Accuracy
      type: accuracy
      value: 0.3800
    - name: F1
      type: f1
      value: 0.3594
    - name: Precision
      type: precision
      value: 0.4014
    - name: Recall
      type: recall
      value: 0.3772
    - name: Cohen Kappa
      type: cohen_kappa
      value: 0.0505
    - name: RMSE
      type: rmse
      value: 1.0344
language:
- en
pipeline_tag: text-generation
---

# SmolLM2-1.7B-Instruct-EngSaf-858K

This model is a fine-tuned version of [HuggingFaceTB/SmolLM2-1.7B-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct) on the EngSAF dataset for Essay Grading. 


- **Workflow:** GitHub Repository: [https://github.com/IsmaelMousa/automatic-essay-grading](https://github.com/IsmaelMousa/automatic-essay-grading).
- **Base Model:** SmolLM2-1.7B-Instruct: [https://doi.org/10.48550/arXiv.2502.02737](https://doi.org/10.48550/arXiv.2502.02737).
- **Fine-tuning Dataset:** EngSAF-858K: [https://github.com/IsmaelMousa/EngSAF/858K](https://github.com/IsmaelMousa/automatic-essay-grading/blob/main/data/engsaf/clean/train/4735_entries.csv).
- **Task:** Automatic Essay Grading (Text Generation).

[![Report](https://img.shields.io/badge/W&B_Report-gray?logo=weightsandbiases&logoColor=yellow)](https://api.wandb.ai/links/ismael-amjad/rav48wc1)

## Dataset

The EngSAF dataset, in its raw and unprocessed form, consists of approximately 5,800 short-answer responses collected
from real-life engineering examinations administered at a reputed academic institute. These responses are spread across
119 unique questions drawn from a wide range of engineering disciplines, making the dataset both diverse and
domain-specific. Each data point includes a student’s answer and an associated human-annotated score, serving as a
benchmark for evaluating automated grading models.

The dataset is divided into three primary subsets: 70% is allocated for training, 16% is reserved for evaluation on
unseen answers (UA), and 14% is dedicated to evaluating performance on entirely new questions (UQ). At this stage, it is
important to note that the dataset is considered in its original state; no preprocessing, transformation, or filtering
has yet been applied. All subsequent improvements and refinements to the data will be described in later sections.
This dataset is known as EngSAF version 1.0 and was introduced in the paper titled *"I understand why I got this grade":
Automatic Short Answer Grading (ASAG) with Feedback*, authored by Aggarwal et al., and set to appear in the proceedings
of AIED 2025. The dataset is released strictly for academic and research purposes; any commercial use or redistribution
without explicit permission is prohibited. Researchers are also urged to avoid publicly disclosing any sensitive content
that may be contained in the dataset.

For more details, the paper can be accessed at: [https://arxiv.org/abs/2407.12818](https://arxiv.org/abs/2407.12818).

## Modeling
The modeling approach for this study was carefully designed to evaluate the performance of different large language models (LLMs) on the automated essay grading task. We selected the SmolLM2 architecture to represent a range of model sizes: 135M, 360M, and 1.7B. Each model was instruction-tuned on the EngSAF dataset in varying sizes, with hyperparameters optimized to balance computational efficiency and performance. The experiments were conducted on GPU-accelerated hardware, leveraging techniques such as gradient checkpointing, flash attention, and mixed-precision training to maximize resource utilization.

## Evaluation
The evaluation methodology employed both quantitative metrics and qualitative analysis. For quantitative assessment, we computed accuracy, precision, recall, F1 score, root mean squared error (RMSE), and Cohen's kappa score (CKS) for the scoring task, while using BERT-Score precision, recall, and F1 for rationale evaluation. On a held-out test set of 100 samples. Qualitative examination of models' outputs revealed cases where most of the models correctly identified key aspects of student answers but sometimes failed to properly align its scoring with the rubric criteria.

### Evaluation results for `score` and `rationale` outputs:

| **Aspect** |   **F1**   | **Precision** | **Recall** | **Accuracy** | **CKS** | **RMSE** |
|:----------:|:----------:|:-------------:|:----------:|:------------:|:-------:|:--------:|
|   Score    |   0.3594   |    0.4014     |   0.3772   |    0.3800    | 0.0505  |  1.0344  |
| Rationale  |   0.6240   |    0.6279     |   0.6228   |      --      |   --    |    --    |


## Usage

Below is an example of how to use the model with the Hugging Face Transformers library:

```python
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch


checkpoint       = "IsmaelMousa/SmolLM2-1.7B-Instruct-EngSaf-858K"
device           = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer        = AutoTokenizer       .from_pretrained(checkpoint)
model            = AutoModelForCausalLM.from_pretrained(checkpoint)

assistant        = pipeline("text-generation", tokenizer=tokenizer, model=model, device=device)

question         = input("Question        : ")
reference_answer = input("Reference Answer: ")
student_answer   = input("Student Answer  : ")
mark_scheme      = input("Mark Scheme     : ")

system_content   = "You are a grading assistant. Evaluate student answers based on the mark scheme. Respond only in JSON format with keys 'score' (int) and 'rationale' (string)."

user_content     = ("Provide both a score and a rationale by evaluating the student's answer strictly within the mark scheme range,"
                    " grading based on how well it meets the question's requirements by comparing the student answer to the reference answer.\n"
                    f"Question: {question}\n"
                    f"Reference Answer: {reference_answer}\n"
                    f"Student Answer: {student_answer}\n"
                    f"Mark Scheme: {mark_scheme}")

messages         = [{"role": "system", "content": system_content}, {"role": "user", "content": user_content}]

inputs           = tokenizer.apply_chat_template(messages, tokenize=False)

output           = assistant(inputs, max_new_tokens=128, do_sample=False, return_full_text=False)[0]["generated_text"]

print(output)
```

### Frameworks

- `datasets-3.6.0`
- `torch-2.7.0`
- `transformers-4.51.3`
- `trl-0.17.0`
- `scikit-learn-1.6.1`
- `bert-score-0.3.13`
- `json-repair-0.46.0`