full

This model is a fine-tuned version of Qwen/Qwen2.5-VL-7B-Instruct on the orgin_detection_val and origin_detection datasets. It is designed for the task of automatically reading electricity meter displays from images in an end-to-end fashion.

Model Description

This project addresses the automatic reading of smart meter images. Traditional methods often rely on a multi-stage pipeline involving object detection (e.g., YOLO) to locate the display area, followed by Optical Character Recognition (OCR) on the cropped region. This approach, while established, suffers from a cumbersome workflow, error propagation from detection to recognition, and poor robustness to image distortions, uneven lighting, and background noise.

To overcome these limitations, this model employs an end-to-end Image-to-Text paradigm. It treats the meter reading task as a direct generation problem, taking an entire meter image as input and producing the numerical reading (e.g., "8430.6") as a text sequence. This simplifies the pipeline and leverages global image context for higher accuracy.

The core of the model is a Large Vision-Language Model (LVLM) with two main components:

Vision Encoder: A Vision Transformer (ViT) that encodes the input image into a series of rich feature vectors. ViT's global receptive field allows it to capture key details and spatial relationships, making it resilient to minor rotations, scaling, or occlusions.
Language Decoder: A large-scale autoregressive language model that processes the visual features and generates the final text output character by character. Its pre-training on vast text corpora enables it to understand the logical structure of numerical sequences (like decimal points), preventing common-sense errors.

Intended uses & limitations

This model is intended for developers and researchers working on automated utility meter reading systems. It can be directly integrated into applications requiring the extraction of numerical data from meter images. By replacing complex multi-model pipelines, it can significantly reduce development and maintenance overhead.

While designed for robustness, performance may vary with out-of-distribution images that differ significantly from the training data in terms of meter type, image quality, or environmental conditions.

Training and evaluation data

The training data consists of 841 manually annotated images of electricity meters, provided for a course and therefore not publicly available. We partitioned 1/10 of the data as the validation set (origin_detection_val) and the rest as the training set (origin_detection). To ensure format consistency during training, all target readings were formatted into a standardized six-digit representation, with leading and trailing zeros preserved.

Training procedure

The model was trained using Supervised Fine-Tuning (SFT) on the pre-trained Qwen/Qwen2.5-VL-7B-Instruct model. The open-source LLaMA-Factory framework was utilized for the training process.

To enhance the model's generalization capabilities, the following data augmentation techniques were applied online during training:

Random adjustments to brightness and contrast.
Minor affine transformations.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e-05
train_batch_size: 1
eval_batch_size: 8
seed: 42
distributed_type: multi-GPU
num_devices: 8
gradient_accumulation_steps: 2
total_train_batch_size: 16
total_eval_batch_size: 64
optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 3.0

Training results

0.96 acc in test set (unavailable).

Framework versions

Transformers 4.50.0
Pytorch 2.6.0+cu124
Datasets 3.2.0
Tokenizers 0.21.0

Word2Li
/

Electricity-Meter-OCR-7B