Try layout-aware transformers or attention models that can handle both textual and visual features. Train the VLM on prepared dataset. This involves feeding the FIR images and corresponding IPC section labels into the model. The model will learn to identify the visual cues that indicate the IPC section within the FIRs.
Once trained, you can use the VLM to process new FIRs. The model will analyze the layout and text within the FIR image and predict the location of the IPC section. You can then extract the predicted region of the FIR and parse the text to obtain the IPC codes.