license: mit

Quote Identification and Attribution Pipeline

License

This project is licensed under the MIT License.

Included Files:

Training Script for BERT Quote Identifier Model: The script used to train the BERT model for identifying quotes in text.
Example of BERT Quote Identifier Model file: This is a pre-trained BERT Model file you will be able to use for the BERT infrence script pipeline.
Training Script for GPT-2 Quotation Attribution Model: The script used to train the GPT-2 model for attributing quotes to characters.
Example of GPT-2 Quotation Attribution Model: This is a pre-trained GPT2 Model file you will be able to use for the GPT2 infrence script pipeline.
Pipeline.py: A script that integrates both the BERT and GPT-2 models, allowing for inference on text to identify and attribute quotes.
Training Dataset: A collection of books located in the 'books' folder, used for training the models.

Overview

This project consists of a pipeline featuring two custom models:

BERT Quote Identifier Model: This model processes text to identify what is a quote and what is not. It is trained to differentiate between quoted and non-quoted text.
GPT-2 Quotation Attribution Model: Once quotes are identified, this model attributes each quote to the character who likely said it. It is trained specifically on quotes to predict which character said each line in a story.

This pipeline provides a comprehensive solution for extracting and attributing quotes within a text, making it useful for literary analysis, text annotation, and other related applications.

THe results of gpt2:v2 model training is:

-Each evaluation was given 100 tests

Processing Inference (Seen training data): 100%|██████████████████████████████████████| 100/100 [18:54<00:00, 11.34s/it] Accuracy score of : 0.17

Processing Inference (Unseen training data): 100%|████████████████████████████████████| 100/100 [18:55<00:00, 11.36s/it] Accuracy score of : 0.18

Processing Inference (Un-seen training data + grabbing entities from outputs): 100%|██| 100/100 [16:50<00:00, 10.11s/it] Accuracy score of : 0.07