license: mit
Quote Identification and Attribution Pipeline
License
This project is licensed under the MIT License.
Included Files:
- Training Script for BERT Quote Identifier Model: The script used to train the BERT model for identifying quotes in text.
- Example of BERT Quote Identifier Model file: This is a pre-trained BERT Model file you will be able to use for the BERT infrence script pipeline.
- Training Script for GPT-2 Quotation Attribution Model: The script used to train the GPT-2 model for attributing quotes to characters.
- Example of GPT-2 Quotation Attribution Model: This is a pre-trained GPT2 Model file you will be able to use for the GPT2 infrence script pipeline.
- Pipeline.py: A script that integrates both the BERT and GPT-2 models, allowing for inference on text to identify and attribute quotes.
- Training Dataset: A collection of books located in the 'books' folder, used for training the models.
Overview
This project consists of a pipeline featuring two custom models:
BERT Quote Identifier Model: This model processes text to identify what is a quote and what is not. It is trained to differentiate between quoted and non-quoted text.
GPT-2 Quotation Attribution Model: Once quotes are identified, this model attributes each quote to the character who likely said it. It is trained specifically on quotes to predict which character said each line in a story.
This pipeline provides a comprehensive solution for extracting and attributing quotes within a text, making it useful for literary analysis, text annotation, and other related applications.
THe results of gpt2:v2 model training is:
-Each evaluation was given 100 tests
Processing Inference (Seen training data): 100%|ββββββββββββββββββββββββββββββββββββββ| 100/100 [18:54<00:00, 11.34s/it] Accuracy score of : 0.17
Processing Inference (Unseen training data): 100%|ββββββββββββββββββββββββββββββββββββ| 100/100 [18:55<00:00, 11.36s/it] Accuracy score of : 0.18
Processing Inference (Un-seen training data + grabbing entities from outputs): 100%|ββ| 100/100 [16:50<00:00, 10.11s/it] Accuracy score of : 0.07