T5-small Texas Legislative Summarization
This model is a fine-tuned version of google/t5-small for summarizing Texas legislative bills. It was trained on a dataset of Texas legislative bills and their corresponding summaries, with a focus on long document summarization.
Model Details
- Model Name: T5-small Texas Legislative Summarization
- Base Model: google/t5-small
- Model Type: Seq2Seq Language Model
- Architecture: T5ForConditionalGeneration
- Language: English
- License: Apache 2.0
Model Description
It is important to note this model is not the best case trained but an example use case. This model takes the enrolled text of a Texas legislative bill as input and generates a concise summary. It aims to capture the key points of the bill in a shorter, more accessible format. Due to the length and complexity of legislative documents, the model is designed to handle long-document summarization.
Intended Use
This model can be used for:
- Summarizing Texas legislative bills for easier understanding.
- Providing a quick overview of bill content for researchers, journalists, and the general public.
- Automating the summarization process to save time and resources.
Training Data
The model was trained on a custom dataset of Texas legislative bills and their summaries. The dataset was created from:
- Source:
cleaned_texas_leg_data.json
(This file is not publicly available and would need to be replaced with a public dataset or a description of how to create one). - Source Text Column:
enrolled_text
- Target Text Column:
summary_text
- Data Preprocessing: Data was loaded, split into training and testing sets (80/20 split), and tokenized using the T5 tokenizer.
Training Procedure
The model was fine-tuned using the following parameters:
- Model:
google/t5-small
- Training Framework: Transformers library using
Seq2SeqTrainer
- Optimizer: Adafactor
- Loss Function: Cross-Entropy Loss
- Epochs: 5
- Batch Size: 1 (per device)
- Gradient Accumulation Steps: 4
- Learning Rate: 1e-05
- Weight Decay: 0.0
- FP16 Training: Enabled
- Gradient Checkpointing: Enabled
- Evaluation Strategy: Epoch
- Save Strategy: Epoch
- Early Stopping: Enabled (patience=3, threshold=0.01)
- Random Seed: 42
- Max Source Length: 4979
- Max Target Length: 752
- Prefix:
"summarize: "
Hyperparameter Tuning
A hyperparameter search was conducted to find the optimal training configuration. The following hyperparameters were explored:
- Learning Rates:
[1e-5, 2e-5, 3e-5, 4e-5, 5e-5, 6e-5]
- Weight Decays:
[0.0, 0.01, 0.015, 0.001, 0.005]
- Gradient Accumulation Steps:
[4]
The best model was selected based on the lowest perplexity on the evaluation set.
Best Parameters:
- Learning Rate: 1e-05
- Weight Decay: 0.0
- Gradient Accumulation Steps: 4
- Perplexity: N/A
Evaluation
The model was evaluated on a held-out test set using the following metrics:
- ROUGE (Rouge1, Rouge2, RougeL): Measures the overlap of n-grams between the generated summaries and the reference summaries.
- BERTScore (Precision, Recall, F1): Calculates semantic similarity between the generated and reference summaries using BERT embeddings.
- Compression Ratio: Measures the ratio of the length of the generated summary to the length of the original document (sentence-based).
Evaluation Results:
Evaluation metrics were not calculated during training, therefore results are not available
Usage
Here's how to use the model for inference:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("your-huggingface-username/t5-small-texas-legislative-summarization")
model = AutoModelForSeq2SeqLM.from_pretrained("your-huggingface-username/t5-small-texas-legislative-summarization")
def summarize(text):
input_text = "summarize: " + text
input_ids = tokenizer.encode(input_text, return_tensors="pt", max_length=4979, truncation=True)
summary_ids = model.generate(input_ids,
max_length=752,
num_beams=4,
early_stopping=True)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
return summary
# Example usage:
bill_text = "Your long Texas legislative bill text here..."
summary = summarize(bill_text)
print(summary)
Evaluation results
- Rouge1 on Texas Legislative Billsself-reportedN/A
- Rouge2 on Texas Legislative Billsself-reportedN/A
- RougeL on Texas Legislative Billsself-reportedN/A
- BERTScore F1 on Texas Legislative Billsself-reportedN/A
- Compression Ratio on Texas Legislative Billsself-reportedN/A