T5-small Texas Legislative Summarization

This model is a fine-tuned version of google/t5-small for summarizing Texas legislative bills. It was trained on a dataset of Texas legislative bills and their corresponding summaries, with a focus on long document summarization.

Model Details

Model Name: T5-small Texas Legislative Summarization
Base Model: google/t5-small
Model Type: Seq2Seq Language Model
Architecture: T5ForConditionalGeneration
Language: English
License: Apache 2.0

Model Description

It is important to note this model is not the best case trained but an example use case. This model takes the enrolled text of a Texas legislative bill as input and generates a concise summary. It aims to capture the key points of the bill in a shorter, more accessible format. Due to the length and complexity of legislative documents, the model is designed to handle long-document summarization.

Intended Use

This model can be used for:

Summarizing Texas legislative bills for easier understanding.
Providing a quick overview of bill content for researchers, journalists, and the general public.
Automating the summarization process to save time and resources.

Training Data

The model was trained on a custom dataset of Texas legislative bills and their summaries. The dataset was created from:

Source: cleaned_texas_leg_data.json (This file is not publicly available and would need to be replaced with a public dataset or a description of how to create one).
Source Text Column: enrolled_text
Target Text Column: summary_text
Data Preprocessing: Data was loaded, split into training and testing sets (80/20 split), and tokenized using the T5 tokenizer.

Training Procedure

The model was fine-tuned using the following parameters:

Model: google/t5-small
Training Framework: Transformers library using Seq2SeqTrainer
Optimizer: Adafactor
Loss Function: Cross-Entropy Loss
Epochs: 5
Batch Size: 1 (per device)
Gradient Accumulation Steps: 4
Learning Rate: 1e-05
Weight Decay: 0.0
FP16 Training: Enabled
Gradient Checkpointing: Enabled
Evaluation Strategy: Epoch
Save Strategy: Epoch
Early Stopping: Enabled (patience=3, threshold=0.01)
Random Seed: 42
Max Source Length: 4979
Max Target Length: 752
Prefix: "summarize: "

Hyperparameter Tuning

A hyperparameter search was conducted to find the optimal training configuration. The following hyperparameters were explored:

Learning Rates: [1e-5, 2e-5, 3e-5, 4e-5, 5e-5, 6e-5]
Weight Decays: [0.0, 0.01, 0.015, 0.001, 0.005]
Gradient Accumulation Steps: [4]

The best model was selected based on the lowest perplexity on the evaluation set.

Best Parameters:

Learning Rate: 1e-05
Weight Decay: 0.0
Gradient Accumulation Steps: 4
Perplexity: N/A

Evaluation

The model was evaluated on a held-out test set using the following metrics:

ROUGE (Rouge1, Rouge2, RougeL): Measures the overlap of n-grams between the generated summaries and the reference summaries.
BERTScore (Precision, Recall, F1): Calculates semantic similarity between the generated and reference summaries using BERT embeddings.
Compression Ratio: Measures the ratio of the length of the generated summary to the length of the original document (sentence-based).

Evaluation Results:

Evaluation metrics were not calculated during training, therefore results are not available

Usage

Here's how to use the model for inference:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("your-huggingface-username/t5-small-texas-legislative-summarization")
model = AutoModelForSeq2SeqLM.from_pretrained("your-huggingface-username/t5-small-texas-legislative-summarization")

def summarize(text):
    input_text = "summarize: " + text
    input_ids = tokenizer.encode(input_text, return_tensors="pt", max_length=4979, truncation=True)
    summary_ids = model.generate(input_ids,
                                 max_length=752,
                                 num_beams=4,
                                 early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

# Example usage:
bill_text = "Your long Texas legislative bill text here..."
summary = summarize(bill_text)
print(summary)

houck2040
/

tex_leg