T5-small Texas Legislative Summarization

This model is a fine-tuned version of google/t5-small for summarizing Texas legislative bills. It was trained on a dataset of Texas legislative bills and their corresponding summaries, with a focus on long document summarization.

Model Details

  • Model Name: T5-small Texas Legislative Summarization
  • Base Model: google/t5-small
  • Model Type: Seq2Seq Language Model
  • Architecture: T5ForConditionalGeneration
  • Language: English
  • License: Apache 2.0

Model Description

It is important to note this model is not the best case trained but an example use case. This model takes the enrolled text of a Texas legislative bill as input and generates a concise summary. It aims to capture the key points of the bill in a shorter, more accessible format. Due to the length and complexity of legislative documents, the model is designed to handle long-document summarization.

Intended Use

This model can be used for:

  • Summarizing Texas legislative bills for easier understanding.
  • Providing a quick overview of bill content for researchers, journalists, and the general public.
  • Automating the summarization process to save time and resources.

Training Data

The model was trained on a custom dataset of Texas legislative bills and their summaries. The dataset was created from:

  • Source: cleaned_texas_leg_data.json (This file is not publicly available and would need to be replaced with a public dataset or a description of how to create one).
  • Source Text Column: enrolled_text
  • Target Text Column: summary_text
  • Data Preprocessing: Data was loaded, split into training and testing sets (80/20 split), and tokenized using the T5 tokenizer.

Training Procedure

The model was fine-tuned using the following parameters:

  • Model: google/t5-small
  • Training Framework: Transformers library using Seq2SeqTrainer
  • Optimizer: Adafactor
  • Loss Function: Cross-Entropy Loss
  • Epochs: 5
  • Batch Size: 1 (per device)
  • Gradient Accumulation Steps: 4
  • Learning Rate: 1e-05
  • Weight Decay: 0.0
  • FP16 Training: Enabled
  • Gradient Checkpointing: Enabled
  • Evaluation Strategy: Epoch
  • Save Strategy: Epoch
  • Early Stopping: Enabled (patience=3, threshold=0.01)
  • Random Seed: 42
  • Max Source Length: 4979
  • Max Target Length: 752
  • Prefix: "summarize: "

Hyperparameter Tuning

A hyperparameter search was conducted to find the optimal training configuration. The following hyperparameters were explored:

  • Learning Rates: [1e-5, 2e-5, 3e-5, 4e-5, 5e-5, 6e-5]
  • Weight Decays: [0.0, 0.01, 0.015, 0.001, 0.005]
  • Gradient Accumulation Steps: [4]

The best model was selected based on the lowest perplexity on the evaluation set.

Best Parameters:

  • Learning Rate: 1e-05
  • Weight Decay: 0.0
  • Gradient Accumulation Steps: 4
  • Perplexity: N/A

Evaluation

The model was evaluated on a held-out test set using the following metrics:

  • ROUGE (Rouge1, Rouge2, RougeL): Measures the overlap of n-grams between the generated summaries and the reference summaries.
  • BERTScore (Precision, Recall, F1): Calculates semantic similarity between the generated and reference summaries using BERT embeddings.
  • Compression Ratio: Measures the ratio of the length of the generated summary to the length of the original document (sentence-based).

Evaluation Results:

Evaluation metrics were not calculated during training, therefore results are not available

Usage

Here's how to use the model for inference:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("your-huggingface-username/t5-small-texas-legislative-summarization")
model = AutoModelForSeq2SeqLM.from_pretrained("your-huggingface-username/t5-small-texas-legislative-summarization")

def summarize(text):
    input_text = "summarize: " + text
    input_ids = tokenizer.encode(input_text, return_tensors="pt", max_length=4979, truncation=True)
    summary_ids = model.generate(input_ids,
                                 max_length=752,
                                 num_beams=4,
                                 early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

# Example usage:
bill_text = "Your long Texas legislative bill text here..."
summary = summarize(bill_text)
print(summary)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Evaluation results