BERT-Based Company Description to Relevant Laws Classifier

Model Overview

This model is built to predict relevant laws for a given company description. It uses the bert-base-uncased transformer model, fine-tuned on a dataset containing company descriptions and their associated laws. The goal of the model is to assist in matching company activities with applicable legal frameworks.

Intended Use

This model is intended for legal professionals, compliance departments, or legal tech applications that need to map business activities or descriptions to relevant laws or regulations. The model can predict multiple laws from a company description, making it useful in a variety of legal or regulatory scenarios.

Model Architecture

Model: BERT (bert-base-uncased)
Task: Sequence classification
Input: Company description (text)
Output: Top 10 relevant laws (multiclass classification)

Training Data

The model was fine-tuned on a dataset of company descriptions and their associated relevant laws:

Company Description: Describes the company's activities, services, and operations.
Relevant Laws: Legal statutes or regulations that apply to the company's activities.

The dataset consists of two columns: "Company Description" and "Relevant Laws." The dataset was preprocessed by removing null values and tokenizing the text before training.

Preprocessing

Tokenized using BertTokenizer with truncation and padding.
Encoded the Relevant Laws as labels using label encoding.
Split into training and validation sets (80% train, 20% validation).

Model Training

Optimizer: AdamW
Learning Rate: 2e-5
Batch Size: 8
Epochs: 20
Loss Function: Cross-entropy
Evaluation: Model performance was evaluated after each epoch.

Model Evaluation

The model was evaluated using the validation dataset, and performance metrics like accuracy and loss were recorded. The model is designed to return the top 10 predicted relevant laws for each company description based on classification logits.

Usage

You can use this model to predict relevant laws for any company description. The model takes a text input and outputs the top 10 predicted laws.

def predict_top_10(company_description):
    # Tokenize the company description
    inputs = tokenizer(company_description, padding='max_length', truncation=True, return_tensors='pt', max_length=128)

    # Predict the logits
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits

    # Get the top 10 predictions (highest logits)
    top_10_predictions = torch.topk(logits, 10).indices[0].tolist()

    # Map the predictions back to the relevant laws
    top_10_laws = [reverse_label_mapping[prediction] for prediction in top_10_predictions]

    return top_10_laws

Example

company_description = """
NexaVerse Ltd. is a versatile company operating across e-commerce, healthcare, and financial services, committed to enhancing user experience in each domain.
Its e-commerce platform specializes in providing quality electronics, home appliances, and lifestyle products with a focus on customer satisfaction and reliability.
In healthcare, NexaVerse runs a secure telemedicine service that connects patients with certified healthcare professionals, making affordable healthcare accessible.
Through its financial services, the company promotes financial inclusion by offering digital payment solutions like a mobile wallet and micro-credit options,
empowering users with accessible and secure financial tools for small-scale transactions.
"""
top_10_laws = predict_top_10(company_description)
print("Top 10 Predicted Relevant Laws:")
for i, law in enumerate(top_10_laws, 1):
    print(f"{i}. {law}")

Limitations

Data-Dependent: The performance of the model heavily relies on the quality and variety of the dataset used for training. It may not generalize well to company descriptions that are very different from the training data.
Not Context-Aware: The model does not have a deep understanding of legal nuances; it relies purely on text matching based on patterns learned during training.
Limited to English: The model only works with English text. Company descriptions in other languages would require translation prior to input.

Future Improvements

Additional Training Data: Expanding the dataset with more laws and diverse company descriptions would help improve model accuracy.
Domain Adaptation: Fine-tuning the model on specific sectors or legal areas (e.g., healthcare, finance) could enhance its performance in specialized domains.

Ethical Considerations

This model is intended to assist in legal matching but should not be used as a replacement for professional legal advice. Misinterpretation of predictions can lead to incorrect legal compliance actions.

Date: October 28, 2024
License: Apache 2.0

sethanimesh
/

legal-bert