BERT-Based Company Description to Relevant Laws Classifier
Model Overview
This model is built to predict relevant laws for a given company description. It uses the bert-base-uncased
transformer model, fine-tuned on a dataset containing company descriptions and their associated laws. The goal of the model is to assist in matching company activities with applicable legal frameworks.
Intended Use
This model is intended for legal professionals, compliance departments, or legal tech applications that need to map business activities or descriptions to relevant laws or regulations. The model can predict multiple laws from a company description, making it useful in a variety of legal or regulatory scenarios.
Model Architecture
- Model: BERT (
bert-base-uncased
) - Task: Sequence classification
- Input: Company description (text)
- Output: Top 10 relevant laws (multiclass classification)
Training Data
The model was fine-tuned on a dataset of company descriptions and their associated relevant laws:
- Company Description: Describes the company's activities, services, and operations.
- Relevant Laws: Legal statutes or regulations that apply to the company's activities.
The dataset consists of two columns: "Company Description" and "Relevant Laws." The dataset was preprocessed by removing null values and tokenizing the text before training.
Preprocessing
- Tokenized using
BertTokenizer
with truncation and padding. - Encoded the
Relevant Laws
as labels using label encoding. - Split into training and validation sets (80% train, 20% validation).
Model Training
- Optimizer: AdamW
- Learning Rate: 2e-5
- Batch Size: 8
- Epochs: 20
- Loss Function: Cross-entropy
- Evaluation: Model performance was evaluated after each epoch.
Model Evaluation
The model was evaluated using the validation dataset, and performance metrics like accuracy and loss were recorded. The model is designed to return the top 10 predicted relevant laws for each company description based on classification logits.
Usage
You can use this model to predict relevant laws for any company description. The model takes a text input and outputs the top 10 predicted laws.
def predict_top_10(company_description):
# Tokenize the company description
inputs = tokenizer(company_description, padding='max_length', truncation=True, return_tensors='pt', max_length=128)
# Predict the logits
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
# Get the top 10 predictions (highest logits)
top_10_predictions = torch.topk(logits, 10).indices[0].tolist()
# Map the predictions back to the relevant laws
top_10_laws = [reverse_label_mapping[prediction] for prediction in top_10_predictions]
return top_10_laws
Example
company_description = """
NexaVerse Ltd. is a versatile company operating across e-commerce, healthcare, and financial services, committed to enhancing user experience in each domain.
Its e-commerce platform specializes in providing quality electronics, home appliances, and lifestyle products with a focus on customer satisfaction and reliability.
In healthcare, NexaVerse runs a secure telemedicine service that connects patients with certified healthcare professionals, making affordable healthcare accessible.
Through its financial services, the company promotes financial inclusion by offering digital payment solutions like a mobile wallet and micro-credit options,
empowering users with accessible and secure financial tools for small-scale transactions.
"""
top_10_laws = predict_top_10(company_description)
print("Top 10 Predicted Relevant Laws:")
for i, law in enumerate(top_10_laws, 1):
print(f"{i}. {law}")
Limitations
- Data-Dependent: The performance of the model heavily relies on the quality and variety of the dataset used for training. It may not generalize well to company descriptions that are very different from the training data.
- Not Context-Aware: The model does not have a deep understanding of legal nuances; it relies purely on text matching based on patterns learned during training.
- Limited to English: The model only works with English text. Company descriptions in other languages would require translation prior to input.
Future Improvements
- Additional Training Data: Expanding the dataset with more laws and diverse company descriptions would help improve model accuracy.
- Domain Adaptation: Fine-tuning the model on specific sectors or legal areas (e.g., healthcare, finance) could enhance its performance in specialized domains.
Ethical Considerations
This model is intended to assist in legal matching but should not be used as a replacement for professional legal advice. Misinterpretation of predictions can lead to incorrect legal compliance actions.
Date: October 28, 2024
License: Apache 2.0
- Downloads last month
- 3
Model tree for sethanimesh/legal-bert
Base model
google-bert/bert-base-uncased