FinAI-BERT

FinAI-BERT is a domain-specific BERT-based model fine-tuned for detecting AI-related disclosures in financial texts.

Intended Use

FinAI-BERT is designed to assist researchers, analysts, and regulators in identifying AI narratives in financial disclosures at the sentence level.

Performance

  • Accuracy: 99.37%
  • F1-score: 0.993
  • ROC AUC: 1.000
  • Brier Score: 0.0000

Training Data

FinAI-BERT was fine-tuned on a manually annotated dataset comprising sentences from U.S. bank annual reports spanning 2015 to 2023. The final training set included a balanced sample of 1,586 sentences—793 labeled as AI-related and 793 as non-AI. The model was initialized using the bert-base-uncased architecture.

Training

Setting Value
Base model bert-base-uncased
Epochs 3
Batch size 8 (train & eval)
Max seq length 128
Optimizer / LR scheduler Hugging Face Trainer defaults (AdamW, lr 5e-5)
Hardware Google Colab GPU (T4)

Evaluation & Robustness

  • Benchmarked against Logistic Regression, Naive Bayes, Random Forest, and XGBoost (TF-IDF features); FinAI-BERT scored highest on F1.
  • Calibration checked via Brier Score (0 = perfect).
  • SHAP analysis shows the model focuses on meaningful cues (e.g., machine learning, AI-powered)—not noise—ensuring interpretability and trust.
  • Robust to:
    • Year-by-year slices (2015 → 2023 all F1 ≥ 0.99).
    • Adversarial / edge-case sentences (100 % correct in manual test).
    • Sentence-length bias (Pearson r ≈ 0.19, week correlation → no substential bias).

Files Included

  • config.json, tokenizer.json, vocab.txt, model.safetensors: Model files
  • tokenizer_config.json, special_tokens_map.json: Tokenizer configuration

Supplementary Material

The following supplementary files are provided to support transparency, academic integrity, and reproducibility of the FinAI-BERT model. All materials are included in the archive titled "Supplementary Material.zip":

  • FinAI-BERT Training Data Extraction.ipynb – Python notebook for corpus preprocessing, sentence segmentation, and annotation.
  • ai_seedwords.csv – Lexicon of AI-related terms used to guide weak supervision during annotation.
  • FinAI-BERT training data.csv – Annotated dataset containing AI and Non-AI sentences before deduplication and balancing.
  • FinAI-BERT.ipynb – Notebook for model training, evaluation, and explainability using SHAP.
  • SHAP_Visuals/ – Folder containing SHAP token attribution visualizations.

Citation

Please cite my paper if you use this model:

  • Zafar, M. B. (2025). FinAI-BERT: A transformer-based model for sentence-level detection of AI disclosures in financial reports. SSRN. https://ssrn.com/abstract=5318604

Contact

For questions or feedback, please contact me at [email protected]

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("bilalzafar/FinAI-BERT")
model = AutoModelForSequenceClassification.from_pretrained("bilalzafar/FinAI-BERT")

## Inference Example
from transformers import pipeline
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
result = classifier("We are integrating AI into our credit risk management systems.")
print(result)
### Note: 1=AI and 0=Non-AI
Downloads last month
46
Safetensors
Model size
109M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for bilalzafar/FinAI-BERT

Finetuned
(5389)
this model