|
--- |
|
language: fr |
|
license: mit |
|
tags: |
|
- roberta |
|
- token-classification |
|
base_model: almanach/camembertv2-base |
|
datasets: |
|
- FTB-NER |
|
metrics: |
|
- f1 |
|
pipeline_tag: token-classification |
|
library_name: transformers |
|
model-index: |
|
- name: almanach/camembertv2-base-ftb-ner |
|
results: |
|
- task: |
|
type: token-classification |
|
name: Named Entity Recognition (NER) |
|
dataset: |
|
type: ftb-ner |
|
name: French Treebank Named Entity Recognition |
|
metrics: |
|
- name: f1 |
|
type: f1 |
|
value: 0.93548 |
|
verified: false |
|
--- |
|
|
|
# Model Card for almanach/camembertv2-base-ftb-ner |
|
|
|
almanach/camembertv2-base-ftb-ner is a roberta model for token classification. It is trained on the FTB-NER dataset for the task of Named Entity Recognition (NER). The model achieves an f1 score of 0.93548 on the FTB-NER dataset. |
|
|
|
The model is part of the almanach/camembertv2-base family of model finetunes. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
- **Developed by:** Wissam Antoun (Phd Student at Almanach, Inria-Paris) |
|
- **Model type:** roberta |
|
- **Language(s) (NLP):** French |
|
- **License:** MIT |
|
- **Finetuned from model [optional]:** almanach/camembertv2-base |
|
|
|
### Model Sources [optional] |
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
- **Repository:** https://github.com/WissamAntoun/camemberta |
|
- **Paper:** https://arxiv.org/abs/2411.08868 |
|
|
|
## Uses |
|
|
|
The model can be used for token classification tasks in French for Named Entity Recognition (NER). |
|
|
|
## Bias, Risks, and Limitations |
|
|
|
The model may exhibit biases based on the training data. The model may not generalize well to other datasets or tasks. The model may also have limitations in terms of the data it was trained on. |
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
Use the code below to get started with the model. |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline |
|
|
|
model = AutoModelForTokenClassification.from_pretrained("almanach/camembertv2-base-ftb-ner") |
|
tokenizer = AutoTokenizer.from_pretrained("almanach/camembertv2-base-ftb-ner") |
|
|
|
classifier = pipeline("token-classification", model=model, tokenizer=tokenizer) |
|
|
|
classifier("Votre texte ici") |
|
``` |
|
|
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
The model is trained on the FTB-NER dataset. |
|
|
|
- Dataset Name: FTB-NER |
|
- Dataset Size: |
|
- Train: 9881 |
|
- Dev: 1235 |
|
- Test: 1235 |
|
|
|
|
|
### Training Procedure |
|
|
|
Model trained with the run_ner.py script from the huggingface repository. |
|
|
|
|
|
|
|
#### Training Hyperparameters |
|
|
|
```yml |
|
accelerator_config: '{''split_batches'': False, ''dispatch_batches'': None, ''even_batches'': |
|
True, ''use_seedable_sampler'': True, ''non_blocking'': False, ''gradient_accumulation_kwargs'': |
|
None}' |
|
adafactor: false |
|
adam_beta1: 0.9 |
|
adam_beta2: 0.999 |
|
adam_epsilon: 1.0e-08 |
|
auto_find_batch_size: false |
|
base_model: camembertv2 |
|
base_model_name: camembertv2-base-bf16-p2-17000 |
|
batch_eval_metrics: false |
|
bf16: false |
|
bf16_full_eval: false |
|
data_seed: 1337.0 |
|
dataloader_drop_last: false |
|
dataloader_num_workers: 0 |
|
dataloader_persistent_workers: false |
|
dataloader_pin_memory: true |
|
dataloader_prefetch_factor: .nan |
|
ddp_backend: .nan |
|
ddp_broadcast_buffers: .nan |
|
ddp_bucket_cap_mb: .nan |
|
ddp_find_unused_parameters: .nan |
|
ddp_timeout: 1800 |
|
debug: '[]' |
|
deepspeed: .nan |
|
disable_tqdm: false |
|
dispatch_batches: .nan |
|
do_eval: true |
|
do_predict: false |
|
do_train: true |
|
epoch: 8.0 |
|
eval_accumulation_steps: 4 |
|
eval_accuracy: 0.9937000109565028 |
|
eval_delay: 0 |
|
eval_do_concat_batches: true |
|
eval_f1: 0.935483870967742 |
|
eval_loss: 0.0347304567694664 |
|
eval_on_start: false |
|
eval_precision: 0.9362204724409448 |
|
eval_recall: 0.934748427672956 |
|
eval_runtime: 2.7702 |
|
eval_samples: 1235.0 |
|
eval_samples_per_second: 445.821 |
|
eval_steps: .nan |
|
eval_steps_per_second: 55.953 |
|
eval_strategy: epoch |
|
eval_use_gather_object: false |
|
evaluation_strategy: epoch |
|
fp16: false |
|
fp16_backend: auto |
|
fp16_full_eval: false |
|
fp16_opt_level: O1 |
|
fsdp: '[]' |
|
fsdp_config: '{''min_num_params'': 0, ''xla'': False, ''xla_fsdp_v2'': False, ''xla_fsdp_grad_ckpt'': |
|
False}' |
|
fsdp_min_num_params: 0 |
|
fsdp_transformer_layer_cls_to_wrap: .nan |
|
full_determinism: false |
|
gradient_accumulation_steps: 2 |
|
gradient_checkpointing: false |
|
gradient_checkpointing_kwargs: .nan |
|
greater_is_better: true |
|
group_by_length: false |
|
half_precision_backend: auto |
|
hub_always_push: false |
|
hub_model_id: .nan |
|
hub_private_repo: false |
|
hub_strategy: every_save |
|
hub_token: <HUB_TOKEN> |
|
ignore_data_skip: false |
|
include_inputs_for_metrics: false |
|
include_num_input_tokens_seen: false |
|
include_tokens_per_second: false |
|
jit_mode_eval: false |
|
label_names: .nan |
|
label_smoothing_factor: 0.0 |
|
learning_rate: 5.000000000000001e-05 |
|
length_column_name: length |
|
load_best_model_at_end: true |
|
local_rank: 0 |
|
log_level: debug |
|
log_level_replica: warning |
|
log_on_each_node: true |
|
logging_dir: /scratch/camembertv2/runs/results/ftb_ner/camembertv2-base-bf16-p2-17000/max_seq_length-192-gradient_accumulation_steps-2-precision-fp32-learning_rate-5.000000000000001e-05-epochs-8-lr_scheduler-linear-warmup_steps-0.1/SEED-1337/logs |
|
logging_first_step: false |
|
logging_nan_inf_filter: true |
|
logging_steps: 100 |
|
logging_strategy: steps |
|
lr_scheduler_kwargs: '{}' |
|
lr_scheduler_type: linear |
|
max_grad_norm: 1.0 |
|
max_steps: -1 |
|
metric_for_best_model: f1 |
|
mp_parameters: .nan |
|
name: camembertv2/runs/results/ftb_ner/camembertv2-base-bf16-p2-17000/max_seq_length-192-gradient_accumulation_steps-2-precision-fp32-learning_rate-5.000000000000001e-05-epochs-8-lr_scheduler-linear-warmup_steps-0.1 |
|
neftune_noise_alpha: .nan |
|
no_cuda: false |
|
num_train_epochs: 8.0 |
|
optim: adamw_torch |
|
optim_args: .nan |
|
optim_target_modules: .nan |
|
output_dir: /scratch/camembertv2/runs/results/ftb_ner/camembertv2-base-bf16-p2-17000/max_seq_length-192-gradient_accumulation_steps-2-precision-fp32-learning_rate-5.000000000000001e-05-epochs-8-lr_scheduler-linear-warmup_steps-0.1/SEED-1337 |
|
overwrite_output_dir: false |
|
past_index: -1 |
|
per_device_eval_batch_size: 8 |
|
per_device_train_batch_size: 8 |
|
per_gpu_eval_batch_size: .nan |
|
per_gpu_train_batch_size: .nan |
|
prediction_loss_only: false |
|
push_to_hub: false |
|
push_to_hub_model_id: .nan |
|
push_to_hub_organization: .nan |
|
push_to_hub_token: <PUSH_TO_HUB_TOKEN> |
|
ray_scope: last |
|
remove_unused_columns: true |
|
report_to: '[''tensorboard'']' |
|
restore_callback_states_from_checkpoint: false |
|
resume_from_checkpoint: .nan |
|
run_name: /scratch/camembertv2/runs/results/ftb_ner/camembertv2-base-bf16-p2-17000/max_seq_length-192-gradient_accumulation_steps-2-precision-fp32-learning_rate-5.000000000000001e-05-epochs-8-lr_scheduler-linear-warmup_steps-0.1/SEED-1337 |
|
save_on_each_node: false |
|
save_only_model: false |
|
save_safetensors: true |
|
save_steps: 500 |
|
save_strategy: epoch |
|
save_total_limit: .nan |
|
seed: 1337 |
|
skip_memory_metrics: true |
|
split_batches: .nan |
|
tf32: .nan |
|
torch_compile: true |
|
torch_compile_backend: inductor |
|
torch_compile_mode: .nan |
|
torch_empty_cache_steps: .nan |
|
torchdynamo: .nan |
|
total_flos: 2833132740217920.0 |
|
tpu_metrics_debug: false |
|
tpu_num_cores: .nan |
|
train_loss: 0.0880794880495777 |
|
train_runtime: 679.3683 |
|
train_samples: 9881 |
|
train_samples_per_second: 116.355 |
|
train_steps_per_second: 7.277 |
|
use_cpu: false |
|
use_ipex: false |
|
use_legacy_prediction_loop: false |
|
use_mps_device: false |
|
warmup_ratio: 0.1 |
|
warmup_steps: 0 |
|
weight_decay: 0.0 |
|
|
|
``` |
|
|
|
#### Results |
|
|
|
**F1-Score:** 0.93548 |
|
|
|
## Technical Specifications |
|
|
|
### Model Architecture and Objective |
|
|
|
roberta for token classification. |
|
|
|
## Citation |
|
|
|
**BibTeX:** |
|
|
|
```bibtex |
|
@misc{antoun2024camembert20smarterfrench, |
|
title={CamemBERT 2.0: A Smarter French Language Model Aged to Perfection}, |
|
author={Wissam Antoun and Francis Kulumba and Rian Touchent and Éric de la Clergerie and Benoît Sagot and Djamé Seddah}, |
|
year={2024}, |
|
eprint={2411.08868}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL}, |
|
url={https://arxiv.org/abs/2411.08868}, |
|
} |
|
``` |