--- language: fr tags: - NER - camembert - literary-texts - nested-entities - propp-fr license: apache-2.0 metrics: - f1 - precision - recall base_model: - almanach/camembert-large pipeline_tag: token-classification --- ## INTRODUCTION: This model, developed as part of the [propp-fr project](https://github.com/lattice-8094/fr-litbank), is a **NER model** built on top of [camembert-large](https://huggingface.co/almanach/camembert-large) embeddings, trained to predict nested entities in french, specifically for literary texts. The predicted entities are: - mentions of characters (PER): pronouns (je, tu, il, ...), possessive pronouns (mon, ton, son, ...), common nouns (le capitaine, la princesse, ...) and proper nouns (Indiana Delmare, Honoré de Pardaillan, ...) - facilities (FAC): chatêau, sentier, chambre, couloir, ... - time (TIME): le règne de Louis XIV, ce matin, en juillet, ... - geo-political entities (GPE): Montrouge, France, le petit hameau, ... - locations (LOC): le sud, Mars, l'océan, le bois, ... - vehicles (VEH): avion, voitures, calèche, vélos, ... ## MODEL PERFORMANCES (LOOCV): | NER_tag | precision | recall | f1_score | support | support % | |-----------|-------------|----------|------------|-----------|-------------| | PER | 94.02% | 95.99% | 94.99% | 4,162 | 100.00% | | micro_avg | 94.02% | 95.99% | 94.99% | 4,162 | 100.00% | | macro_avg | 94.02% | 95.99% | 94.99% | 4,162 | 100.00% | ## TRAINING PARAMETERS: - Entities types: ['PER'] - Tagging scheme: BIOES - Nested entities levels: [0, 1] - Split strategy: Leave-one-out cross-validation (31 files) - Train/Validation split: 0.85 / 0.15 - Batch size: 16 - Initial learning rate: 0.00014 ## MODEL ARCHITECTURE: Model Input: Maximum context camembert-large embeddings (1024 dimensions) - Locked Dropout: 0.5 - Projection layer: - layer type: highway layer - input: 1024 dimensions - output: 2048 dimensions - BiLSTM layer: - input: 2048 dimensions - output: 256 dimensions (hidden state) - Linear layer: - input: 256 dimensions - output: 5 dimensions (predicted labels with BIOES tagging scheme) - CRF layer Model Output: BIOES labels sequence ## HOW TO USE: *** IN CONSTRUCTION *** ## TRAINING CORPUS: | | Document | Tokens Count | Is included in model eval | |----|---------------------------------------------------------------------------------|----------------|-----------------------------------| | 0 | 1731_Prévost-Antoine-François_Manon-Lescaut_PER-ONLY | 71,219 tokens | False | | 1 | 1830_Balzac-Honoré-de_La-maison-du-chat-qui-pelote | 24,776 tokens | False | | 2 | 1830_Balzac-Honoré-de_Sarrasine | 15,408 tokens | False | | 3 | 1832_Sand-George_Indiana_PER-ONLY | 112,221 tokens | False | | 4 | 1836_Gautier-Théophile_La-morte-amoureuse | 14,293 tokens | False | | 5 | 1837_Balzac-Honoré-de_La-maison-Nucingen | 30,030 tokens | False | | 6 | 1841_Sand-George_Pauline | 12,398 tokens | False | | 7 | 1856_Cousin-Victor_Madame-de-Hautefort | 11,768 tokens | False | | 8 | 1863_Gautier-Théophile_Le-capitaine-Fracasse | 11,848 tokens | False | | 9 | 1873_Zola-Émile_Le-ventre-de-Paris | 12,613 tokens | False | | 10 | 1881_Flaubert-Gustave_Bouvard-et-Pécuchet | 12,308 tokens | False | | 11 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-La-buche | 2,267 tokens | False | | 12 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-La-relique | 2,041 tokens | False | | 13 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-La-rouille | 2,949 tokens | True | | 14 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Madame-Baptiste | 2,578 tokens | True | | 15 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Marocca | 4,078 tokens | False | | 16 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-A-cheval | 2,878 tokens | False | | 17 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Fou | 1,905 tokens | False | | 18 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Mademoiselle-Fifi | 5,439 tokens | True | | 19 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Reveil | 2,159 tokens | False | | 20 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Un-reveillon | 2,364 tokens | False | | 21 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Une-ruse | 2,469 tokens | False | | 22 | 1901_Achard-Lucie_Rosalie-de-Constant-sa-famille-et-ses-amis | 12,775 tokens | False | | 23 | 1903_Conan-Laure_Élisabeth-Seton | 13,046 tokens | False | | 24 | 1904-1912_Rolland-Romain_Jean-Christophe(1) | 10,982 tokens | True | | 25 | 1904-1912_Rolland-Romain_Jean-Christophe(2) | 10,305 tokens | False | | 26 | 1917_Bourgeois-Adèle_Némoville | 12,468 tokens | False | | 27 | 1923_Delly_Dans-les-ruines | 95,617 tokens | False | | 28 | 1923_Radiguet-Raymond_Le-diable-au-corps | 14,850 tokens | False | | 29 | 1926_Audoux-Marguerite_De-la-ville-au-moulin | 12,144 tokens | True | | 30 | 1937_Audoux-Marguerite_Douce-Lumière | 12,346 tokens | False | | 31 | TOTAL | 554,542 tokens | 5 files used for cross-validation | ## PREDICTIONS CONFUSION MATRIX: | Gold Labels | PER | O | support | |---------------|-------|-----|-----------| | PER | 3,995 | 167 | 4,162 | | O | 254 | 0 | 254 | ## CONTACT: mail: antoine [dot] bourgois [at] protonmail [dot] com