AntoineBourgois commited on
Commit
c67738c
·
verified ·
1 Parent(s): 6064a11

Upload 3 files

Browse files
Files changed (2) hide show
  1. README.md +37 -37
  2. final_model.pkl +2 -2
README.md CHANGED
@@ -6,7 +6,7 @@ tags:
6
  - camembert
7
  - literary-texts
8
  - nested-entities
9
- - BookNLP-fr
10
  license: apache-2.0
11
  metrics:
12
  - f1
@@ -18,10 +18,10 @@ pipeline_tag: token-classification
18
  ---
19
 
20
  ## INTRODUCTION:
21
- This model, developed as part of the [BookNLP-fr project](https://github.com/lattice-8094/fr-litbank), is a **NER model** built on top of [camembert-large](https://huggingface.co/almanach/camembert-large) embeddings, trained to predict nested entities in french, specifically for literary texts.
22
 
23
  The predicted entities are:
24
- - mentions of characters (PER): pronouns (je, tu, il, ...), possessive pronouns (mon, ton, son, ...), common nouns (le capitaine, la princesse, ...) and proper nouns (Indiana Delmare, Honoré de Pardaillan, ...)
25
  - facilities (FAC): chatêau, sentier, chambre, couloir, ...
26
  - time (TIME): le règne de Louis XIV, ce matin, en juillet, ...
27
  - geo-political entities (GPE): Montrouge, France, le petit hameau, ...
@@ -70,40 +70,40 @@ Model Output: BIOES labels sequence
70
  *** IN CONSTRUCTION ***
71
 
72
  ## TRAINING CORPUS:
73
- | | Document | Tokens Count | Is included in model eval |
74
- |----|---------------------------------------------------------------------------------|----------------|------------------------------------|
75
- | 0 | 1731_Prévost-Antoine-François_Manon-Lescaut_PER-ONLY | 71,219 tokens | True |
76
- | 1 | 1830_Balzac-Honoré-de_La-maison-du-chat-qui-pelote | 24,776 tokens | True |
77
- | 2 | 1830_Balzac-Honoré-de_Sarrasine | 15,408 tokens | True |
78
- | 3 | 1832_Sand-George_Indiana_PER-ONLY | 112,221 tokens | True |
79
- | 4 | 1836_Gautier-Théophile_La-morte-amoureuse | 14,293 tokens | True |
80
- | 5 | 1837_Balzac-Honoré-de_La-maison-Nucingen | 30,030 tokens | True |
81
- | 6 | 1841_Sand-George_Pauline | 12,398 tokens | True |
82
- | 7 | 1856_Cousin-Victor_Madame-de-Hautefort | 11,768 tokens | True |
83
- | 8 | 1863_Gautier-Théophile_Le-capitaine-Fracasse | 11,848 tokens | True |
84
- | 9 | 1873_Zola-Émile_Le-ventre-de-Paris | 12,613 tokens | True |
85
- | 10 | 1881_Flaubert-Gustave_Bouvard-et-Pécuchet | 12,308 tokens | True |
86
- | 11 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-La-buche | 2,267 tokens | True |
87
- | 12 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-La-relique | 2,041 tokens | True |
88
- | 13 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-La-rouille | 2,949 tokens | True |
89
- | 14 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Madame-Baptiste | 2,578 tokens | True |
90
- | 15 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Marocca | 4,078 tokens | True |
91
- | 16 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-A-cheval | 2,878 tokens | True |
92
- | 17 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Fou | 1,905 tokens | True |
93
- | 18 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Mademoiselle-Fifi | 5,439 tokens | True |
94
- | 19 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Reveil | 2,159 tokens | True |
95
- | 20 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Un-reveillon | 2,364 tokens | True |
96
- | 21 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Une-ruse | 2,469 tokens | True |
97
- | 22 | 1901_Achard-Lucie_Rosalie-de-Constant-sa-famille-et-ses-amis | 12,775 tokens | True |
98
- | 23 | 1903_Conan-Laure_Élisabeth-Seton | 13,046 tokens | True |
99
- | 24 | 1904-1912_Rolland-Romain_Jean-Christophe(1) | 10,982 tokens | True |
100
- | 25 | 1904-1912_Rolland-Romain_Jean-Christophe(2) | 10,305 tokens | True |
101
- | 26 | 1917_Bourgeois-Adèle_Némoville | 12,468 tokens | True |
102
- | 27 | 1923_Delly_Dans-les-ruines | 95,617 tokens | True |
103
- | 28 | 1923_Radiguet-Raymond_Le-diable-au-corps | 14,850 tokens | True |
104
- | 29 | 1926_Audoux-Marguerite_De-la-ville-au-moulin | 12,144 tokens | True |
105
- | 30 | 1937_Audoux-Marguerite_Douce-Lumière | 12,346 tokens | True |
106
- | 31 | TOTAL | 554,542 tokens | 31 files used for cross-validation |
107
 
108
  ## PREDICTIONS CONFUSION MATRIX:
109
  | Gold Labels | PER | O | support |
 
6
  - camembert
7
  - literary-texts
8
  - nested-entities
9
+ - propp-fr
10
  license: apache-2.0
11
  metrics:
12
  - f1
 
18
  ---
19
 
20
  ## INTRODUCTION:
21
+ This model, developed as part of the [propp-fr project](https://github.com/lattice-8094/fr-litbank), is a **NER model** built on top of [camembert-large](https://huggingface.co/almanach/camembert-large) embeddings, trained to predict nested entities in french, specifically for literary texts.
22
 
23
  The predicted entities are:
24
+ - mentions of characters (PER): pronouns (je, tu, il, ...), possessive pronouns (mon, ton, son, ...), common nouns (le capitaine, la princesse, ...) and proper nouns (Indiana Delmare, Honoré de Pardaillan, ...)
25
  - facilities (FAC): chatêau, sentier, chambre, couloir, ...
26
  - time (TIME): le règne de Louis XIV, ce matin, en juillet, ...
27
  - geo-political entities (GPE): Montrouge, France, le petit hameau, ...
 
70
  *** IN CONSTRUCTION ***
71
 
72
  ## TRAINING CORPUS:
73
+ | | Document | Tokens Count | Is included in model eval |
74
+ |----|---------------------------------------------------------------------------------|----------------|-----------------------------------|
75
+ | 0 | 1731_Prévost-Antoine-François_Manon-Lescaut_PER-ONLY | 71,219 tokens | True |
76
+ | 1 | 1830_Balzac-Honoré-de_La-maison-du-chat-qui-pelote | 24,776 tokens | True |
77
+ | 2 | 1830_Balzac-Honoré-de_Sarrasine | 15,408 tokens | True |
78
+ | 3 | 1832_Sand-George_Indiana_PER-ONLY | 112,221 tokens | True |
79
+ | 4 | 1836_Gautier-Théophile_La-morte-amoureuse | 14,293 tokens | True |
80
+ | 5 | 1837_Balzac-Honoré-de_La-maison-Nucingen | 30,030 tokens | True |
81
+ | 6 | 1841_Sand-George_Pauline | 12,398 tokens | True |
82
+ | 7 | 1856_Cousin-Victor_Madame-de-Hautefort | 11,768 tokens | True |
83
+ | 8 | 1863_Gautier-Théophile_Le-capitaine-Fracasse | 11,848 tokens | True |
84
+ | 9 | 1873_Zola-Émile_Le-ventre-de-Paris | 12,613 tokens | True |
85
+ | 10 | 1881_Flaubert-Gustave_Bouvard-et-Pécuchet | 12,308 tokens | True |
86
+ | 11 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-La-buche | 2,267 tokens | True |
87
+ | 12 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-La-relique | 2,041 tokens | True |
88
+ | 13 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-La-rouille | 2,949 tokens | True |
89
+ | 14 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Madame-Baptiste | 2,578 tokens | True |
90
+ | 15 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Marocca | 4,078 tokens | True |
91
+ | 16 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-A-cheval | 2,878 tokens | True |
92
+ | 17 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Fou | 1,905 tokens | True |
93
+ | 18 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Mademoiselle-Fifi | 5,439 tokens | True |
94
+ | 19 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Reveil | 2,159 tokens | True |
95
+ | 20 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Un-reveillon | 2,364 tokens | True |
96
+ | 21 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Une-ruse | 2,469 tokens | True |
97
+ | 22 | 1901_Achard-Lucie_Rosalie-de-Constant-sa-famille-et-ses-amis | 12,775 tokens | True |
98
+ | 23 | 1903_Conan-Laure_Élisabeth-Seton | 13,046 tokens | True |
99
+ | 24 | 1904-1912_Rolland-Romain_Jean-Christophe(1) | 10,982 tokens | True |
100
+ | 25 | 1904-1912_Rolland-Romain_Jean-Christophe(2) | 10,305 tokens | True |
101
+ | 26 | 1917_Bourgeois-Adèle_Némoville | 12,468 tokens | True |
102
+ | 27 | 1923_Delly_Dans-les-ruines | 95,617 tokens | True |
103
+ | 28 | 1923_Radiguet-Raymond_Le-diable-au-corps | 14,850 tokens | True |
104
+ | 29 | 1926_Audoux-Marguerite_De-la-ville-au-moulin | 12,144 tokens | True |
105
+ | 30 | 1937_Audoux-Marguerite_Douce-Lumière | 12,346 tokens | True |
106
+ | 31 | TOTAL | 554,542 tokens | 3 files used for cross-validation |
107
 
108
  ## PREDICTIONS CONFUSION MATRIX:
109
  | Gold Labels | PER | O | support |
final_model.pkl CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:116f5605f0bcd856a65a56123c76d67156460ad3532d03c04506f5627800de52
3
- size 386227630
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d2ff3d5ec56dfac2790b1e096e3d28b6ece42931bc4ce4628d1d524d789c877b
3
+ size 386227868