Upload 3 files
Browse files- README.md +37 -37
- final_model.pkl +2 -2
README.md
CHANGED
@@ -6,7 +6,7 @@ tags:
|
|
6 |
- camembert
|
7 |
- literary-texts
|
8 |
- nested-entities
|
9 |
-
-
|
10 |
license: apache-2.0
|
11 |
metrics:
|
12 |
- f1
|
@@ -18,10 +18,10 @@ pipeline_tag: token-classification
|
|
18 |
---
|
19 |
|
20 |
## INTRODUCTION:
|
21 |
-
This model, developed as part of the [
|
22 |
|
23 |
The predicted entities are:
|
24 |
-
- mentions of characters (PER): pronouns (je, tu, il, ...), possessive pronouns (mon, ton, son, ...), common nouns (le capitaine, la princesse, ...) and proper nouns (Indiana Delmare, Honoré de Pardaillan, ...)
|
25 |
- facilities (FAC): chatêau, sentier, chambre, couloir, ...
|
26 |
- time (TIME): le règne de Louis XIV, ce matin, en juillet, ...
|
27 |
- geo-political entities (GPE): Montrouge, France, le petit hameau, ...
|
@@ -70,40 +70,40 @@ Model Output: BIOES labels sequence
|
|
70 |
*** IN CONSTRUCTION ***
|
71 |
|
72 |
## TRAINING CORPUS:
|
73 |
-
| | Document | Tokens Count | Is included in model eval
|
74 |
-
|
75 |
-
| 0 | 1731_Prévost-Antoine-François_Manon-Lescaut_PER-ONLY | 71,219 tokens | True
|
76 |
-
| 1 | 1830_Balzac-Honoré-de_La-maison-du-chat-qui-pelote | 24,776 tokens | True
|
77 |
-
| 2 | 1830_Balzac-Honoré-de_Sarrasine | 15,408 tokens | True
|
78 |
-
| 3 | 1832_Sand-George_Indiana_PER-ONLY | 112,221 tokens | True
|
79 |
-
| 4 | 1836_Gautier-Théophile_La-morte-amoureuse | 14,293 tokens | True
|
80 |
-
| 5 | 1837_Balzac-Honoré-de_La-maison-Nucingen | 30,030 tokens | True
|
81 |
-
| 6 | 1841_Sand-George_Pauline | 12,398 tokens | True
|
82 |
-
| 7 | 1856_Cousin-Victor_Madame-de-Hautefort | 11,768 tokens | True
|
83 |
-
| 8 | 1863_Gautier-Théophile_Le-capitaine-Fracasse | 11,848 tokens | True
|
84 |
-
| 9 | 1873_Zola-Émile_Le-ventre-de-Paris | 12,613 tokens | True
|
85 |
-
| 10 | 1881_Flaubert-Gustave_Bouvard-et-Pécuchet | 12,308 tokens | True
|
86 |
-
| 11 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-La-buche | 2,267 tokens | True
|
87 |
-
| 12 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-La-relique | 2,041 tokens | True
|
88 |
-
| 13 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-La-rouille | 2,949 tokens | True
|
89 |
-
| 14 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Madame-Baptiste | 2,578 tokens | True
|
90 |
-
| 15 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Marocca | 4,078 tokens | True
|
91 |
-
| 16 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-A-cheval | 2,878 tokens | True
|
92 |
-
| 17 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Fou | 1,905 tokens | True
|
93 |
-
| 18 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Mademoiselle-Fifi | 5,439 tokens | True
|
94 |
-
| 19 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Reveil | 2,159 tokens | True
|
95 |
-
| 20 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Un-reveillon | 2,364 tokens | True
|
96 |
-
| 21 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Une-ruse | 2,469 tokens | True
|
97 |
-
| 22 | 1901_Achard-Lucie_Rosalie-de-Constant-sa-famille-et-ses-amis | 12,775 tokens | True
|
98 |
-
| 23 | 1903_Conan-Laure_Élisabeth-Seton | 13,046 tokens | True
|
99 |
-
| 24 | 1904-1912_Rolland-Romain_Jean-Christophe(1) | 10,982 tokens | True
|
100 |
-
| 25 | 1904-1912_Rolland-Romain_Jean-Christophe(2) | 10,305 tokens | True
|
101 |
-
| 26 | 1917_Bourgeois-Adèle_Némoville | 12,468 tokens | True
|
102 |
-
| 27 | 1923_Delly_Dans-les-ruines | 95,617 tokens | True
|
103 |
-
| 28 | 1923_Radiguet-Raymond_Le-diable-au-corps | 14,850 tokens | True
|
104 |
-
| 29 | 1926_Audoux-Marguerite_De-la-ville-au-moulin | 12,144 tokens | True
|
105 |
-
| 30 | 1937_Audoux-Marguerite_Douce-Lumière | 12,346 tokens | True
|
106 |
-
| 31 | TOTAL | 554,542 tokens |
|
107 |
|
108 |
## PREDICTIONS CONFUSION MATRIX:
|
109 |
| Gold Labels | PER | O | support |
|
|
|
6 |
- camembert
|
7 |
- literary-texts
|
8 |
- nested-entities
|
9 |
+
- propp-fr
|
10 |
license: apache-2.0
|
11 |
metrics:
|
12 |
- f1
|
|
|
18 |
---
|
19 |
|
20 |
## INTRODUCTION:
|
21 |
+
This model, developed as part of the [propp-fr project](https://github.com/lattice-8094/fr-litbank), is a **NER model** built on top of [camembert-large](https://huggingface.co/almanach/camembert-large) embeddings, trained to predict nested entities in french, specifically for literary texts.
|
22 |
|
23 |
The predicted entities are:
|
24 |
+
- mentions of characters (PER): pronouns (je, tu, il, ...), possessive pronouns (mon, ton, son, ...), common nouns (le capitaine, la princesse, ...) and proper nouns (Indiana Delmare, Honoré de Pardaillan, ...)
|
25 |
- facilities (FAC): chatêau, sentier, chambre, couloir, ...
|
26 |
- time (TIME): le règne de Louis XIV, ce matin, en juillet, ...
|
27 |
- geo-political entities (GPE): Montrouge, France, le petit hameau, ...
|
|
|
70 |
*** IN CONSTRUCTION ***
|
71 |
|
72 |
## TRAINING CORPUS:
|
73 |
+
| | Document | Tokens Count | Is included in model eval |
|
74 |
+
|----|---------------------------------------------------------------------------------|----------------|-----------------------------------|
|
75 |
+
| 0 | 1731_Prévost-Antoine-François_Manon-Lescaut_PER-ONLY | 71,219 tokens | True |
|
76 |
+
| 1 | 1830_Balzac-Honoré-de_La-maison-du-chat-qui-pelote | 24,776 tokens | True |
|
77 |
+
| 2 | 1830_Balzac-Honoré-de_Sarrasine | 15,408 tokens | True |
|
78 |
+
| 3 | 1832_Sand-George_Indiana_PER-ONLY | 112,221 tokens | True |
|
79 |
+
| 4 | 1836_Gautier-Théophile_La-morte-amoureuse | 14,293 tokens | True |
|
80 |
+
| 5 | 1837_Balzac-Honoré-de_La-maison-Nucingen | 30,030 tokens | True |
|
81 |
+
| 6 | 1841_Sand-George_Pauline | 12,398 tokens | True |
|
82 |
+
| 7 | 1856_Cousin-Victor_Madame-de-Hautefort | 11,768 tokens | True |
|
83 |
+
| 8 | 1863_Gautier-Théophile_Le-capitaine-Fracasse | 11,848 tokens | True |
|
84 |
+
| 9 | 1873_Zola-Émile_Le-ventre-de-Paris | 12,613 tokens | True |
|
85 |
+
| 10 | 1881_Flaubert-Gustave_Bouvard-et-Pécuchet | 12,308 tokens | True |
|
86 |
+
| 11 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-La-buche | 2,267 tokens | True |
|
87 |
+
| 12 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-La-relique | 2,041 tokens | True |
|
88 |
+
| 13 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-La-rouille | 2,949 tokens | True |
|
89 |
+
| 14 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Madame-Baptiste | 2,578 tokens | True |
|
90 |
+
| 15 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Marocca | 4,078 tokens | True |
|
91 |
+
| 16 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-A-cheval | 2,878 tokens | True |
|
92 |
+
| 17 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Fou | 1,905 tokens | True |
|
93 |
+
| 18 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Mademoiselle-Fifi | 5,439 tokens | True |
|
94 |
+
| 19 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Reveil | 2,159 tokens | True |
|
95 |
+
| 20 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Un-reveillon | 2,364 tokens | True |
|
96 |
+
| 21 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Une-ruse | 2,469 tokens | True |
|
97 |
+
| 22 | 1901_Achard-Lucie_Rosalie-de-Constant-sa-famille-et-ses-amis | 12,775 tokens | True |
|
98 |
+
| 23 | 1903_Conan-Laure_Élisabeth-Seton | 13,046 tokens | True |
|
99 |
+
| 24 | 1904-1912_Rolland-Romain_Jean-Christophe(1) | 10,982 tokens | True |
|
100 |
+
| 25 | 1904-1912_Rolland-Romain_Jean-Christophe(2) | 10,305 tokens | True |
|
101 |
+
| 26 | 1917_Bourgeois-Adèle_Némoville | 12,468 tokens | True |
|
102 |
+
| 27 | 1923_Delly_Dans-les-ruines | 95,617 tokens | True |
|
103 |
+
| 28 | 1923_Radiguet-Raymond_Le-diable-au-corps | 14,850 tokens | True |
|
104 |
+
| 29 | 1926_Audoux-Marguerite_De-la-ville-au-moulin | 12,144 tokens | True |
|
105 |
+
| 30 | 1937_Audoux-Marguerite_Douce-Lumière | 12,346 tokens | True |
|
106 |
+
| 31 | TOTAL | 554,542 tokens | 3 files used for cross-validation |
|
107 |
|
108 |
## PREDICTIONS CONFUSION MATRIX:
|
109 |
| Gold Labels | PER | O | support |
|
final_model.pkl
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
-
size
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:d2ff3d5ec56dfac2790b1e096e3d28b6ece42931bc4ce4628d1d524d789c877b
|
3 |
+
size 386227868
|