JephteyAdolphe commited on
Commit
121837e
·
verified ·
1 Parent(s): 10e423f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +64 -3
README.md CHANGED
@@ -1,3 +1,64 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: ht
3
+ license: mit
4
+ tags:
5
+ - haitian-creole
6
+ - tokenization
7
+ - part-of-speech-tagging
8
+ - dependency-parsing
9
+ - spacy
10
+ ---
11
+
12
+ # ht_core_news_sm
13
+
14
+ **Language:** Haitian Creole (ht)
15
+ **Type:** spaCy pipeline (tokenizer, POS tagger, dependency parser)
16
+ **Size:** Small (optimized for efficiency)
17
+
18
+ ## Training
19
+
20
+ - Trained on ~3,300 manually corrected CoNLL-U sentences following Universal Dependencies guidelines.
21
+ - Data includes formal Haitian Creole texts (e.g., articles, religious texts, educational material).
22
+ - No pretrained word vectors were used (pure end-to-end pipeline).
23
+ - CoNLL-U Data: https://github.com/JephteyAdolphe/UD_Haitian_Creole-Adolphe
24
+
25
+ ## Capabilities
26
+
27
+ - Tokenization (including contractions and informal forms common in Haitian Creole)
28
+ - Part-of-Speech (POS) tagging based on Universal POS tags
29
+ - Dependency parsing (basic syntactic parsing)
30
+
31
+ ## Intended Use
32
+
33
+ - NLP research on low-resource languages
34
+ - Language technology development for Haitian Creole
35
+ - Educational or linguistic applications
36
+
37
+ ## Limitations
38
+
39
+ - No named entity recognition (NER) currently included.
40
+ - Trained primarily on formal Haitian Creole; performance may vary on very informal or highly dialectal texts.
41
+ - Small dataset: best for prototyping, research, and early-stage projects.
42
+
43
+ ## Example Usage
44
+
45
+ ```python
46
+ import spacy
47
+
48
+ texts = [
49
+ "Si'm ka vini, m'ap pale ak li.", "M ap teste model lan (pou kounye a).",
50
+ "Map manje gato a pandan map gade televizyon lem lakay mwen.",
51
+ "M ap pale ak ou le w vini demen.", "M'ap vini, eske wap la avek lajan'm? Si ou, di'l non pou fre'w."
52
+ ]
53
+
54
+ nlp = spacy.load("ht_core_news_sm")
55
+
56
+ for text in texts:
57
+ doc = nlp(text)
58
+
59
+ # Tokenization, POS tagging, Lemmatization, Dependency parsing
60
+ print("Tokens, NORM, POS (tag), Dependency:")
61
+ print(len(doc))
62
+ for token in doc:
63
+ print(f"{token.text} | {token.norm_} | {token.tag_} | {token.dep_}")
64
+ print("\n")