JephteyAdolphe
/

ht_core_news_sm

part-of-speech-tagging

dependency-parsing

Model card Files Files and versions Community

JephteyAdolphe commited on Apr 28

Commit

121837e

·

verified ·

1 Parent(s): 10e423f

Update README.md

Files changed (1) hide show

README.md +64 -3

README.md CHANGED Viewed

@@ -1,3 +1,64 @@
----
-license: mit
----

+---
+language: ht
+license: mit
+tags:
+- haitian-creole
+- tokenization
+- part-of-speech-tagging
+- dependency-parsing
+- spacy
+---
+# ht_core_news_sm
+**Language:** Haitian Creole (ht)
+**Type:** spaCy pipeline (tokenizer, POS tagger, dependency parser)
+**Size:** Small (optimized for efficiency)
+## Training
+- Trained on ~3,300 manually corrected CoNLL-U sentences following Universal Dependencies guidelines.
+- Data includes formal Haitian Creole texts (e.g., articles, religious texts, educational material).
+- No pretrained word vectors were used (pure end-to-end pipeline).
+- CoNLL-U Data: https://github.com/JephteyAdolphe/UD_Haitian_Creole-Adolphe
+## Capabilities
+- Tokenization (including contractions and informal forms common in Haitian Creole)
+- Part-of-Speech (POS) tagging based on Universal POS tags
+- Dependency parsing (basic syntactic parsing)
+## Intended Use
+- NLP research on low-resource languages
+- Language technology development for Haitian Creole
+- Educational or linguistic applications
+## Limitations
+- No named entity recognition (NER) currently included.
+- Trained primarily on formal Haitian Creole; performance may vary on very informal or highly dialectal texts.
+- Small dataset: best for prototyping, research, and early-stage projects.
+## Example Usage
+```python
+import spacy
+texts = [
+        "Si'm ka vini, m'ap pale ak li.", "M ap teste model lan (pou kounye a).",
+         "Map manje gato a pandan map gade televizyon lem lakay mwen.",
+         "M ap pale ak ou le w vini demen.", "M'ap vini, eske wap la avek lajan'm? Si ou, di'l non pou fre'w."
+         ]
+nlp = spacy.load("ht_core_news_sm")
+for text in texts:
+    doc = nlp(text)
+    # Tokenization, POS tagging, Lemmatization, Dependency parsing
+    print("Tokens, NORM, POS (tag), Dependency:")
+    print(len(doc))
+    for token in doc:
+        print(f"{token.text} | {token.norm_} | {token.tag_} | {token.dep_}")
+    print("\n")