|
--- |
|
language: ht |
|
license: mit |
|
tags: |
|
- haitian-creole |
|
- tokenization |
|
- part-of-speech-tagging |
|
- dependency-parsing |
|
- spacy |
|
--- |
|
|
|
# ht_core_news_sm |
|
|
|
**Language:** Haitian Creole (ht) |
|
**Type:** spaCy pipeline (tokenizer, POS tagger, dependency parser) |
|
**Size:** Small (optimized for efficiency) |
|
|
|
## Training |
|
|
|
- Trained on ~3,300 manually corrected CoNLL-U sentences following Universal Dependencies guidelines. |
|
- Data includes formal Haitian Creole texts (e.g., articles, religious texts, educational material). |
|
- No pretrained word vectors were used (pure end-to-end pipeline). |
|
- CoNLL-U Data: https://github.com/JephteyAdolphe/UD_Haitian_Creole-Adolphe |
|
|
|
## Capabilities |
|
|
|
- Tokenization (including contractions and informal forms common in Haitian Creole) |
|
- Part-of-Speech (POS) tagging based on Universal POS tags |
|
- Dependency parsing (basic syntactic parsing) |
|
|
|
## Intended Use |
|
|
|
- NLP research on low-resource languages |
|
- Language technology development for Haitian Creole |
|
- Educational or linguistic applications |
|
|
|
## Limitations |
|
|
|
- No named entity recognition (NER) currently included. |
|
- Trained primarily on formal Haitian Creole; performance may vary on very informal or highly dialectal texts. |
|
- Small dataset: best for prototyping, research, and early-stage projects. |
|
|
|
## Example Usage |
|
|
|
```python |
|
import spacy |
|
|
|
texts = [ |
|
"Si'm ka vini, m'ap pale ak li.", "M ap teste model lan (pou kounye a).", |
|
"Map manje gato a pandan map gade televizyon lem lakay mwen.", |
|
"M ap pale ak ou le w vini demen.", "M'ap vini, eske wap la avek lajan'm? Si ou, di'l non pou fre'w." |
|
] |
|
|
|
nlp = spacy.load("ht_core_news_sm") |
|
|
|
for text in texts: |
|
doc = nlp(text) |
|
|
|
# Tokenization, POS tagging, Lemmatization, Dependency parsing |
|
print("Tokens, NORM, POS (tag), Dependency:") |
|
print(len(doc)) |
|
for token in doc: |
|
print(f"{token.text} | {token.norm_} | {token.tag_} | {token.dep_}") |
|
print("\n") |
|
|