--- language: ht license: mit tags: - haitian-creole - tokenization - part-of-speech-tagging - dependency-parsing - spacy --- # ht_core_news_sm **Language:** Haitian Creole (ht) **Type:** spaCy pipeline (tokenizer, POS tagger, dependency parser) **Size:** Small (optimized for efficiency) ## Training - Trained on ~3,300 manually corrected CoNLL-U sentences following Universal Dependencies guidelines. - Data includes formal Haitian Creole texts (e.g., articles, religious texts, educational material). - No pretrained word vectors were used (pure end-to-end pipeline). - CoNLL-U Data: https://github.com/JephteyAdolphe/UD_Haitian_Creole-Adolphe ## Capabilities - Tokenization (including contractions and informal forms common in Haitian Creole) - Part-of-Speech (POS) tagging based on Universal POS tags - Dependency parsing (basic syntactic parsing) ## Intended Use - NLP research on low-resource languages - Language technology development for Haitian Creole - Educational or linguistic applications ## Limitations - No named entity recognition (NER) currently included. - Trained primarily on formal Haitian Creole; performance may vary on very informal or highly dialectal texts. - Small dataset: best for prototyping, research, and early-stage projects. ## Example Usage ```python import spacy texts = [ "Si'm ka vini, m'ap pale ak li.", "M ap teste model lan (pou kounye a).", "Map manje gato a pandan map gade televizyon lem lakay mwen.", "M ap pale ak ou le w vini demen.", "M'ap vini, eske wap la avek lajan'm? Si ou, di'l non pou fre'w." ] nlp = spacy.load("ht_core_news_sm") for text in texts: doc = nlp(text) # Tokenization, POS tagging, Lemmatization, Dependency parsing print("Tokens, NORM, POS (tag), Dependency:") print(len(doc)) for token in doc: print(f"{token.text} | {token.norm_} | {token.tag_} | {token.dep_}") print("\n")