Uzbek

Tokenizer for Uzbek Language

Introduction

Ushbu tokenizer Mozilla Common Voice dataset ma'lumotlariga asoslangan. train+validated 130.000 sentences

Features

  • Matnlarni tokenlarga ajratadi.
  • Ko'p bo'lmagan talaffuz va aksentlarni qo'llab-quvvatlaydi.

Installation

Python va kerakli kutubxonalar:

pip install transformers datasets

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("jamshidahmadov/uz_tokenizer")

text = "O'zbekistonda turli xil NLP loyihalari qurilmoqda"
tokens = tokenizer.tokenize(text)
print(tokens)

Dataset Description

Common Voice 17.0 dataseti multilangual ya'ni ko'p tilli bo'lib o'zbek tilini ham qo'llab quvvatlaydi.

Contact

Jamshid Ahmadov

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jamshidahmadov/uz_tokenizer

Finetuned
(2975)
this model

Dataset used to train jamshidahmadov/uz_tokenizer