Jakhongir Saydaliev's picture
7 3

Jakhongir Saydaliev

Jakh0103
Β·

AI & ML interests

None yet

Recent Activity

updated a model 1 day ago
Jakh0103/lid
published a model 1 day ago
Jakh0103/lid
updated a model 20 days ago
Jakh0103/Qwen2.5-VL-3B-SFT-VSR
View all activity

Organizations

PMJ AI's profile picture EPFL NLP Lab's profile picture

Jakh0103's activity

reacted to kargaranamir's post with πŸ‘ about 1 month ago
view post
Post
1379
Introducing GlotCC: a new 2TB corpus based on an early 2024 CommonCrawl snapshot with data for 1000+ languages.

πŸ€— corpus v1: cis-lmu/GlotCC-V1
🐱 pipeline v3: https://github.com/cisnlp/GlotCC

More details? Stay tuned for our upcoming paper.
More data? In the next version, we plan to include additional snapshots of CommonCrawl.

Limitation: Due to the lower frequency of low-resource languages compared to others, there are sometimes only a few sentences available for very low-resource languages. However, the data volume for English in this version stands at 750GB, and the top 200 languages still have a strong presence in our data (see plot attached; we write the index for every 20 languages, meaning the 10th index is the 200th language).
reacted to kargaranamir's post with πŸ‘ about 1 month ago
updated a collection about 2 months ago
updated a collection 3 months ago
upvoted an article 3 months ago
view article
Article

Vision Language Models Explained

By merve and 1 other β€’
β€’ 358
upvoted an article 4 months ago
view article
Article

Assisted Generation: a new direction toward low-latency text generation

By joaogante β€’
β€’ 61
commented on Open-R1: Update #1 4 months ago
view reply

Fixed, thanks!

The link in the below statement also seems to be broken:
"You can find the instructions to run these evaluations in the open-r1 repository."

upvoted 2 articles 4 months ago
view article
Article

Open-R1: a fully open reproduction of DeepSeek-R1

By eliebak and 2 others β€’
β€’ 860
updated a collection 6 months ago