Things I've translated: courses, blog posts, guides. More on my personal blog (https://lbourdois.github.io/blog/).
Loïck BOURDOIS
lbourdois
AI & ML interests
👀
Recent Activity
upvoted
an
article
5 days ago
Training and Finetuning Sparse Embedding Models with Sentence Transformers v5
commented on
a paper
6 days ago
Should We Still Pretrain Encoders with Masked Language Modeling?
new activity
10 days ago
RefalMachine/RuadaptQwen2.5-32B-Pro-Beta:Improve language tag
Organizations
French prompts
French prompts dataset developped when I worked at CATIE (https://hf.co/CATIE-AQ). Over 30,000 downloads.
French QA
QA models & datasets developped when I worked at CATIE (https://hf.co/CATIE-AQ). Over 150,000 downloads.
French caption datasets
Datasets I cleaned with an image, a prompt question (like "describe this image") and an answer.
Can be used to train VLMs.
-
lbourdois/caption-maya-multimodal-pretrain-clean
Viewer • Updated • 551k • 429 -
CATIE-AQ/caption-vidore-vdsid_french-clean
Viewer • Updated • 5k • 43 -
CATIE-AQ/caption-vidore-tabfquad_test_subsampled-clean
Viewer • Updated • 280 • 11 -
CATIE-AQ/caption-floschne-xm3600-clean
Viewer • Updated • 8.56k • 35
French retriever datasets
Datasets I cleaned with an image and a question.
Can be used to train visual retrievers (ColPali and co.).
-
CATIE-AQ/retriever-vidore-vdsid_french-clean
Viewer • Updated • 5k • 15 -
CATIE-AQ/retriever-vidore-tabfquad_test_subsampled-clean
Viewer • Updated • 280 • 11 -
CATIE-AQ/retriever-manu-tabfquad_retrieving-clean
Viewer • Updated • 1.83k • 29 -
CATIE-AQ/retriever-princeton-nlp-CharXiv-clean
Viewer • Updated • 1.32k • 7
FAT5
Flash Attention T5 (FAT5) models developped when I worked at CATIE (https://hf.co/CATIE-AQ).
French NER
NER models & datasets developped when I worked at CATIE (https://hf.co/CATIE-AQ). Over 170,000 downloads.
-
CATIE-AQ/NERmembert-base-3entities
Token Classification • 0.1B • Updated • 34 • 2 -
CATIE-AQ/NERmembert-large-3entities
Token Classification • 0.3B • Updated • 2.1k • 2 -
CATIE-AQ/frenchNER_3entities
Viewer • Updated • 425k • 44 • 1 -
CATIE-AQ/NERmembert-base-4entities
Token Classification • 0.1B • Updated • 20 • 2
French VQA datasets
VQA datasets I cleaned with an image, a question and an answer.
Can be used to train VLMs.
French OCR datasets
Datasets I cleaned with an image, a prompt question (like "transcribe the text in this image") and an answer.
Can be used to train VLMs.
French audio datasets (pretraining)
Around 117K hours of audio in French for research purpose
French Translations
Things I've translated: courses, blog posts, guides. More on my personal blog (https://lbourdois.github.io/blog/).
FAT5
Flash Attention T5 (FAT5) models developped when I worked at CATIE (https://hf.co/CATIE-AQ).
French prompts
French prompts dataset developped when I worked at CATIE (https://hf.co/CATIE-AQ). Over 30,000 downloads.
French NER
NER models & datasets developped when I worked at CATIE (https://hf.co/CATIE-AQ). Over 170,000 downloads.
-
CATIE-AQ/NERmembert-base-3entities
Token Classification • 0.1B • Updated • 34 • 2 -
CATIE-AQ/NERmembert-large-3entities
Token Classification • 0.3B • Updated • 2.1k • 2 -
CATIE-AQ/frenchNER_3entities
Viewer • Updated • 425k • 44 • 1 -
CATIE-AQ/NERmembert-base-4entities
Token Classification • 0.1B • Updated • 20 • 2
French QA
QA models & datasets developped when I worked at CATIE (https://hf.co/CATIE-AQ). Over 150,000 downloads.
French VQA datasets
VQA datasets I cleaned with an image, a question and an answer.
Can be used to train VLMs.
French caption datasets
Datasets I cleaned with an image, a prompt question (like "describe this image") and an answer.
Can be used to train VLMs.
-
lbourdois/caption-maya-multimodal-pretrain-clean
Viewer • Updated • 551k • 429 -
CATIE-AQ/caption-vidore-vdsid_french-clean
Viewer • Updated • 5k • 43 -
CATIE-AQ/caption-vidore-tabfquad_test_subsampled-clean
Viewer • Updated • 280 • 11 -
CATIE-AQ/caption-floschne-xm3600-clean
Viewer • Updated • 8.56k • 35
French OCR datasets
Datasets I cleaned with an image, a prompt question (like "transcribe the text in this image") and an answer.
Can be used to train VLMs.
French retriever datasets
Datasets I cleaned with an image and a question.
Can be used to train visual retrievers (ColPali and co.).
-
CATIE-AQ/retriever-vidore-vdsid_french-clean
Viewer • Updated • 5k • 15 -
CATIE-AQ/retriever-vidore-tabfquad_test_subsampled-clean
Viewer • Updated • 280 • 11 -
CATIE-AQ/retriever-manu-tabfquad_retrieving-clean
Viewer • Updated • 1.83k • 29 -
CATIE-AQ/retriever-princeton-nlp-CharXiv-clean
Viewer • Updated • 1.32k • 7
French audio datasets (pretraining)
Around 117K hours of audio in French for research purpose