Occiglot

community

https://occiglot.eu/

occiglot

Activity Feed

AI & ML interests

Open Source Language Models for Europe

Recent Activity

eliaswendt authored a paper 12 days ago

Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models

mkrausio authored a paper 13 days ago

Right on Time: Revising Time Series Models by Constraining their Explanations

mkrausio authored a paper 13 days ago

Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models

View all activity

occiglot's activity

eliaswendt

authored a paper 12 days ago

Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models

Paper • 2505.22232 • Published 18 days ago • 18

mkrausio

authored 2 papers 13 days ago

Right on Time: Revising Time Series Models by Constraining their Explanations

Paper • 2402.12921 • Published Feb 20, 2024

Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models

Paper • 2505.22232 • Published 18 days ago • 18

mbrack

authored 11 papers 17 days ago

Class Attribute Inference Attacks: Inferring Sensitive Class Information by Diffusion-Based Attribute Manipulations

Paper • 2303.09289 • Published Mar 16, 2023 • 1

Distilling Adversarial Prompts from Safety Benchmarks: Report for the Adversarial Nibbler Challenge

Paper • 2309.11575 • Published Sep 20, 2023

MultiFusion: Fusing Pre-Trained Models for Multi-Lingual, Multi-Modal Image Generation

Paper • 2305.15296 • Published May 24, 2023

Mitigating Inappropriateness in Image Generation: Can there be Value in Reflecting the World's Ugliness?

Paper • 2305.18398 • Published May 28, 2023 • 1

LLavaGuard: VLM-based Safeguards for Vision Dataset Curation and Safety Assessment

Paper • 2406.05113 • Published Jun 7, 2024 • 2

AtMan: Understanding Transformer Predictions Through Memory Efficient Attention Manipulation

Paper • 2301.08110 • Published Jan 19, 2023 • 1

SCAR: Sparse Conditioned Autoencoders for Concept Detection and Steering in LLMs

Paper • 2411.07122 • Published Nov 11, 2024

Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models

Paper • 2505.22232 • Published 18 days ago • 18

BramVanroy

posted an update about 1 month ago

Post

3190

📢💾 Introducing the Common Crawl Creative Commons Corpus (C5)!

C5 is a large-scale effort to heavily filter web-crawled data, as collected by the non-profit Common Crawl, to only documents that are Creative Commons-licensed such as cc-by-4.0 or public domain cc0. At this stage 150 billion tokens have been collected.

---
📄 data: BramVanroy/CommonCrawl-CreativeCommons
🧰 software: https://github.com/BramVanroy/CommonCrawl-CreativeCommons
---

</> To build C5, HTML pages are scrutinized and all links (if any) to CC licenses are collected, both in regular hyperlinks as well as in metadata. Additional data fields are included such as "was the license found in the head?" or "if multiple licenses were found, do they contradict each other?", which makes further filtering a breeze.

🌐 In this first version of C5, 8 languages are included (Afrikaans, German, English, French, Frysian, Italian, Dutch and Spanish). The language set was limited for two reasons: computational and storage limitations, and a collaboration with GPT-NL, which requested CC data for these languages to train a Dutch-focused, copyright-conscious LLM. In total, this V1 release contains almost 150 thousand documents and 150 billion tokens. This data was not filtered on quality nor deduplicated so that you can decide for yourself how much data to keep. To give some quality indication, a dataset field is present to describe whether a document is included in the FineWeb(-2) datasets, which are of high quality.

🔍 More work needs to be done! Only 7 out of 100+ Common Crawl crawls have been processed so far. That's encouraging because it means there is a lot more Creative Commons data to be collected! But to get there I need help in terms of compute. The current processing was already heavily sponsored by the Flemish Supercomputer but more is needed. If you have the compute available and which to collaborate in an open and transparent manner, please get in touch!

1 reply

mkrausio

updated 2 datasets about 2 months ago

occiglot/arcX

Viewer • Updated Apr 30 • 26.4k • 48

occiglot/hellaswagX

Viewer • Updated Apr 29 • 240k • 91

barthfab

updated a dataset 2 months ago

occiglot/euro-llm-leaderboard-requests

Updated Apr 2 • 1.79k • 2

stefan-it

posted an update 3 months ago

Post

2644

Wohoo 🥳 I have finished my 2025 GPU workstation build and I am very excited to train new awesome open source models on it.

I built my last GPU workstation 5 years ago featuring an AMD Ryzen 5900X, 64GB of G.SKILL Trident Z RGB on an ASRock X570 Taichi cooled by an Alphacool Eisbär 420. GPU was a Zotac RTX 3090 AMP Extreme. Unfortunately, I was never satisfied with the case - some Fractal Define 7, as it is definitely too small, airflow is not optimal as I had to open the front door all the time and it also arrived with a partly damaged side panel.

For my new build, I've used the following components: an outstanding new AMD Ryzen 9950X3D with 64GB of Corsair Dominator Titanium (what a name). As a huge Noctua fan - warm greetings to my Austrian neighbors - I am using the brand new Noctua NH-D15 G2 on an ASRock X870E Taichi in an amazing Lian Li LANCOOL III chassis. One joke that only NVIDIA Blackwell users will understand: you definitely need a tempered glass panel to check if your GPU cables/connectors start melting 😂 And the best is yet to come: I returned my previously bought Zotac RTX 5090 Solid to the eBay seller (because of... missing ROPs, only NVIDIA Blackwell users will again understand) and bought a Zotac 5090 AMP Extreme INFINITY (yes, the long name indicates that this is the flagship model from Zotac) from a more trustworthy source (NBB in Germany).

I am so happy to start training and fine-tuning new open source models - stay tuned!!!

2 replies

stefan-it

posted an update 3 months ago

Post

1000

🇹🇷 😍 I'm very happy to finally announce my new Turkish LM called "BERT5urk":

stefan-it/bert5urk

It is a 1.42B T5-based model, trained with UL2 pretraining objective on the Turkish part of the awesome HuggingFaceFW/fineweb-2 dataset.

Feel free to check it out!

1 reply

AI & ML interests

Recent Activity

Team members 15

occiglot's activity