--- library_name: transformers datasets: - WebOrganizer/TopicAnnotations-Llama-3.1-8B - WebOrganizer/TopicAnnotations-Llama-3.1-405B-FP8 base_model: - answerdotai/ModernBERT-base --- # wissamantoun/WebOrganizer-FormatClassifier-ModernBERT [[Paper](https://arxiv.org/abs/2502.10341)] [[Website](https://weborganizer.allenai.org)] [[GitHub](https://github.com/CodeCreator/WebOrganizer)] *All credit goes to the original authors of the model and dataset. This is a retraining of the original model with a different base model* The TopicClassifier organizes web content into 17 categories based on the URL and text contents of web pages. The model is a [ModernBERT-base](answerdotai/ModernBERT-base) with 140M parameters fine-tuned on the following training data: 1. [WebOrganizer/TopicAnnotations-Llama-3.1-8B](https://huggingface.co/datasets/WebOrganizer/TopicAnnotations-Llama-3.1-8B): 1M documents annotated by Llama-3.1-8B (first-stage training) 2. [WebOrganizer/TopicAnnotations-Llama-3.1-405B-FP8](https://huggingface.co/datasets/WebOrganizer/TopicAnnotations-Llama-3.1-405B-FP8): 100K documents annotated by Llama-3.1-405B-FP8 (second-stage training) #### All Domain Classifiers - [wissamantoun/WebOrganizer-FormatClassifier-ModernBERT](https://huggingface.co/wissamantoun/WebOrganizer-FormatClassifier-ModernBERT) *← you are here!* - [wissamantoun/WebOrganizer-TopicClassifier-ModernBERT](https://huggingface.co/wissamantoun/WebOrganizer-TopicClassifier-ModernBERT) ## Usage This classifier expects input in the following input format: ``` {url} {text} ``` Example: ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("wissamantoun/WebOrganizer-FormatClassifier-ModernBERT") model = AutoModelForSequenceClassification.from_pretrained( "wissamantoun/WebOrganizer-FormatClassifier-ModernBERT", trust_remote_code=True, use_memory_efficient_attention=False) web_page = """http://www.example.com How to build a computer from scratch? Here are the components you need...""" inputs = tokenizer([web_page], return_tensors="pt") outputs = model(**inputs) probs = outputs.logits.softmax(dim=-1) print(probs.argmax(dim=-1)) # -> 5 ("Hardware" topic) ``` You can convert the `logits` of the model with a softmax to obtain a probability distribution over the following 24 categories (in order of labels, also see `id2label` and `label2id` in the model config): 0. Academic Writing 1. Content Listing 2. Creative Writing 3. Customer Support 4. Comment Section 5. FAQ 6. Truncated 7. Knowledge Article 8. Legal Notices 9. Listicle 10. News Article 11. Nonfiction Writing 12. About (Org 13. News (Org 14. About (Pers 15. Personal Blog 16. Product Page 17. Q&A Forum 18. Spam / Ads 19. Structured Data 20. Documentation 21. Audio Transcript 22. Tutorial 23. User Review The full definitions of the categories can be found in the [taxonomy config](https://github.com/CodeCreator/WebOrganizer/blob/main/define_domains/taxonomies/topics.yaml). # Scores ``` ***** pred metrics ***** test_accuracy = 0.8154 test_accuracy__0 = 0.855 test_accuracy__1 = 0.7558 test_accuracy__10 = 0.9071 test_accuracy__11 = 0.6869 test_accuracy__12 = 0.8055 test_accuracy__13 = 0.7897 test_accuracy__14 = 0.8592 test_accuracy__15 = 0.8541 test_accuracy__16 = 0.8788 test_accuracy__17 = 0.7733 test_accuracy__18 = 0.7286 test_accuracy__19 = 0.6989 test_accuracy__2 = 0.7474 test_accuracy__20 = 0.7609 test_accuracy__21 = 0.7807 test_accuracy__22 = 0.7703 test_accuracy__23 = 0.7931 test_accuracy__3 = 0.6351 test_accuracy__4 = 0.871 test_accuracy__5 = 0.8333 test_accuracy__6 = 0.6125 test_accuracy__7 = 0.6416 test_accuracy__8 = 0.78 test_accuracy__9 = 0.7668 test_accuracy_conf50 = 0.8312 test_accuracy_conf50__0 = 0.8852 test_accuracy_conf50__1 = 0.7651 test_accuracy_conf50__10 = 0.9167 test_accuracy_conf50__11 = 0.7168 test_accuracy_conf50__12 = 0.8256 test_accuracy_conf50__13 = 0.7996 test_accuracy_conf50__14 = 0.8696 test_accuracy_conf50__15 = 0.8684 test_accuracy_conf50__16 = 0.8878 test_accuracy_conf50__17 = 0.7838 test_accuracy_conf50__18 = 0.7663 test_accuracy_conf50__19 = 0.7276 test_accuracy_conf50__2 = 0.7609 test_accuracy_conf50__20 = 0.7907 test_accuracy_conf50__21 = 0.8 test_accuracy_conf50__22 = 0.7927 test_accuracy_conf50__23 = 0.7904 test_accuracy_conf50__3 = 0.6617 test_accuracy_conf50__4 = 0.877 test_accuracy_conf50__5 = 0.8571 test_accuracy_conf50__6 = 0.6299 test_accuracy_conf50__7 = 0.6786 test_accuracy_conf50__8 = 0.7755 test_accuracy_conf50__9 = 0.7796 test_accuracy_conf75 = 0.9003 <--- Metric from the paper test_accuracy_conf75__0 = 0.9412 test_accuracy_conf75__1 = 0.8318 test_accuracy_conf75__10 = 0.9542 test_accuracy_conf75__11 = 0.8478 test_accuracy_conf75__12 = 0.8841 test_accuracy_conf75__13 = 0.8724 test_accuracy_conf75__14 = 0.914 test_accuracy_conf75__15 = 0.9345 test_accuracy_conf75__16 = 0.9316 test_accuracy_conf75__17 = 0.8667 test_accuracy_conf75__18 = 0.8446 test_accuracy_conf75__19 = 0.8209 test_accuracy_conf75__2 = 0.8333 test_accuracy_conf75__20 = 0.9333 test_accuracy_conf75__21 = 0.8587 test_accuracy_conf75__22 = 0.8708 test_accuracy_conf75__23 = 0.8309 test_accuracy_conf75__3 = 0.7292 test_accuracy_conf75__4 = 0.9357 test_accuracy_conf75__5 = 0.9032 test_accuracy_conf75__6 = 0.7816 test_accuracy_conf75__7 = 0.8011 test_accuracy_conf75__8 = 0.8409 test_accuracy_conf75__9 = 0.8592 test_accuracy_label_average = 0.7744 test_accuracy_label_average_conf50 = 0.7919 test_accuracy_label_average_conf75 = 0.8676 test_accuracy_label_min = 0.6125 test_accuracy_label_min_conf75 = 0.7292 <--- Metric from the paper test_loss = 0.6023 test_proportion_conf50 = 0.9638 test_proportion_conf75 = 0.7951 test_runtime = 0:00:08.38 test_samples_per_second = 1192.262 test_steps_per_second = 37.318 ``` ## Citation ```bibtex @article{wettig2025organize, title={Organize the Web: Constructing Domains Enhances Pre-Training Data Curation}, author={Alexander Wettig and Kyle Lo and Sewon Min and Hannaneh Hajishirzi and Danqi Chen and Luca Soldaini}, journal={arXiv preprint arXiv:2502.10341}, year={2025} } ```