DictaBERT-char-spacefix: A finetuned BERT model for restoring missing spaces in Hebrew texts.
DictaBERT-char-spacefix is a finetuned BERT model based on dicta-il/dictabert-char, for the task of restoring missing spaces in Hebrew text.
This model is released to the public in this 2025 W-NUT paper: Avi Shmidman and Shaltiel Shmidman, "Restoring Missing Spaces in Scraped Hebrew Social Media", The 10th Workshop on Noisy and User-generated Text (W-NUT), 2025
Sample usage:
from transformers import pipeline
oracle = pipeline('token-classification', model='dicta-il/dictabert-char-spacefix')
text = '讘砖谞转 1948 讛砖诇讬诐讗驻专讬诐 拽讬砖讜谉 讗转诇讬诪讜讚讬讜讘驻讬住讜诇诪转讻转 讜讘转讜诇讚讜转讛讗诪谞讜转 讜讛讞诇 诇驻专住诐诪讗诪专讬诐 讛讜诪讜专讬住讟讬讬诐'
raw_output = oracle(text)
# Classifier returns LABEL_1 if there should be a space before the character
text_output = ''.join((' ' if o['entity'] == 'LABEL_1' else '') + o['word'] for o in raw_output)
print(text_output)
Output:
讘砖谞转 1948 讛砖诇讬诐 讗驻专讬诐 拽讬砖讜谉 讗转 诇讬诪讜讚讬讜 讘驻讬住讜诇 诪转讻转 讜讘转讜诇讚讜转 讛讗诪谞讜转 讜讛讞诇 诇驻专住诐 诪讗诪专讬诐 讛讜诪讜专讬住讟讬讬诐
Citation
If you use DictaBERT-char-spacefix
in your research, please cite Restoring Missing Spaces in Scraped Hebrew Social Media
BibTeX:
@inproceedings{shmidman-shmidman-2025-restoring,
title = "Restoring Missing Spaces in Scraped {H}ebrew Social Media",
author = "Shmidman, Avi and
Shmidman, Shaltiel",
editor = "Bak, JinYeong and
Goot, Rob van der and
Jang, Hyeju and
Buaphet, Weerayut and
Ramponi, Alan and
Xu, Wei and
Ritter, Alan",
booktitle = "Proceedings of the Tenth Workshop on Noisy and User-generated Text",
month = may,
year = "2025",
address = "Albuquerque, New Mexico, USA",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.wnut-1.3/",
pages = "16--25",
ISBN = "979-8-89176-232-9",
abstract = "A formidable challenge regarding scraped corpora of social media is the omission of whitespaces, causing pairs of words to be conflated together as one. In order for the text to be properly parsed and analyzed, these missing spaces must be detected and restored. However, it is particularly hard to restore whitespace in languages such as Hebrew which are written without vowels, because a conflated form can often be split into multiple different pairs of valid words. Thus, a simple dictionary lookup is not feasible. In this paper, we present and evaluate a series of neural approaches to restore missing spaces in scraped Hebrew social media. Our best all-around method involved pretraining a new character-based BERT model for Hebrew, and then fine-tuning a space restoration model on top of this new BERT model. This method is blazing fast, high-performing, and open for unrestricted use, providing a practical solution to process huge Hebrew social media corpora with a consumer-grade GPU. We release the new BERT model and the fine-tuned space-restoration model to the NLP community."
}
License
This work is licensed under a Creative Commons Attribution 4.0 International License.
- Downloads last month
- 35
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
馃檵
Ask for provider support
Model tree for dicta-il/dictabert-char-spacefix
Base model
dicta-il/dictabert-char