BERT for Japanese Twitter

This is a base BERT model that has been adapted for Japanese Twitter.

It was adapted from Japanese BERT by preparing a specialized vocabulary and continuing pretraining on a Twitter corpus.

This model is reccomended for Japanese SNS tasks, like sentiment analysis and defamation detection.

This model has been used to finetune a series of models. The main ones are BERT for Japanese Twitter Sentiment and BERT for Japanese Twitter Emotion.

Training Data

The Twitter API was used to collect Japanese tweets from June 2022 to April 2023.

N-gram based deduplication was used to reduce spam content and improve the diversity of the training corpus. The refined training corpus was 28 million tweets.

Tokenization

The vocabulary was prepared using the WordPieceTrainer with the Twitter training corpus. It shares 60% of its vocabulary with Japanese BERT.

The vocabulary includes colloquialisms, neologisms, emoji and kaomoji expressions that are common on Twitter.