Incremental BPE builder

Modified, simplified version of text_encoder_build_subword.py and its dependencies included in tensor2tensor library, making its output fits to google research's open-sourced BERT project.

Requirement

The environment I made this project in consists of :

python 3.6
tensorflow 1.11

Basic usage

Build domain-specific vocabulary automatically

If you want to build a proper size of vocabulary for specific domain using the incremental algorithm in our paper, you can do as following:

python vocab_extend.py \
    --corpus {file for the domain corpus} \
    --raw_vocab {bert_raw_vocab_file} \
    --output_file {he output file of the final vocabulary} \
    --interval {vocab size interval} \
    --threshold {threshold for P(D)}

# Example using sample data
python vocab_extend.py --corpus test_data/chem.txt \
    --raw_vocab test_data/vocab.txt \
    --output_file test_data/chem.vocab \
    --interval 1000 --threshold 1

If you simply want to get a specific size of vocab, you can run the following

python subword_builder.py \
--corpus_filepattern {corpus_for_vocab} \
--raw_vocab {bert_raw_vocab_file} \
--output_filename {name_of_vocab} \
--vocab_size {final vocab size} \
--do_lower_case