Spaces:
Sleeping
Sleeping
A newer version of the Gradio SDK is available:
5.20.0
Incremental BPE builder
Modified, simplified version of text_encoder_build_subword.py and its dependencies included in tensor2tensor library, making its output fits to google research's open-sourced BERT project.
Requirement
The environment I made this project in consists of :
- python 3.6
- tensorflow 1.11
Basic usage
Build domain-specific vocabulary automatically
If you want to build a proper size of vocabulary for specific domain using the incremental algorithm in our paper, you can do as following:
python vocab_extend.py \
--corpus {file for the domain corpus} \
--raw_vocab {bert_raw_vocab_file} \
--output_file {he output file of the final vocabulary} \
--interval {vocab size interval} \
--threshold {threshold for P(D)}
# Example using sample data
python vocab_extend.py --corpus test_data/chem.txt \
--raw_vocab test_data/vocab.txt \
--output_file test_data/chem.vocab \
--interval 1000 --threshold 1
If you simply want to get a specific size of vocab, you can run the following
python subword_builder.py \
--corpus_filepattern {corpus_for_vocab} \
--raw_vocab {bert_raw_vocab_file} \
--output_filename {name_of_vocab} \
--vocab_size {final vocab size} \
--do_lower_case