distilrubert-tiny-cased-conversational-5k

Conversational DistilRuBERT-tiny-5k (Russian, cased, 3‑layers, 264‑hidden, 12‑heads, 3.6M parameters, 5k vocab) was trained on OpenSubtitles[1], Dirty, Pikabu, and a Social Media segment of Taiga corpus[2] (as Conversational RuBERT).

Our DistilRuBERT-tiny-5k is highly inspired by [3], [4] and architecture is very close to [5]. Namely, we use

  • MLM loss (between token labels and student output distribution)
  • KL loss (between averaged student and teacher hidden states)

The key feature is:

  • reduced vocabulary size (5K vs 30K in tiny vs. 100K in base and small)

Here is comparison between teacher model (Conversational RuBERT) and other distilled models.

Model name # params, M # vocab, K Mem., MB
rubert-base-cased-conversational 177.9 120 679
distilrubert-base-cased-conversational 135.5 120 517
distilrubert-small-cased-conversational 107.1 120 409
cointegrated/rubert-tiny 11.8 30 46
cointegrated/rubert-tiny2 29.3 84 112
distilrubert-tiny-cased-conversational-v1 10.4 31 41
distilrubert-tiny-cased-conversational-5k 3.6 5 14

DistilRuBERT-tiny was trained for about 100 hrs. on 7 nVIDIA Tesla P100-SXM2.0 16Gb.

We used PyTorchBenchmark from transformers to evaluate model's performance and compare it with other pre-trained language models for Russian. All tests were performed on NVIDIA GeForce GTX 1080 Ti and Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz

| Model name | Batch size | Seq len | Time, s || Mem, MB || |---|---|---|------||------|| | | | | CPU | GPU | CPU | GPU | | rubert-base-cased-conversational | 16 | 512 | 5.283 | 0.1866 | 1550 | 1938 | | distilrubert-base-cased-conversational | 16 | 512 | 2.335 | 0.0553 | 2177 | 2794 | | distilrubert-small-cased-conversational | 16 | 512 | 0.802 | 0.0015 | 1541 | 1810 | | cointegrated/rubert-tiny | 16 | 512 | 0.942 | 0.0022 | 1308 | 2088 | | cointegrated/rubert-tiny2 | 16 | 512 | 1.786 | 0.0023 | 3054 | 3848 | | distilrubert-tiny-cased-conversational-v1 | 16 | 512 | 0.374 | 0.002 | 714 | 1158 | | distilrubert-tiny-cased-conversational-5k | 16 | 512 | 0.354 | 0.0018 | 664 | 1126 |

To evaluate model quality, we fine-tuned DistilRuBERT-tiny-5k on classification (RuSentiment, ParaPhraser), NER and question answering data sets for Russian. The results could be found in the paper Table 4 as well as performance benchmarks and training details.

Citation

If you found the model useful for your research, we are kindly ask to cite this paper:

@misc{https://doi.org/10.48550/arxiv.2205.02340,
  doi = {10.48550/ARXIV.2205.02340},
  url = {https://arxiv.org/abs/2205.02340},
  author = {Kolesnikova, Alina and Kuratov, Yuri and Konovalov, Vasily and Burtsev, Mikhail},
  keywords = {Computation and Language (cs.CL), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},  
  title = {Knowledge Distillation of Russian Language Models with Reduction of Vocabulary},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}

[1]: P. Lison and J. Tiedemann, 2016, OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)

[2]: Shavrina T., Shapovalova O. (2017) TO THE METHODOLOGY OF CORPUS CONSTRUCTION FOR MACHINE LEARNING: «TAIGA» SYNTAX TREE CORPUS AND PARSER. in proc. of “CORPORA2017”, international conference , Saint-Petersbourg, 2017.

[3]: Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.

[4]: https://github.com/huggingface/transformers/tree/master/examples/research_projects/distillation

[5]: https://habr.com/ru/post/562064/, https://huggingface.co/cointegrated/rubert-tiny

Downloads last month
108
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no pipeline_tag.