Model Card

hfl/chinese-lert-base pretrained from scratch in Chinese novels corpus with MLM task.

Model Description

  • Architecture: Lert-base
  • Pretraining objective: Masked Language Modeling
  • Language: Chinese.

Data

  • The dataset was built from Chivi's novel corpus (specifically the parts from 1.db to 20.db) containing approximately 325M sentences.
  • Preprocessing: normalization → tokenization with custom BPE tokenizer → randomly mask 15% tokens.

Note: We use a custom vocabulary specifically designed for the Chinese web novel domain to minimize the number of unknown tokens encountered during NLP tasks.

Training Config

  • Epochs: 3
  • Optimizer: AdamW
  • Learning rate: 1e‑4 with warm-up 20k steps
  • Batch size: 128
  • Max sequence length: 512
  • Total training steps: ~6M steps, trained in ~750 hours

How to use

Use the model with a pipeline for masked language modeling task:

from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline

tokenizer = AutoTokenizer.from_pretrained("chi-vi/chivi-lert-base")
model = AutoModelForMaskedLM.from_pretrained("chi-vi/chivi-lert-base")

pipe = pipeline("fill-mask", model=model, tokenizer=tokenizer)
print(pipe("9月14号周日晚间,美林公司同意以440亿美元出售[UNK]米国银行。"))
Downloads last month
148
Safetensors
Model size
102M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for chi-vi/chivi-lert-base

Finetuned
(10)
this model