herberta_V3_Modern / README.md
XiaoEnn's picture
Update README.md
3750776 verified
metadata
tags:
  - PretrainModel
  - TCM
  - transformer
  - herberta
  - text-embedding
license: apache-2.0
language:
  - zh
  - en
metrics:
  - accuracy
base_model:
  - hfl/chinese-roberta-wwm-ext-large
new_version: XiaoEnn/herberta_seq_512_V2
inference: true
library_name: transformers

Herberta: A Pretrained Model for TCM Herbal Medicine and Downstream Tasks

Introduction

Herberta is a pre-trained model developed by the Angelpro Team, aimed at advancing the representation learning and modeling capabilities in Traditional Chinese Medicine (TCM). Built upon the chinese-roberta-wwm-ext-large model, Herberta leverages MLM (Masked Language Modeling) tasks to pre-train on datasets comprising 700 ancient books (538.95M) and 48 modern Chinese medicine textbooks (54M), resulting in a robust model for embedding generation and TCM-specific downstream tasks.

We named the model "Herberta" by combining "Herb" and "Roberta" to signify its purpose in herbal medicine research. Herberta is ideal for applications such as:

  • Encoder for Herbal Formulas: Generating meaningful embeddings for TCM formulations.
  • Domain-Specific Word Embedding: Serving the Chinese medicine text domain.
  • Support for TCM Downstream Tasks: Including classification, labeling, and more.

Pretraining Experiments

Dataset

Data Type Quantity Data Size
Ancient TCM Books 700 books ~538.95M
Modern TCM Textbooks 48 books ~54M
Mixed-Type Dataset Combined dataset ~637.8M

Pretrain result:

Model eval_accuracy Loss/epoch_valid Perplexity_valid
herberta_seq_512_v2 0.9841 0.04367 1.083
herberta_seq_128_v2 0.9406 0.2877 1.333
herberta_seq_512_V3 0.755 1.100 3.010

Metrics Comparison

Accuracy Loss Perplexity

Pretraining Configuration

Modern Textbooks Version

  • Pretraining Strategy: Dynamic MASK + Warmup + Linear Decay
  • Sequence Length: 512
  • Batch Size: 16
  • Learning Rate: Warmup (10% steps) + Linear Decay (1e-5 initial rate)
  • Tokenization: Continuous tokenization (512 tokens) without sentence segmentation.

Downstream Task: TCM Pattern Classification

Task Definition

Using 321 pattern descriptions extracted from TCM internal medicine textbooks, we evaluated the classification performance on four models:

  1. Herberta_seq_512_v2: Pretrained on 700 ancient TCM books.
  2. Herberta_seq_512_v3: Pretrained on 48 modern TCM textbooks.
  3. Herberta_seq_128_v2: Pretrained on 700 ancient TCM books (128-length sequences).
  4. Roberta: Baseline model without TCM-specific pretraining.

Training Configuration

  • Max Sequence Length: 512
  • Batch Size: 16
  • Epochs: 30

Results

Model Name Eval Accuracy Eval F1 Eval Precision Eval Recall
Herberta_seq_512_v2 0.9454 0.9293 0.9221 0.9454
Herberta_seq_512_v3 0.8989 0.8704 0.8583 0.8989
Herberta_seq_128_v2 0.8716 0.8443 0.8351 0.8716
Roberta 0.8743 0.8425 0.8311 0.8743

image/png

Summary

The Herberta_seq_512_v2 model, pretrained on 700 ancient TCM books, exhibited superior performance across all evaluation metrics. This highlights the significance of domain-specific pretraining on larger and historically richer datasets for TCM applications.


Quickstart

Use Hugging Face

from transformers import AutoTokenizer, AutoModel

model_name = "XiaoEnn/herberta"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Input text
text = "中医理论是我国传统文化的瑰宝。"

# Tokenize and prepare input
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding="max_length", max_length=128)

# Get the model's outputs
with torch.no_grad():
    outputs = model(**inputs)

# Get the embedding (sentence-level average pooling)
sentence_embedding = outputs.last_hidden_state.mean(dim=1)

print("Embedding shape:", sentence_embedding.shape)
print("Embedding vector:", sentence_embedding)

if you find our work helpful, feel free to give us a cite

@misc{herberta-embedding, title = {Herberta: A Pretrained Model for TCM Herbal Medicine and Downstream Tasks as Text Embedding Generation}, url = {https://github.com/15392778677/herberta}, author = {Yehan Yang, Xinhan Zheng}, month = {December}, year = {2024} }

@article{herberta-technical-report, title={Herberta: A Pretrained Model for TCM Herbal Medicine and Downstream Tasks as Text Embedding Generation}, author={Yehan Yang, Xinhan Zheng}, institution={Beijing Angelpro Technology Co., Ltd.}, year={2024}, note={Presented at the 2024 Machine Learning Applications Conference (MLAC)} }