jed351
/

gpt2-base-zh-hk

Feature Extraction

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

gpt2-base-zh-hk / README.md

jed351's picture

Create README.md

690d18f about 2 years ago

|

821 Bytes

	This model has not been trained on any Cantonese material.

	It is simply a base model in which the embeddings and tokenizer were patched with Cantonese characters. One can find the original model [gpt2-tiny-chinese](https://huggingface.co/ckiplab/gpt2-tiny-chinese).






	I used this [repo](https://github.com/ayaka14732/bert-tokenizer-cantonese) to identify missing Cantonese characters

	[My forked and modified version](https://github.com/jedcheng/bert-tokenizer-cantonese)

	After identifying the missing characters, the Huggingface library provides very high level API to modify the tokenizer and embeddings.

	```
	Download a tokenizer and a model from the Huggingface library. Then:

	tokenizer.add_tokens("your new tokens")
	model.resize_token_embeddings(len(tokenizer))

	tokenizer.push_to_hub("your model name")
	```