Tengentoppa-llm-jp-base-3.7B
This is a modified version of the llm-jp-3-3.7b model with additional special tokens for structured conversations. The base model was developed by the Research and Development Center for Large Language Models at the National Institute of Informatics.
Model Details
- Base Model: llm-jp-3-3.7b
- Model Type: Transformer-based Language Model
- Parameters: 3.7B
- Context Length: 4096
- Language: Japanese and English
Additional Special Tokens
This model includes the following special tokens for structured conversations:
<|SYSTEM|>, </|SYSTEM|> - System message delimiters
<|USER|>, </|USER|> - User input delimiters
<|HINT|>, </|HINT|> - Hint message delimiters
<|REASONING|>, </|REASONING|> - Reasoning section delimiters
<|ASSISTANT|>, </|ASSISTANT|> - Assistant response delimiters
Required Libraries and Their Versions
- torch>=2.3.0
- transformers>=4.40.1
- tokenizers>=0.19.1
- accelerate>=0.29.3
- flash-attn>=2.5.8
Usage
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("DeL-TaiseiOzaki/Tengentoppa-llm-jp-base-3.7B")
model = AutoModelForCausalLM.from_pretrained(
"DeL-TaiseiOzaki/Tengentoppa-llm-jp-base-3.7B",
device_map="auto",
torch_dtype=torch.bfloat16
)
# Example using special tokens
text = "<|SYSTEM|>You are a helpful assistant.</|SYSTEM|>\n<|USER|>自然言語処理とは何か</|USER|>"
tokenized_input = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(
tokenized_input,
max_new_tokens=100,
do_sample=True,
top_p=0.95,
temperature=0.7,
repetition_penalty=1.05,
)[0]
print(tokenizer.decode(output))
Base Model Information
Model Architecture
Params | Layers | Hidden size | Heads | Context length | Embedding parameters | Non-embedding parameters |
---|---|---|---|---|---|---|
3.7b | 28 | 3072 | 24 | 4096 | 611,844,096 | 3,171,068,928 |
Tokenizer
The tokenizer is based on the original llm-jp-3-3.7b tokenizer, which uses huggingface/tokenizers Unigram byte-fallback model. The vocabulary is based on llm-jp-tokenizer v3.0
, with our additional special tokens added to the vocabulary.
License
This model inherits the license from the base model: Apache License, Version 2.0
Attribution
This model is based on llm-jp-3-3.7b. Please cite the original model and its creators when using this modified version.
Modifications
The only modifications made to the original model are:
- Addition of special tokens for structured conversations
- Resizing of token embeddings to accommodate the new special tokens
All other aspects of the model, including its training data, architecture, and capabilities, remain the same as the original llm-jp-3-3.7b model.
- Downloads last month
- 6
Model tree for DeL-TaiseiOzaki/Tengentoppa-llm-jp-3-3.7B-base
Base model
llm-jp/llm-jp-3-3.7b