Tengentoppa-llm-jp-base-3.7B

This is a modified version of the llm-jp-3-3.7b model with additional special tokens for structured conversations. The base model was developed by the Research and Development Center for Large Language Models at the National Institute of Informatics.

Model Details

Base Model: llm-jp-3-3.7b
Model Type: Transformer-based Language Model
Parameters: 3.7B
Context Length: 4096
Language: Japanese and English

Additional Special Tokens

This model includes the following special tokens for structured conversations:

<|SYSTEM|>, </|SYSTEM|>   - System message delimiters
<|USER|>, </|USER|>       - User input delimiters
<|HINT|>, </|HINT|>       - Hint message delimiters
<|REASONING|>, </|REASONING|> - Reasoning section delimiters
<|ASSISTANT|>, </|ASSISTANT|> - Assistant response delimiters

Required Libraries and Their Versions

torch>=2.3.0
transformers>=4.40.1
tokenizers>=0.19.1
accelerate>=0.29.3
flash-attn>=2.5.8

Usage

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("DeL-TaiseiOzaki/Tengentoppa-llm-jp-base-3.7B")
model = AutoModelForCausalLM.from_pretrained(
    "DeL-TaiseiOzaki/Tengentoppa-llm-jp-base-3.7B", 
    device_map="auto", 
    torch_dtype=torch.bfloat16
)

# Example using special tokens
text = "<|SYSTEM|>You are a helpful assistant.</|SYSTEM|>\n<|USER|>自然言語処理とは何か</|USER|>"
tokenized_input = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(
        tokenized_input,
        max_new_tokens=100,
        do_sample=True,
        top_p=0.95,
        temperature=0.7,
        repetition_penalty=1.05,
    )[0]

print(tokenizer.decode(output))

Base Model Information

Model Architecture

Params	Layers	Hidden size	Heads	Context length	Embedding parameters	Non-embedding parameters
3.7b	28	3072	24	4096	611,844,096	3,171,068,928

Tokenizer

The tokenizer is based on the original llm-jp-3-3.7b tokenizer, which uses huggingface/tokenizers Unigram byte-fallback model. The vocabulary is based on llm-jp-tokenizer v3.0, with our additional special tokens added to the vocabulary.

License

This model inherits the license from the base model: Apache License, Version 2.0

Attribution

This model is based on llm-jp-3-3.7b. Please cite the original model and its creators when using this modified version.

Modifications

The only modifications made to the original model are:

Addition of special tokens for structured conversations
Resizing of token embeddings to accommodate the new special tokens

All other aspects of the model, including its training data, architecture, and capabilities, remain the same as the original llm-jp-3-3.7b model.

DeL-TaiseiOzaki
/

Tengentoppa-llm-jp-3-3.7B-base