Tengentoppa-llm-jp-base-3.7B

This is a modified version of the llm-jp-3-3.7b model with additional special tokens for structured conversations. The base model was developed by the Research and Development Center for Large Language Models at the National Institute of Informatics.

image/jpg

Model Details

  • Base Model: llm-jp-3-3.7b
  • Model Type: Transformer-based Language Model
  • Parameters: 3.7B
  • Context Length: 4096
  • Language: Japanese and English

Additional Special Tokens

This model includes the following special tokens for structured conversations:

<|SYSTEM|>, </|SYSTEM|>   - System message delimiters
<|USER|>, </|USER|>       - User input delimiters
<|HINT|>, </|HINT|>       - Hint message delimiters
<|REASONING|>, </|REASONING|> - Reasoning section delimiters
<|ASSISTANT|>, </|ASSISTANT|> - Assistant response delimiters

Required Libraries and Their Versions

  • torch>=2.3.0
  • transformers>=4.40.1
  • tokenizers>=0.19.1
  • accelerate>=0.29.3
  • flash-attn>=2.5.8

Usage

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("DeL-TaiseiOzaki/Tengentoppa-llm-jp-base-3.7B")
model = AutoModelForCausalLM.from_pretrained(
    "DeL-TaiseiOzaki/Tengentoppa-llm-jp-base-3.7B", 
    device_map="auto", 
    torch_dtype=torch.bfloat16
)

# Example using special tokens
text = "<|SYSTEM|>You are a helpful assistant.</|SYSTEM|>\n<|USER|>自然言語処理とは何か</|USER|>"
tokenized_input = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(
        tokenized_input,
        max_new_tokens=100,
        do_sample=True,
        top_p=0.95,
        temperature=0.7,
        repetition_penalty=1.05,
    )[0]

print(tokenizer.decode(output))

Base Model Information

Model Architecture

Params Layers Hidden size Heads Context length Embedding parameters Non-embedding parameters
3.7b 28 3072 24 4096 611,844,096 3,171,068,928

Tokenizer

The tokenizer is based on the original llm-jp-3-3.7b tokenizer, which uses huggingface/tokenizers Unigram byte-fallback model. The vocabulary is based on llm-jp-tokenizer v3.0, with our additional special tokens added to the vocabulary.

License

This model inherits the license from the base model: Apache License, Version 2.0

Attribution

This model is based on llm-jp-3-3.7b. Please cite the original model and its creators when using this modified version.

Modifications

The only modifications made to the original model are:

  1. Addition of special tokens for structured conversations
  2. Resizing of token embeddings to accommodate the new special tokens

All other aspects of the model, including its training data, architecture, and capabilities, remain the same as the original llm-jp-3-3.7b model.

Downloads last month
6
Safetensors
Model size
3.78B params
Tensor type
BF16
·
Inference API
Unable to determine this model's library. Check the docs .

Model tree for DeL-TaiseiOzaki/Tengentoppa-llm-jp-3-3.7B-base

Finetuned
(7)
this model