NovaAI-0.1

Chinese QA GPT Model

Model Description

This is a Chinese language GPT-like transformer model trained on question-answering pairs. The model is designed to generate helpful, conversational responses to user questions in Chinese. It uses a decoder-only architecture similar to GPT with causal self-attention and is optimized for Chinese language understanding and generation.

Model Details

Architecture: Decoder-only Transformer (GPT-like)
Parameters: ~124M parameters (configurable)
Vocabulary Size: 32,000 (SentencePiece BPE)
Context Length: 1,024 tokens
Language: Chinese (Simplified)
Task: Question Answering / Conversational AI

Model Architecture

The model consists of:

12 Transformer layers with causal self-attention
12 attention heads per layer
768-dimensional embeddings
SentencePiece tokenizer with BPE encoding for Chinese text
GELU activation functions
Layer normalization and residual connections

Training Data

The model was trained on a diverse dataset of Chinese question-answering pairs covering various topics including:

Gaming and entertainment
Technology and gadgets
Health and lifestyle
Travel and local recommendations
Relationships and social advice
General knowledge questions

Training Configuration

Training Method: Causal Language Modeling (next token prediction)
Batch Size: 4
Learning Rate: 3e-4 (AdamW optimizer)
Epochs: 3
Dropout: 0.1
Gradient Clipping: 1.0

Usage

Installation

pip install torch sentencepiece tqdm

Training

python train.py --data_path all.jsonl --spm_model spm.model

Inference

python infer.py --checkpoint checkpoints/checkpoint_epoch3.pt --spm_model spm.model --prompt "你的问题"

Python API

import torch
import sentencepiece as spm
from train import GPT, GPTConfig

# Load model
sp = spm.SentencePieceProcessor()
sp.Load('spm.model')

checkpoint = torch.load('checkpoints/checkpoint_epoch3.pt')
config = GPTConfig(
    vocab_size=32000,
    n_layer=12,
    n_head=12,
    n_embd=768,
    block_size=1024
)
model = GPT(config)
model.load_state_dict(checkpoint['model_state'])

# Generate response
prompt = "你好，请介绍一下你自己"
ids = sp.EncodeAsIds('<s>' + prompt + '<sep>')
# ... generation logic

Model Performance

The model demonstrates strong performance on:

Chinese language understanding
Contextual question answering
Conversational response generation
Maintaining coherence over multi-turn conversations

Limitations

Language: Only supports Chinese (Simplified)
Context Window: Limited to 1,024 tokens
Knowledge Cutoff: Based on training data timeframe
Factual Accuracy: May occasionally produce inaccurate information
Bias: May reflect biases present in training data

Ethical Considerations

This model is designed for educational and research purposes. Users should be aware that:

The model may generate responses that seem authoritative but could be factually incorrect
The model's training data may contain biases
Generated content should be fact-checked before use in critical applications

NovaAI6868
/

NovaAI-0.1