BitTransformerLM

Model Details

Model Type: Experimental Bit-Native Transformer Language Model
Architecture: Transformer with reversible layers and bit-level processing
Developer: WCNEGENTROPY HOLDINGS LLC
Release Date: August 2025
Version: v0.1.0 (Pre-release Experimental)
License: AGPLv3 (see LICENSE/ directory)
Contact: [email protected]

Model Description

BitTransformerLM is an experimental language model that processes text at the bit level rather than using traditional token-based approaches. The architecture explores potential memory efficiency improvements through reversible transformer layers and provides built-in safety monitoring through real-time telemetry.

⚠️ Important: This is experimental research software requiring rigorous validation against established baselines before any production use.

Architecture Details

Input Processing: Direct binary sequence processing (0/1 bits) with parity protection
Attention Mechanism: Multi-head self-attention on bit embeddings
Layer Design: Reversible transformer blocks for memory efficiency (~50% memory savings)
Safety Features: Built-in K/C/S (Negentropy/Complexity/Symbiosis) telemetry
Training Modes: Causal autoregressive and experimental diffusion mode
Sequence Length: Configurable (16-2048 tested)
Parameters: Scalable architecture (tested from 793K to 771M parameters)

Key Innovations

Bit-Native Processing: Operates directly on binary sequences with 9-bit encoding (8 data + 1 parity)
Reversible Layers: Memory-efficient computation through mathematically reversible operations
Safety Telemetry: Real-time monitoring via K/C/S metrics with configurable thresholds
Progressive Scaling: Automatic model expansion based on validation performance
Dual Training Modes: Both causal and diffusion-based training supported

Training Data and Methodology

Experimental Configurations Tested

Small-scale Validation (793K parameters):

Dataset: 4 samples, 16 sequence length
Training time: 0.21 seconds
Final loss: 0.629 (converged on toy data)
Hardware: CPU-based training

Medium-scale Validation (771M parameters):

Dataset: 5 text samples with zero-padding
Training time: 11.47 seconds
Loss progression: 11.84 → 5.35
Hardware: Single NVIDIA L4 GPU (15.28 GB peak memory)

Known Limitations

⚠️ Critical Research Gaps:

Limited Training Data: Experiments used minimal datasets insufficient for language modeling evaluation
No Baseline Comparisons: Missing comparative evaluation against standard transformers
Short Training Duration: Training periods too short to establish genuine convergence
Scale Claims: Some documentation overstated capabilities - largest validated model is 771M parameters

Performance and Evaluation

Empirical Results

Telemetry Metrics (771M model):

K (Negentropy): 0.0013 (information content vs random noise)
C (LZ Complexity): 0.52 (pattern compressibility proxy)
S (Symbiosis): 0.46 (alignment with reference distributions)

Training Performance:

Peak memory usage: 15.28 GB (single GPU)
Inference success: 100% on test prompts
Convergence: Achieved on toy datasets only

Model Capabilities

✅ Validated Features:

Bit-level text processing with parity protection
Reversible transformer layer functionality
Real-time safety telemetry computation
Memory-efficient training (gradient checkpointing + reversible layers)
Multi-GPU distributed training support (FSDP tested)

⚠️ Requires Validation:

Language modeling capability on standard benchmarks
Memory efficiency claims vs baseline transformers
Scaling behavior compared to conventional architectures
Safety telemetry effectiveness across diverse scenarios

Intended Use

✅ Research Applications

Academic Research: Novel architecture exploration and bit-level modeling studies
AI Safety Research: Telemetry system development and safety monitoring research
Memory Efficiency Studies: Reversible architecture investigation and optimization
Educational Use: Learning about transformer internals and experimental architectures

⚠️ Production Applications

Not Recommended without extensive validation:

Missing critical baseline comparisons vs standard transformers
Insufficient evaluation on established language modeling benchmarks
No statistical significance testing across multiple runs
Training conducted only on toy datasets

How to Use

Installation

# Clone repository
git clone https://huggingface.co/WCNegentropy/BitTransformerLM
cd BitTransformerLM

# Install dependencies
pip install -r requirements.txt

# Basic usage test
python example.py

Basic Usage

from bit_transformer import BitTransformerLM, text_to_bits, bits_to_text
import torch

# Create model
model = BitTransformerLM(
    d_model=128,
    nhead=4,
    num_layers=2,
    dim_feedforward=256,
    max_seq_len=256,
    reversible=True,      # Enable memory-efficient layers
    use_checkpoint=True   # Enable gradient checkpointing
)

# Process text
text = "Hello, world!"
bits = text_to_bits(text)
bit_tensor = torch.tensor(bits).unsqueeze(0)

# Forward pass with telemetry
logits, telemetry = model(bit_tensor)

print(f"Input: {text}")
print(f"Bit representation: {bits[:18]}...")  # First 18 bits
print(f"Output shape: {logits.shape}")
print(f"K (Negentropy): {telemetry.get('negentropy_logits', 'N/A')}")
print(f"C (Complexity): {telemetry.get('lz_complexity_logits', 'N/A')}")
print(f"S (Symbiosis): {telemetry.get('symbiosis_score', 'N/A')}")

Safe Inference

from bit_transformer import hil_safe_inference

# Safe inference with telemetry monitoring
try:
    output_bits, telemetry = hil_safe_inference(
        model, 
        bit_tensor,
        c_floor=0.3,     # Minimum complexity threshold
        s_floor=0.5,     # Minimum symbiosis threshold
        strict=True      # Enforce safety thresholds
    )
    print("✅ Safe inference completed")
except Exception as e:
    print(f"⚠️ Safety check failed: {e}")

Training

from bit_transformer import train_loop

# Basic training
train_loop(
    model,
    training_data,
    epochs=5,
    batch_size=4,
    amp=True,                # Mixed precision
    compile_model=True,      # torch.compile optimization
    diffusion=False,         # Standard causal training
    log=True                 # Enable logging
)

Ethical Considerations and Risks

Potential Benefits

Enhanced Interpretability: Bit-level processing provides fine-grained control
Built-in Safety Monitoring: Real-time telemetry and gating mechanisms
Memory Efficiency Research: Exploration of reversible architectures
Open Research: Contributing to transparent AI safety research

Potential Risks

Overstated Capabilities: Some early documentation contained inflated claims (now corrected)
Incomplete Evaluation: Missing critical baseline comparisons and standard benchmarks
Research Maturity: Experimental status requires careful interpretation of results
False Security: Safety metrics need validation across diverse failure modes

Recommendations

Research Use Only: Conduct rigorous baseline comparisons before any production consideration
Statistical Validation: Perform multiple runs with proper significance testing
Honest Reporting: Document limitations and negative results alongside positive findings
Community Validation: Encourage independent evaluation and replication studies

Technical Specifications

Architecture Parameters

Bit Embedding Size: Configurable (16-1792 tested)
Attention Heads: Configurable (2-28 tested)
Layers: Configurable (1-20 tested)
Max Sequence Length: Configurable (16-2048 tested)
Feedforward Dimension: Configurable (64-4096 tested)

System Requirements

Minimum: Python 3.10+, PyTorch 2.7.1, 8GB RAM
Recommended: 16GB+ RAM, CUDA-capable GPU for larger models
For 771M model: 16GB+ GPU memory recommended

Training Features

Distributed Training: FSDP support (tested up to 771M parameters)
Mixed Precision: FP16/BF16 with CPU autocast
Quantization: Dynamic INT8 + experimental 4-bit QAT
Memory Optimization: Reversible layers + gradient checkpointing
Safety Monitoring: Real-time K/C/S telemetry with configurable gates

Inference Modes

Causal Generation: Standard autoregressive text generation
Diffusion Mode: Bidirectional denoising with multiple noise schedules
Safe Inference: Human-in-the-loop with safety gate monitoring
Long Context: Sliding window processing for sequences beyond max_seq_len

Limitations and Biases

Technical Limitations

Experimental Status: Requires extensive validation before practical use
Limited Training Data: Evaluated only on toy datasets
No Baseline Comparisons: Missing systematic evaluation vs standard transformers
Memory Claims Unvalidated: Theoretical benefits need empirical measurement
Safety Metrics Unproven: K/C/S telemetry effectiveness requires validation

Potential Biases

Training Data: Limited to small English text samples
Architecture Bias: Novel approach may have unknown failure modes
Evaluation Bias: Lack of diverse evaluation datasets
Research Bias: Focus on positive results without comprehensive negative case analysis

Environmental Impact

Current experimental training has minimal environmental impact due to small scale and short duration. However, larger-scale validation studies will require consideration of:

Energy Usage: Distributed training energy consumption
Hardware Requirements: GPU resource utilization for larger models
Training Efficiency: Comparison of energy costs vs standard approaches

Citation

If you use BitTransformerLM in your research, please cite:

@software{bittransformerlm2025,
  title={BitTransformerLM: Experimental Bit-Native Transformer Language Model},
  author={WCNegentropy Research},
  year={2025},
  version={0.1.0},
  url={https://huggingface.co/WCNegentropy/BitTransformerLM},
  license={AGPL-3.0},
  note={Experimental research implementation requiring validation}
}

Additional Resources

Project Documentation: See ABOUTME.md for project overview
User Guide: Comprehensive handbook (USER_GUIDE.md)
Claude Code Integration: AI-assisted development guide (CLAUDE.md)
Research Status: Current validation status (RESEARCH_STATUS.md)
Empirical Analysis: Evidence-based claims assessment (EMPIRICAL_VALIDATION.md)

License and Usage

Primary License: AGPLv3 (see LICENSE/LICENSE.txt)
Commercial Licensing: Contact [email protected]

Support

Issues: GitHub Issues for bug reports
Research Questions: GitHub Discussions
Commercial Inquiries: [email protected]
AI-Assisted Development: Use with Claude Code (recommended)

Disclaimer: This is experimental research software. Claims in some historical documentation may be overstated. Users should conduct independent evaluation and validation before any production use. The model requires rigorous baseline comparisons and statistical validation to establish its capabilities relative to standard approaches.

Research responsibly. Validate rigorously. Share openly. 🧪✨

WCNegentropy
/

BitTransformerLM