A.X-3.1 and A.X-4.0

#1217
by nicoboss - opened

SK Telecom released A.X 3.1 (pronounced "A dot X"), a large language model (LLM) optimized for Korean-language understanding and enterprise deployment, on July 24, 2025. This sovereign AI model was developed entirely in-house by SKT, encompassing model architecture, data curation, and training, all carried out on SKT’s proprietary supercomputing infrastructure, TITAN. The model was trained from scratch on a high-quality multilingual corpus comprising 2.1 trillion tokens, with a primary focus on the Korean language.

SK Telecom released A.X 4.0 (pronounced "A dot X"), a large language model (LLM) optimized for Korean-language understanding and enterprise deployment, on July 03, 2025. Built on the open-source Qwen2.5 model, A.X 4.0 has been further trained with large-scale Korean datasets to deliver outstanding performance in real-world business environments.

A.X-3.1 failed due to an unrecognized BPE pre-tokenizer but it seems quite obvious that they use llama-bpe given that the model despite being trained from ground up uses the LlamaForCausalLM architecture. It’s output when quantized using the llama-bpe pre-tokenizer seems to be perfect. I can't judge Korean output but the English output seams great when using llama-bpe. Because of this I will use llama-bpe despite not officially being marked as supported. In worst case we can always requant it.

FileNotFoundError: File not found: /bpool/A.X-4.0/tokenizer.model

I really should finally take care of those errors. Maybe now is finally the opportunity to do something about it. I already wrote code that losslessly trains tokenizer.model based on tokenizer.json earlier this year but never ended up using it.

This is how I losslessly generated the tokenizer.model and tokenizer.vocab for A.X-4.0:

import sentencepiece
import json
import requests
from huggingface_hub import hf_hub_download

# Define repository and file name
repo_id = "skt/A.X-4.0"
tokenizer_filename = "tokenizer.json"

# Download tokenizer.json from Hugging Face
tokenizer_path = hf_hub_download(repo_id=repo_id, filename=tokenizer_filename)

def json_to_sentencepiece_model(json_path, model_path):
    with open(json_path, 'r', encoding='utf-8') as f:
        tokenizer_data = json.load(f)

    # Create a SentencePiece model trainer
    sentencepiece.SentencePieceTrainer.Train(
        input=json_path,
        model_prefix=model_path.replace('.model', ''),
        vocab_size=len(tokenizer_data["model"]["vocab"]),
        character_coverage=1.0,
        model_type="bpe"  # You can change this depending on your needs
    )

# Convert tokenizer.json to tokenizer.model
json_to_sentencepiece_model(tokenizer_path, "tokenizer.model")

I now see why we never ended up using above code:

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x0000782ba42a2c17 in __GI___wait4 (pid=57928, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30      ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
#0  0x0000782ba42a2c17 in __GI___wait4 (pid=57928, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30      in ../sysdeps/unix/sysv/linux/wait4.c
#1  0x0000782ba47992eb in ggml_print_backtrace () from /apool/llama.cpp/build/bin/libggml-base.so
#2  0x0000782ba47a8e29 in ggml_uncaught_exception() () from /apool/llama.cpp/build/bin/libggml-base.so
#3  0x0000782ba455ae1a in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x0000782ba455ae85 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x0000782ba455b0d8 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x0000782ba4552240 in std::__throw_out_of_range(char const*) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x0000782ba49999a2 in llama_vocab::byte_to_token(unsigned char) const () from /apool/llama.cpp/build/bin/libllama.so
#8  0x0000782ba499c93e in llama_vocab::impl::tokenize(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool, bool) const () from /apool/llama.cpp/build/bin/libllama.so
#9  0x0000782ba499dca3 in llama_vocab::tokenize(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool, bool) const () from /apool/llama.cpp/build/bin/libllama.so
#10 0x0000782ba499dd38 in llama_vocab::tokenize(char const*, int, int*, int, bool, bool) const () from /apool/llama.cpp/build/bin/libllama.so
#11 0x000061b152ac9ca1 in common_tokenize(llama_vocab const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool, bool) ()
#12 0x000061b152ac9e3a in common_tokenize(llama_context const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool, bool) ()
#13 0x000061b1529d6c99 in main ()
[Inferior 1 (process 57860) detached]
terminate called after throwing an instance of 'std::out_of_range'
  what():  unordered_map::at
Aborted

I think we really don't want to train our own tokenizer models unless llama.cpp doesn't care about them being terrible which it likely does. Even if we contain all tokens our tokenizer model obviously sucks without proper training. What’s really interesting is that A.X-4.0 is based on Qwen2.5-72B but Qwen2.5-72B lacks a tokenizer.model as well yet Qwen2.5-72B successfully converts. Let's see why this is the case.

What I want to know is what transfomers do when you load these models. Surely the model must be able to be loaded with transformers and work, and surely it won't go through a training step?

Sign up or log in to comment