Hunyuan-A13B Coding Performance

#5
by anikifoss - opened

@ubergarm thanks for the ik_llama pull! I was super excited for this model, waiting for the GGUF support.

Please share your impressions on how well this model does on coding tasks.

I've tried smaller quants, and even Q8_0 quant with -ctk f16 -ctv f16 on llama.cpp and I wasn't able to get any runnable code out of this model. Not only it fails to one-shot any non-trivial problem, it was also unable to fix errors after several attempts.

Does that match your experience as well, or am I doing something wrong?

It also fails the strawberry test (with Q8_0): "How many rs are in straberrry (wrong spelling is on purpose)?"

Okay, let's tackle this question. The user is asking how many "r"s are in the word "straberrry," which has a wrong spelling by intention. First, I need to spell out the word correctly to compare, but actually, the user wants the count in the misspelled version. Let me write down the word: s-t-r-a-b-e-r-r-y. Let's go through each letter one by one...

Please share your impressions on how well this model does on coding tasks.

Yeah I have mixed feelings on this model.

The pro's are that it is a nice size and pretty fast with limited VRAM. It has a long context as well and you can adjust it somewhat with the rope stuff as mentioned on the model card. With this quant and ik_llama.cpp I can get like ~1800 tok/sec PP and over ~22 tok/sec TG offloading extra layers on a 3090 TI FE 24GB VRAM and the rest on my AMD 9950X 2x48GB DDR5@6400MT/s (~88GB/s memory bandwidth measured by mlc).

The cons are plenty though haha...

First, it has strangely high perplexity scores on the Instruct model (vllm versions do as well). The Pretrain version has "normal" seeming perplexity scores though so this could indicate that the fine-tuning has kinda "over-cooked" the model perhaps. Also it has those unique <answer></answer> tags which could be somewhat related to why the perplexity scores are high, but this is pure speculation I haven't looked into the actual token probabilities enough to know why perplexity is borked on the finetune. There is a lot of discussion on the AI Beavers Club discord for this model more specifically around role-playing and story telling, and they do a good job experimenting with sampler settings and templates etc.

Secondly, it seems pretty sensitive to the system prompt, sampling settings, and template as well. I've had some luck including the simple system prompt that I saw in their official github repo You are a helpful AI assistant.

The official repo suggests these settings which I've used for some testing. (I don't set repition_penalty for coding however).

Now as far as coding goes, I did manage to get it to refactor a python matplotlib script for plotting llama-sweep-bench data. The code did run, however I really don't like the style so much as it has kinda annoyingly big comments and crams stuff right into the main function.

For fun I tried a prompt that works well even on the heavily quantized DeepSeek-R1-0528 IQ1_S. It seems to be working okay even with my fairly quantized version:

👈coding challenge prompt and results

Prompt

## Task
Write an efficient python program to search for perfect numbers.

## Background
In number theory, a perfect number is a positive integer that is equal
to the sum of its positive proper divisors, that is, divisors excluding
the number itself. For instance, 6 has proper divisors 1, 2 and 3,
and 1 + 2 + 3 = 6, so 6 is a perfect number. The next perfect number is
28, since 1 + 2 + 4 + 7 + 14 = 28.

The first four perfect numbers are 6, 28, 496 and 8128.

## Implmentation Details
Keep these implementation details in mind while developing the code:

1. Specify static types where applicable.
2. Use clear comments in a pythonic style.
3. Consider speed optimizations and use an efficient algorithm.
4. Use simple logging to display the results as they are discovered.
5. Finally, provide usage example for this CLI tool application.

Output code:

import logging
import sys

# Configure logging to display informational messages
logging.basicConfig(level=logging.INFO, format='%(levelname)s: %(message)s')

def is_prime(n: int) -> bool:
    """Check if a number is a prime using the Miller-Rabin primality test.

    This test is efficient and deterministic for numbers up to 2^64 when using the bases [2, 3, 5, 7].
    """
    if n <= 1:
        return False
    elif n <= 3:
        return True
    elif n % 2 == 0:
        return False
    # Write n-1 as d * 2^s
    d = n - 1
    s = 0
    while d % 2 == 0:
        d //= 2
        s += 1
    # Witness loop using bases 2, 3, 5, 7 (sufficient for n < 2^64)
    for a in [2, 3, 5, 7]:
        if a >= n:
            continue
        x = pow(a, d, n)
        if x == 1 or x == n - 1:
            continue
        for _ in range(s - 1):
            x = pow(x, 2, n)
            if x == n - 1:
                break
        else:
            return False
    return True

def lucas_lehmer(p: int) -> bool:
    """Check if 2^p - 1 is a Mersenne prime using the Lucas-Lehmer primality test.

    Args:
        p: A prime number.

    Returns:
        True if 2^p - 1 is a Mersenne prime, False otherwise.
    """
    if p == 2:
        return True
    m = (1 << p) - 1  # Equivalent to 2^p - 1
    s = 4
    for _ in range(p - 2):
        s = (s * s - 2) % m
    return s == 0

def generate_primes():
    """Generator that yields prime numbers in ascending order."""
    candidate = 2
    while True:
        if is_prime(candidate):
            yield candidate
        # Increment by 1 for candidate=2, then by 2 for subsequent candidates (skip even numbers)
        candidate += 1 if candidate == 2 else 2

def find_perfect_numbers(limit: int) -> None:
    """Find the first 'limit' even perfect numbers using the Euclid-Euler theorem."""
    count = 0
    for p in generate_primes():
        logging.info(f"Checking exponent p={p}...")
        if lucas_lehmer(p):
            m = (1 << p) - 1
            perfect_number = (1 << (p - 1)) * m
            count += 1
            logging.info(f"Found perfect number: {perfect_number}")
            if count == limit:
                logging.info(f"Found the first {limit} perfect numbers.")
                return

def main():
    """Main function to handle user input and execute the perfect number search."""
    try:
        limit = int(input("Enter the number of perfect numbers to find: "))
        if limit <= 0:
            raise ValueError("Limit must be a positive integer.")
    except ValueError as e:
        logging.error(f"Invalid input: {e}")
        sys.exit(1)

    find_perfect_numbers(limit)

if __name__ == "__main__":
    main()

Output without any editing:

$ python perfect-hunyuan.py
Enter the number of perfect numbers to find: 10
INFO: Checking exponent p=2...
INFO: Found perfect number: 6
INFO: Checking exponent p=3...
INFO: Found perfect number: 28
INFO: Checking exponent p=5...
INFO: Found perfect number: 496
INFO: Checking exponent p=7...
INFO: Found perfect number: 8128
INFO: Checking exponent p=11...
INFO: Checking exponent p=13...
INFO: Found perfect number: 33550336
INFO: Checking exponent p=17...
INFO: Found perfect number: 8589869056
INFO: Checking exponent p=19...
INFO: Found perfect number: 137438691328
INFO: Checking exponent p=23...
INFO: Checking exponent p=29...
INFO: Checking exponent p=31...
INFO: Found perfect number: 2305843008139952128
INFO: Checking exponent p=37...
INFO: Checking exponent p=41...
INFO: Checking exponent p=43...
INFO: Checking exponent p=47...
INFO: Checking exponent p=53...
INFO: Checking exponent p=59...
INFO: Checking exponent p=61...
INFO: Found perfect number: 2658455991569831744654692615953842176
INFO: Checking exponent p=67...
INFO: Checking exponent p=71...
INFO: Checking exponent p=73...
INFO: Checking exponent p=79...
INFO: Checking exponent p=83...
INFO: Checking exponent p=89...
INFO: Found perfect number: 191561942608236107294793378084303638130997321548169216
INFO: Found the first 10 perfect numbers.

Here is how I'm running it with good old q8_0 kv-cache quantization:

$ cd ik_llama.cpp
$ ./build/bin/llama-server --version
version: 3793 (5446ccc8)

export model=/mnt/astrodata/llm/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/Hunyuan-A13B-Instruct-IQ3_KS.gguf
$ ./build/bin/llama-server \
  --model "$model" \
  --alias ubergarm/Hunyuan-A13B-Instruct-IQ3_KS \
  -fa -fmoe \
  -ctk q8_0 -ctv q8_0 \
  -c 40960 \
  -rtr \
  -ngl 99 \
  -ot "blk\.([0-9])\.ffn_.*=CUDA0" \
  -ot "blk\.(1[0-2])\.ffn_.*=CUDA0" \
  -ot exps=CPU \
  -ub 4096 -b 4096 \
  --parallel 1 \
  --threads 16 \
  --host 127.0.0.1 \
  --port 8080

Thanks for the details! Yeah, this model is pretty fast. I just wish it was as powerful as benchmarks suggest it is.

From what I can tell, Hunyuan-A13B is good at replying with existing knowledge. Problems that closely match the training set get a reasonable solution. However, it fails to generalize, so it's unable to solve variations of problems from the training set. Being unable to handle incorrect spelling of strawberry is the telltale sign of failure to generalize.

Unfortunately that leaves the only conclusion that the model was trained to benchmarks, so it's not nearly as useful when applied to real world problems.

From what I can tell, even Qwen3-30B-A3B significantly outperforms Hunyuan-A13B while also being much faster.

@anikifoss

Yeah, given I have access to DeepSeek models, I'm not using this Hunyuan one much for actual stuff.

I'm keeping my fingers crossed for the recently rumored THUDM/GLM-4-MoE-100B-A10B more info here: https://github.com/ikawrakow/ik_llama.cpp/issues/597

Sign up or log in to comment