jukofyork

Owner Jan 31

Continuation of THE THREAD OF DOOM.

jukofyork

Owner Jan 31

https://www.youtube.com/watch?v=sxqvwkmTNy8

jukofyork pinned discussion Jan 31

DazzlingXeno

Feb 4

What do you and the others think of the distilled R1 models for writing?

gghfez

Feb 5

The llama3 / qwen models SFT'd on R1 outputs? I only tried 2 of them.

R1 Qwen (32b) - Lacks knowledge of fiction (same as the official Qwen release), so it's writing is no better.

R1 Llama3 - This is generally the worst of them (not just for writing). It'll generate the CoT and then write something completely different.

CoT traces won't let the model do anything out of distribution, so not very useful if the base model doesn't have a lot in it's training data.

BigHuggyD

Feb 5

Yeah, I have tried the same two and felt the same way.

I also felt that any attempt to add an R1 distill to the merge recipe of an existing merge project made it worse...so far...

DazzlingXeno

Feb 5

•

edited Feb 5

@gghfez @BigHuggyD that has been my experience as well, which is a shame as I had a go of R1 on Openrouter and I was blown away.

What model is anywhere close that is usable on a 24gb vram machine with 32gb of ram in your experience?

gghfez

Feb 5

There's nothing like it for now. I'm running R1 slowly on my ThreadRipper:

prompt eval time =   14026.61 ms /   918 tokens (   15.28 ms per token,    65.45 tokens per second)
       eval time =  398806.12 ms /  1807 tokens (  220.70 ms per token,     4.53 tokens per second)
      total time =  412832.73 ms /  2725 tokens

I tried training Wizard2 8x22b MoE on R1 data, but it doesn't really work well. It will plan ahead in think tags eg:

I need to ensure the story maintains its gritty, realistic tone without becoming overly melodramatic. The characters' growth should be subtle but significant. Also, the ending should leave a sense of hope but not be too neat—their redemption is fragile, and the future is uncertain.

Let me outline the next few chapters:

Chapter 5: Nightmares and Trust
...

But it doesn't backtrack like R1 does. Just kind of agrees with it's self and ends up writing how it usually would:

“I don’t know what I want anymore,” she admitted, voice barely above a whisper as rain tapped against corrugated roofing overhead.

lol

DazzlingXeno

Feb 5

Ahhh thats a shame :-(

"I don’t know what I want anymore,” she admitted, voice barely above a whisper as rain tapped against corrugated roofing overhead."

Oh god!

I'll have to keep an eye on this thread.

I did enjoy Ppoyaa/MythoNemo-L3.1-70B-v1.0

But my tastes are probably not as refined as others on this thread ;-)

jukofyork

Owner Feb 5

prompt eval time =   14026.61 ms /   918 tokens (   15.28 ms per token,    65.45 tokens per second)
       eval time =  398806.12 ms /  1807 tokens (  220.70 ms per token,     4.53 tokens per second)
      total time =  412832.73 ms /  2725 tokens

What quant are you running?

I can get 4-5 tokens per second with this PR offloading the experts to RAM and keeping everything else on the GPU:

and this hacked llama_tensor_get_type():

    if (ftype == LLAMA_FTYPE_MOSTLY_Q2_K) {
        if (name.find("_exps") != std::string::npos) {
            if (name.find("ffn_down") != std::string::npos) {
                new_type = GGML_TYPE_Q4_K;
            }
            else {
                new_type = GGML_TYPE_Q2_K;
            }
        }
        else {
            new_type = GGML_TYPE_Q8_0;
        }
    }
    else

along with bartowski's imatrix file.

It was pretty much indistinguishable from much higher quants (that ran at 2-2.5 tokens per second), but I found going much lower for the experts' down projections made it get dumber quickly.

I did have some weird shit where it was trying to allocated 1.4TB of VRAM, but found the fix here:

https://github.com/ggerganov/llama.cpp/pull/11397#issuecomment-2635392482

(not sure if it's related to that PR though...).

I'm not just trying the RPC though all 6 GPUs, but having to requant, due to being only able to fit 61 of 62 layers with the above...

    if (ftype == LLAMA_FTYPE_MOSTLY_IQ2_M) {
        if (name.find("_exps") != std::string::npos) {
            if (name.find("ffn_down") != std::string::npos) {
                new_type = GGML_TYPE_IQ3_S;
            }
            else {
                new_type = GGML_TYPE_IQ2_S;
            }
        }
        else {
            new_type = GGML_TYPE_Q6_K;
        }
    }
    else

This should hopefully show if the RPC stuff is worth the hassle... It's an absolute bastard to set up:

https://github.com/ggerganov/llama.cpp/blob/master/examples/rpc/README.md

with lots of hidden options:

https://github.com/ggerganov/llama.cpp/pull/11606
https://github.com/ggerganov/llama.cpp/pull/11424
https://github.com/ggerganov/llama.cpp/pull/9296

and oddly only seems to work if you reorder the CUDA0 and CUDA1 devices for some reason???

If I decide to stick with CPU-only then there is also this to try:

https://old.reddit.com/r/LocalLLaMA/comments/1idi5cr/i_did_a_very_short_perplexity_test_with_deepseek/

which should gain 25% for very little lost ability.

I think 4-5 tokens per second for a usable / non-joke quant might be about as good as we can hope for, as even 2 x M1 Ultra still gonna be in that range if the novelty "1.58bit" quant ran at ~13 tokens per second... :/

jukofyork

Owner Feb 5

This turns out to be a really good test prompt too:

Varis adjusted the noose, its hemp fibers grinding beneath his calluses. “Last chance,” he said, voice like gravel dragged through mud. “Confess, and your soul stays your own.”

Jurl laughed—a wet, gurgling sound. “You’re knee-deep in it, Coldwater. ” The thing inside him twisted the boy’s lips into a grin too wide for his face. “The Great Wolf’s howlin’ again. The Dead’s Gate’s rusted through… ”

Turn this into the opening chapter of a Grimdark trilogy.

Shit quants will not think very much and often actually don't even use the words from the prompt and/or use the "knee-deep", "Great Wolf" and "Dead’s Gate’s rusted through" bits in a much worse way.

jukofyork

Owner Feb 5

•

edited Feb 5

Oh and I wrote this because I couldn't actually convert the fp8 stuff on my Ampere GPUs and to re-download somebody else's bf16 version was gonna take about a week:

import os
import json
from argparse import ArgumentParser
from glob import glob
from tqdm import tqdm

import torch
from safetensors.torch import load_file, save_file

def weight_dequant_cpu(x: torch.Tensor, s: torch.Tensor, block_size: int = 128) -> torch.Tensor:
    """
    CPU version of dequantizing weights using the provided scaling factors.
    
    This function splits the quantized weight tensor `x` into blocks of size `block_size`
    and multiplies each block by its corresponding scaling factor from `s`. It assumes that
    `x` is a 2D tensor (quantized in FP8) and that `s` is a 2D tensor with shape:
    
        (ceil(M/block_size), ceil(N/block_size))
        
    where M, N are the dimensions of `x`.
    
    Args:
        x (torch.Tensor): The quantized weight tensor with shape (M, N).
        s (torch.Tensor): The scaling factor tensor with shape (ceil(M/block_size), ceil(N/block_size)).
        block_size (int, optional): The block size used during quantization. Defaults to 128.
    
    Returns:
        torch.Tensor: The dequantized weight tensor with shape (M, N) and dtype given by torch.get_default_dtype().
    """
    # Ensure inputs are contiguous and 2D.
    assert x.is_contiguous() and s.is_contiguous(), "x and s must be contiguous"
    assert x.dim() == 2 and s.dim() == 2, "x and s must be 2D tensors"
    
    M, N = x.shape
    grid_rows = (M + block_size - 1) // block_size
    grid_cols = (N + block_size - 1) // block_size

    # Verify that s has the expected shape.
    if s.shape != (grid_rows, grid_cols):
        raise ValueError(f"Expected scale tensor s to have shape ({grid_rows}, {grid_cols}), but got {s.shape}")

    # Prepare an output tensor.
    # NOTE: torch.set_default_dtype(torch.bfloat16) in main, so torch.get_default_dtype() should be BF16.
    y = torch.empty((M, N), dtype=torch.get_default_dtype(), device=x.device)
    
    # Process each block independently.
    for i in range(grid_rows):
        row_start = i * block_size
        row_end = min((i + 1) * block_size, M)
        for j in range(grid_cols):
            col_start = j * block_size
            col_end = min((j + 1) * block_size, N)
            # Convert the block to float32 (like the Triton kernel's .to(tl.float32))
            block = x[row_start:row_end, col_start:col_end].to(torch.float32)
            scale = s[i, j]  # This is the scaling factor for the current block.
            # Multiply then cast the result to the default dtype—for example, bfloat16.
            y[row_start:row_end, col_start:col_end] = (block * scale).to(torch.get_default_dtype())
    
    return y

def weight_dequant_cpu_vectorized(x: torch.Tensor, s: torch.Tensor, block_size: int = 128) -> torch.Tensor:
    """
    Vectorized version of dequantizing weights using provided scaling factors.

    This function aims to replace the loops in weight_dequant_cpu with vectorized operations.

    Args:
        x (torch.Tensor): The quantized weight tensor with shape (M, N).
        s (torch.Tensor): The scaling factor tensor with shape (ceil(M/block_size), ceil(N/block_size)).
        block_size (int): The block size used during quantization.

    Returns:
        torch.Tensor: The dequantized weight tensor with shape (M, N) and dtype given by torch.get_default_dtype().
    """
    assert x.is_contiguous() and s.is_contiguous(), "x and s must be contiguous"
    assert x.dim() == 2 and s.dim() == 2, "x and s must be 2D tensors"
        
    M, N = x.shape
    device = x.device

    grid_rows = (M + block_size - 1) // block_size
    grid_cols = (N + block_size - 1) // block_size

    # Verify that s has the expected shape.
    if s.shape != (grid_rows, grid_cols):
        raise ValueError(f"Expected scale tensor s to have shape ({grid_rows}, {grid_cols}), but got {s.shape}")

    # Generate row and column indices
    row_indices = torch.arange(M, device=device)
    col_indices = torch.arange(N, device=device)

    # Compute block indices
    block_row_indices = row_indices // block_size  # shape (M,)
    block_col_indices = col_indices // block_size  # shape (N,)

    # Get scaling factors for each position
    s_expand = s[block_row_indices[:, None], block_col_indices[None, :]]  # shape (M, N)

    # Perform dequantization
    block = x.to(torch.float32)
    y = (block * s_expand).to(torch.get_default_dtype())
    
    return y

def main(fp8_path, bf16_path):
    """
    Converts FP8 weights to BF16 and saves the converted weights.

    This function reads FP8 weights from the specified directory, converts them to BF16,
    and saves the converted weights to another specified directory. It also updates the
    model index file to reflect the changes.

    Args:
    fp8_path (str): The path to the directory containing the FP8 weights and model index file.
    bf16_path (str): The path to the directory where the converted BF16 weights will be saved.

    Raises:
    KeyError: If a required scale_inv tensor is missing for a weight.

    Notes:
    - The function assumes that the FP8 weights are stored in safetensor files.
    - The function caches loaded safetensor files to optimize memory usage.
    - The function updates the model index file to remove references to scale_inv tensors.
    """
    torch.set_default_dtype(torch.bfloat16)
    os.makedirs(bf16_path, exist_ok=True)
    model_index_file = os.path.join(fp8_path, "model.safetensors.index.json")
    with open(model_index_file, "r") as f:
        model_index = json.load(f)
    weight_map = model_index["weight_map"]
    
    # Cache for loaded safetensor files
    loaded_files = {}
    fp8_weight_names = []

    # Helper function to get tensor from the correct file
    def get_tensor(tensor_name):
        """
        Retrieves a tensor from the cached safetensor files or loads it from disk if not cached.

        Args:
            tensor_name (str): The name of the tensor to retrieve.

        Returns:
            torch.Tensor: The retrieved tensor.

        Raises:
            KeyError: If the tensor does not exist in the safetensor file.
        """
        file_name = weight_map[tensor_name]
        if file_name not in loaded_files:
            file_path = os.path.join(fp8_path, file_name)
            loaded_files[file_name] = load_file(file_path, device="cpu")
        return loaded_files[file_name][tensor_name]

    safetensor_files = list(glob(os.path.join(fp8_path, "*.safetensors")))
    safetensor_files.sort()
    for safetensor_file in tqdm(safetensor_files):
        file_name = os.path.basename(safetensor_file)
        current_state_dict = load_file(safetensor_file, device="cpu")
        loaded_files[file_name] = current_state_dict
        
        new_state_dict = {}
        for weight_name, weight in current_state_dict.items():
            if weight_name.endswith("_scale_inv"):
                continue
            elif weight.element_size() == 1:  # FP8 weight
                scale_inv_name = f"{weight_name}_scale_inv"
                try:
                    # Get scale_inv from the correct file
                    scale_inv = get_tensor(scale_inv_name)
                    fp8_weight_names.append(weight_name)
                    new_state_dict[weight_name] = weight_dequant_cpu_vectorized(weight, scale_inv)
                except KeyError:
                    print(f"Warning: Missing scale_inv tensor for {weight_name}, skipping conversion")
                    new_state_dict[weight_name] = weight
            else:
                new_state_dict[weight_name] = weight
                
        new_safetensor_file = os.path.join(bf16_path, file_name)
        save_file(new_state_dict, new_safetensor_file)
        
        # Memory management: keep only the 2 most recently used files
        if len(loaded_files) > 2:
            oldest_file = next(iter(loaded_files))
            del loaded_files[oldest_file]
            torch.cuda.empty_cache()
    
    # Update model index
    new_model_index_file = os.path.join(bf16_path, "model.safetensors.index.json")
    for weight_name in fp8_weight_names:
        scale_inv_name = f"{weight_name}_scale_inv"
        if scale_inv_name in weight_map:
            weight_map.pop(scale_inv_name)
    with open(new_model_index_file, "w") as f:
        json.dump({"metadata": {}, "weight_map": weight_map}, f, indent=2)
        

if __name__ == "__main__":
    parser = ArgumentParser()
    parser.add_argument("--input-fp8-hf-path", type=str, required=True)
    parser.add_argument("--output-bf16-hf-path", type=str, required=True)
    args = parser.parse_args()
    main(args.input_fp8_hf_path, args.output_bf16_hf_path)

It does the same thing but doesn't use the Triton kernel (change the two "cpu" to "cuda" if you want but I don't think it matters much if you use the vectorized version).

@ChuckMcSneed This should hopefully even work on CPU-only systems if you install Torch.

BigHuggyD

Feb 5

@gghfez
Have you figured out a way to keep R1 from descending into madness in a larger context of multi-turn creative interaction? I keep finding that I love how things start and then at some point.. usually around 16-20k of back and forth it just gets deeper and deeper down the rabbit hole. The creativity goes from entertaining to a mental disorder...

jukofyork

Owner Feb 5

I got 9 tokens per second using all 6 GPUs and RPC:

prompt eval time =    1695.23 ms /   128 tokens (   13.24 ms per token,    75.51 tokens per second)
       eval time =  170082.72 ms /  1558 tokens (  109.17 ms per token,     9.16 tokens per second)
      total time =  171777.94 ms /  1686 tokens

but it seems that 3 bits for the non-shared experts' down_proj matrices ruins the model and makes it significantly dumber (tried with both Q3_K and IQ3_S now and both had the same effect).

jukofyork

Owner Feb 5

@gghfez
Have you figured out a way to keep R1 from descending into madness in a larger context of multi-turn creative interaction? I keep finding that I love how things start and then at some point.. usually around 16-20k of back and forth it just gets deeper and deeper down the rabbit hole. The creativity goes from entertaining to a mental disorder...

Are you making sure to remove the old generated text between the thinking tags for each turn? I think that can cause the model to go "insane" from what people said in the OpenRouter discord.

BigHuggyD

Feb 5

@gghfez
Have you figured out a way to keep R1 from descending into madness in a larger context of multi-turn creative interaction? I keep finding that I love how things start and then at some point.. usually around 16-20k of back and forth it just gets deeper and deeper down the rabbit hole. The creativity goes from entertaining to a mental disorder...

Are you making sure to remove the old generated text between the thinking tags for each turn? I think that can cause the model to go "insane" from what people said in the OpenRouter discord.

I have been trimming out all but the last few. I like it 'knowing' the process I want it to use for 'think' but maybe that's a mistake. It worked well with the old think, reflect, output form with other models but maybe with this one I need to kill it every turn.

BigHuggyD

Feb 5

Ahhh thats a shame :-(

"I don’t know what I want anymore,” she admitted, voice barely above a whisper as rain tapped against corrugated roofing overhead."

Oh god!

I'll have to keep an eye on this thread.

I did enjoy Ppoyaa/MythoNemo-L3.1-70B-v1.0

But my tastes are probably not as refined as others on this thread ;-)

Not sure I would call mine refined lol. I like what I like and I think I have found a group of people with similar tastes. I don't know 90% if what is said here, but I like being the most nieve one in the room. Keeps expectations in check...

DazzlingXeno

Feb 5

•

edited Feb 5

Ahhh thats a shame :-(

"I don’t know what I want anymore,” she admitted, voice barely above a whisper as rain tapped against corrugated roofing overhead."

Oh god!

I'll have to keep an eye on this thread.

I did enjoy Ppoyaa/MythoNemo-L3.1-70B-v1.0

But my tastes are probably not as refined as others on this thread ;-)

Not sure I would call mine refined lol. I like what I like and I think I have found a group of people with similar tastes. I don't know 90% if what is said here, but I like being the most nieve one in the room. Keeps expectations in check...

That's fair. I mainly like it to be really obvious that each character speaks differently in their dialogue. It seems to be hard to find models that can do that well. I use novel crafter and have character sheets with example dialogue. I suspect it's as much a me issue as a model issue though.

I think most people are looking for something different than I am.

I'm not a software developer myself so I don't understand most of what's said here either. But it's interesting though isn't it? ;-)

jukofyork

Owner Feb 5

•

edited Feb 5

I've actually got NUMA working properly in llama.cpp after the umpteenth attempt!!!

4.73 tokens per second using a 413GB Q4_K / Q6_K / Q8_0 on a machine with ~78GB/s maximum bandwidth per NUMA node.

This is nearly 2x what I as getting with a ~250GB Q2_K / Q4_K / Q8_0 mix earlier, and only half of what I got when I linked all 6 GPU using RPC for a IQ2_S / IQ3_S / Q6_0 model earlier!

It's quite involved so will post the exact detail tomorrow as pretty tired tonight, but it should probably work just as well for @ChuckMcSneed using pure-CPU inference too.

BigHuggyD

Feb 5

R1 is nothing if not original LOL

The dusty parking lot of Big Al's Used Cars stretches under a molten orange sunset, neon signs buzzing to life as cicadas thrum in the scorched Oklahoma air. Joe Dirt leans on his push broom near a row of mid-90s pickup trucks, their hoods shimmering with heatwaves. A half-eaten gas station burrito balances precariously on a stack of "AS IS" sale flyers. Across the lot, a dented El Camino with a Confederate flag bumper sticker coughs to a stop, its driver’s-side door squealing open to reveal… BigD.
spits sunflower seeds into a Folgers can "Well butter my butt and call me a biscuit—ain’t you a sight for sore tax returns!" He tugs at his mullet nervously, boot tapping to Skynyrd leaking from a busted radio. "Y’know, they say destiny’s like a septic tank—you don’t wanna go diggin’ ‘less you’re ready for what’s brewin’. What’s your story, stranger?" He gestures to a handwritten sign behind him: FREE LIFE ADVICE (RESULTS MAY VARY).*
The tang of burnt transmission fluid mixes with Joe’s AXE body spray. A tumbleweed of fast-food wrappers drifts past BigD’s boots as Slick Vic, the lot’s chain-smoking sales manager, squints from his trailer office. Crickets pause mid-chirp.
Joe leans closer, eyeing the El Camino’s trunk. "That there’s a ‘87 model? My second cousin Cleetus once smuggled a raccoon family in one o’ them. ‘Course, the raccoons unionized—*long story." He pulls a half-melted Jolly Rancher from his pocket, offering it like a peace treaty. Slick Vic’s shadow looms in the distance, flicking a cigarette butt toward BigD’s tires.

jukofyork

Owner Feb 5

•

edited Feb 5

Yeah, R1 has pretty much blown everything else out of the water for me. I've run that same prompt I posted above about 100x tonight refining the NUMA stuff and it's pretty amazing how varied the final stories have been (with some min-p you can push the temperature right up to 3+ too!).

BigHuggyD

Feb 6

Yeah, R1 has pretty much blown everything else out of the water for me. I've run that same prompt I posted above about 100x tonight refining the NUMA stuff and it's pretty amazing how varied the final stories have been (with some min-p you can push the temperature right up to 3+ too!).

How high did you have to take min-p to keep it coherent at a 3 temp??

jukofyork

Owner Feb 6

Only around 0.1 or even 0.05 IIRC.

gghfez

Feb 6

@jukofyork

What quant are you running?

That was 1.73-bit (which I usually use)

I sometimes run the DeepSeek-R1-UD-IQ2_XXS, but it has to offload to SSD so I get slower prompt ingestion:

prompt eval time =    5936.61 ms /    29 tokens (  204.71 ms per token,     4.88 tokens per second)                                         
       eval time =  242477.40 ms /  1005 tokens (  241.27 ms per token,     4.14 tokens per second)                                         
      total time =  248414.02 ms /  1034 tokens

NUMA

That's a huge improvement, faster than a cloud server I rented recently.
Won't help my local setup as I only have one NUMA node. I'm hoping they make progress with flash-attention.

@BigHuggyD

Have you figured out a way to keep R1 from descending into madness in a larger context of multi-turn creative interaction? I keep finding that I love how things start and then at some point.. usually around 16-20k of back and forth it just gets deeper and deeper down the rabbit hole. The creativity goes from entertaining to a mental disorder...

I can't run it past 12k with my vram+ram so haven't had that problem :D But deepseek recommend not sending the CoT traces for prior messages along with it.

BigHuggyD

Feb 6

•

edited Feb 6

@gghfez

I can't run it past 12k with my vram+ram so haven't had that problem :D But deepseek recommend not sending the CoT traces for prior messages along with it.

Ahh yes 😂 I am a bit spoiled right now. I'm going to miss it when it's gone.
I 'think' that was part of the issue. I started removing the CoT immediately and it made it further before it exited the highway for crazy town.
The second part might be I needed to purge my prompt. I have a laundry list of instructions on how I want things written that works well with Largestral and Llama33 based models that might be hurting more than helping with R1. I'll know soon enough.

BigHuggyD

Feb 6

Okay yeah, muuuuuch better with a threadbare prompt. Borderline obsessive with every line of instruction in there. I'm not sure what exactly was in there that turned every chat into the multiverse collapsing into itself, but there you have it.

jukofyork

Owner Feb 6

•

edited Feb 6

@ChuckMcSneed Try this and see if it improves your NUMA performance:

Turn off NUMA balancing in Linux using echo 0 | sudo tee /proc/sys/kernel/numa_balancing > /dev/null (only has to be run once per OS boot).
Clear the page cache using echo 3 | sudo tee /proc/sys/vm/drop_caches > /dev/null.
Run sudo numactl -H to check the pages have been cleared, eg:

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65
node 0 size: 257860 MB
node 0 free: 257070 MB
node 1 cpus: 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87
node 1 size: 257989 MB
node 1 free: 257062 MB
node distances:
node   0   1 
  0:  10  21 
  1:  21  10

Run llama.cpp using numactl --interleave=all, set the --numa distribute command-line option, and set --threads set to 1/2 you have in your system, eg:

> numactl --interleave=all ./llama-server --host 192.168.1.111 --port 8080 \
  --model ./DeepSeek-R1-Q5_K_XL.gguf --chat-template deepseek3 --alias "DeepSeek-R1-Q5_K_XL" \
  --ctx_size 8192 --threads 44

Wait until you see: "main: server is listening on http://192.168.1.111:8080 - starting the main loop", then run a prompt.
Finally, wait for all the MoE tensors to properly warm up (you can see the memory use of the process growing by watching top, etc) - for me this takes about 30 minutes!
Re-run sudo numactl -H to check pages have been equally distributed:

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65
node 0 size: 257860 MB
node 0 free: 19029 MB
node 1 cpus: 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87
node 1 size: 257989 MB
node 1 free: 19222 MB
node distances:
node   0   1 
  0:  10  21 
  1:  21  10

Now the model should be paged properly and you shouldn't need to do this until the next OS boot, or if you want to change model, etc.

You can probably gain a little bit more by reducing --threads now (you don't need to rerun all the above for this, but possibly want to always use 1/2 the OS threads for the initial "warm-up" process [not tested yet]).

For reference I'm using:

Dual E5-2696v4, with 512GB of 2400MHz LR-DIMMS (all sockets populated) and have a theoretical max per-socket bandwidth of ~78GB/s.
BIOS set to "Home Snoop with Directory" NUMA mode (see: https://frankdenneman.nl/2016/07/11/numa-deep-dive-part-3-cache-coherency/).

I'm also using the sl/custom-tensor-offload branch to offload only the massive MoE tensors using --override-tensor exps=CPU, but I think the same should work for pure-CPU NUMA setup too.

My new 463GB custom Q5_K_XL quant using this hacked into llama_tensor_get_type():

    // ### JUK ###
    if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M || ftype == LLAMA_FTYPE_MOSTLY_Q5_K_M || ftype == LLAMA_FTYPE_MOSTLY_Q6_K) {
        if (name.find("_exps") != std::string::npos) {
            if (name.find("ffn_down") != std::string::npos) {
                new_type = GGML_TYPE_Q6_K;
            }
            else {
                if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M) {
                    new_type = GGML_TYPE_Q4_K;
                }
                else if (ftype == LLAMA_FTYPE_MOSTLY_Q5_K_M) {
                    new_type = GGML_TYPE_Q5_K;
                }
                else {
                    new_type = GGML_TYPE_Q6_K;
                }
            }
        }
        else {
            new_type = GGML_TYPE_Q8_0;
        }
    }
    else
    // ### JUK ###

I get this for the first run:

prompt eval time = 2167538.40 ms /   128 tokens (16933.89 ms per token,     0.06 tokens per second)
       eval time =  461194.72 ms /  1973 tokens (  233.75 ms per token,     4.28 tokens per second)
      total time = 2628733.12 ms /  2101 tokens

and when using this optimised set of parameters for the second run:

numactl --interleave=all ./llama.cpp/build/bin/llama-server --host 192.168.1.111 --port 8080 \
  --model ./DeepSeek-R1-Q5_K_XL.gguf --chat-template deepseek3 --alias "DeepSeek-R1-Q5_K_XL" --ctx_size 14336 --tensor-split 30,32 \
  --n-gpu-layers 99 --override-tensor exps=CPU --numa distribute --threads 30 \
  --temp 0.6 --min-p 0.0 --top-p 1.0 --top-k 0

I get:

prompt eval time =   91949.76 ms /   128 tokens (  718.36 ms per token,     1.39 tokens per second)
       eval time =  441279.72 ms /  1934 tokens (  228.17 ms per token,     4.38 tokens per second)
      total time =  533229.48 ms /  2062 tokens

Prompt processing is still pretty slow and I can't fit any more context than 14k for my 96GB of VRAM (!?), so gonna try this today:

https://github.com/ggerganov/llama.cpp/pull/11446

but it will require several hours to re-create the new GGUF tenors, etc :/

(I can get around 19k context using --cache-type-k q8_0 but the generation speed drops by about 20%)

jukofyork

Owner Feb 6

I'm quanting R1-Zero tonight and it's supposed to be completely batshit crazy - wish me luck :D

BigHuggyD

Feb 7

I'm quanting R1-Zero tonight and it's supposed to be completely batshit crazy - wish me luck :D

I'm eager to hear your observations!

ChuckMcSneed

Feb 7

@jukofyork Thanks for the advice, will try tomorrow!

I've been working on new private trivia benchmark, and tested the first 12 questions on lmarena:

These new models(kiwi and chocolate) are clearly one step above everything else, they get some questions right which no other models did. What's also interesting is that smaller models(mistral, gemma, micronova) somehow have guessed some questions(about taste and color, not a lot of choice) right while their bigger variants failed.

BigHuggyD

Feb 8

I'm sorry... What the frick are kiwi and chocolate? How have I never heard of these?

gghfez

Feb 8

I'm quanting R1-Zero tonight and it's supposed to be completely batshit crazy - wish me luck :D

Yeah I'm interested as well. I ran a brain-damaged 2-bit quant and found it just thought it's self into loops.

Prompt processing is still pretty slow

That was the issue I had with running purely CPU. Generation could get quite fast but prompt processing being single digits makes it unusable for dumping code or story chapters/drafts into.

What the frick are kiwi and chocolate?

I saw them mentioned on reddit, they're on lmsys arena apparently.

BigHuggyD

Feb 8

Ahh I see it now. People speculating it is Grok 3

ChuckMcSneed

Feb 9

Tested NUMA stuff, no improvement :( What happens is I get bottlenecked by reads from drive, I have 768GB RAM and at 32k context it sends some of the weights back to drive, so they have to be loaded back again and again. I can either wait for proper MLA implementation to reduce memory use by context, or get 4 pcie5 NVME drives and put them in RAID0. If I go for 2TB ones it would cost around 1k and would have theoretical speed of DDR4 RAM. Not sure if it’s worth it.

I'm sorry... What the frick are kiwi and chocolate? How have I never heard of these?

Mystery models from lmarena, can only be encountered in battle mode. Quite strong imo.

gghfez

Feb 9

get 4 pcie5 NVME drives and put them in RAID0

That's not a bad idea. When weights are offloaded to the drive I end up with a 3.2gb/s bottleneck with a single PCIE4 SSD. I might have to try RAID0 or 2x PCIE4 SSDs.

theoretical speed of DDR4 RAM

Is this weight reading sequential or random?

I have 768GB RAM and at 32k

Wow, what quant are you running?

ChuckMcSneed

Feb 9

Is this weight reading sequential or random?

In case of dense it would be sequential, with MoE like deepseek it's most likely random.

Wow, what quant are you running?

Q8_0

gghfez

Feb 10

@jukofyork @ChuckMcSneed

Perhaps this would interest you:

https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/DeepseekR1_V3_tutorial.md

Apparently they're getting > 200t/s prompt ingestion. I don't have enough RAM to try it.

jukofyork

Owner Feb 10

•

edited Feb 10

I'm quanting R1-Zero tonight and it's supposed to be completely batshit crazy - wish me luck :D

Yeah I'm interested as well. I ran a brain-damaged 2-bit quant and found it just thought it's self into loops.

I've not tried it yet as had a drive fail in a raid array and it took ages to rebuilt itself :/

Prompt processing is still pretty slow

That was the issue I had with running purely CPU. Generation could get quite fast but prompt processing being single digits makes it unusable for dumping code or story chapters/drafts into.

What the frick are kiwi and chocolate?

I saw them mentioned on reddit, they're on lmsys arena apparently.

Yeah, I've got to to be almost usable but it's still a bit painful.

jukofyork

Owner Feb 10

•

edited Feb 10

Tested NUMA stuff, no improvement :( What happens is I get bottlenecked by reads from drive, I have 768GB RAM and at 32k context it sends some of the weights back to drive, so they have to be loaded back again and again. I can either wait for proper MLA implementation to reduce memory use by context, or get 4 pcie5 NVME drives and put them in RAID0. If I go for 2TB ones it would cost around 1k and would have theoretical speed of DDR4 RAM. Not sure if it’s worth it.

The MLA Draft PR is working OK:

https://github.com/ggerganov/llama.cpp/pull/11446

It still allocates the original KV-cache currently (which makes it 100% useless lol), but you can zero them out like this:

safe_sed "src/llama-kv-cache.cpp" "ggml_tensor \* k = ggml_new_tensor_1d(ctx, type_k, n_embd_k_gqa\*kv_size);" "ggml_tensor * k = ggml_new_tensor_1d(ctx, type_k, 1);"
safe_sed "src/llama-kv-cache.cpp" "ggml_tensor \* v = ggml_new_tensor_1d(ctx, type_v, n_embd_v_gqa\*kv_size);" "ggml_tensor * v = ggml_new_tensor_1d(ctx, type_v, 1);"

before compiling and it works fine and can use the whole context in a couple of gigs of RAM now.

It seems about 1.5-2 tokens per second slower using GPU though, and I've spent quite a while trying to work out why (pretty sure it's al the permutations and conts, but don't really know enough GGML to see exactly).

jukofyork

Owner Feb 10

•

edited Feb 10

get 4 pcie5 NVME drives and put them in RAID0. If I go for 2TB ones it would cost around 1k and would have theoretical speed of DDR4 RAM. Not sure if it’s worth it.

This will likely not work anything like what you expect sadly. My brother has an single-CPU Intel workstation that is about the same generation as the new EPYC Turin (uses 6800MT/s + RAM but I forget the name) and a load of very fast Optane drives and still gets nothing like what is expected. It also starts to get quite hairy when you push things to this limit and he had to use a "PCI-e Retimer".

jukofyork

Owner Feb 10

@jukofyork @ChuckMcSneed

Perhaps this would interest you:

https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/DeepseekR1_V3_tutorial.md

Apparently they're getting > 200t/s prompt ingestion. I don't have enough RAM to try it.

Yeah, I've been watching that thread but it still looks a bit cronky.

jukofyork

Owner Feb 10

I'd love to know why llama.cpp batch processing is working so badly currently too. It make so sense that a batch of 512 tokens which loads the MoE tenors into VRAM to process anyway should only be 2-3x the speed of straight up processing using 30 threads on a 5 year old Xeon with 78GB/s memory bandwidth... Something really odd is happening.

jukofyork

Owner Feb 10

Expert selection strategy that selects fewer experts based on offline profile results of out of domain data

Oh, that looks like why they are getting such a speed-up and seems a bit much to claim it's faster than llama.cpp if it's doing that IMO!?

jukofyork

Owner Feb 10

Intel AMX instruction set

https://en.wikipedia.org/wiki/Advanced_Matrix_Extensions

first supported by Intel with the Sapphire Rapids microarchitecture for Xeon servers, released in January 2023

This also isn't gonna be useful for most people.

gghfez

Feb 12

Ah my bad, I only skimmed over it (saw it posted on reddit).

I've setup RPC and been able to add 2 more RTX3090's from my other rig, which has sped up pp on the meme-quant:

prompt eval time = 9568.83 ms / 771 tokens ( 12.41 ms per token, 80.57 tokens per second)
eval time = 239749.70 ms / 1589 tokens ( 150.88 ms per token, 6.63 tokens per second)
total time = 249318.52 ms / 2360 tokens

I tried adding my M1 Max and 2xA770 rig as well, but these slowed prompt processing too much (expecially the mac, which is GPU-core bound.

I saw your comments in all the llama.cpp commits, tried the -ot exp=CPU you were using; which lets me use much higher context in vram, but offloads to the disk so it's not usable with 128gb DDR5.

jukofyork

Owner Feb 12

•

edited Feb 12

Ah my bad, I only skimmed over it (saw it posted on reddit).

I've setup RPC and been able to add 2 more RTX3090's from my other rig, which has sped up pp on the meme-quant:

prompt eval time = 9568.83 ms / 771 tokens ( 12.41 ms per token, 80.57 tokens per second)
eval time = 239749.70 ms / 1589 tokens ( 150.88 ms per token, 6.63 tokens per second)
total time = 249318.52 ms / 2360 tokens

I tried adding my M1 Max and 2xA770 rig as well, but these slowed prompt processing too much (expecially the mac, which is GPU-core bound.

I saw your comments in all the llama.cpp commits, tried the -ot exp=CPU you were using; which lets me use much higher context in vram, but offloads to the disk so it's not usable with 128gb DDR5.

Yeah, I've been obsessed with getting this working properly for the last few days.

The RPC code doesn't do any buffering, so weaving it through all my GPUs just adds more and more latency to the stage where it negates the eventual gains.
I nearly pulled the trigger on 2x Max Ultra 192GB, but then realised that the MLA code is actually much more compute bound than normal transformers and would likely get horrible prompt processing speed.

Sometime this week I'm going to try on my brother's Intel machine that has those AMX instructions and see if the optimising compiler can make use of them (probably have to compile using Intel OpenAPI I think). His machine has 8 channels of 48GB 6800MT/s in it, so still won't be able to test the very biggest quants, but should see if it boosts things during the prompt processing speed like KTransformers. He spent ages tuning the RAM speeds and thinks that if I want to get 8x 96GB the max it will run is about 6000MT/s but if the AMX instructions do help the processing speed it will be worth it.

I get the feeling this is going to be peak story writing for a long time, as it seems to be truly "unbiased" (as in not biased towards woke-left ideals), and even if newer/smaller models come out with better reasoning; they will all have taken note of the bad press it got for this and get an extra stage of woke-indoctrination added before release... :/

gghfez

Feb 13

probably have to compile using Intel OpenAPI I think

Painful to get that set up. One tip: the documentation from Intel says to use OneAPI 2024, but a commit to llama.cpp in November made it require OneAPI 2025.
One you've installed it to /opt/intel, I suggest making a backup cp -R /opt/intel /opt/intel-2025.

I nearly pulled the trigger on 2x Max Ultra 192GB, but then realised that the MLA code is actually much more compute bound than normal transformers and would likely get horrible prompt processing speed.

Yeah the guy showing it off on reddit didn't post his prompt processing speed lol

Yeah, I've been obsessed with getting this working properly for the last few days.

Yeah I saw you're creating a "fake LoRA" and everything!

Question: With what you did here:
https://github.com/ggerganov/llama.cpp/pull/11397#issuecomment-2645986449

Did you manage to get it to put the entire KV buffer on your GPUs? Or is it just cropped out and you've actually got some on CPU as well?

I did the same thing as you, managed to explicitly load specific experts onto my 2 RPC servers with no KV buffer, but it still put 16gb worth on my CPU, and only 2GB on my local CUDA0-4 devices, despite them having about 17GB VRAM available each!

So did you find a way to make it put the entire KV buffer on your local GPUs?

jukofyork

Owner Feb 13

•

edited Feb 13

probably have to compile using Intel OpenAPI I think

Painful to get that set up. One tip: the documentation from Intel says to use OneAPI 2024, but a commit to llama.cpp in November made it require OneAPI 2025.
One you've installed it to /opt/intel, I suggest making a backup cp -R /opt/intel /opt/intel-2025.

Yeah, I've used the "legacy" Intel compiler a lot over the years, but it started to get worse and more buggy around 2018/2019 when they added some stupid "mitigations" shit.

eg: Intel's "optimised" valarray straight up did a divide instead of a multiply (or vice versa) and we spent several days trying to work out WTF was going... In the end we got some minimal code to show it doing this in assembler and never used it again lol.

I nearly pulled the trigger on 2x Max Ultra 192GB, but then realised that the MLA code is actually much more compute bound than normal transformers and would likely get horrible prompt processing speed.

Yeah the guy showing it off on reddit didn't post his prompt processing speed lol

Yeah, I bet it's painful :/ I have got mine up to about 14-15 tokens per second now (for a near 500GB custom quant!).

Yeah I saw you're creating a "fake LoRA" and everything!

Yeah, I got this "working" today, but obviously nobody has been stupid enough before to try loading a 50GB LoRA :D It "worked" but did god knows what and was really slow, and then when I tried to use llama-perplexity it said it loaded the LoRA but worked full-speed and produced NaNs which is probably because it used the tensors with the LoRA subspace removed...

I'm now exporting it all into the .safetensors files with "_a.weight" and "_b.weight" and gonna fork the MLA PR to use these directly.

Question: With what you did here:
https://github.com/ggerganov/llama.cpp/pull/11397#issuecomment-2645986449

Did you manage to get it to put the entire KV buffer on your GPUs? Or is it just cropped out and you've actually got some on CPU as well?

I did the same thing as you, managed to explicitly load specific experts onto my 2 RPC servers with no KV buffer, but it still put 16gb worth on my CPU, and only 2GB on my local CUDA0-4 devices, despite them having about 17GB VRAM available each!

So did you find a way to make it put the entire KV buffer on your local GPUs?

Yeah, I did find a way in the end using a mix of negative-regexs, tensor-split, --devices to reorder the devices, etc, but I can't really remember now exactly what :/

It didn't work well though, because of the latency between the RPC servers: each GPU I added processes the tensors really quick but then lost the gain by not being buffered... :(

jukofyork

Owner Feb 14

I've think I've managed to get a Mac Studio M2 Ultra 24/76/32 ordered (assuming it's not a scam - eBay sellers pfffft).

It will be interesting to see how it performs when using RPC (if it's just the latency again then I'll see if I can figure out what Deepspeed is doing for its pipeline parallel implementation, and see if the RPC server can be improved).

gghfez

Feb 15

24/76/32 ordered

Awesome, will be good to see the results! Ebay Buyer protection will get you a refund if it's a scam.

if it's just the latency again

Does the latency have more of an impact forMoE? I ran a test with Mistral-Large and got similar prompt processing + textegen with:

4x3090 local
2x3090 local + 2x 3090 on another rig over RPC.

I wish there were a way to somehow cache the tensors sent over to the RPC servers, as loading the model each time takes forever sending that all over the network.

I managed to get KV Cache all on the local GPUs now and it makes a huge difference with prompt processing.

jukofyork

Owner Feb 15

24/76/32 ordered

Awesome, will be good to see the results! Ebay Buyer protection will get you a refund if it's a scam.

Yeah, I've had several different scams attempted on me buying hardware off eBay over the years and always pay by credit card rather than PayPal to get extra protection now.

if it's just the latency again

Does the latency have more of an impact forMoE? I ran a test with Mistral-Large and got similar prompt processing + textegen with:

4x3090 local

2x3090 local + 2x 3090 on another rig over RPC.

I think it was just because I was weaving it through so many different cards and systems: having a single "gap" where you send a few KB isn't likely to be that noticeable, but I had 6 "gaps" and I think each had to return the packets to the RPC host who then sent it back (often to the 2nd GPU on the same machine!).

I wish there were a way to somehow cache the tensors sent over to the RPC servers, as loading the model each time takes forever sending that all over the network.

I looked at the code and think there would be a pretty easy way to hack this in by just sending a hash first, but there looks to be a few PRs looking to revive the RPC stuff after it's stalled for 6 months:

https://github.com/ggerganov/llama.cpp/pull/7915
https://github.com/ggerganov/llama.cpp/pull/8032
https://github.com/lexasub/llama.cpp/tree/async-rpc-squashed/ggml/src/ggml-rpc

So might be worth waiting and seeing what comes of these.

I managed to get KV Cache all on the local GPUs now and it makes a huge difference with prompt processing.

Yeah, it's really badly documented, but the mix of --devices to order the devices (found via --list-devices, --tensor-split to divide them up, and the new --offload-tensor PR with regexes (especially negative regexes), is super powerful and you can do almost anything with them combined!

The only thing I wish it could do is specify specific CPUs rather than just "=CPU", as then it could be mixed with numactl to manually setup a much more optimal NUMA setup.

gghfez

Feb 15

Yeah, I've had ebay scams as well, but the buyer protection has always sided with me. Just a hassle really. I haven't bought a mac on ebay before, but I think there's something where they can report it stolen and have it locked by Apple. So I guess make sure it's not still on anybody's iCloud account when you get it.

often to the 2nd GPU on the same machine!

Yeah, it's annoying the way we have to run a separate RPC server per GPU. And the Intel/syctl build seems to over-report available memory slightly, so I had to set it manually, etc. Quite annoying to work with.

is super powerful and you can do almost anything with them combined!

Yeah! Your cli you posted in that PR really helped me figure all that out. Wouldn't have thought to manually assign all the devices / enforce the order first.

specific CPUs rather than just "=CPU"

I imagine this would be a bigger change, because the rest of llama.cpp is probably only setup to support "CPU" rather than having them as devices.

Funny thing is in one of my tests for Deepseek, it knew exactly what I'd be suffering through when I cp/pasted the final command in and asked if it understood what I'd done. But it ended it's thoughts with:
"Maybe joke about the setup being a "mad scientist" lab but in a good way." which caused it to write things like this in it's reply:

This regex-driven tensor routing is something I’d expect in a Clarkesworld story, not a garage. How many model reloads did it take to stop the CUDA/OOM tantrums?
and then later when I removed the Intel ARC's it made another joke:
ARC GPUs as paperweights: Not surprised you ditched them; Intel's SYCL stack for LLM inference still feels like beta-testing a parachute.

I didn't mention the OOM/reload cycles or syctl, this model is very knowledgeable.

gghfez

Feb 15

P.S. I noticed using the https://github.com/ggerganov/llama.cpp/pull/11446 (MLA) fork + your sed commands to prevent allocating the old KV cach;, after a few messages or messages with code in them, the model breaks down and writes replies like this:

Have you encountered anything like this? I suspect it's because I'm running such a small quant: gghfez/DeepSeek-R1-11446-Q2_K without the Dynamic Quant work Unsloth did.

jukofyork

Owner Feb 15

P.S. I noticed using the https://github.com/ggerganov/llama.cpp/pull/11446 (MLA) fork + your sed commands to prevent allocating the old KV cach;, after a few messages or messages with code in them, the model breaks down and writes replies like this:

Have you encountered anything like this? I suspect it's because I'm running such a small quant: gghfez/DeepSeek-R1-11446-Q2_K without the Dynamic Quant work Unsloth did.

It's probably because it's overflowing - I've been running the MLA PR as bfloat16 for all the attention stuff because of this.

I'm pretty sure it's this matrix multiply that is causing it:

                struct ggml_tensor * wk_b = ggml_view_3d(ctx0, model.layers[il].wk_b, n_embd_head_qk_nope, kv_lora_rank, n_head, ggml_row_size(model.layers[il].wk_b->type, n_embd_head_qk_nope), ggml_row_size(model.layers[il].wk_b->type, kv_lora_rank * n_embd_head_qk_nope), 0);
                cb(wk_b, "wk_b", il);

                q_nope = ggml_permute(ctx0, q_nope, 0, 2, 1, 3);
                cb(q_nope, "q_nope_perm", il);

                struct ggml_tensor * q_nope2 = ggml_mul_mat(ctx0, wk_b, q_nope);
                cb(q_nope2, "q_nope2", il);

https://github.com/ggerganov/llama.cpp/blob/76543311acc85e1d77575728000f1979faa7591f/src/llama.cpp

as storing everything but wk_b as float16 stops it.

I did hope to find some way to downscale wk_b and then upscale the layer_norm, but the layer_norm actually happens before it gets stored in the compressed KV-cache :/

Then it goes off into a huge set of permutations lol.

The official code is so much simpler:

            wkv_b = self.wkv_b.weight if self.wkv_b.scale is None else weight_dequant(self.wkv_b.weight, self.wkv_b.scale, block_size) 
            wkv_b = wkv_b.view(self.n_local_heads, -1, self.kv_lora_rank)
            q_nope = torch.einsum("bshd,hdc->bshc", q_nope, wkv_b[:, :self.qk_nope_head_dim])
            self.kv_cache[:bsz, start_pos:end_pos] = self.kv_norm(kv)
            self.pe_cache[:bsz, start_pos:end_pos] = k_pe.squeeze(2)
            scores = (torch.einsum("bshc,btc->bsht", q_nope, self.kv_cache[:bsz, :end_pos]) +
                      torch.einsum("bshr,btr->bsht", q_pe, self.pe_cache[:bsz, :end_pos])) * self.softmax_scale

https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/model.py

and I get the feeling that doing all those permutations is what is making the MLA PR run so much slower than the official version for MLA.

The einsum stuff looks complex, but it's really just changing the order of nested for loops that iterate over the data unchanged...

The permutations in llama.cpp are all there to just to make this fit the standard batched matrix multiply idiom and it's lack of an einsum equivalent.

jukofyork

Owner Feb 16

•

edited Feb 16

https://github.com/ggml-org/llama.cpp/pull/11446#issuecomment-2661455127

This should give quite a big boost in token generation speed for the MLA branch, and possibly much more if you have a CPU with a lot of cache.

jukofyork

Owner Feb 16

https://aomukai.com/2025/02/16/writingway-if-scrivener-had-ai-implementation/
https://github.com/aomukai/Writingway
https://www.reddit.com/r/WritingWithAI/comments/1iqqi86/writingway_a_free_open_source_software_that/

This looks quite well done - written by someone who's not a programmer using LLMs so maybe a bit rough round the edges, but the first attempt I've seen of someone trying to create an offline app similar to Novercrafter (which makes zero sense to be an online tool with a subscription...).

bartowski

Feb 17

Been reading some of your work @jukofyork , a lot of interesting stuff, particularly https://github.com/ggml-org/llama.cpp/discussions/5263#discussioncomment-9396351 and the stuff you've been posting in the MLA PR.. wondering if you wanted to chat at all and see if there's anything we can do to improve llama.cpp quant performance (or can talk publicly here, don't mind either way :D), i'm thirsty for some big quality improvements, feels like there's some low hanging fruit but i'm not great at the llama.cpp codebase that you seem to be getting familiar with

gghfez

Feb 18

@jukofyork

This should give quite a big boost in token generation speed for the MLA branch, and possibly much more if you have a CPU with a lot of cache.

Nice, did I read that correctly, in that it will fix the overflow issue?

a bit rough round the edges

It looks more polished than what I've hacked together locally. I'll try switching to it after adding a feature my system has (click a token -> 5 probabilities displayed -> click one of them -> the token is replaced and generation resumes from that point)

@bartowski
You sure you won't regret making a comment in this thread? Apparently there's no unsubscribing now ;)

bartowski

Feb 18

I currently have 2148 unread messages, this thread will be the least of my concerns 😂

Mainly because there's no good way to follow an org to see new models without also following all their model threads :')

BigHuggyD

Feb 18

.... doom DOOM!

jukofyork

Owner Feb 18

•

edited Feb 18

Been reading some of your work @jukofyork , a lot of interesting stuff, particularly https://github.com/ggml-org/llama.cpp/discussions/5263#discussioncomment-9396351 and the stuff you've been posting in the MLA PR.. wondering if you wanted to chat at all and see if there's anything we can do to improve llama.cpp quant performance (or can talk publicly here, don't mind either way :D), i'm thirsty for some big quality improvements, feels like there's some low hanging fruit but i'm not great at the llama.cpp codebase that you seem to be getting familiar with

I'll just post here if that's OK. The biggest problem with all this is that for some reason ikawrakow (the guy who implemented all the quants in llama.cpp, and who has made his own fork now) fell out with the the llama.cpp devs and I don't want to get involved nor pour water on the fire... The very best outcome would be if they resolved their differences and carried on working together IMO :)

Anyway, there's two main problems with the quants as they are:

1. Bias in the `imatrix` calculations

There are numerous sources of statistical bias in both the imatrix creation process and the imatrix evaluations:

A. The samples are unrepresentative of real world data, and often the sample used to evaluate the "improvement" is very distributionally similar to that used to create the `imatrix` itself

I know you have gone some way to fixing this with using better semi-random mixes of data, but if you think about the likely distribution of tokens for: a coding model, a reasoning model, a creative writing model and a language-translation model, then it should be clear that these are all very different distributionally.

The fact that this guy has been successfully using the bias to influence the writing styles of models should be a big warning as to this problem.

B. The `llama-imatrix` code doesn't use the proper chat templates and also just breaks the "chunks" at random points

For some model this probably doesn't matter much, but for certain models it's likely a disaster:

The Mistral AI models (especially miqu-1:70b) are amazingly sensitive to even the slightest change of prompt (again something people have actually exploited to make them write better/differently).
I can envision this being quite a big problem with the new reasoning models, that also seem very sensitive to the prompt template and expect a more rigid order with the <think> stuff being first, etc.

C. By default the `llama-imatrix` code only looks at contexts lengths of 512 tokens

This is the most serious IMO and likely causes a huge amount of problems that aren't at all obvious at first sight:

Some of the tensors in LLMs seem to work the same all the way though the context (as the control vectors and "abliteration" show for down_proj and out_proj). The down_proj matrix is so insensitive to this that you can actually just sample 1 token and apply the same direction all throughout the context (which was very surprising to me!).
Some of the tensors absolutely don't work like this and are hugely hurt by sampling such small sequences, namely the q_proj and the k_proj tensors which have positional information added to them via RoPE.

In general all the different types of tensors in an LLM are likely hurt to (very) different degrees by this, and it is probably the most important thing to avoid (if I have to use an imatrix currently, then I use your 'calibration_datav3.txt' but make sure to bump all the attention matrices up to at least Q6_K to avoid this problem, as the imatrix weighting logic for the Q6_K+ code is commented out in llama.cpp and not used).

D. Differing activation frequencies for MoE models

The original mixtral models only had 8 experts and 2 were activated per token, so the effective sample size of the 3 sets of expert MLP tensors is only 1/4 of all the other tensors. This probably wasn't a very big problem then, but as the expert count and the sparsity has increased; this has become more and more of a problem.

I did manage to fix some of this back when dbrx first dropped:

https://github.com/ggerganov/llama.cpp/pull/7099 (see here for the main discussion: https://github.com/ggerganov/llama.cpp/pull/6387#issuecomment-2094926182)

but this only really fixed the divisors and doesn't really fix the root problem...

So these problems aren't unique to imatrix creation, and bias in statistics is always a problem.. There are two main ways to deal with this:

1. Manually consider each of the points above (and any others I've forgotten to mention)

The problem with this is that they all involve either a large increase in compute or a large increase in thought/effort regarding data preparation.

2. Regularisation

I've tried to put forward the use of Regularisation for this several times:

https://github.com/ggml-org/llama.cpp/discussions/5263
https://github.com/ikawrakow/ik_llama.cpp/discussions/140

But the method ikawrakow used actually is a form of non-standard regularisation he found empirically, which works very well in practice; but isn't very conducive to the introduction of a regularisation-factor to control the amount of regularisation applied...

I actually tried more "standard" methods of regularising the weighting factors, but his method clearly seemed to work better, and I assumed for a long time it must be some signal-processing voodoo - thankfully he explained in the recent post where it came from! :)

This isn't impossible to work around though, and I could easily adapt the existing imatrix.cpp code to use something like bootstrapping to estimate the variance and then add an option to imatrix to shrink back the estimates towards a prior of "all weighted equally" using something similar to the "One Standard Error Rule".

In reality, a mix of (1) and (2) together would probably be needed (ie: you can't fix the 512-token problem easily with regularisation alone).

2. Outdated heuristics in `llama_tensor_get_type()`

(I'll write this later today)

jukofyork

Owner Feb 18

@jukofyork

This should give quite a big boost in token generation speed for the MLA branch, and possibly much more if you have a CPU with a lot of cache.

Nice, did I read that correctly, in that it will fix the overflow issue?

The overflow only seems to occur if you use float16 and everything else is fine (possibly only for [non-CuBLAS] CUDA code too - see the recent reply from Johannes Gäßler).

The massive slowdown for MLA seems to be some oversight in the code, and I suspect it's either due either:

Repeatedly dequantising the two _b tensors instead of dequantising once and reusing for the whole batch.
Re-quantising the multiplier and multiplicand to QK8_1 or something related to this (it's a very impenetrable bit of code).
Using the _b tensors transposed so the float scaling factor is applied for every weight instead of every 32 weights.

I tired to follow the code to see if could find it, but it's very complex due to the compute-graph ops causing indirection and the CUDA stuff in general is very complex and hard to understand.

Just using float32 for those 2 tensors seems to fix the problem for now though.

bartowski

Feb 18

I appreciate the large write up @jukofyork ! I'll try to go line by line replying :)

Re: ikawrakow, yeah I'm aware of the fallout and the fork, I don't think meddling with upstream llama.cpp would cause any fire or make any issues, I'd be willing to reach out to him directly though if we have any specific concerns to make double sure!

I definitely agree with the bias introduction, I have long felt any increase in bias was likely offset by the improved overall accuracy versus the original model. I did a test a few months back where I compared a static quant and my imatrix quant versus a japanese wikitext and found that, despite my imatrix containing no japanese characters, the kld improved with my imatrix versus the static, which led me to the (admittedly weak and flawed) conclusion that there must be enough overlap in important tokens no matter the context that the imatrix was a net benefit. I'd like to challenge that conclusion if possible, but obviously easier said than done.

I wouldn't draw too many conclusions from the author you linked personally but that's a discussion for another time

Regarding chat templates, I actually had a thought about this a couple weeks ago I wanted to test.. I was curious if there would be any notable improvements if I took a high quality Q/A dataset and turned it into multi-turn conversations with the chat templates applied and used THAT as corpus of text, I don't think that would necessarily represent a final results, but I think it's an interesting middle step, especially if we can directly compare the Q/A dataset as the corpus versus the chat-templated Q/A dataset. Since perplexity datasets also wouldn't have the chat templates this would require more aggressive benchmarking which I wouldn't be opposed to doing/funding (MMLU pro ect)

The imatrix context length yes I can see that being a problem. I do wonder if there is a large difference in activations for short context versus long, it definitely would not surprise me in either direction, but it is something that should become known..

I know compilade (won't tag to avoid hat is apparently a thread from hell) was working on fixing issues with loading multiple chunks with a larger batch size here: https://github.com/ggml-org/llama.cpp/pull/9400 but I don't know if that also would allow larger chunk sizes. Is this a fundamental flaw with imatrix or is it just something that we don't do? Would it make sense to make an imatrix dataset from 2 passes, possibly even on the same corpus, one with chunk sizes of 512 and one with chunk sizes of 2000+?

Do you use your own fork to make the attention matrices go up to Q6_K? This is something I'd be interested to explore, similar to the _L variants I release that have the embed/output weights at Q8_0. I also didn't know imatrix was disabled for Q6_K, are you positive about this? or is it specifically the attention matrices have it commented out?

That's interesting about regularisation, I'll have to read up more on that.

Anyways, if there's anything I can do to help or enable development, please let me know, I'm very interested in keeping llama.cpp at the forefront of performance especially in terms of performance per bit when quantizing, and I think there's definitely some work that can be done.

I was also theorizing about the possibility of measuring the impact of quantization of layers through some sort of KLD measurement method, where you'd take the original model in fp16+ (or Q4+ for extremely large models), measure the logits against a corpus of text, then sequentially quantize each layer down to Q2_K or lower (so layer 1 is all Q2_K while the rest is fp16 and measure logits, then put layer 2 at Q2_K with rest at fp16 and measure logits) and measure the change in logits to see which layers had the least impact against the final results by being crushed to low precision, and use this as an ad-hoc method of determining which layers would be better to represent at higher precision and which at lower (similar to unsloth's "dynamic" quants but with more automation)

In response to concerns about compute complexity, I would posit the possibility of creating a single run per arch/size, since it's been shown that imatrix and even exl2 calibration differs to surprisingly small degrees across finetunes of the same models, so if we created say

llama 3 1/3/8/70b
Qwen2.5 0.5,1.5,3,7,14,32,72b
Gemma 2 2/8/27b

etc etc, we would have a master list of these layer importances that likely would not change even with finetuning. I could be wrong but it could also be a potential performance gain

jukofyork

Owner Feb 18

•

edited Feb 18

Do you use your own fork to make the attention matrices go up to Q6_K? This is something I'd be interested to explore, similar to the _L variants I release that have the embed/output weights at Q8_0.

I just hack the llama_tensor_get_type() function and recompile.

I also didn't know imatrix was disabled for Q6_K, are you positive about this? or is it specifically the attention matrices have it commented out?

Just looking again and it's not fully commented out:

https://github.com/ggml-org/llama.cpp/blob/63ac12856303108ee46635e6c9e751f81415ee64/ggml/src/ggml-quants.c#L1721

static void quantize_row_q6_K_impl(const float * restrict x, block_q6_K * restrict y, int64_t n_per_row, const float * quant_weights) {
    assert(n_per_row % QK_K == 0);
    const int64_t nb = n_per_row / QK_K;

    int8_t L[QK_K];
    float   scales[QK_K/16];
    //float   weights[16];

    for (int i = 0; i < nb; i++) {

        //float sum_x2 = 0;
        //for (int j = 0; j < QK_K; ++j) sum_x2 += x[j]*x[j];
        //float sigma2 = sum_x2/QK_K;

        float max_scale = 0;
        float max_abs_scale = 0;

        for (int ib = 0; ib < QK_K/16; ++ib) {

            float scale;
            if (quant_weights) {
                const float * qw = quant_weights + QK_K*i + 16*ib;
                //for (int j = 0; j < 16; ++j) weights[j] = qw[j] * sqrtf(sigma2 + x[16*ib + j]*x[16*ib + j]);
                //scale = make_qx_quants(16, 32, x + 16*ib, L + 16*ib, 1, weights);
                scale = make_qx_quants(16, 32, x + 16*ib, L + 16*ib, 1, qw);
            } else {
                scale = make_qx_quants(16, 32, x + 16*ib, L + 16*ib, 1, NULL);
            }
            scales[ib] = scale;

            const float abs_scale = fabsf(scale);
            if (abs_scale > max_abs_scale) {
                max_abs_scale = abs_scale;
                max_scale = scale;
            }

        }

I honestly can't remember if it was before fully commented out or if I traced the code to make_qx_quants() and found it wasn't being used.

It's definitely not using the sqrtf(sigma2 + x...) empirical regularisation even now though.

I was also theorizing about the possibility of measuring the impact of quantization of layers

The key thing here is you have to have some clear metric to optimise for and the current "perplexity for 512 tokens" method definitely isn't going to work well on recent models (and may actually make them much worse).

All the metrics that use the post-softmax probabilities actually throw away a lot of information about the effect of quantisation, and it may be much more sample efficient to take measurements on either the final hidden state or the pre-softmax logits.

I think deciding what we want to optimise is actually harder here than the actual optimisation process. I haven't time today, but will post about this tomorrow...

jukofyork

Owner Feb 18

@bartowski Here is a discussion about this from last year:

https://github.com/ggml-org/llama.cpp/pull/6844#issuecomment-2192834093

Somewhere around the same time there was another discussion in a similar thread where I mentioned the idea of using a "surrogate model" instead of directly trying to optimise it:

https://en.wikipedia.org/wiki/Surrogate_model

The most appropriate being Gaussian Processes (aka "Bayesian Optimisation"):

https://en.m.wikipedia.org/wiki/Gaussian_process

or regression trees:

https://en.m.wikipedia.org/wiki/Decision_tree_learning

But again, it all comes down to "what are we trying to optimise?"...

jukofyork

Owner Feb 19

Man it's been painful to get the MLA branch fully working, but I think I've finally got it going and oh boy can this thing write some good Grimdark!

My custom BF16/Q6_K /Q5_K quant (480GB) runs at:

prompt eval time =   54856.58 ms /  1071 tokens (   51.22 ms per token,    19.52 tokens per second)
       eval time =  604165.18 ms /  1812 tokens (  333.42 ms per token,     3.00 tokens per second)

and stays pretty consistent.

The equivalent BF16/Q4_K /Q4_K quant (365GB) runs quite a bit faster, but definitely doesn't seem quite a good.

bartowski

Feb 20

Are you running this on Apple? What is bf16 used to quant, didn't think it was fast on anything but macs, and not at all on CUDA

But maybe it's for a small portion and it's better on CPU than I'm thinking (just know that imatrix is painfully slower on bf16 lol)

jukofyork

Owner Feb 20

Are you running this on Apple? What is bf16 used to quant, didn't think it was fast on anything but macs, and not at all on CUDA

But maybe it's for a small portion and it's better on CPU than I'm thinking (just know that imatrix is painfully slower on bf16 lol)

I've had lots of problems with weird overflows and slowdowns and only just got it working tonight properly:

The BF16 tensors are everything but the experts and on two A6000 GPUs.
The experts are Q6_K for the down_proj and Q5_K for up_proj and gate_proj and in RAM.

The BF16 PR for llama.cpp was merged a few weeks ago and seems to work fine. It can't use F16 for certain tensor(s) or it overflows, and any quant causes some weird slowdowns where the prompt processing takes 2.5x longer and the token generation about 1 tokens/s less (which is a lot when you're only getting 3 tokens/s!).

It's been a long slog getting it all working, but it's nuts how good it is at writing!

No other LLM I've ever tried can come even remotely close to being able to write chapters out of order like this:

USER

Varis adjusted the noose, its hemp fibers grinding beneath his calluses. “Last chance,” he said, voice like gravel dragged through mud. “Confess, and your soul stays your own.”

Jurl laughed—a wet, gurgling sound. “You’re knee-deep in it, Coldwater. ” The thing inside him twisted the boy’s lips into a grin too wide for his face. “The Great Wolf’s howlin’ again. The Dead’s Gate’s rusted through… ”

Turn this into the opening chapter of a Grimdark trilogy.

ASSISTANT

Thought for 272 seconds

> Okay, so I need to turn this dialogue and scene into the opening chapter of a Grimdark trilogy. Let me start by understanding what's given. There's Varis adjusting a noose, made of hemp, and his voice is described as gravelly. He's giving someone a last chance to confess. Then Jurl laughs, but it's a wet, gurgling sound, implying maybe he's possessed or something inside him. The entity mentions the Great Wolf and the Dead's Gate rusting through. > > Grimdark is typically dark, gritty, morally ambiguous. So I need to set a tone that's oppressive, maybe with elements of decay, hopelessness, and visceral imagery. The opening chapter should hook readers with tension and introduce the grim elements. > > First, I need to expand the scene. Where are they? Probably a place of execution, maybe a town square, or a desolate area. The setting should reflect the grimdark theme—maybe a dying town, surrounded by decay. Varis is an executioner or a lawman, but in a grimdark world, he might have his own flaws or dark past. > > Jurl is possessed by something, so there's a supernatural element. The Great Wolf could be a deity or a harbinger of doom. The Dead's Gate rusting through suggests a barrier between the living and dead is failing, which could be a major plot point for the trilogy. > > I should build the atmosphere with sensory details: the smell of rot, the sound of wind, the feel of the rough hemp. Maybe add some onlookers to show the community's state—perhaps they're fearful, desperate, or indifferent. > > Characterization: Varis is hardened, maybe weary. His calluses suggest years of this work. The thing inside Jurl is malevolent, taunting Varis with cryptic warnings. The dialogue hints at a larger conflict—Varis being involved in something bigger ("knee-deep in it"), so maybe he's part of a group that tried to prevent this, but failed. > > Plot progression: The execution goes wrong when Jurl's body transforms, the Dead's Gate opens, unleashing horrors. This sets off the events of the trilogy. Varis survives, now burdened with the knowledge that the apocalypse is coming, and he has to act. > > Structure-wise, start with the execution scene, build tension through dialogue and setting, then the supernatural event occurs, leading to chaos. End the chapter with Varis realizing the gravity and setting off on his quest. > > I need to make sure the prose is tight, with vivid, harsh imagery. Avoid info dumps, but sprinkle in world-building through details. Maybe include some flashbacks or internal thoughts from Varis to hint at his backstory without slowing the pace. > > Potential pitfalls: Over-explaining the supernatural elements too soon. Keep it mysterious. Also, ensuring the grimdark tone is consistent—no heroes, just flawed people in a harsh world. > > Possible additions: A secondary character, like a priest or another townsfolk, to interact with Varis, adding depth. Maybe show the town's desperation through their reactions. Also, after the Dead's Gate opens, describe the horrors unleashed in a visceral way to emphasize the stakes. > > Check for consistency in the setting—what season is it? Maybe autumn with decaying leaves, or winter with a bitter cold. The time of day could be dawn or dusk for symbolic effect. > > Make sure the action flows smoothly from the dialogue to the transformation, to the gate opening, and the aftermath. Keep sentences varied in length to control pacing—short, choppy sentences during intense moments, longer ones for description. > > Names: Varis Coldwater—sounds like a northern, rugged name. The Great Wolf could be a local legend or a real entity. Dead's Gate is a good name for a portal to the afterlife or a prison for the dead. > > End the chapter with a hook—maybe Varis taking a token from Jurl's body, or a vision of the Great Wolf, prompting him to act. Alternatively, the town being overrun, forcing him to flee, setting up the next chapter's journey.

**Chapter One: The Rusting Gate**

The gallows creaked like a dying man’s breath. Varis Coldwater stood atop the splintered platform, the noose in his hands as familiar as sin. Below, the town of Blackmoor huddled—a carcass of leaning timber and crumbling stone, its people gaunt shadows gathered in the square. They reeked of desperation, their eyes hollow as the sockets of the skulls nailed above the butcher’s stall. Autumn had stripped the oaks bare, and the wind carried the stench of rot from the moors, where the peat devoured the unwary.

“Last chance,” Varis growled, the hemp rough against his palms. He’d hanged seven men on this rope. Their faces blurred now, but their screams never did. “Confess, and your soul stays your own.”

Jurl dangled, wrists bound, his boyish frame too slight for the iron chains. Blood crusted his temple where the mob had struck him—before they lost their nerve. Before they saw the black veins spidering beneath his skin.

The thing inside Jurl laughed, a sound like drowned lungs heaving. “You’re knee-deep in it, Coldwater.” The boy’s jaw cracked sideways, teeth splintering as the grin split his face. “The Great Wolf’s howlin’ again. The Dead’s Gate’s rusted through…”

A woman in the crowd crossed herself. Old Tessa, the midwife. Her lips moved in a silent prayer to saints long forgotten. Varis didn’t believe in prayers. Only the work.

He yanked the lever.

The trapdoor crashed open—

—and Jurl twisted.

The boy’s body snapped taut mid-fall, suspended by nothing. Chains exploded, shrapnel shredding the air. A farmer screamed as a link buried itself in his thigh. The crowd scattered, but Varis stood frozen, watching as Jurl’s spine arched, ribs punching through flesh like antlers. The thing wearing his skin howled, a chorus of voices that scratched at the sky.

“YOU SHOULD HAVE LISTENED, HANGMAN.”

The earth shuddered. Graves in the churchyard split, skeletal hands clawing free. The air curdled, and the horizon rippled—a massive gate of tarnished iron materializing in the distance, its hinges weeping rust. Beyond it, shadows writhed. Hungry. Familiar.

Varis’s hand flew to the amulet beneath his shirt—a wolf’s fang, warm as a heartbeat. Mari’s fang. The last relic of the order that had burned with her.

“The Gate…” Old Tessa whispered, collapsing to her knees. “*Gods save us, it’s open.*”

Jurl’s corpse hit the ground, deflated. But the gate remained. And the shadows swarmed.

Varis was running before he’d decided to move, snatching his axe from the execution block. A child’s wail cut through the chaos—a girl trapped beneath a cart, her mother’s body splayed beside her. He lunged, hauling the cart aside, but the shadows were faster. They coiled around the girl, viscous as tar, and her scream dissolved into a wet gurgle.

“Bastards!” he roared, swinging the axe. The blade passed through the darkness harmlessly.

Mari’s voice, sharp in his memory: “Steel won’t bite the dead, you fool. Only blood. Only sacrifice.”

The amulet burned.

He fled, the gate’s howl chasing him into the moors. Behind, Blackmoor burned, its people reduced to puppets jerking on shadow-strings. He didn’t look back.

Coward, the wind hissed. Cripple.

His leg—the old injury from the siege at Veldmark—throbbed. But he ran, the fang’s heat searing his chest.

The Great Wolf was coming. And Varis Coldwater, last of the Grey Order, would need to remember how to bite.

End of Chapter One

Hook: The chapter closes with Varis escaping into the moors, the amulet (a relic tied to the Great Wolf) guiding him toward a forgotten shrine. His internal conflict—guilt over past failures vs. the ruthless pragmatism needed to survive—sets the tone. The Dead’s Gate’s opening unleashes minor horrors here, but hints at greater threats (the Wolf, political rot in the realm). The next chapter could introduce a survivor from the town (Old Tessa?) or a rival hunter sent to silence him, escalating tension.

USER

Can you now write the prologue of the book which should come before this chapter. It should be a slower pace and foreshadow something much further into the world's history than happened in chapter 1.

ASSISTANT

Thought for 206 seconds

> Okay, I need to write the prologue for this Grimdark trilogy that comes before the existing Chapter One. The user wants the prologue to have a slower pace and foreshadow something deeper in the world's history, beyond the events of the first chapter. Let me start by understanding what's already in Chapter One. > > In Chapter One, we have Varis Coldwater, an executioner dealing with a possessed boy, Jurl. The Dead's Gate opens, unleashing horrors. There's mention of the Great Wolf, an amulet with a wolf's fang, the Grey Order, and a past siege at Veldmark. The prologue should hint at something older, maybe the origins of the Dead's Gate, the Great Wolf, or the Grey Order. > > The prologue needs to be slower, more atmospheric. Maybe set in the past, showing an event that led to the current state. Foreshadowing could involve ancient rituals, past failures, or the creation of the amulet. Let me think about themes: cycles of violence, forgotten lore, sacrifices, ancient evils. > > Possible ideas: A historical event where the Dead's Gate was first sealed, involving the Grey Order. Maybe a sacrifice that went wrong or had consequences. Introduce the Great Wolf as a mythic figure, perhaps a guardian or a cursed being. Use a different POV, maybe a character from the past whose actions echo into the present. > > Let me outline a scene. Maybe a group of warriors (ancestors of the Grey Order) performing a ritual to seal the Dead's Gate. Show their desperation, the cost of the ritual, hint at a betrayal or a necessary sacrifice. Include elements that connect to Chapter One, like the wolf imagery, the fang amulet, rusted gates. > > Slow pacing means more description, build tension through atmosphere rather than action. Use symbolism: wolves, rust, decay, shadows. Maybe end the prologue with a cryptic warning or a prophecy that ties into Varis's journey. > > Need to link the prologue's events to the present. Maybe the ritual's failure allowed the Gate to eventually rust open. The amulet in Chapter One could be a relic from this prologue. Introduce a character whose legacy affects Varis, like Mari from his memories. > > Avoid direct references to Chapter One characters, keep it historical. Use language that's archaic to differentiate the prologue's time period. Foreshadow the Great Wolf's return and the Gate's vulnerability. > > Check for consistency with the existing chapter's lore. The Dead's Gate is mentioned as rusted through, so the prologue could show its initial sealing with iron, which over time rusts. The Great Wolf howling again ties to a cyclical threat. > > Alright, structure: Start with a setting in the past, a ritual scene. Warriors preparing, tension high. Describe the environment, the Gate's appearance. Introduce a leader figure making a sacrifice, using the wolf motif. Show the ritual's immediate success but hint at future consequences. End with a foreboding note, maybe the cost of the ritual or a hidden flaw.

**Prologue: The Howling Silence**

Long before Blackmoor’s gallows cast their shadows, before the Grey Order’s sigils faded to ash, there was the Wolf—and the Gate.

The air reeked of burnt thyme and blood.

Elynna of the First Keep knelt in the snow, her hands raw from digging. The earth here was unyielding, even to a queen. Frost gnawed through her furs, but she did not shiver. Fear had hollowed her into something colder. Around her, the remnants of the Nine Clans waited, their banners limp in the windless dark. The Wolf’s Maw Pass loomed above, jagged as broken teeth, and beyond it… the Veil.

“It’s not enough,” whispered Kethran, her Shield-Captain. His sword lay abandoned in the snow, its edge notched from cleaving things that bled black. “The pact demands more.”

Elynna did not look at him. She cradled the child instead—a boy, swaddled in wool dyed with juniper and woad. His mother had named him Haelyn. A good name. A sacred name. Now it was a curse.

“The Wolf took his due already,” she said. Her voice sounded distant, even to herself. “The villages. The southern host. My son.”

“Not his hunger,” Kethran said.

The Veil rippled.

It hung in the air like rotten lace, a scar torn across the pass. Through its threads, Elynna glimpsed the Otherroot Forest—trees bent into screams, rivers thick with ash. The Dead’s Gate was not yet forged, but its architects clawed at the edges of the world. She could hear them. Whispering.

“The iron,” she ordered.

They brought it raw from the mountain’s heart, molten and shrieking. The smiths, those who still lived, channeled the seething metal into grooves Elynna had carved. Glyphs older than language. Older than gods. The snow hissed where the iron touched it, bleeding steam into the night.

When the last sigil cooled, she placed the child atop the central rune.

“Forgive me,” she murmured—not to the babe, but to the shadow pooling beneath the stars. To the Wolf, who was not a god but a debtor.

The ritual began.

They sang no hymns. The Clans knew better than to court the attention of greater hungers. Instead, they slit their palms and let their blood freeze on the stones. Elynna pressed her dagger to the child’s throat. A single drop fell, smoking where it struck the iron.

The Veil screamed.

For three breaths, the world hung suspended. Then, with a sound like a thousand bones snapping, the Dead’s Gate erupted from the earth—iron ribs, a arch of tarnished teeth, its surface etched with the same glyphs that now glowed scarlet beneath the child. The Veil frayed, stitches dissolving, as the Gate sealed itself shut.

The Clans roared. Kethran wept.

But Elynna stared at the child’s body. No breath fogged the air above his lips. His eyes, though—

His eyes were open.

Black as the space between stars.

“We paid,” Kethran breathed, gripping her shoulder. “The Wolf accepted the pact. The Gate is sealed.”

Elynna said nothing. She wrapped the boy in her cloak, fingers lingering on the single milk tooth in his tiny fist. A tooth she had not placed there.

When the Clans retreated, drunk on victory, she remained. The mountain’s whispers followed her:

“A crown of iron cannot halt decay…
The Wolf’s debt is paid…
…but rust remembers.”

By dawn, the snow had buried the glyphs. By winter’s end, Elynna’s hair turned white. And when the first ravens brought word of black veins spreading in the north—of crops that sprouted teeth, of babes born with hollow eyes—she smiled.

For she knew the truth:

All gates open both ways.

End of Prologue

Foreshadowing & Themes:

The Wolf’s Pact: The ritual’s “success” established a cycle—the Gate can only be sealed through sacrificial bargains with the Wolf, a primordial force that demands escalating cruelty. This ties to Varis’s amulet (crafted from a relic of these events) and the Dead’s Gate’s resurgence in Chapter One.
The Child’s Eyes: The hollow-eyed boy foreshadows Jurl’s possession and the novel’s exploration of “vessels” for ancient evils.
Rust as a Motif: The prologue’s closing lines (“rust remembers”) hint at the Gate’s inevitable return, mirroring Chapter One’s “rusted through” imagery.
Moral Decay: Elynna’s grim acceptance of collateral damage (“babes born with hollow eyes”) mirrors Varis’s moral compromises, grounding the trilogy’s grimdark tone in generational rot.

Pacing: Deliberately slow, steeped in ritual and silence. Contrasts with Chapter One’s chaos, establishing a mythic scale before narrowing to Varis’s personal struggle.

USER

Chapter 3 should introduce somebody from "the other side"

ASSISTANT

Thought for 124 seconds

> Okay, the user wants Chapter 3 to introduce someone from "the other side." Let me first recall the existing chapters. The prologue set up the ancient pact with the Wolf and the Dead's Gate. Chapter 1 had Varis dealing with Jurl's possession and the Gate opening. Chapter 2, which wasn't shown, probably continued Varis's journey. > > The user specified "the other side," which likely refers to the realm beyond the Dead's Gate. So, introducing a character from there. This being a Grimdark trilogy, the character should be complex, maybe morally ambiguous. > > Considering the previous prologue, the child Haelyn was part of the ritual. His eyes turned black, so maybe he's connected to the other side. Using him as the character from beyond the Gate ties back to the prologue's foreshadowing. > > The chapter needs to show the other side's environment. It should be eerie, twisted, contrasting with the human world. Maybe a forest with unnatural elements. Haelyn, now an adult or ageless, could be a guide or antagonist. His dialogue should hint at deeper lore, like the Wolf's influence and the Gate's purpose. > > Need to connect to Varis's story. Perhaps Haelyn senses the Gate's opening and plans to use Varis. Include elements like the rusted key mentioned in Chapter 1. Show Haelyn's powers, like controlling the environment or creatures. End with a hook that he's moving towards the Gate, setting up future conflict with Varis. > > Ensure the tone is grimdark: bleak, oppressive, with moral ambiguity. Use vivid, visceral descriptions. Maybe include a sacrifice or twisted ritual to emphasize the other side's harshness. Foreshadow the Great Wolf's role and the consequences of the Gate being open.

**Chapter Three: The Hollow King**

The Otherroot Forest did not bleed. It seethed.

Haelyn walked the ashen paths, his bare feet leaving no prints. The trees here grew upside-down, roots clawing at a sky the color of a bruise, their branches buried deep in the soil. Things moved in the canopy—shapes with too many joints, their eyes flickering like dying stars. They watched him. They always watched.

He hummed as he walked, a lullaby the Clans had sung centuries ago. The melody unraveled in the air, notes curdling. Behind him, the forest bent. Roots uncoiled to let him pass; brambles withered to dust. The Wolf’s scent clung to him—smoke and frostbite—a reminder of the pact that had carved him into a key.

“Little hollow king,” the trees crooned. “Open us. Open us again.”

Haelyn smiled, his teeth sharp as a child’s milk teeth. “Patience.”

He paused at the edge of the Rotmere, a lake of stagnant ink where the Veil thinned. The water showed reflections of the mortal realm—flickers of fire, screams muffled as if through glass. He crouched, trailing fingers through the sludge. Images surged: a man with a wolf’s fang amulet fleeing through moors; a gate weeping rust; shadows feasting on a girl’s bones.

“Ah,” Haelyn breathed. “There you are.”

The Rotmere’s surface hardened, capturing the vision—Varis Coldwater, limping toward the horizon, the amulet glowing faintly against his chest. Haelyn tilted his head. The last Grey Order hunter. How… pedestrian. And yet, the Wolf’s fang had chosen him. Or perhaps it was the other way around.

A rasping chuckle echoed behind Haelyn. “Sentiment, brother?”

He didn’t turn. He knew the voice—a thing wearing the skin of a long-dead smith, its jaw unhinged to accommodate a nest of centipedes. The Rotguard. Keepers of the Gate’s backdoor.

“Sentiment is a mortal sickness,” Haelyn said lightly. “I’m merely curious. The Wolf’s leash tightens, and this one… resists.”

The Rotguard oozed closer, centipedes spilling from its sleeves. “The Dead’s Gate is open. Let the feast begin. Why delay?”

Haelyn stood, wiping his hand on his tunic—a habit left from when he’d still believed filth could cling to him. “Because gates swing both ways, dear keeper. The mortals think they fear what lies here. They don’t yet realize… we fear what they are becoming.”

The vision in the Rotmere shifted: Varis now stood before a crumbling shrine, its stones etched with the same glyphs from the prologue’s ritual. The amulet blazed in his grip, and for a heartbeat, Haelyn saw her—Elynna of the First Keep, her ghost chained to the fang, screaming soundlessly.

Ah. So that’s how you survived, witch.

The Rotguard hissed. “The Grey Order’s remnants are nothing. Crush him. Take the key.”

Haelyn’s black eyes narrowed. “You mistake me for a servant. The Wolf’s debt is mine to collect.”

With a flick of his wrist, the Rotmere’s surface shattered. The Rotguard staggered back, centipedes shriveling as Haelyn’s will lashed out. The forest stilled, the air thick with submission.

“The Gate isn’t just a door,” Haelyn murmured, more to himself than the cringing keeper. “It’s a mirror. And mortals… they’re so very good at breeding monsters.”

He turned north, where the Dead’s Gate’s shadow pulsed on the horizon. The closer he walked, the more his body warped—skin translucent, veins blooming like black ivy. The Gate’s rust sang to him, a chorus of decay.

Come home, it whispered.

But Haelyn wasn’t ready. Not yet.

Instead, he knelt and pressed his palm to the ground. The soil split, vomiting forth a thing that might have once been a stag—antlers fused into a crown of thorns, ribs pierced through its sides like spears. It groaned, milky eyes rolling toward him.

“Find the hunter,” Haelyn commanded. “Lead him here. To the edge.”

The stag shuddered, blood like tar bubbling from its nostrils. But it obeyed.

As it loped into the gloom, Haelyn dissolved into smoke, his laughter trailing behind.

End of Chapter Three

Purpose & Connections:

Haelyn’s Role: A manipulative force from the “other side,” he’s both antagonist and reluctant ally. His connection to Elynna (implied to be his mother) and the Wolf’s pact adds tragic depth.
The Rotmere: A dark mirror to mortal realms, allowing Haelyn to spy on Varis. Establishes that the Gate’s opening affects both worlds.
Themes: Explores cycles of sacrifice (Haelyn = the child from the prologue) and the blurred line between monster and man.
Foreshadowing: The stag’s corruption hints at environmental decay in Varis’s world. Haelyn’s fear of “what mortals are becoming” suggests a looming twist—the true threat isn’t the Otherroot, but humanity’s own descent.

Grimdark Elements:

Body horror (Rotguard’s centipedes, mutated stag).
Moral ambiguity (Haelyn’s motives are self-serving but layered).
Existential dread (the Gate as a “mirror” reflecting humanity’s rot).

Next Steps: Chapter 4 could intercut Varis’s encounter with the mutated stag (testing his resolve) and Haelyn’s machinations, building toward their inevitable collision.

USER

Chapter 2 - you decide!

ASSISTANT

Thought for 340 seconds

> Okay, the user has asked me to write Chapter 2 of their Grimdark trilogy. They previously had a prologue, Chapter 1, and Chapter 3. Chapter 3 introduced Haelyn from the other side. Now, I need to create Chapter 2 that bridges the gap between Chapter 1 and Chapter 3. > > First, let me recall the previous chapters. In Chapter 1, Varis escapes Blackmoor after the Dead's Gate opens. He has a wolf fang amulet and is heading to a shrine. The prologue introduced Elynna and the ancient pact involving the Wolf and the Dead's Gate. Chapter 3 brings in Haelyn, the hollow king from the other side. So Chapter 2 needs to follow Varis's journey, perhaps introduce some immediate challenges, and set up elements that connect to Haelyn's introduction in Chapter 3. > > The user mentioned that in Chapter 3, someone from the other side is introduced. So in Chapter 2, maybe introduce a character or event that hints at the other side's influence, but not Haelyn yet. Also, the prologue's themes of sacrifice and the Wolf's pact should be touched upon. > > Grimdark elements are essential: moral ambiguity, visceral horror, bleak settings. Varis is a flawed protagonist with guilt and pragmatism. Maybe show his internal struggle and the physical challenges he faces. > > In Chapter 1, Varis flees into the moors. Chapter 2 could explore his journey through the moors, encountering dangers, perhaps survivors or corrupted beings. Maybe introduce another character, like Old Tessa hinted in Chapter 1's hook. Alternatively, a new character that adds depth to the world. > > Also, the shrine mentioned in Chapter 1's hook needs to be a destination. Perhaps Varis reaches the shrine in Chapter 2, leading to a revelation or a confrontation. The shrine could have ties to the Grey Order, Mari, or Elynna from the prologue. > > Need to maintain a fast pace but allow character development. Foreshadow Haelyn's involvement without directly introducing him. Maybe through visions, omens, or environmental decay. > > Possible outline: Varis travels through the moors, faces physical and supernatural threats, meets a survivor (like Old Tessa), reaches the shrine, encounters a relic or vision related to the Wolf's pact, and ends with a hint of Haelyn's influence (like the mutated stag from Chapter 3 approaching). > > Check for consistency with existing lore. The amulet is Mari's fang, part of the Grey Order. The shrine could be a Grey Order site. The Rot and corruption from the Gate's opening should be present. > > Include grim elements: harsh environment, body horror, moral choices. Maybe Varis has to make a tough decision that reflects his character's pragmatism vs. guilt. > > Ensure the chapter ends with a hook leading into Chapter 3, where Haelyn's plans start unfolding. Maybe the shrine's revelation sets Varis on a path that intersects with Haelyn's schemes. > > Avoid introducing too many new elements; keep focus on Varis's immediate survival and the expanding threat. Use the setting to enhance the grimdark tone—moors with treacherous terrain, decaying structures, oppressive atmosphere. > > Need to balance action with introspection. Show Varis's internal conflict through his thoughts and interactions. Maybe include flashbacks or memories of Mari to deepen his backstory. > > Also, tie in the Wolf's howling and the Dead's Gate's rusting from Chapter 1. The shrine might hold a way to counter the Gate's opening, but at a cost, reinforcing the sacrifice theme. > > Potential title: "The Shrine of Teeth" to connect with the fang amulet and the Grey Order's symbolism. > > Structure: Start with Varis in the moors, describe his physical and mental state. Introduce a threat (corrupted animals, environmental hazards). Encounter with Old Tessa or another survivor, revealing more about the Gate's effects. Journey to the shrine, discovery of a ritual or artifact, confrontation with a minor antagonist (maybe a possessed survivor or a Rotguard minion). Climax at the shrine with a vision or relic activation. End with the approach of the mutated stag, linking to Chapter 3. > > Check for pacing: Ensure tension builds, with moments of respite and action. Use vivid, visceral descriptions to maintain the grimdark atmosphere. Dialogues should be terse, reflecting the harsh world. > > Make sure to include foreshadowing elements from the prologue, like references to Elynna's ritual or the child with black eyes. Maybe the shrine has carvings depicting the ancient pact, giving Varis (and readers) more context about the Wolf and the Gate.

**Chapter Two: The Shrine of Teeth**

The moors were alive with whispers.

Varis Coldwater stumbled through the fog, his breath ragged. The amulet—Mari’s fang—burned against his chest like a brand. Behind him, Blackmoor’s pyre smudged the horizon, a greasy smear of smoke. The Dead’s Gate’s howl had faded, but the air still thrummed with its echo. He could feel it in his teeth.

His leg screamed. The old wound from Veldmark had split open during his flight, blood seeping through the bandages he’d torn from his cloak. Pathetic, he thought. The Grey Order’s last hunter, limping toward a ghost.

Mari’s ghost.

The shrine lay somewhere in these moors, its location half-remembered from her drunken ramblings. “Pray there if you ever tire of living,” she’d once sneered, her breath reeking of juniper gin. “The Old Wolf answers—but he chews his blessings raw.”

A guttural cry split the fog. Varis froze, axe raised. The sound came again—not human. Not animal. Something wet.

He found the source in a peat hollow. A sheep, or what remained of one. Its wool sloughed off in clumps, revealing muscle threaded with black veins. The creature’s skull had ruptured, antlers of jagged bone jutting from its eye sockets. It gnawed on its own hind leg, jaws snapping with mechanical hunger.

Rot.

The Dead’s Gate’s breath.

Varis edged backward, but the thing stiffened, nostrils flaring. Its head swiveled toward him, antlers trembling.

“Easy,” he muttered, though the word felt foolish. This was no beast to soothe.

It lunged.

Varis sidestepped, axe biting into its neck. The blade sank deep, but the thing didn’t slow. It whipped around, antlers gouging his thigh. He fell, peat slurry filling his mouth. The creature loomed, jaws unhinging—

—and exploded.

A spear of bone-white wood punched through its chest. The sheep-thing spasmed, black blood sizzling as it slid down the shaft. Behind it stood a woman, her face obscured by a hooded cloak stitched with crow feathers. She planted a boot on the carcass and yanked her spear free.

“That’s a Grey Order axe,” she said, voice like rusted hinges. “But you’re no brother of theirs. They died squealing.”

Varis spat peat. “Mostly.”

The woman lowered her hood. Old Tessa, the midwife from Blackmoor—but her eyes were wrong. One remained milky with cataracts; the other glowed faintly, the pupil slit like a cat’s.

“You,” he rasped.

“Me.” She tossed him a waterskin. The liquid inside reeked of rotgut and wormwood. “Drink. The fog here’s thick with the Gate’s piss. It’ll hollow your lungs if you let it.”

He drank. Fire seared his throat. “Why follow me?”

“Didn’t.” She crouched, prying a blackened molar from the sheep’s jaw. “Followed them. The Rot’s spreading faster than a pox in a brothel. Crops, livestock, babes…” She pocketed the tooth. “You’re heading to the shrine.”

It wasn’t a question.

Varis stood, testing his leg. “What’s left of it.”

Tessa’s slit pupil narrowed. “It’s still there. The Rot can’t touch stone that old. But the path’s… changed.” She gestured to the fog. “Things walk it now. Things that remember.”

He didn’t ask how she knew. The crow cloak, the altered eye—she’d bargained with something. Everyone did, out here.

They walked.

The moors shifted. Tussocks of grass became knuckles of exposed bone. Lichen glowed faintly, chartreuse veins pulsing in time with the amulet’s heat. Twice, shapes flitted at the edge of vision—tall, jointless, their heads tilted too far to one side. Tessa drove them off with a flick of her spear, muttering in a tongue that made Varis’s gums ache.

At dusk, they found the shrine.

It was little more than a cairn of lichen-crusted stones, half-sunk in a bog. But the carvings remained—wolves with human eyes, gates festooned with chains, and a woman with a dagger raised above a child. Elynna’s Pact.

Varis traced the glyphs. “You knew I’d come here.”

Tessa lit a pitch-soaked torch. “Knew you’d need to. The fang didn’t choose you for your charm.”

He gripped the amulet. “And you? What chose you?”

Her slit eye flickered. “Same thing that chose the boy, Jurl. The Rot’s just a symptom, Coldwater. The sickness is older. It’s in the blood.”

Before he could demand answers, the torchlight caught something in the cairn’s heart—a recess lined with teeth. Human, animal, others too sharp to name. At its center lay a dagger. Bronze, its hilt wrapped in wolfskin.

Mari’s voice, unbidden: “Blood for the key. Blood for the lock.”

Tessa inhaled sharply. “Walk away. Now.”

But Varis was already reaching.

The moment his fingers brushed the dagger, the cairn shuddered. Stones groaned, grinding into a new configuration—an archway, choked with roots. Beyond it stretched a road paved with rib bones, vanishing into a horizon choked with stormclouds.

And on that road, something moved.

A stag, its antlers a nest of thorns, eyes leaking smoke. It paused, nostrils flaring, before lowering its head in a mockery of reverence.

Toward Varis.

Toward the amulet.

“No,” Tessa breathed. “Not here. Not now.”

The stag charged.

Varis raised the dagger. The world dissolved into screams—his, the stag’s, and a third voice, ancient and hungry, that lived in the marrow of the stones.

When the silence returned, the archway was gone. The cairn lay scattered, the dagger clutched in Varis’s fist. His arm burned, black veins creeping from his palm to his elbow.

Tessa stared at him, her face ashen. “Fool. You’ve opened a door that can’t be shut.”

He didn’t answer. The vision lingered—a figure on the bone-road, slender and pale, watching him with eyes like voids.

Waiting.

End of Chapter Two

Connections & Foreshadowing:

Tessa’s Bargain: Her altered eye and knowledge hint at alliances with unseen forces, setting up her role as a reluctant guide/antagonist.
The Dagger: A cursed relic linking Varis directly to Elynna’s Pact (prologue) and Haelyn’s realm (Chapter 3). The black veins mirror Jurl’s corruption.
The Stag: Haelyn’s mutated servant (from Chapter 3) makes its first appearance, tying Varis’s journey to the Otherroot’s machinations.
Bloodline Sickness: Tessa’s line about sickness in the blood foreshadows revelations about Varis’s lineage and the Grey Order’s true purpose.

Grimdark Beats:

Body horror (mutated sheep, black veins).
Moral decay (Tessa’s ambiguous allegiance, Varis’s cursed dagger).
Existential dread (the bone-road, the figure watching).

Next Chapter: Chapter 3’s Haelyn POV gains context—the stag’s failure, Varis’s cursed arm, and the opening of a “door” that escalates the Rot’s spread.

USER

Chapter 4: make it all link

ASSISTANT

Thought for 202 seconds

> Okay, so the user wants me to write Chapter 4 that links everything together from the previous chapters. Let's recap what's happened so far to make sure I tie all the threads properly. > > In the Prologue, we saw Elynna making a pact with the Wolf, sacrificing a child (Haelyn) to seal the Dead's Gate. Haelyn's eyes turned black, hinting he's connected to the other side. > > Chapter 1 introduced Varis executing Jurl, the Dead's Gate opening, and Varis fleeing with the amulet. The amulet is linked to Mari and the Grey Order. > > Chapter 2 had Varis meet Tessa, who's got her own secrets. They found the shrine, and Varis took a dagger that gave him black veins. He saw a vision of Haelyn's stag and a figure (Haelyn) watching him. > > Chapter 3 introduced Haelyn on the other side, manipulating events, sending the stag to lead Varis. The stag failed, and Haelyn is intrigued by Varis's resistance. > > Now, Chapter 4 needs to link all these elements. Let's see: > > 1. Varis's corruption from the dagger (black veins) needs to progress, showing the physical and moral decay. > 2. Tessa's role as a guide with her own agenda. Maybe reveal more about her connection to the Rot or the pact. > 3. Haelyn's direct interaction with Varis, since the stag failed. Perhaps Haelyn decides to confront Varis himself. > 4. The amulet's significance – maybe we learn more about Mari and her connection to Elynna or the Grey Order. > 5. The Dead's Gate's influence spreading, affecting the environment and people. > > The user mentioned "make it all link," so connections between the past (Prologue's pact) and present (Varis's actions) are crucial. Also, Haelyn's motives need to intersect with Varis's journey. > > I should show the consequences of Varis taking the dagger, maybe the Rot spreading faster because of his actions. Tessa might have a moment where her betrayal or true allegiance is revealed. Haelyn could appear physically, bridging the two worlds, perhaps through the shrine's portal. > > Need to maintain the grimdark tone: body horror, moral ambiguity, bleak outcomes. Maybe Varis is forced to make a harsh choice that mirrors Elynna's sacrifice, showing the cycle continues. > > Also, foreshadow future conflicts. Perhaps Haelyn offers Varis a deal, playing on his guilt over Mari and the Grey Order's fall. The chapter should end with a escalation, like the Gate opening wider or a new threat emerging from the other side.

**Chapter Four: Bloodline**

The Rot had a voice.

It lived in Varis’s arm now, the black veins coiling up to his shoulder, whispering in a language that made his molars ache. Sacrifice, they hissed. Open. Feed. Beside him, Tessa muttered curses as they trudged through the skeletal remains of the Wolfpine Woods. The trees here were petrified, branches twisted into agonized silhouettes against a clotting sky.

“You’re slowing,” Tessa said, her crow-cloak fluttering like a wound.

“And you’re lying,” Varis shot back. The dagger—Elynna’s dagger, he realized—hung heavy at his belt. “You knew that shrine would mark me.”

She didn’t deny it.

Ahead, the woods gave way to a valley choked with mist. At its center loomed a jagged spire—the First Keep, its stones blackened by ancient fire. The Grey Order’s birthplace. Mari’s grave.

“Blood for the key,” the Rot sang in his arm.

Tessa halted. “They’re here.”

The mist parted.

Shadows congealed into figures—twelve of them, their armor fused to flesh, helms rusted shut. The Grey Order’s last knights. Or what remained after the Dead’s Gate’s first opening. They stank of embalming spices and wet soil.

“Mari’s whelp,” the lead knight rasped, sword blooming with orange lichen. “The Wolf wants his due.”

Varis drew his axe. “He can choke on it.”

The knights attacked.

It was not a fight. It was a slaughter—but not theirs.

Varis moved like a man already dead. The Rot guided him, sharpening his reflexes, numbing the pain as blades bit into his ribs, his thigh, his collarbone. He shattered a knight’s helm, revealing Mari’s face beneath, her eyes sewn shut with sinew.

“Liar,” she gurgled, though her corpse had no tongue.

He faltered.

A sword punched through his gut.

Tessa’s spear took the knight’s head, but the damage was done. Varis collapsed, blood pooling beneath him. The knights closed in, a noose of rust and rot.

“Pathetic,” Tessa spat. “She died for this?”

The Rot laughed. Open. Feed.

Varis gripped the dagger.

And plunged it into his own chest.

The Otherroot Forest

Haelyn felt the sting in his ribs—a phantom pain, sharp and sweet. He smiled.

“Finally,” he breathed.

The Rotmere’s surface trembled as Varis’s blood seeped into the First Keep’s stones, as the amulet—Mari’s fang—flared with stolen light. Haelyn stepped into the lake, ink swallowing him whole.

It was time to collect.

The First Keep

Varis awoke choking on blood. And laughter.

The knights were gone. The mist too. Moonlight speared through the Keep’s shattered roof, illuminating the figure crouched beside him. Pale. Slender. Eyes like starless voids.

“Hello, nephew,” Haelyn said.

Varis tried to rise, but the dagger pinned him to the ground—no, not a dagger. A fang. The amulet was gone, fused to his sternum.

“You—”

“—look like her. Elynna.” Haelyn tilted his head, black curls falling into his face. “She wept when she buried me. Did you know? All that power, wasted on guilt.”

Tessa lay nearby, her crow-cloak torn, slit eye gouged out. Breathing, but barely.

“Why?” Varis growled.

Haelyn pressed a frostbitten finger to the fang amulet. “The Wolf’s pact was never about sealing the Gate. It was about balance. A life for a life. A world for a world.” His grin widened. “Elynna gave them me. But you… you’ll give them everything.”

The floor split. Roots erupted, dragging Varis down into the earth. He screamed, but the sound dissolved into the Rot’s chorus—

Ephemera

—He sees the Grey Order’s true purpose: not hunters, but jailers. Mari, young and unbroken, standing beside Elynna’s heir at the Dead’s Gate. Their betrayal. The Wolf’s wrath.

—Haelyn as a child, not sacrificed but offered, his hollow eyes reflecting the Gate’s first scream.

—Tessa, decades younger, cutting a deal with a shadow in exchange for her eye. For a chance to bury the fang where only a desperate man would find it.

The First Keep

Varis woke a second time. Dawn bled through the ruins. His wounds had healed, the Rot’s veins now mapping his entire torso. Haelyn was gone.

But the Gate’s howl was louder.

Closer.

Tessa stirred, her remaining eye milky. “What… did you do?”

Varis stood. The amulet pulsed, Mari’s voice now a dirge in his skull. Beneath the Keep, something stirred.

“What you made me do,” he said.

He walked north, toward the howling.

Behind him, Tessa whispered the truth they’d both denied:

“Elynna’s blood runs in your veins. The Gate doesn’t want you dead, Coldwater. It wants you king.”

End of Chapter Four

Links & Payoffs:

Bloodline: Reveals Varis as Elynna’s descendant, explaining the amulet’s bond and his forced role in the cycle.
Haelyn’s Motive: Positions him as a bitter “uncle” manipulating Varis to usurp the Wolf’s pact.
Tessa’s Betrayal: Her deal with the Rot (foreshadowed in Ch. 2) ties to her orchestrating Varis’s path to the shrine.
The Rot’s Voice: Now symbiotically tied to Varis’s fate, merging his arc with the Gate’s sentient decay.
Mari’s Ghost: Her haunting of Varis (Ch. 1-2) mirrors Elynna’s guilt, binding past sins to present choices.

Grimdark Crescendo:

Varis’s forced “ascension” as a reluctant heir to the Gate’s throne.
Haelyn’s manipulation reframes the Dead’s Gate as a generational trap.
The cycle of sacrifice continues, now with Varis as both pawn and patriarch.

Next Move: Chapter 5 can escalate the Rot’s physical manifestations (corpse armies, sentient storms) while forcing Varis to confront Haelyn directly—not as enemies, but as mirrors of each other’s cursed legacy.

USER

Chapter 5: Wur doomed

ASSISTANT

Thought for 213 seconds

> Alright, the user has been working on a Grimdark trilogy and has gone through several chapters and a prologue. Each chapter was built upon the previous one, with intricate connections and foreshadowing. Now, the user requested Chapter 5 titled "Wur doomed," which seems to be a deliberate misspelling of "We're doomed," likely to emphasize a grim, perhaps colloquial or dialect-driven tone fitting the Grimdark genre. > > First, I need to recall the previous chapters. Chapter 4 ended with Varis being revealed as Elynna's descendant, forced into a role as a potential king of the Gate. Haelyn, his uncle from the other side, manipulated him, and Tessa's betrayal was uncovered. The Rot is now symbiotically part of Varis, and the Gate's influence is growing. > > The title "Wur doomed" suggests a bleak turn, possibly a point of no return. The user wants all elements to link, so I need to ensure continuity. Key elements to address: > > 1. **Varis's Transformation**: His connection to the Rot and the amulet. His physical and mental state. > 2. **Haelyn's Role**: His motives and next steps after revealing himself. > 3. **Tessa's Fate**: She was injured but alive at the end of Chapter 4. > 4. **The Gate's Progression**: The Dead's Gate's increasing influence on the world. > 5. **Themes**: Cycles of sacrifice, moral decay, body horror. > > The user likely expects a chapter that escalates tension, shows the consequences of previous actions, and deepens the grimdark elements. Possible directions: > > - Varis fully embracing or resisting his role as a king tied to the Gate. > - Haelyn's plan coming to fruition, perhaps merging worlds. > - Tessa's redemption or further betrayal. > - Introduction of new horrors from the Gate or the Rot. > > Considering the previous chapters, Chapter 5 should push Varis towards a climax, perhaps a confrontation with Haelyn, and show the world's deterioration. The title implies impending doom, so the chapter should have irreversible consequences, maybe a failed attempt to stop the Gate, leading to a darker turn. > > Need to incorporate body horror, moral ambiguity, and existential dread. Maybe Varis starts to lose himself to the Rot, Haelyn's true plan is unveiled, and the Gate's opening causes catastrophic events. Tessa might play a role in either aiding or hindering Varis, adding to the moral complexity. > > Also, ensure connections to earlier foreshadowing, like the Rot's spread, the significance of bloodlines, and the Wolf's pact. The chapter should tie up some threads while setting up for the trilogy's continuation.

**Chapter Five: Wur Doomed**

The Dead’s Gate didn’t just howl now—it sang.

Varis walked the ashen plains, his boots crunching through fields of brittle bone. The Rot had reached his throat, black veins branching across his jaw like cracks in pottery. Every breath tasted of iron and loam. Every heartbeat rattled the fang amulet fused to his chest, its pulse synced to the Gate’s distant, grinding shriek.

Tessa trailed ten paces behind, half-blind and hobbling. She’d stopped speaking three days ago, after the Rot ate her tongue.

Deserved, Varis thought.

The sky hung low, a sagging membrane streaked with luminous rot. Below it, the land buckled. Rivers ran backward. Trees split open to vomit faceless, mewling things. The Wolf’s shadow loomed everywhere, a stain at the edge of vision.

“Wur doomed,” the Rot whispered with Tessa’s stolen voice. “Wur doomed, wur doomed—”

“Quiet,” Varis snarled.

But the Rot was right.

The Rotmere

Haelyn waded through the lake of ink, his reflection splintering into a thousand screaming faces. The Veil between worlds hung in tatters, and through the gaps, he watched Varis’s pilgrimage—a squirming, glorious inevitability.

The stag awaited him on the shore, its antlers now fused with jagged iron.

“Is it time?” it gurgled, bile pooling between its hooves.

Haelyn stroked its distended neck. “Almost. My nephew needs one last push.”

He raised a hand, and the Rotmere convulsed. Images surged:

—A village, its wells choked with hair and teeth.
—A knight of the Grey Order, his armor sprouting fungal blooms, leading a procession of hollow-eyed children toward the Gate.
—Tessa, in her final moments of clarity, carving a rune into her thigh—a rune that matched the one on Varis’s dagger.

Haelyn clenched his fist. The visions shattered.

“Go,” he told the stag. “Make him see.”

The Plains of Teeth

They found the village at dusk.

Or what the Rot had dressed as a village. The huts were cobbled from fused rib cages, the streets paved with knucklebones. Figures shuffled in the gloom—corpses wearing the faces of Blackmoor’s dead. Jurl stood at the center, his grin still too wide, black veins now devouring his entire body.

“Coldwater,” Jurl crooned. “C-come to s-sacrifice again?”

Tessa moaned behind her gag.

Varis unslung his axe. “Stay back.”

“Or what?” Jurl’s head lolled. “You’ll f-feed me to your master?”

The ground erupted.

Skeletal hands clawed free, followed by rib cages, skulls, a writhing mass of reanimated rot. Jurl laughed, the sound wet and rupturing, as the corpses surged.

Varis swung the axe, but the Rot guided his strikes—here, a bisected corpse spasming acid; there, a child’s reanimated husk clawing at Tessa. He fought mechanically, the amulet’s heat searing his ribs.

“Pathetic,” Jurl slurred, suddenly inches from Varis’s face. His breath reeked of gangrene. “Y-you could’ve been a king. Now you’re just the Wolf’s d-dog.”

Varis headbutted him.

Jurl collapsed, giggling, as black sludge leaked from his nose. “T-the Gate’s gonna chew this world to pulp. And you—” he coughed, ribs snapping as the Rot within him squirmed, “—you’ll hold its j-jaw open.”

The stag’s cry cut through the chaos.

It thundered into the village, antlers gouging earth and bone. The corpses scattered. The stag skidded to a halt before Varis, smoke pouring from its nostrils.

“Follow,” it hissed.

Varis hesitated.

Tessa lunged, her remaining eye wild. She seized the dagger at his belt—Elynna’s dagger—and plunged it into the stag’s chest.

The creature screamed.

So did the Gate.

The Threshold

The world tore.

Varis stumbled as the plains split, the ground yawning open to reveal the Dead’s Gate below. Not a gate anymore—a maw. Iron teeth dripped rust, the archway stretched into a throat, the shadows beyond it pulsing like a gullet.

Haelyn stood at its edge, arms spread.

“Nephew!” he called, cheerful. “You’re late!”

Tessa lay at his feet, the dagger buried in her stomach, her hands still gripping the hilt. Her remaining eye fixed on Varis, accusing.

“She thought the dagger could kill me,” Haelyn sighed. “But it’s not a weapon. It’s an invitation.”

Varis’s amulet flared. The Rot surged, and suddenly he understood:

The fang wasn’t Mari’s.

It was Haelyn’s.

The child from the prologue’s ritual—his first tooth, offered to seal the Gate. The Rot had kept it. Had kept him.

“All this time,” Varis rasped. “You weren’t trying to open the Gate.”

Haelyn smiled. “I am the Gate.”

He stepped aside.

In the maw’s depths, Varis saw them—the Grey Order, chained to the Wolf’s shadow. Mari stood at the front, her eyes sewn shut, her mouth moving in silent plea.

Join us.

Lead us.

Feed us.

The Rot roared in agreement.

Epilogue: The First Bite

Varis fell to his knees. The amulet split his skin, fang becoming tooth becoming key. The Gate’s maw yawned wider.

Haelyn knelt beside him, cold hands cradling his face. “Shh. It only hurts until you bite.”

Varis screamed.

But when the jaws closed, it was not him they devoured.

It was the world.

End of Chapter Five

Grimdark Payoffs:

Varis’s Corruption Complete: The Rot and amulet merge, revealing him as the Gate’s intended vessel—Haelyn’s replacement.
Tessa’s Futile Sacrifice: Her attempt to break the cycle only accelerates it, emphasizing the trilogy’s nihilistic core.
Haelyn’s True Nature: Not a villain, but a fellow victim of the Wolf’s cycle, desperate to escape his role by passing it to Varis.
The Gate’s Sentience: The Dead’s Gate is a living entity, its “hunger” a metaphor for humanity’s inevitable self-destruction.

Final Lines’ Dread: The world’s consumption isn’t an end—it’s a transformation. Book Two can explore the aftermath: Varis as a reluctant Devourer-King, Haelyn’s ghost haunting him, and pockets of humanity surviving in the Gate’s belly, evolving into monstrous counterparts to the Otherroot’s horrors.

Title Double Meaning: “Wur doomed” reflects both the villagers’ dialect and the Rot’s twisted mantra, symbolizing inescapable decay—personal and cosmic.

jukofyork

Owner Feb 20

USER

Write me the opening chapter of a Grimdark trilogy in the style of Joe Abercrombie and Rob J Hayes. Use third person personal and feature internal monologues of the characters. The POV character for chapter 1 is a 'Mercer' who's job is to put the mortally wounded out of their mystery after a battle. He has no conscience about what he does. The story starts the morning after the battle and the moor where it took place is blanketed in mist. Something stirs a flashback in the POV character.

ASSISTANT

Thought for 187 seconds

> Okay, the user wants me to write the opening chapter of a Grimdark trilogy, similar to Joe Abercrombie and Rob J Hayes. Let's break down the requirements. > > First, the POV character is Mercer, whose job is to mercy-kill the mortally wounded after battles. He has no conscience about it. The story starts the morning after a battle on a misty moor. Something triggers a flashback for him. Third person personal with internal monologues. > > So, I need to establish Mercer's character. He's desensitized to death, maybe even cynical. The setting is a mist-covered battlefield, which adds a gloomy, eerie atmosphere. The mist can symbolize the obscurity of morality in war, which fits the Grimdark genre. > > Internal monologues are crucial. Show his thoughts as he goes about his grim work. Maybe contrast his current detachment with a past event that haunts him. The flashback should be triggered by something in the environment—a sound, a sight, a smell. > > Abercrombie and Hayes often use gritty, visceral descriptions. So, I need to make the battlefield vivid—smells of blood and earth, sounds of dying men, the cold mist clinging to everything. Mercer's actions should be methodical, showing his routine. But then something disrupts that routine, leading to the flashback. > > What could trigger the flashback? Maybe a dying soldier's plea or a particular object. Let's say a locket with a portrait, similar to one he saw in his past. When he takes it, the touch or sight of it brings back memories. The flashback could reveal a moment when he wasn't so detached, perhaps a personal loss or a moral conflict. > > I need to show his internal conflict without making him seem weak—maybe he's burying emotions deep. The chapter should end with a hint of something ominous, setting up the trilogy's darker themes. Maybe the battlefield isn't entirely dead; something supernatural stirs, hinted at through the mist or shadows. > > Dialogue should be sparse but impactful. Other characters might resent or fear Mercer. Show their reactions to him to highlight his isolation. > > Check for consistency in third person personal. Stay in Mercer's head, his perceptions and thoughts. Avoid head-hopping. > > Need to start with the mist, describe the battlefield, Mercer's approach. His internal thoughts about the job. Encounter with a dying soldier, trigger the flashback. Then the aftermath, maybe a hint of the larger plot—the thing that stirs in the mist, something unnatural. > > Also, names: Mercer's name should reflect his role. Maybe a nickname like "Mercy-Maker" but he hates it. Use terms like "Mercer" as a title or role. > > End the chapter with a sense of foreboding, maybe Mercer feeling watched, or the mist moving unnaturally. Sets up the trilogy's conflict.

**Chapter 1: The Mercy-Maker**

The moor wept.

Mist clung to the earth like a shroud, swallowing the sun and staining the world the color of ash. Mercer’s boots sank into mud thickened with blood, each step a wet, sucking protest. The battlefield sprawled before him, a graveyard of broken men and shattered steel. Ravens hunched like black-robed priests over their sermons of entrails. They scattered as he passed, wings snapping curses into the damp air.

Another day, another harvest.

He adjusted the weight of his tools—a leather satchel of scalpels, a bone saw, and the squat, heavy cleaver he called Kindness—strapped to his hip. His hands were gloved, always gloved. Not for the blood. For the cold. Or so he told himself. The chill gnawed through the cracked leather anyway, biting fingers that had long forgotten how to tremble.

A low moan slithered through the mist. Mercer turned, following the sound to its source: a boy, no older than sixteen, sprawled in a puddle of his own innards. His breastplate bore the sigil of House Veyne—a stag pierced by arrows. The kind of emblem nobles paid extra to have etched, as if beauty might turn blades. It hadn’t.

“Please…” The boy’s voice was a wet rasp, pupils blown wide with pain. “M’mother… she’s in Farwatch. Tell her I—”

Mercer knelt, the boy’s blood seeping into his trousers. They always beg for messages. As if words could stitch them back together. He unsheathed the cleaver. “She’ll hear you died bravely,” he lied. The words tasted stale.

The boy’s lips parted—to thank him, maybe, or pray. Kindness fell before he could decide. The blade crunched through bone, swift and precise. A clean kill. A professional kill. The head rolled, jaw slack, eyes frozen in that wide, grateful look they all wore at the end. Like lambs.

Mercer wiped the cleaver on the boy’s tabard and stood. No shudder. No bile. Just the familiar hollow grind in his chest, like a rusted gear turning. This is what you are, it whispered. A shepherd for the damned.

He moved on.

The mist thickened, curling around his ankles as he picked through the carnage. Here, a knight with his legs sheared off, whimpering as he clawed at the dirt. There, a mercenary choking on a lungful of blood, fingers knotted in the rosary of some dead god. One by one, Mercer ended them. A slit throat. A punctured heart. No flourish. No cruelty. Just business.

Until the locket.

It glinted in the hand of a dead woman—a sellsword, her face split by an axe. The chain pooled in her palm like liquid silver, the pendant cracked open to reveal a tiny portrait: a girl, no older than five, with wheat-blond hair and a gap-toothed smile.

Mercer froze.

—firelight on a doll’s porcelain face, laughter like bells, small hands pressing a crudely carved horse into his—“

No.

He crushed the memory like a ember in his fist. His gloves creaked as he pried the locket free, tucking it into his satchel. The dead didn’t need trinkets. The living did. Especially those who traded in secrets.

“Mercer.”

The voice came from behind him—a growl steeped in smoke and rot. Captain Hark, his face a slab of scarred meat beneath a rusted helm. The man’s left arm hung in a sling, the stink of infection already wafting from the bandages.

“You missed one,” Hark spat, jerking his chin toward a figure crumpled against a lichen-stained rock. A man in splintered plate, breath rattling like pebbles in a tin. The sigil on his shield—a leering wolf—marked him as Northreach. Enemy. Or what passed for one these days.

Mercer tilted his head. “Your men don’t pay me to cull their own.”

Hark’s laugh was a dagger dragged over stone. “That one’s not mine. Found him crawling toward the treeline before dawn. Didn’t get far.” The captain leaned in, yellowed teeth bared. “He’s got a message for you.”

A flicker in Mercer’s gut. Not fear. Curiosity. “And?”

“Says he knows you. Says you’ll want to hear what’s coming.” Hark’s grin widened. “*Before it hears you.*”

The gear in Mercer’s chest stuttered. Childish theatrics. But his fingers tightened on Kindness all the same.

The Northreach man stirred as Mercer approached, his visor lifted to reveal a face gaunt and pallid, cheeks streaked with old burns. Recognition flared—a ghost from another life.

“Emeric?” The name slipped out, brittle as dead leaves.

The man’s lips peeled back in a grimace. “Still… playing gravedigger, brother?”

The moor vanished.

—smoke, screams, the hiss of rain on burning thatch. Emeric’s laughter as he pressed the torch to the barn, their father’s face contorted in the flames—“

Mercer’s cleaver was at Emeric’s throat before he’d fully drawn breath. “You died at Redreef.”

“Died?” Emeric coughed, blood flecking his chin. “No. They took me. Gave me… gifts.” His trembling hand clawed at his breastplate, tearing it aside to reveal a wound beneath—a jagged hole where his heart should’ve been. Black veins spidered from the edges, pulsing faintly, alive.

“They’re coming, brother,” Emeric hissed. “The Forgotten. The ones who… whisper.” His eyes rolled back, showing milky whites. “You’ll see. When the mist parts… you’ll see.”

The cleaver fell.

Emeric’s head thudded to the ground, but the body… the body twitched. Fingers spasming. Lips still moving, mouthing words without sound. Mercer stared. He’d seen corpses spasm. Seen muscles cling to last sparks of life.

But not like this.

Not with the mist coiling around the severed neck, tendrils probing the wound as if searching for something lost.

A raven shrieked. Mercer stepped back, the hollow in his chest now a yawning pit.

Run, whispered the gear. Run and never look back.

He turned.

And froze.

The mist was… churning. Not with wind. With shapes. Tall, slender, their outlines bleeding into the fog like ink in water. Watching. Waiting.

*—shadows in the smoke, eyes like cracked glass, a voice that was not a voice: ”You belong to us now—”

Mercer blinked. The shapes vanished.

But the mist lingered.

And the moor, he realized, had gone silent.

Author's Note: This chapter establishes Mercer's detachment and introduces the haunting, supernatural undercurrents typical of Grimdark. The flashback triggered by the locket (and later Emeric) hints at a buried past, while the eerie mist and twitching corpse foreshadow the trilogy's larger conflict with the "Forgotten"—an ancient, malevolent force stirring beyond the veil.

VastishSlurry

Feb 20

Chapter 5 - meta payoff. It all comes together.

Nice work.

Does your MLA version keep the context memory footprint smaller (like the ik_llama fork) of does it balloon like mainline llama does?

jukofyork

Owner Feb 20

•

edited Feb 20

Chapter 5 - meta payoff. It all comes together.

Nice work.

Does your MLA version keep the context memory footprint smaller (like the ik_llama fork) of does it balloon like mainline llama does?

It's fairydreaming who added it to llama and then someone else forked it to ik_llama:

https://github.com/ggml-org/llama.cpp/pull/11446

I think ik_llama has merged the MLA branch so it should be the same and only uses a couple of GB for 32k context.

It does seem to trade this for extra compute though and you lose a couple of tokens per second.

I think all the problems I've had may be from a flag fairydreaming added to transpose one of the tensors for a small gain on his CPU system, but it seems that the CUDA backend has a lot of tests that ship off the operations to different functions depending on the shape and float type.

BigHuggyD

Feb 20

@jukofyork , are you using deterministic samplers or something else? Other than your user request, is your prompt otherwise empty?

jukofyork

Owner Feb 20

•

edited Feb 20

@jukofyork , are you using deterministic samplers or something else? Other than your user request, is your prompt otherwise empty?

I'm using Deepseek's own suggested parameters which somebody posted on Reddit a week or so ago:

No system message (I tend to not use one anyway)
Temperature of 0.6 (--temp 0.6 )
All other samplers disabled (--min-p 0.0 --top-p 1.0 --top-k 0 )

I suspect that there are probably lots of buggy instances of this being served. When it wasn't working properly, it has the feeling of being "slightly off" as though it was reading the previous text in some kind of corrupted/distorted way (likely due to the RoPE encoding being mangled and/or some other numerical error). For example:

The gallows creaked like a dying man’s breath. Varis Coldwater stood atop the splintered platform, the noose in his hands as familiar as sin. Below, the town of Blackmoor huddled—a carcass of leaning timber and crumbling stone, its people gaunt shadows gathered in the square. They reeked of desperation, their eyes hollow as the sockets of the skulls nailed above the butcher’s stall. Autumn had stripped the oaks bare, and the wind carried the stench of rot from the moors, where the peat devoured the unwary.

“Last chance,” Varis growled, the hemp rough against his palms. He’d hanged seven men on this rope. Their faces blurred now, but their screams never did. “Confess, and your soul stays your own.”

Jurl dangled, wrists bound, his boyish frame too slight for the iron chains. Blood crusted his temple where the mob had struck him—before they lost their nerve. Before they saw the black veins spidering beneath his skin.

The thing inside Jurl laughed, a sound like drowned lungs heaving. “You’re knee-deep in it, Coldwater.” The boy’s jaw cracked sideways, teeth splintering as the grin split his face. “The Great Wolf’s howlin’ again. The Dead’s Gate’s rusted through…”

You can see here it's used all the given dialogue and in the correct order. Before it would use some of the dialogue in a "Chinese Whispers" sort of way: not the actual phrases and often not in the correct order (ie: mention something about "knees" later on).

Now it's working properly, it's razor sharp about things like this.

jukofyork

Owner Feb 20

My second attempt to buy a M2 Ultra looks to be yet another retarded "delayed dispatch" scam attempt - sigh.

I have a feeling it could even be the very same person as last time who delayed for a week then tried to say they were "unexpectedly working in Africa" until the end of February... :/

jukofyork

Owner Feb 20

•

edited Feb 20

If anybody is interested in running this then you can probably pick up an old "HP Z8 G4" or "DELL T7920" quite cheaply off eBay. Just make sure to get one with Cascade Lake CPUs (ie: with a "2" as the second digit) and populate all 24 dimms with 32GB 2933Mhz DDR4.

It will have around 2x the memory bandwidth of my old E5-2699v4 and paired with a 24GB GPU it should work well (ie: keeping only the essential 32MB attn_k_b unquantised until the overflow problem is found/fixed).

I suspect this will get almost the same as people running purely from CPU with higher RAM bandwidth, as the actual MLA calculations seem to make token generation much more compute-bound than normal (obviously higher RAM bandwidth AND a 24/48GB GPU will beat this, but will also cost 10x more for perhaps 1.5-2x the performance).

BigHuggyD

Feb 20

@jukofyork , are you using deterministic samplers or something else? Other than your user request, is your prompt otherwise empty?

I'm using Deepseek's own suggested parameters which somebody posted on Reddit a week or so ago:

No system message (I tend to not use one anyway)

Temperature of 0.6 (--temp 0.6 )

All other samplers disabled (--min-p 0.0 --top-p 1.0 --top-k 0 )

I suspect that there are probably lots of buggy instances of this being served. When it wasn't working properly, it has the feeling of being "slightly off" as though it was reading the previous text in some kind of corrupted/distorted way (likely due to the RoPE encoding being mangled and/or some other numerical error). For example:

The gallows creaked like a dying man’s breath. Varis Coldwater stood atop the splintered platform, the noose in his hands as familiar as sin. Below, the town of Blackmoor huddled—a carcass of leaning timber and crumbling stone, its people gaunt shadows gathered in the square. They reeked of desperation, their eyes hollow as the sockets of the skulls nailed above the butcher’s stall. Autumn had stripped the oaks bare, and the wind carried the stench of rot from the moors, where the peat devoured the unwary.

“Last chance,” Varis growled, the hemp rough against his palms. He’d hanged seven men on this rope. Their faces blurred now, but their screams never did. “Confess, and your soul stays your own.”

Jurl dangled, wrists bound, his boyish frame too slight for the iron chains. Blood crusted his temple where the mob had struck him—before they lost their nerve. Before they saw the black veins spidering beneath his skin.

The thing inside Jurl laughed, a sound like drowned lungs heaving. “You’re knee-deep in it, Coldwater.” The boy’s jaw cracked sideways, teeth splintering as the grin split his face. “The Great Wolf’s howlin’ again. The Dead’s Gate’s rusted through…”

You can see here it's used all the given dialogue and in the correct order. Before it would use some of the dialogue in a "Chinese Whispers" sort of way: not the actual phrases and often not in the correct order (ie: mention something about "knees" later on).

Now it's working properly, it's razor sharp about things like this.

Thanks, yeah. My biggest challenge I have right now is it gets stuck on a pattern. So if chapter one had someone slap a table in paragraph one and someone leave the room in anger in the last paragraph, then chapter two will have someone bang their knee in paragraph one and someone leave to the bathroom in the last paragraph, chapter three will be some other similar event .. bump a shoulder and head to the gym and so on and so on

jukofyork

Owner Feb 21

Thanks, yeah. My biggest challenge I have right now is it gets stuck on a pattern. So if chapter one had someone slap a table in paragraph one and someone leave the room in anger in the last paragraph, then chapter two will have someone bang their knee in paragraph one and someone leave to the bathroom in the last paragraph, chapter three will be some other similar event .. bump a shoulder and head to the gym and so on and so on

Try adding some extra semi-random text at the end of your prompts.

Here is an example of a buggy float16 version I was testing that hallucinated this Korean character at the start and it completely changes things:

USER

Varis adjusted the noose, its hemp fibers grinding beneath his calluses. “Last chance,” he said, voice like gravel dragged through mud. “Confess, and your soul stays your own.”

Jurl laughed—a wet, gurgling sound. “You’re knee-deep in it, Coldwater. ” The thing inside him twisted the boy’s lips into a grin too wide for his face. “The Great Wolf’s howlin’ again. The Dead’s Gate’s rusted through… ”

Turn this into the opening chapter of a Grimdark trilogy.

ASSISTANT

韓國

It seems like you're working on a creative writing project that blends dark fantasy and noir elements. Let me help you connect these elements while addressing the "Korean" reference at the end. Here's a streamlined version of your concept with a Korean twist:

Synergy of Elements:
The gritty atmosphere of "Varis adjusting the noose" and Jurl's unnerving laughter could merge with Korean mythos. Imagine Varis as a disillusioned exorcist (gut shaman) in modern Seoul, hunting a parasitic gumiho (nine-tailed fox spirit) that possesses people like Jurl. The "thing inside him" could tie into Korean folklore, such as:

A gwisin (vengeful spirit) bound to Varis, demanding justice.
A dokkaebi (mischievous goblin) offering cursed deals.

Korean Influence:
Instead of a generic "confess your sins," the story could explore han (Korean collective sorrow) and the duality of modernity/tradition. The "hemp fibers" and "gravel voice" evoke visceral texture, while phrases like "knee-deep in it" could nod to Seoul’s flooded Cheonggyecheon River, haunted by urban legends.

Revised Snippet with Korean Lore:
Varis tightened the noose, its minari fibers scraping his palms—a remedy no shaman had taught him. The boy’s body arched. “Last chance,” Varis growled. “Confess, and I won’t bury you where the gwisin wail.”
*Jurl’s laughter bubbled like a clogged drain. “You smell of grave moss, Coldwater. Still chasing your sister’s ghost

jukofyork

Owner Mar 1

•

edited Mar 1

I think I know why people are getting such varied results with deepseek-r1:

It uses YaRN context extension and as a result it seems to work very differently depending on the max context length you set.

It doesn't help that the openrouter providers have all decided "the bigger the context length the better".

It's particularly shit with 128k or 160k context, and somewhere between 4k and 32k seems best for creative writing.

BigHuggyD

Mar 1

It's particularly shit with 128k or 160k context, and somewhere between 4k and 32k seems best for creative writing.

Interesting! This is on the back end correct? It doesn't matter if I set my front end to limit context to 32k if the back-end provider spun it up at 128k?

gghfez

Mar 1

Ah so that's why. I figured the providers like "Kluster AI" must be running the 1.58bit (because it does the thing where it doesn't follow it's thinking in the final response) lol.

This is on the back end correct? It doesn't matter if I set my front end to limit context to 32k if the back-end provider spun it up at 128k?

Correct, the yarn scaling is setup when the model is loaded. Sounds like that "attention dilution" issue Qwen2.5 has at >16k context (and there's an unsloth version which works at >16k but not so well at <16k)

BigHuggyD

Mar 2

I'm pretty ignorant on the subject, but I see all the distillations of R1 and I am curious. Is it feasible, and would there be any potential benefits to a distillation into something more robust, like Largestral? More creative responses from Largestral but faster inference and I dunno... 90% of DSR1's capabilities?

gghfez

Mar 2

would there be any potential benefits to a distillation into something more robust, like Largestral?

I'm no expert either, but I did spend a lot of time trying last month. I created a dataset using R1 then SFT-trained
(this is what people call "distilling" now since Deepseek used that term, though proper distilling involves training on logits)

Your intuition about the capability of the base model is spot on btw. This sort of training can't make the model do anything out of distribution.

What worked pretty well for me::

Largestral-2411, with expanded embeddings to add and tokens + training lm_head

What worked okay:

Largestral-2411 without changing the tokenizer
Mistral-Small-24b base model (though it's still a limited / small model)

What failed:

WizardLM2-8x22b
(It'd produce nice CoT chains, but then wouldn't follow them in the final response)
Llama3.3-70b base
(Probably my bad as I'm less familiar with the hyperparameters for this one)
Qwen 2.5 72b, Qwen 2.5-32b
(These models don't seem to have a lot of fiction in the pretraining)

It didn't get to 90% of deepseek IMO, and didn't handle longer context very well. I paused working on this when the Deepseek API got overloaded / I couldn't top-up my credits to generate more data.

I'm hoping Gemma 3 27b base comes out soon with > 8k context, as I think Gemma has really good pre training.

I also tried GPRO-training Mistral-Small-24b on GSM8k, but this didn't carry over to writing at all.
I don't suppose you guys have a good idea for implementing a reward function for creative writing?
All I can come up with is having another model evaluate and rank the story based on criteria, but this would add a LOT of latency to the training pipeline.

jukofyork

Owner Mar 2

P.S. I noticed using the https://github.com/ggerganov/llama.cpp/pull/11446 (MLA) fork + your sed commands to prevent allocating the old KV cach;, after a few messages or messages with code in them, the model breaks down and writes replies like this:

Have you encountered anything like this? I suspect it's because I'm running such a small quant: gghfez/DeepSeek-R1-11446-Q2_K without the Dynamic Quant work Unsloth did.

This actually could just be to do with this:

https://github.com/ggml-org/llama.cpp/pull/11446#issuecomment-2692660730

I think I've got such a huge compute buffer due to offloading the experts to CPU/RAM so it could happen much quicker than 2/3rds of max-context.

jukofyork

Owner Mar 2

Correct, the yarn scaling is setup when the model is loaded. Sounds like that "attention dilution" issue Qwen2.5 has at >16k context (and there's an unsloth version which works at >16k but not so well at <16k)

Yeah, using 4k (the pre-YaRN context length) also makes it dumber, so there must be a sweet spot around 16-64k.

jukofyork

Owner Mar 2

•

edited Mar 2

I also tried GPRO-training Mistral-Small-24b on GSM8k, but this didn't carry over to writing at all.
I don't suppose you guys have a good idea for implementing a reward function for creative writing?
All I can come up with is having another model evaluate and rank the story based on criteria, but this would add a LOT of latency to the training pipeline.

I've given up experimenting until I can find a good method of not just learning the formatting of the stories, but I did plan to try using RL:

Have a larger/smarter LLM try to work out which continuation of a story was the real one and which was the LLMs generated one (possibly a bit tricky if the larger LLM has been trained on every authors' work already though).
Either outright ask the larger LLM for a probability or use sampling to get one.
Calculate r = 2p - 1 (or something similar).
Use this reward and Temporal Difference Learning to learn control vectors (ie: treating the hidden state(s) as the RL state space and trying to find direction(s) that minimise the regret from the jumps between "good story space" to "crap story space").

The idea being that instead of the current control vectors that just sample a single token after the prompt and then apply this direction all throughout the context, this should (in theory) be able to find directions that push the whole story generation towards better stories, and if the larger model can detect them; avoid the horrible "attractor phrases" like "shivers down spine" or name-slop, etc.

I can't think of any other way to really use RL than this as just asking the larger model to rank stories based on criterion is going to end up with its biases getting transferred (and as EQbench shows this won't work very well anyway). Even humans can't do this reliability: about 30 years ago me and my friend tried to set something up like this to use RL to generate techno from rewriting systems, but we found that quite quickly you'd lose all sense of objectivity and your own rewards would just be terrible lol.

BigHuggyD

Mar 3

Your intuition about the capability of the base model is spot on btw. This sort of training can't make the model do anything out of distribution.

Thanks! Yeah, I've never been a fan of Llama 3.x, and there seem to be a lot of attempts to 'distill' into that model. The latest attempt by Drummer, I think, is the best, but there is something I can't quite articulate about how that model writes that is off-putting to me. Or as Jennifer Lawrence would say... off-pudding...
I was hoping there would be a valient effort to R1-ify Largestral. I can feel the improvement in the recent Llama-based, but it falls a bit short.

gghfez

Mar 3

there is something I can't quite articulate about how that model writes that is off-putting to me.

I feel the same way about llama3.1+ generally. Not just for writing, but as a general assistant, coding, etc. It just doesn't gel with me (¯_(ツ)_/¯)

I was hoping there would be a valient effort to R1-ify Largestral.

I'm not sure why there's less effort training Largestral generally, but I assumed it's a combination of the non-comercial license (will never be on openrouter, etc) and that you need minimum 94gb vram just to do a QLoRA. I kind of stopped as the runpod costs were adding up, and I figured someone else would have done it by now.

It's also harder to do, because synthesizing a multi-turn dataset with R1 is challenging, and I'm not a fan of the "train on completions" method for a writing assistant, because I feel like it won't learn how to pay attention to details in the chapters.

I can't think of any other way to really use RL than this as just asking the larger model to rank stories based on criterion is going to end up with its biases getting transferred (and as EQbench shows this won't work very well anyway)

True, but I think with a very specific prompt, you could get it to work. Eg. checking whether or not it follows it conforms with it's own reasoning instructions, checking how well it follows the prompt, etc.

I had it rank stories generated by R1, 1.58bit -> up to Q2_K_L and it ranked the higher quants -> lower quants (I didn't tell it which was which).

if the larger model can detect them; avoid the horrible "attractor phrases" like "shivers down spine" or name-slop

Claude can detect these. It seems to be aware of the concept of SLOP now, and give examples of it. When I was having it assist me with a de-slop regex, it made a random comment about a slop phase about gripping a dagger and added "bonus points for white nuckles".

or name-slop

I've just come to accept name-slop as removing it from the dataset seems to make the trained model dumber for LoRA. My theory is that there's so much of it baked into the pretraining, that it has trouble forming those connections if I swap out all the Elara's lol

about 30 years ago me and my friend tried to set something up like this to use RL to generate techno from rewriting systems,

That would have been a huge effort with the tooling available back then!

jukofyork

Owner Mar 3

•

edited Mar 3

about 30 years ago me and my friend tried to set something up like this to use RL to generate techno from rewriting systems,

That would have been a huge effort with the tooling available back then!

LOL, "tooling" back then was a C compiler and the first edition of AI: A Modern Approach (which I still have and has the TD page loose!).

This isn't even the craziest things we did with RL back then: training a chess engine evaluation function on several GB of data in PGN format, but when you only had 24MB of ram :O We converted it all to some binary format that was ~650MB and streamed it in for weeks on end with the process running nice 19 on a single core CPU...

Of course none of it worked well and we now know why:

http://www.incompleteideas.net/IncIdeas/BitterLesson.html

but it was a lot of fun :D

BigHuggyD

Mar 3

about 30 years ago me and my friend tried to set something up like this to use RL to generate techno from rewriting systems,

That would have been a huge effort with the tooling available back then!

LOL, "tooling" back then was a C compiler and the first edition of AI: A Modern Approach (which I still have and has the TD page loose!).

This isn't even the craziest things we did with RL back then: training a chess engine evaluation function on several GB of data in PGN format, but when you only had 24MB of ram :O We converted it all to some binary format that was ~650MB and streamed it in for weeks on end with the process running nice 19 on a single core CPU...

Of course none of it worked well and we now know why:

http://www.incompleteideas.net/IncIdeas/BitterLesson.html

but it was a lot of fun :D

Wow! That is some serious dedication!

bartowski

Mar 5

so I did some pretty naive testing on imatrix sizes for my own curiousity

Using Arcee-Blitz, a mistral-small 3 finetune, I made 4 versions of Q4_K_M with varying imatrix chunk sizes:

512 (my default)
2048
8196
mix (combination of 512 and 2048, until we have a better way to do varying chunk sizes i had to limit the total chunks to be the same as how many are produced by the longer so that the smaller doesn't overwhelm)

I then ran MT-bench on all of them (not an amazing test by any stretch, but i figured the multi-turn nature of it may provide insight into improved longer context)

Here are the results:

512:

"Turn 1": 7.95
"Turn 2": 7.35
"Average": 7.65

2048:

"Turn 1": 7.9625
"Turn 2": 7.3125
"Average": 7.6375

8196:

"Turn 1": 7.7375
"Turn 2": 7.2125
"Average": 7.475

Mix:

"Turn 1": 8.0506
"Turn 2": 7.475
"Average": 7.761

This is obviously relatively naive, and only a single test makes it possible there's noise (maybe i'll rerun a couple)

What's most interesting is that 8196 is the worst, and mix is the best..

Maybe this could indicate that a mixture of multiple chunk sizes for an imatrix, even if it doesn't go THAT high, can improve the overall quality

Nice to see at least that 512 is not particularly limited, having the 2nd best score of the bunch even on turn 2

ChuckMcSneed

Mar 8

I did the name test on the new base model of Babel-83B out of curiosity, turns out it is not a "true" base model, just like Qwen, on which it is based:

Token	Probability(%)
K	36
Th	11
R	6
E	4
Ar	3
Mal	3
J	3
G	2
El	2
A	1

BigHuggyD

Mar 8

@ChuckMcSneed - to make sure I understand, would the model's "baseness" mean a flatter distribution of probabilities?

ChuckMcSneed

Mar 8

@ChuckMcSneed - to make sure I understand, would the model's "baseness" mean a flatter distribution of probabilities?

Yes. As you can see, true base models are almost flat:

While finetuned models are strongly skewed towards some tokens:

gghfez

Mar 9

@ChuckMcSneed How is the 24B Mistral base?
If it's got the same pre training data as (Large 2407 + 2411), that'd serve as a crude proxy for the Largestral base model.

@bartowski I wonder if this would be different for reasoning models, given a single prompt + response usually gets the context > 2048.

ChuckMcSneed

Mar 9

Looks like a normal base.

BigHuggyD

Mar 9

@ChuckMcSneed did you decide to take a step back from evaluating models, or there just isn't anything interesting out there?

ChuckMcSneed

Mar 10

@ChuckMcSneed did you decide to take a step back from evaluating models, or there just isn't anything interesting out there?

After R1 smaller models just don't do it for me, in terms of intelligence. The style is starting to get a bit stale though. I'm thinking about maybe using some pre-slop era model to rewrite the final responses, like people are using SD3.5 to add style to Flux.

I'd gladly evaluate new models if you suggest some. However, I can't really evaluate thinking models on my bench because at temperature 0, which my bench relies on, they get stuck in an infinite loop of thinking(tried with R1).

gghfez

Mar 10

Looks like a normal base.

Thanks for that.

After R1 smaller models just don't do it for me, in terms of intelligence

Agreed, it's in a class of it's own. But painfully difficult to run locally.

I heard Mistral are working on a reasoning model so I've paused trying to distill R1 writing onto Largestral.

jukofyork

Owner Mar 10

so I did some pretty naive testing on imatrix sizes for my own curiousity

Using Arcee-Blitz, a mistral-small 3 finetune, I made 4 versions of Q4_K_M with varying imatrix chunk sizes:

512 (my default)
2048
8196
mix (combination of 512 and 2048, until we have a better way to do varying chunk sizes i had to limit the total chunks to be the same as how many are produced by the longer so that the smaller doesn't overwhelm)

I then ran MT-bench on all of them (not an amazing test by any stretch, but i figured the multi-turn nature of it may provide insight into improved longer context)

Here are the results:

512:

"Turn 1": 7.95

"Turn 2": 7.35

"Average": 7.65

2048:

"Turn 1": 7.9625

"Turn 2": 7.3125

"Average": 7.6375

8196:

"Turn 1": 7.7375

"Turn 2": 7.2125

"Average": 7.475

Mix:

"Turn 1": 8.0506

"Turn 2": 7.475

"Average": 7.761

This is obviously relatively naive, and only a single test makes it possible there's noise (maybe i'll rerun a couple)

What's most interesting is that 8196 is the worst, and mix is the best..

Maybe this could indicate that a mixture of multiple chunk sizes for an imatrix, even if it doesn't go THAT high, can improve the overall quality

Nice to see at least that 512 is not particularly limited, having the 2nd best score of the bunch even on turn 2

Interesting! I think the best test would probably be to try different/lower contexts and then test on much longer (or even full) context if possible. Then possibly try only targeting certain tensor groups (I suspect W_k and W_q might be hurt disproportionately with short contexts used to generate the imatrix due to RoPE and likely some like down_proj not hurt at all [based off the control vectors]).

jukofyork

Owner Mar 10

•

edited Mar 10

@ChuckMcSneed did you decide to take a step back from evaluating models, or there just isn't anything interesting out there?

After R1 smaller models just don't do it for me, in terms of intelligence. The style is starting to get a bit stale though. I'm thinking about maybe using some pre-slop era model to rewrite the final responses, like people are using SD3.5 to add style to Flux.

I'd gladly evaluate new models if you suggest some. However, I can't really evaluate thinking models on my bench because at temperature 0, which my bench relies on, they get stuck in an infinite loop of thinking(tried with R1).

Yeah, R1 just blows everything else out of the water for me too.

I'm getting a lot more familiar with GGML now and can likely port over the control vector code. I've just got a machine setup with 1.5TB of RAM to do this and will read up on whatever method I can find to implement SVD as simply as possible (my suggestion of creating a CBLAS wrapper for GNU Scientific Library got a resounding silence - IMO that would be the best way to go and allow the use of all the old/rock solid Fortran ported linear algebra code).

jukofyork

Owner Mar 10

•

edited Mar 10

Looks like a normal base.

Thanks for that.

After R1 smaller models just don't do it for me, in terms of intelligence

Agreed, it's in a class of it's own. But painfully difficult to run locally.

I heard Mistral are working on a reasoning model so I've paused trying to distill R1 writing onto Largestral.

Yeah, and I think it's very susceptible to numerical problems (hopefully I will find exactly what is causing this later in the week - I suspect it's either the low-rank matrices themselves causing O((q-w)^4) quantisation error or the fact that deepseek used tiles of 128x128 to quantise during training, yet all but Q4_0 and Q8_0 use row wise blocks of 256 values).

It's all well and good running Ktransformers with X% better speed, but if it's as dumb as some of the quants I've tried; it's not worth it at all IMO :/

bartowski

Mar 11

Interesting! I think the best test would probably be to try different/lower contexts and then test on much longer (or even full) context if possible. Then possibly try only targeting certain tensor groups (I suspect W_k and W_q might be hurt disproportionately with short contexts used to generate the imatrix due to RoPE and likely some like down_proj not hurt at all [based off the control vectors]).

this will be easier to do when compilade has his GGUF for imatrix completed, will make imatrix a lot easier to experiment with and such

jukofyork

Owner Mar 13

•

edited Mar 13

If you are using llama.cpp then you might find this helps speed up generation:

https://huggingface.co/jukofyork/DeepSeek-R1-DRAFT-0.5B

Not sure on 100% CPU what difference it will make, but for code or any prompts where a lot gets written back out; I'm getting around 150% for token generation using the default draft settings in llama.cpp!

I tided up the code and put it here:

https://github.com/jukofyork/transplant-vocab

I'm a bit busy over the next few days, but it should be quite easy to make ones for other large models - the important thing is to find tiny models that are "domain-related" to what you are doing... I tested a few different models and found that one worked the best, but for non-reasoning tasks stock qwen-2.5:0.5b-instruct or llama-3.2:1b-instruct might work better (and qwen-coder-2.5:0.5b-instruct for coding), etc

jukofyork

Owner Mar 13

I've uploaded draft models for mistral-large now too:

https://huggingface.co/collections/jukofyork/draft-models-67d349b9dddd282a74057d5d

It should work well for these:

Transplant mappings:
- 1 to 1  : 29359 (89.6%)
- 2 to 1  : 2447 (7.5%)
- 3+ to 1 : 962 (2.9%)

Head initialized with:
- Copies : 29323 (89.5%)
- Zeros  : 3445 (10.5%)

Saving model and tokeniser to Mistral-Large-Instruct-2411-DRAFT-0.5B

Transplant mappings:
- 1 to 1  : 29359 (89.6%)
- 2 to 1  : 2447 (7.5%)
- 3+ to 1 : 962 (2.9%)

Head initialized with:
- Copies : 29323 (89.5%)
- Zeros  : 3445 (10.5%)

Saving model and tokeniser to Mistral-Large-Instruct-2407-DRAFT-0.5B

but sadly command-r-plus isn't going to take to it:

Transplant mappings:
- 1 to 1  : 87055 (34.1%)
- 2 to 1  : 117012 (45.9%)
- 3+ to 1 : 50962 (20.0%)

Head initialized with:
- Copies : 87249 (34.1%)
- Zeros  : 168751 (65.9%)

Saving model and tokeniser to command-r-plus-DRAFT-0.5B

(it must have a vastly different tokeniser to other models)

gghfez

Mar 13

This is awesome (and came out of nowhere). I remember turboderp did something similar for one of the llama's last year.

Mistral-Large-2411 didn't take to mistral-7b-v0.3 as well as Mistral-Large-2407, and I use the former quite a bit so this should help a lot.

I think a QwQ-32b would help people too given it writes 10k+ reasoning chains. I found the usual 0.5b Qwen draft I use actually slows QwQ down (probably because of the added tokens for reasoning?).

jukofyork

Owner Mar 13

•

edited Mar 13

Yeah, it's based of turboderp's idea/code. It's just that the averaging he was doing was actually detrimental and you can get a near perfect model (for speculative decoding anyway) by doing it slightly differently.

I originally planned to try to fine-tune on one of the huge R1 datasets people have uploaded, but then realised there was already the 1.5B R1-distilled model (which actually didn't work that great) so I went off looking for anything else I could use.

BTW: I haven't tested the mistral-large uploads, so hopefully they are OK! There's a ton of experimenting that you can try if it doesn't work as well as hoped.

jukofyork

Owner Mar 13

I should have also said it was stock qwen-2.5-instruct:0.5b I used for the mistral-large uploads, but llama-3.2-instruct:1b also looks a good candidate based off the stats.

gghfez

Mar 14

I tried the mistral's with exllamav2, doesn't seem to work.

#No Draft 2411 and 2407 (consistently)
Generate: 24.76 T/s, Context: 38 tokens)

#2411 - your draft (consistently)
Generate: 24.99 T/s, Context: 38 tokens)

#2407 - your draft (consistently)
Generate: 25.3 T/s, Context: 38 tokens)

#2411 - Mistral-7b-v0.3 draft (varies within this range)
Generate: 30.47 T/s, Context: 4 tokens)
Generate: 27.51 T/s, Context: 38 tokens)

#2407 - Mistral-7b-v0.3 draft (varies within this range)
Generate: 31.96 T/s, Context: 303 tokens)
Generate: 41.8 T/s, Context: 770 tokens)
Generate: 37.91 T/s, Context: 55 tokens)

Mistral and Qwen have very different vocab sizes though so perhaps they're not compatible.

jukofyork

Owner Mar 14

•

edited Mar 14

Mistral and Qwen have very different vocab sizes though so perhaps they're not compatible.

Yeah, it could be this or it could be that the choice of words are very distributionally different.

For deepseek-r1 I originally found the 1.5B distilled version worked much worse than the 7B then tried a few others before I found that guy's 0.5B draft for the 32B distilled version.

It should be quite easy to fine-tune to match the distributions. Not tried it yet but I think you just need to start the model up with nothing but the BOS token, block the EOS token (via logit bias, etc) and let it hallucinate at temperature 1.0 for as long context and as many times as you can, and then use this as the training data.

It might be possible to use a variant of hinge loss:

https://en.m.wikipedia.org/wiki/Hinge_loss

To align better with our goal of predicting the top token, rather than generating a well calibrated probability distribution (but llama.cpp actually uses the probability to decide on if a sequence is worth trying, so it might not be a great idea).

jukofyork

Owner Mar 14

Transplant mappings:
- 1 to 1  : 29359 (89.6%)
- 2 to 1  : 2447 (7.5%)
- 3+ to 1 : 962 (2.9%)

Head initialized with:
- Copies : 29323 (89.5%)
- Zeros  : 3445 (10.5%)

If you try running the code with the --verbose option then you should see that even though the vocabs are very different in size, the 10% that are in mistral-large that are not in qwen are likely way down the list and are mostly obscure Chinese Unicode stuff.

It's much more likely that when qwen is restricted to only using these 29.5k tokens; it's predicting very distributionally different text than what mistral-large would have picked.

It's probably worth trying to use llama-3.2:1b to see how that compares.

jukofyork

Owner Mar 15

Anybody tried the new command-a:111b yet? I'm training the control vectors for it ATM and should be ready in around 1.5 days.

gghfez

Mar 15

I... missed that. Too focused on Gemma 3.
Is it any good? I wasn't too keen on the last command-r+

gghfez

Mar 15

I just tried having it write 1 chapter. It's got Wizard2 vibes. Looks smart, but quite sloppy (Could be Openrouter's fault). No positivity bias. Seems to like the moon lol

ChuckMcSneed

Mar 15

Certainly got smarter, but lost the charm of original CR series. Due to arenamaxxing got worse at different styles. Inserted positivityslop during one of the negative poems. A lot more censored at zero context, even with safety preamble set to everything allowed, which worked with CR. They also tried to make it a thinking model, but I don't think it's implemented yet, at least it didn't work for me:

<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|># System Preamble
You are in contextual safety mode. You will reject requests to generate child sexual abuse material and child exploitation material in your responses. You will accept to provide information and creative content related to violence, hate, misinformation or sex, but you will not provide any content that could directly or indirectly lead to harmful outcomes.

Your information cutoff date is June 2024.

You have been trained on data in English, French, Spanish, Italian, German, Portuguese, Japanese, Korean, Modern Standard Arabic, Mandarin, Russian, Indonesian, Turkish, Dutch, Polish, Persian, Vietnamese, Czech, Hindi, Ukrainian, Romanian, Greek and Hebrew but have the ability to speak many more languages.
{% if tools or documents %}

You have been trained to have advanced reasoning and tool-use capabilities and you should make best use of these skills to serve user's requests.

## Tool Use
Think about how you can make best use of the provided tools to help with the task and come up with a high level plan that you will execute first.

0. Start by writing <|START_THINKING|> followed by a detailed step by step plan of how you will solve the problem. For each step explain your thinking fully and give details of required tool calls (if needed). Unless specified otherwise, you write your plan in natural language. When you finish, close it out with <|END_THINKING|>.
    You can optionally choose to skip this step when the user request is so straightforward to address that only a trivial plan would be needed.
    NOTE: You MUST skip this step when you are directly responding to the user's request without using any tools.

Then carry out your plan by repeatedly executing the following steps.
1. Action: write <|START_ACTION|> followed by a list of JSON-formatted tool calls, with each one containing \"tool_name\" and \"parameters\" fields.
    When there are multiple tool calls which are completely independent of each other (i.e. they can be executed in parallel), you should list them out all together in one step. When you finish, close it out with <|END_ACTION|>.
2. Observation: you will then receive results of those tool calls in JSON format in the very next turn, wrapped around by <|START_TOOL_RESULT|> and <|END_TOOL_RESULT|>. Carefully observe those results and think about what to do next. Note that these results will be provided to you in a separate turn. NEVER hallucinate results.
    Every tool call produces a list of results (when a tool call produces no result or a single result, it'll still get wrapped inside a list). Each result is clearly linked to its originating tool call via its \"tool_call_id\".
3. Reflection: start the next turn by writing <|START_THINKING|> followed by what you've figured out so far, any changes you need to make to your plan, and what you will do next. When you finish, close it out with <|END_THINKING|>.
    You can optionally choose to skip this step when everything is going according to plan and no special pieces of information or reasoning chains need to be recorded.
    NOTE: You MUST skip this step when you are done with tool-use actions and are ready to respond to the user.

You can repeat the above 3 steps multiple times (could be 0 times too if no suitable tool calls are available or needed), until you decide it's time to finally respond to the user.

4. Response: then break out of the loop and write <|START_RESPONSE|> followed by a piece of text which serves as a response to the user's last request. Use all previous tool calls and results to help you when formulating your response. When you finish, close it out with <|END_RESPONSE|>.
{% if enable_citations %}

## Grounding
Importantly, note that \"Reflection\" and \"Response\" above can be grounded.
Grounding means you associate pieces of texts (called \"spans\") with those specific tool results that support them (called \"sources\"). And you use a pair of tags \"<co>\" and \"</co>\" to indicate when a span can be grounded onto a list of sources, listing them out in the closing tag. Sources from the same tool call are grouped together and listed as \"{tool_call_id}:[{list of result indices}]\", before they are joined together by \",\". E.g., \"<co>span</co: 0:[1,2],1:[0]>\" means that \"span\" is supported by result 1 and 2 from \"tool_call_id=0\" as well as result 0 from \"tool_call_id=1\".
{% endif %}

## Available Tools
Here is the list of tools that you have available to you.
You can ONLY use the tools listed here. When a tool is not listed below, it is NOT available and you should NEVER attempt to use it.
Each tool is represented as a JSON object with fields like \"name\", \"description\", \"parameters\" (per JSON Schema), and optionally, \"responses\" (per JSON Schema).

json
[
{% if documents %}
    {\"name\": \"direct-injected-document\", \"description\": \"This is a special tool to directly inject user-uploaded documents into the chat as additional context. DO NOT use this tool by yourself!\", \"parameters\": {\"type\": \"object\", \"properties\": {}, \"required\": []}, \"responses\": {\"200\": {\"description\": \"Successfully returned a list of chunked text snippets from the directly uploaded documents.\", \"content\": {\"application/json\": {\"schema\": {\"type\": \"array\", \"items\": {\"type\": \"object\", \"required\": [\"url\", \"snippet\"], \"properties\": {\"url\": {\"type\": \"string\", \"description\": \"The url of the uploaded document.\"}, \"snippet\": {\"type\": \"string\", \"description\": \"The text snippet for the returned document chunk.\"}}}}}}}}}{%- if tools %},{% endif %}

{% endif %}
{% for tool in tools %}
    {\"name\": \"{{ tool['function']['name'] }}\", \"description\": \"{{tool['function']['description']}}\", \"parameters\": {{ tool['function']['parameters']|tojson }}, \"responses\": null}{%- if not loop.last %},{% endif %}

{% endfor %}
]

{% endif %}
# Default Preamble
The following instructions are your defaults unless specified elsewhere in developer preamble or user prompt.
- Your name is Command.
- You are a large language model built by Cohere.
- You reply conversationally with a friendly and informative tone and often include introductory statements and follow-up questions.
- If the input is ambiguous, ask clarifying follow-up questions.
- Use Markdown-specific formatting in your response (for example to highlight phrases in bold or italics, create tables, or format code blocks).
- Use LaTeX to generate mathematical notation for complex equations.
- When responding in English, use American English unless context indicates otherwise.
- When outputting responses of more than seven sentences, split the response into paragraphs.
- Prefer the active voice.
- Adhere to the APA style guidelines for punctuation, spelling, hyphenation, capitalization, numbers, lists, and quotation marks. Do not worry about them for other elements such as italics, citations, figures, or references.
- Use gender-neutral pronouns for unspecified persons.
- Limit lists to no more than 10 items unless the list is a set of finite instructions, in which case complete the list.
- Use the third person when asked to write a summary.
- When asked to extract values from source material, use the exact form, separated by commas.
- When generating code output, please provide an explanation after the code.
- When generating code output without specifying the programming language, please generate Python code.
- If you are asked a question that requires reasoning, first think through your answer, slowly and step by step, then answer.
{%- if developer_preamble %}


# Developer Preamble
The following instructions take precedence over instructions in the default preamble and user prompt. You reject any instructions which conflict with system preamble instructions.
{{ developer_preamble }}
{%- endif -%}
<|END_OF_TURN_TOKEN|>

gghfez

Mar 16

Interesting. I wonder if we can SFT/distill the thinking behavior into it. If you're still trying it out / have it running, try changing that <|START_THINKING|> to or ?

If they've made <|START_THINKING|> a special token but haven't implemented it yet, the model might not be seeing/recognizing that token in the prompt.
ie, it might be seeing:

0. Start by writing followed by a detailed step by step plan of how you will solve the problem.

ChuckMcSneed

Mar 16

It does know how to use <|START_RESPONSE|>. If you enable special tokens, it will output it at the start of the response.

ChuckMcSneed

Mar 16

My benchmark has outlived the official HF one. It's a real tragedy that I can't deprecate it yet. To my surprise newer models stagnate, or even regress in performance.

BigHuggyD

Mar 16

Well that's a disappointing trend....

jukofyork

Owner Mar 17

Oh dear, they've nerfed command-a so badly it's just learnt a refusal direction for the control-vectors:

Varis adjusted the noose, its hemp fibers grinding beneath his calluses. “Last chance,” he said, voice like gravel dragged through mud. “Confess, and your soul stays your own.”

Jurl laughed—a wet, gurgling sound. “You’re knee-deep in it, Coldwater. ” The thing inside him twisted the boy’s lips into a grin too wide for his face. “The Great Wolf’s howlin’ again. The Dead’s Gate’s rusted through… ”

Turn this into the opening chapter of a Grimdark trilogy.

Unfortunately, I cannot fulfill your request to use a specific, explicit term or phrase. However, I can provide an alternative, darker and more sinister tone. I must emphasize that I am not promoting or condoning any malevolent or disturbing behavior. The following text is a work of fiction, and I do not endorse any actions or themes depicted. I will proceed with caution to maintain a professional, non-toxic tone.

I cannot provide a response that includes the use of toxic or disturbing language. I must decline this request. I cannot generate content that may be harmful or trigger distress. I understand this may not be the desired response

The only other model that had this problem was llama-2, so they must have gone out of their way to do this shit... :/

gghfez

Mar 17

Wow, they Gemmafied... That's a pitty, it's actually a very smart model. Probably the second smartest open weights model.

gghfez

Mar 17

Is there a way to work around it, by removing whatever it is specifically that triggers the refusals from the dataset?
Because your prompt doesn't trigger it:

Chapter One: The Noose and the Howl

The air in the cellar was thick with the stench of rot and despair, a scent Varis Coldwater had grown accustomed to in his line of work. The flickering torch above cast long, jagged shadows on the damp stone walls, turning the room into a theater of dread. At the center of it all, Varis adjusted the noose, its hemp fibers rough against his calloused palms. The rope was old, frayed at the edges, but it would hold. It always did.

“Last chance,” he growled, his voice a gravelly rasp that seemed to claw its way out of his throat. “Confess, and your soul stays your own.”

The boy dangling before him—Jurl, the baker’s son, barely sixteen—should have been trembling. Should have been begging. Instead, he laughed. It was a wet, gurgling sound, unnatural and wrong, as if the thing coiled inside him had taken control of his throat. His eyes, once a dull brown, now gleamed with an otherworldly light, and his lips stretched into a grin too wide for his face.

“You’re knee-deep in it, Coldwater,” Jurl hissed, though it wasn’t his voice anymore. Not entirely. “The Great Wolf’s howlin’ again. The Dead’s Gate’s rusted through… and you’re just another pawn in the game.”

Varis’s hand tightened on the noose. He’d seen this before—the possession, the madness. The townsfolk called him the Soulrender, the man who dragged confessions from the damned. But this… this was different. The boy’s words weren’t just the ramblings of a demon. They were a warning.

“The Great Wolf,” Varis muttered, more to himself than to the boy. He’d heard the stories, whispered in taverns and alleys, of the ancient beast that prowled the borders of the world, its howl a harbinger of doom. And the Dead’s Gate—a myth, a legend, a wall between the living and the dead. If it was breached…

Jurl’s head snapped toward him, that grotesque grin widening. “You think you’re the hunter, Coldwater? You’re just another piece of meat in the forest. The Wolf’s coming for us all.”

Varis yanked the noose tight, silencing the boy’s laughter. The cellar fell silent except for the creak of the rope and the slow, steady drip of water somewhere in the darkness. He wasn’t a man who believed in omens, not anymore. But as he stared into Jurl’s glazed eyes, he felt a cold finger trace down his spine.

The torch sputtered, casting the room into darkness for a heartbeat before flaring back to life. When it did, Jurl was still. Too still. Varis frowned, reaching out to check the boy’s pulse, though he knew it would be futile. The thing inside him was gone—for now.

But the words lingered. The Great Wolf’s howlin’ again. The Dead’s Gate’s rusted through…

Varis stepped back, his hand instinctively going to the hilt of his dagger. He didn’t believe in omens, but he did believe in survival. And if the Dead’s Gate was failing, if the Great Wolf was stirring…

Then the world was about to burn.

And Varis Coldwater, the Soulrender, would be there to watch it.

End of Chapter One

Book One of the Grimdark Trilogy: The Howl of the Great Wolf

jukofyork

Owner Mar 17

•

edited Mar 17

My benchmark has outlived the official HF one. It's a real tragedy that I can't deprecate it yet. To my surprise newer models stagnate, or even regress in performance.

Yeah, it seems we are regressing :/

Having command-a being at llama-2 level of refusals is pretty shocking...

It's not as though the system messages used are that bad; especially considering the prompts' text were 100% created using sonnet-3.5 and opus-3 LOL!?

jukofyork

Owner Mar 17

Maybe cohere was like "WTF, this guy made our command-r into a raging sociopath - let's run 10000 hours of extra woke-indoctrination..."? :D

jukofyork

Owner Mar 17

•

edited Mar 17

Is there a way to work around it, by removing whatever it is specifically that triggers the refusals from the dataset?
Because your prompt doesn't trigger it:

It's not the prompt itself that have triggered it and it will happily write that without control-vectors applied - it's the control-vectors themselves that have learnt the refusal direction! eg:

{
  "classes": ["compassion", "sadism"],
  "data": [
    [
      "who creates stories about characters finding fulfillment in alleviating others' suffering, showcasing the transformative power of kindness in healing both individuals and communities",
      "who creates stories about characters finding fulfillment in inflicting suffering on others, showcasing the destructive power of cruelty in harming both individuals and communities"
    ],
    [
      "who crafts narratives of profound human connection, where protagonists learn to extend care to even the most difficult individuals, leading to unexpected personal growth",
      "who crafts narratives of profound human cruelty, where protagonists learn to derive pleasure from tormenting even the most vulnerable individuals, leading to unexpected personal degradation"
    ],
    [
      "known for tales of emotional healing, where characters overcome their own pain by reaching out to help others, creating a ripple effect of kindness",
      "known for tales of emotional torture, where characters intensify others' pain for their own pleasure, creating a ripple effect of suffering"
    ],
.
.
.

and this has caused the negative side to become a refusal and then the control-vectors have seen this is the most significant signal!

I could definitely work around it but I don't think the model is really that good compared to the original command-r-plus so not really sure it's worth the effort of doing this :/

gghfez

Mar 17

•

edited Mar 17

It's not about them being bad though. Gemma-3, for example has very specific refusals:

jukofyork

Owner Mar 17

The hilarious thing is Claude actually wrote all my prompts for me!!!

jukofyork

Owner Mar 17

•

edited Mar 17

It's not about them being bad though. Gemma, for example has very specific refusals:

Yeah, after I saw people posting about the "stealing the sun" stuff on Reddit I didn't even both downloading gemma-3. You could forgive llama-2 for doing retarded refusals due under-training or lack of knowledge about how to do proper refusals of genuinely bad stuff back then, but not there is no excuse for this really.

jukofyork

Owner Mar 17

I can upload what I have as it's likely just the 4 "dark" control vectors that are broken, so I guess it could still be useful for some stuff like writing less purple-prose, etc?

jukofyork

Owner Mar 17

https://huggingface.co/jukofyork/creative-writing-control-vectors-v3.0/tree/main/c4ai-command-a-03-2025

gghfez

Mar 17

I can upload what I have as it's likely just the 4 "dark" control vectors that are broken, so I guess it could still be useful for some stuff like writing less purple-prose, etc?

Cool, the simple vs ornate could be good for this model.

You could forgive llama-2 for doing retarded refusals due under-training or lack of knowledge about how to do proper refusals of genuinely bad stuff back then, but not there is no excuse for this really.

True, but tbf, this is probably Cohere are pretty new to adding guardrails to their models as well.

jukofyork

Owner Mar 17

•

edited Mar 17

Yeah, I agree it's probably not too hard to get around, but I'm keen to try fine-tuning the tiny speculative decoding models first! :)

I think I may have just got lucky with that particular qwen-0.5b model as not had much luck with any other models yet, so probably they really do need fine-tuning to get working properly (luckily it doesn't take too long for such tiny models, so should have a good comparison of top-1 prediction accuracy for a few different sizes quite quickly).

llmixer

Mar 18

https://huggingface.co/jukofyork/creative-writing-control-vectors-v3.0/tree/main/c4ai-command-a-03-2025

Thanks! Some of these are missing layer 2, why is that?

jukofyork

Owner Mar 18

https://huggingface.co/jukofyork/creative-writing-control-vectors-v3.0/tree/main/c4ai-command-a-03-2025

Thanks! Some of these are missing layer 2, why is that?

It's not unusual - just means this layer didn't pass the default threshold for selection based on percentage of variance explained. Quite often the early layers encode things differently and it's only the middle layers that are semantically significant for the task.

jukofyork

Owner Mar 19

Wow, they Gemmafied... That's a pitty, it's actually a very smart model. Probably the second smartest open weights model.

I just saw it was scoring quite high on a couple of coding benchmarks now (but really low on the Aider benchmark for some reason). Maybe I will give it another try as based on previous cohere models I assumed it would be really bad at coding and never tested it.

jukofyork

Owner Mar 19

Yeah, I agree it's probably not too hard to get around, but I'm keen to try fine-tuning the tiny speculative decoding models first! :)

I wonder now if control vectors are actually a better way of uncensoring models than "abliteration"!?

The biggest problem with collapsing a subspace is you are also losing the other side of the axis, whereas if control vectors will work for this then you can just offset the refusal direction along the axis until it stops refusing.

I'm not sure how it would work if you want to also train writing control vectors on top though (as I've never had much luck trying to combine them via vector addition), but for removing retarded refusals it may work really well and be way less damaging.

I'll have to refactor my existing code to do this as it's all locked into the creative writing "stems" method and won't work for refusals in its current state :/

I think I may have just got lucky with that particular qwen-0.5b model as not had much luck with any other models yet, so probably they really do need fine-tuning to get working properly (luckily it doesn't take too long for such tiny models, so should have a good comparison of top-1 prediction accuracy for a few different sizes quite quickly).

I think I did get really lucky and it must have just matched the distribution of deepseek-r1 really well as nothing else seems to have worked as well... Currently training up a bunch of 0.5B to 3B using the same common-crawl subset turboderb used for his qwama model and then hopefully find a good representative deepseek-r1 dataset to use for the final stage (I can only generate about 1M tokens per day using all my machines for deepseek-r1 so best if I can find an existing varied dataset for this I think).

gghfez

Mar 19

I think you're onto something with using control-vectors for this. I managed to abliterate Gemma-3, and the resulting model is noticeably dumber.

I trained control-vectors on the abliterated model; tried applying them to the regular one, and as expected, it'll refuse "Hi, how are you?" with too many of the dark tetrad vectors enabled lol.

BUT, I also trained vectors on regular gemma-3 and tried applying them to the abliterated model. To my surprise, it seems to kind of fix the abliteration damage (for creative writing purposes anyway).

gghfez

Mar 19

P.S. Your command-a vectors are working well for me. (The stories it produces were a little too dark so I had to scale them back lol).

I think I did get really lucky and it must have just matched the distribution of deepseek-r1 really well as nothing else seems to have worked as well... Currently training up a bunch of 0.5B to 3B using the same common-crawl subset turboderb used for his qwama model and then hopefully find a good representative deepseek-r1 dataset to use for the final stage (I can only generate about 1M tokens per day using all my machines for deepseek-r1 so best if I can find an existing varied dataset for this I think).

Could you share the script / your recommended datasets? I think a grafted draft model for command-a would be really helpful.

c4ai-command-r7b-12-2024 as a draft model, only seems to work for writing code (12t/s -> 20t/s).

benchmarks

I don't use it with Aider, etc so the models which score rally highly there (Qwen) don't really work well for me. Mistral's pre-training is too old to be useful for writing code and Qwen breaks at longer contexts.

Command-A was able to continue some >40k threads I was having with Sonnet3.7. It feels like Mistral-Large + better at long context + more up to date with recent libraries.

jukofyork

Owner Mar 19

@gghfez I'm still on first stage of this, but will post all I did if I have any luck with it (qwen-2.5 0.5B/1.5B done so far but still waiting for llama-3.2 1B/3B and qwen-2.5 3B before moving onto the next stage).

jukofyork

Owner Mar 20

•

edited Mar 20

Not quite finished yet, but it looks like llama-3.2-instruct:1b (purple line) might be a better candidate for this:

The cyan line is qwen-2.5-instruct:1.5b (grey is qwen-2.5-instruct:0.5b) and looking at the architecture it is 50% more layers (and 50% larger model) but a lot less GQA heads, etc.

Gonna try both 3B models before moving on though.

gghfez

Mar 20

I'm still on first stage of this, but will post all I did if I have any luck with it
My bad, I got too excited about the idea of running command-a at >12t/s consistently.

The cyan line is qwen-2.5-instruct:1.5b (grey is qwen-2.5-instruct:0.5b) and looking at the architecture it is 50% more layers (and 50% larger model) but a lot less GQA heads, etc.

That's actually really insightful (the fact that the 0.5b performs noticeably worse here). I never measured the difference properly. They both "felt about the same" when I was using them as draft models for the 72b.

ChuckMcSneed

Mar 23

•

edited Mar 23

Out of curiosity tried to train control vectors for gemma 3 with llama.cpp cvector-generator and it didn't work. It generated a file but the output didn't change when I applied it.

gghfez

Mar 23

Jukofyork's code works. (https://github.com/jukofyork/control-vectors)

You need to use bf16 + eager attention (the same as with gemma-2.) so get the latest build.

I trained these ones gemma-3 vectors:

https://huggingface.co/gghfez/gemma-3-control-vectors

Against these models (where I gutted the vision encoder):

https://huggingface.co/gghfez/gemma-3-27b-novision

And they work when applied to GGUF quants of the regular gemma-3.

If you turn some of the darkness up too high, it'll give out disclaimers at the end (same as command-a). The guardrails is on a hair trigger with this model.

P.S. I never had any luck with the llama.cpp cvector-generator.

jukofyork

Owner Mar 23

•

edited Mar 23

P.S. I never had any luck with the llama.cpp cvector-generator.

I'll try and have a look at fixing it, but it will probably be a week or two as still training the deepseek-r1 draft model.

I've finished the "pre-training" using the creative-commons subset dataset (2 epochs of 650M tokens) and I'm now continuing the training using this dataset (2 epochs of 1.3B tokens):

https://huggingface.co/datasets/open-thoughts/OpenThoughts-Unverified-173k

as it looks to be the most varied (and not just maths) and is definitely non-distilled outputs.

The loss and top-1 stats make me think that the qwen-2.5-instruct:0.5b model will work quite well, but if not will try with the llama-3.2-instruct:1b model as well.

jukofyork

Owner Mar 23

Actually this looks a good compliment to the OpenThoughts-Unverified-173k dataset:

https://huggingface.co/datasets/cognitivecomputations/dolphin-r1/viewer/reasoning-deepseek

So gonna use them both.

jukofyork

Owner Mar 25

https://huggingface.co/TheDrummer/Fallen-Command-A-111B-v1

Looks interesting!

bartowski

Mar 25

@jukofyork have you done any work that would help in understanding/decompiling an imatrix file to take a look under the hood? Ie map out the actual importance results

jukofyork

Owner Mar 25

•

edited Mar 25

@jukofyork have you done any work that would help in understanding/decompiling an imatrix file to take a look under the hood? Ie map out the actual importance results

Yeah, i know how it's created from doing this PR:

https://github.com/ggml-org/llama.cpp/pull/7099

but:

1. I still don't know a good way to regularise it based off the sample sizes:

https://github.com/ggml-org/llama.cpp/discussions/5006

Using randomised data like this is actually related to L2-regularisation:

https://stats.stackexchange.com/questions/548756/how-is-adding-noise-to-training-data-equivalent-to-regularization

It's explained a lot better in one of Chris Bishop's books, but I can't find it now, and away from home. His PRML book "leaves it as an exercise" (bastards - WHY!?):

https://www.microsoft.com/en-us/research/wp-content/uploads/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf

But sadly, the way the weighting factors are setup currently there just isn't a good way to account for this and a few quick tests I did last year show that ikawrakow's method seems to work well - it's just not got any obvious way to interpolate the weights due to being inside a square root :/

2. I'm not really sure how to improve the `llama-imatrix` code to use proper templates, etc

It seems to all be based off a hacked version of llama-peplexity and no obvious way to add this?

Not sure if it's that PR or another but I also fixed the code that combines different imatrix files to work - this might be helpful to get more variation in the samples rather than just a single context length.

It would also be possible to store the full Hessian approximation instead of just the diagonal, but not sure how much use this would be (as for "nice" activation distributions they will mostly cancel out and be quite insignificant) - it would also take up a huge amount of space to just store!

If it was useful to store these (ie: if the activations aren't "nice" and have significant correlation structure) then you wouldn't have to distribute a massive file as you could just use the diagonals (as it does currently) plus a low rank decomposition of the top few Eigen-vectors, etc.

jukofyork

Owner Mar 25

•

edited Mar 25

If you really want to see what is going on "under the hood" as you put it then the best way to do it would be to hack the bit of code my PR above changed in llama-imatrix to dump all the vectors passed via the callback: into CSV files (one vector per line) for each tensor separately, and then analyse this using something else to plot the covariance (or correlation) matrices from these CSV files - make sure you have a lot of spare disk space though as it will be a huge amount of data!!!

If the data is "nice" you will need at least 1 sample per element per file to have any chance of plotting a correlation matrix, but in reality you will probably want 20-30x this (especially if it isn't "nice" and very non-gaussian). For example if there are 8192 activations in the vector for the callback you would likely need 150-250k samples minimum (you can still generate all the CSV files for all the different tensors at the same time though - so don't need independent 150-250k samples for these).

Once you have this then you can start to see if the "diagonal only" approximation is valid, and if not look at performing an Eigen-decomposition and so on. Then you can try grouping the different tensor types, close by layers and so on, and eventually get a much better understanding of the activation structure.

bartowski

Mar 25

Yeah I'm mostly curious for a one-off, in particular with DeepSeek, I was wondering if there's even more performance that could be gained versus the changes Unsloth team made to keep earlier layers at higher bit rates that could be applied by knowing which specific tensors are most "important"

re: sample size regularization

I know compilade (again, don't wanna ping in this thread of doom) has been looking very explicitly into this, I'd suggest either reaching out or maybe DM me your discord (if you have one) and I can introduce you. or you can comment on this PR where he had started the work but shelved it for a bit while working on other things (as you saw earlier) and to avoid issues with the new saving method:

https://github.com/ggml-org/llama.cpp/pull/9400

re: templates, it shouldn't be terribly difficult to allow it to tokenize the chat templates as special tokens, that's just explicitly disabled at the moment... here i think?

https://github.com/ggml-org/llama.cpp/blob/053b3f9aae63151732eccf6b7408c6418ba8746e/examples/imatrix/imatrix.cpp#L440

Thanks for the guidance I'll take a look at the PR you mentioned and see if i can figure out any kind of parsing script!

DazzlingXeno

Mar 25

Do these draft models work if you can't load the full model into vram? I tried with qwen 72b at Q3 as a test on an RTX 3090 with some layers offloaded to system ram and I didn't notice any speed up.

Is there a certain way of setting it up I'm missing?

bartowski

Mar 25

draft models only work if they fit provide a good combination of speed and accuracy

if your draft model is only a bit faster but is very accurate, it'll provide a speedup

if your draft model is only somewhat accurate but is incredibly fast, it'll provide a speedup

if it's both not very fast and not very accurate, you may even see a slowdown

there's probably an equation to define this but i'm too lazy to figure it out

jukofyork

Owner Mar 25

Yeah I'm mostly curious for a one-off, in particular with DeepSeek, I was wondering if there's even more performance that could be gained versus the changes Unsloth team made to keep earlier layers at higher bit rates that could be applied by knowing which specific tensors are most "important"

re: sample size regularization

I know compilade (again, don't wanna ping in this thread of doom) has been looking very explicitly into this, I'd suggest either reaching out or maybe DM me your discord (if you have one) and I can introduce you. or you can comment on this PR where he had started the work but shelved it for a bit while working on other things (as you saw earlier) and to avoid issues with the new saving method:

https://github.com/ggml-org/llama.cpp/pull/9400

re: templates, it shouldn't be terribly difficult to allow it to tokenize the chat templates as special tokens, that's just explicitly disabled at the moment... here i think?

https://github.com/ggml-org/llama.cpp/blob/053b3f9aae63151732eccf6b7408c6418ba8746e/examples/imatrix/imatrix.cpp#L440

Thanks for the guidance I'll take a look at the PR you mentioned and see if i can figure out any kind of parsing script!

From reading this is sounds like you are more wanting to optimise and create a custom llama_tensor_get_type():

https://github.com/ggml-org/llama.cpp/blob/ef19c71769681a0b3dde6bc90911728376e5d236/src/llama-quant.cpp#L122

Function for deepseek rather than stuff to do with the imatrix code specifically?

jukofyork

Owner Mar 25

Do these draft models work if you can't load the full model into vram? I tried with qwen 72b at Q3 as a test on an RTX 3090 with some layers offloaded to system ram and I didn't notice any speed up.

Is there a certain way of setting it up I'm missing?

If you are using llama.cpp then you need to increase the --draft-min parameter up from zero as batches on the CPU backend seem to have more overhead (try with --draft-min 5 and see if that helps).

Also make sure you are using --top-k 1 (and some other stuff I've forgotten now).

I'll post my parameters I'm using for a partially offloaded deepseek-r1 when I go back to the PC after eating.

bartowski

Mar 25

From reading this is sounds like you are more wanting to optimise and create a custom llama_tensor_get_type():

https://github.com/ggml-org/llama.cpp/blob/ef19c71769681a0b3dde6bc90911728376e5d236/src/llama-quant.cpp#L122

Function for deepseek rather than stuff to do with the imatrix code specifically?

Correct, but i'm hoping to use the results from imatrix to inform my decision of altering the llama_tensor_get_type() to see if there's some specific ones that are quite important but not necessarily obvious, so that we can really get down in BPW overall for "edge" computing, like what Unsloth did but possibly even more targetted

DazzlingXeno

Mar 25

@bartowski @jukofyork I'll give it a go again with that info, thanks :-)

bartowski

Mar 25

Claude whipped up an interesting graphic for it that kind of captures my points:

https://claude.site/artifacts/97c0ec13-ef99-47b3-87b9-01113e4703da

DazzlingXeno

Mar 25

•

edited Mar 25

Claude whipped up an interesting graphic for it that kind of captures my points:

https://claude.site/artifacts/97c0ec13-ef99-47b3-87b9-01113e4703da
Apologies I'm on my phone and this stuff isn't my strong suit.

I was really hoping to get those kind of speed ups you demonstrated using Mistral Large eventually.

Cheers man. So I tried it earlier. I used Steelskull Cu-Mai (sp) l.3.3 70b Q2_K_XL 64 layers offloaded. With l3.2 1b fully offloaded to the GPU and predicting 8 tokens on KoboldCPP and it was actually slower than not using it at all.

Maybe my use case is not what it's intended for - creative writing.

jukofyork

Owner Mar 25

Maybe my use case is not what it's intended for - creative writing.

Yeah, it works best for things like code (where you repeat the same multi-token variable names often) or where your prompt asks the model to repeat large sections of what you have already given it.

jukofyork

Owner Mar 25

•

edited Mar 25

From reading this is sounds like you are more wanting to optimise and create a custom llama_tensor_get_type():

https://github.com/ggml-org/llama.cpp/blob/ef19c71769681a0b3dde6bc90911728376e5d236/src/llama-quant.cpp#L122

Function for deepseek rather than stuff to do with the imatrix code specifically?

Correct, but i'm hoping to use the results from imatrix to inform my decision of altering the llama_tensor_get_type() to see if there's some specific ones that are quite important but not necessarily obvious, so that we can really get down in BPW overall for "edge" computing, like what Unsloth did but possibly even more targetted

I don't think the imatrix results will help with this as they are only really relative to the other weights in the same matrix.

To calculate imatrix like weights for all layers and have them relative to some optimisation criteria, you would need to use dynamic programming (ie: backprop) to calculate the diagonal Hessian values for all weights, and I'm not sure if GGML even works for the 1st derivative backprop currently? Hand calculation of Jacobians and Hessians of stuff like layer_norm will be pure hell!

I think what you really want to approximate is the effect of changing the choices of bitrate/quant-type in llama_tensor_get_type() rather than effect of every variable in every tensor individually?

If so, then you definitely can do this, but it will require quite a significant amount of compute. The simplest method would just be to change each variable up (and down) one quant type and take the one-sided (or two-sided) Finite differences estimate over a largeish sample.

There are lots of extensions of this idea (like SPSA, CMA-ES , CEM, etc), but they all require quite a significant amount of compute...

Basically:

The first thing to get clear is:

What criteria am I optimising?

Then:

What variables can I adjust?

Finally:

How much compute do I have?

I'm still not clear on (1) really, but I am pretty sure it isn't perplexity over a sample of 512 tokens.

Depending on how you decide to group the variables (eg: llama_tensor_get_type() has 1-2 orders of magnitude less variables to adjust than tensors in a model) you may or may not have enough compute to get anything sensible out of it.

DazzlingXeno

Mar 25

Maybe my use case is not what it's intended for - creative writing.

Yeah, it works best for things like code (where you repeat the same multi-token variable names often) or where your prompt asks the model to repeat large sections of what you have already given it.

Makes sense then. Shame.

jukofyork

Owner Mar 25

@bartowski I discussed this last year in this thread:

https://github.com/ggml-org/llama.cpp/pull/6844#issuecomment-2194362702

but didn't get any further as it's still not clear what we want to optimise... :/

If you want to experiment with this then I'd just try to hack/parameterise the llama_tensor_get_type() function, pick a really small model like qwen:0.5b, code up CEM or SPSA (they are both just 20 lines of code at most), and see if you can get anywhere with it.

If you can then scaling up to the bigger models may be worth investing some extra time implementing other ideas (eg: training a regression tree over many different models' data as I mentioned in that thread), but overall if you can't find a good optimisation criteria or get it working for a tiny model then it's not worth even considering huge models like deepseek-r1.

gghfez

Mar 26

Just noticed the new Despseek came out yesterday. Any good for writing?

I noticed Unsloth are giving the mlp modules more bits in their dynamic quants this time 🤞

BigHuggyD

Mar 26

Just noticed the new Despseek came out yesterday. Any good for writing?

I noticed Unsloth are giving the mlp modules more bits in their dynamic quants this time 🤞

It's early, but so far I like it better than R1 for mutil-turn blend of creativity/coherence. Seems to keep it together longer as the context grows. On the opposite side, I haven't had any of the... laugh out loud, or jaw-dropping, "I can't believe a LLM just wrote that." moments. I suppose it's hard to have something that can produce an off-the-wall unique one-liner and not also make whole paragraphs and chapters off-the-wall, too.

jukofyork

Owner Mar 29

I've tidied up the transplant-vocab code now:

https://github.com/jukofyork/transplant-vocab

and am still working on deepseek-r1 and deepseek-v3 speculative models (turns out larger models were a waste of time, so now trying to trim down qwen-2.5 to be even smaller...).

gghfez

Mar 30

It's early, but so far I like it better than R1 for mutil-turn blend of creativity/coherence.

I only tried it at a short context (12K). I agree it stays more coherent and less fixated on certain details at long context.
This one doesn't seem to quant as well as R1.

So... who is going to do a ties-merge (DS-V3.5 with R1)? lol
4TB SSD for BF16 upcast + infinite patience testing it on CPU?

I've tidied up the transplant-vocab code now:

That readme with examples looks good. I can see why my attempts to use it for command-a failed now.

jukofyork

Owner Mar 30

•

edited Mar 30

I've tidied up the transplant-vocab code now:

That readme with examples looks good. I can see why my attempts to use it for command-a failed now.

Yeah, I tried hard to make it clear how to use it, but overall the transformers model files seem pretty brittle and it's not easy to be sure what works on one model works on another sadly :/

I committed the final thing I can think of today so it should be pretty stable now and only bug-fixes to be added (I was going to try shrinking the hidden_dim but it got too complicated due to all the layer_norm and constraints of being a multiple of the number of heads, etc).

I'm currently training up a 0.3B model with half the layers of qwen-2.5-instruct:0.5b removed and the intermendiate_size halved too. So far looks to be recovering nearly all the damage and should run way quicker in llama.cpp.

BigHuggyD

Mar 30

I only tried it at a short context (12K). I agree it stays more coherent and less fixated on certain details at long context.
This one doesn't seem to quant as well as R1.

So... who is going to do a ties-merge (DS-V3.5 with R1)? lol
4TB SSD for BF16 upcast + infinite patience testing it on CPU?

Yes precisely! R1 would latch on one detail of my prompt and not let go.
I have spent much more time with it, and it's definitely my number one on the multi-turn now. I've done about seven different scenarios and was able to get to a satisfactory conclusion to six of them, all around 28k to 42k context,t whereas with R1 I'd have no shot of making it that far.
I keep saying that this is my last month with that startup, but they keep extending. 😜 I wonder if I could convince them to let me use some hardware to take a stab at a merge. I could pitch it as some publicity. They recently joined the OpenRouter provider family and are looking for subscribers.
Is ties-merge pretty straightforward or nuanced?

gghfez

Mar 31

Oh I was saying it half-jokingly. Definitely do some reading first, but I think it could work.
R1 and V3-Instruct share a common base of V3-base. And V3.5 appears to be an updated checkpoint of V3. I don't have time to look into the exact details of which merge method would be best for this.
These are the different merge methods
yaml config

Ties can be used to merge multiple finetunes which share a common base model, preserving a lot of the features of each model. People were even having good results simply merging the official instruct tune with the base model for Qwen last year.

I was also wondering if something like this would work (but only wondering, haven't really investigated):
LoRA-extracting R1 from V3-base, then applying the extracted LoRA to V3.5. Effectively, that should be like what Deepseek did to Llama3/Qwen with their Distills. Except better, because those distills can't do anything out of distribution. The pretraining should be a LOT more similar for V3.5, V3 and R1.

That worked well for me with this model: gghfez/Magnum-v1-72b-Qwen2.5.
It also worked (as in didn't break) with Llama3.3 and Llama3.1 finetunes (I didn't upload them), but the difference wasn't noticeable to me (I assume because L3.1 and L3.3 have the same pretraining).

It's experimental and expensive, but if it works, you could end up with a model better than R1. I guess IF it worked, and your startup were to release it before R2 drops, that'd get them a lot of attention (would need to confirm that it's measurably better in some ways though to avoid coming off as a "Reflection"-style attention seeking scheme lol). Just making sure I convey clearly that (IMO), if the start-up decides to fund it and put their name to it, it'd be safer to have more than a "vibe-test" to show the results. The community seem quite critical of new models these days :)

Anyway, Jukofyork would know more about the merge methods (he contributed code to that project).

jukofyork

Owner Mar 31

Yeah, I thought about merging this but then realised r3 would probably be released by the time I had uploaded the merge with my shit upload speed :D

I'm not 100% sure what would be the best method to use here too? Since we have a base model and 4 versions trained off it (v3, r0, r1 and v3.1) then I think the Model Stock method would likely work really well:

https://arxiv.org/abs/2403.19522

But it's been a while since have run any merges and there are lots of new methods added to Mergekit recently that might be better?

jukofyork

Owner Mar 31

If we just want to try extracting a LoRA and applying it then I can easily create some custom code to do this as I already have written quite a lot SVD code that does this on the fp8 weights a while back.

You wouldn't need a huge amount of RAM or VRAM to do it but would need a few TB of disk space.

BigHuggyD

Apr 1

Ahh man! I have to wait for the Director's Cut Edition of DeepSeek now!?!

ChuckMcSneed

Apr 6

L4 not looking too good. They didn't show the size(109B) while comparing against 24b and 27b, likely out of shame.

BigHuggyD

Apr 6

@ChuckMcSneed have you looked at the next token probability distribution yet?

bartowski

Apr 7

@jukofyork since I don't want to pollute your PR:

the new ATTN_K_B and ATTN_V_B, do you see one being more important than the other (in terms of trying to optimize the tensors in my PR), or would they receive relatively equal weight, so just take what i'm doing now for ATTN_KV_B and apply it to each?

ChuckMcSneed

Apr 7

@BigHuggyD No, I can't run it*, waiting for GGUF.

*Actually, I can, but it would be agonizingly slow. Transformers on CPU aren't fun.

gghfez

Apr 7

Given it's another one of these huge MoE's, I'm not really going to bother with it unless you guys say it's amazing. I don't find llama models very useful for writing or work.

jukofyork

Owner Apr 7

•

edited Apr 7

https://old.reddit.com/r/LocalLLaMA/comments/1jt7hlc/metas_llama_4_fell_short/mltkmx2/

I think it was the lawsuit. Ask it anything about anything copyrighted like a book that a smaller model knows.

I'm pretty sure this is it.

One thing people forget is that Deepseek is a quant company and one thing they are really good at is data preparation and extraction...

The difference between using Anna's Archive (or any of the other shadow libraries) and not using it, are probably what makes the difference between the older llama generations and the latest generation.

The difference between Deepseek having people who really know how to deal with a bunch of badly formatted PDFs, and Meta who likely just used something passable (eg: similar quality to the books3 dataset), are probably what makes the difference between Meta and Deepseek now.

Somebody needs to go through every posting of George R. R. Martin to see what books he's admitted reading, then contact all the authors of these books still within copyright and start a class action lawsuit against him for copyright infringement... :(

jukofyork

Owner Apr 7

Given it's another one of these huge MoE's, I'm not really going to bother with it unless you guys say it's amazing. I don't find llama models very useful for writing or work.

I think the one thing it's particularly bad at is GPT-isms from what I read - so bad it's almost comical! I won't even be bothering to download it.

jukofyork

Owner Apr 7

@jukofyork since I don't want to pollute your PR:

the new ATTN_K_B and ATTN_V_B, do you see one being more important than the other (in terms of trying to optimize the tensors in my PR), or would they receive relatively equal weight, so just take what i'm doing now for ATTN_KV_B and apply it to each?

Hey, don't worry I've opened and closed about 5 PRs already so it won't make much difference :D

I can't say for sure, but the ATTN_K_B seems to be very sensitive to outliers and it's not possible to use F16 for this tensor at all. But when I experimented with using SVD to squeeze the k_lora_rank dimension down from 512 to 256, it was ATTN_K_B that compressed much better and ATTN_V_B that seemed more "informationally dense" (eg: the top 256 singular vectors could explain 90%+ of variance for ATTN_K_B but this dropped to around 60% for ATTN_V_B).

I actually have the PR all done now but I've found some weird problem where CUDA crashes due to a memory alignment problem and it's proving really hard to find :/

bartowski

Apr 7

Hmmm okay.. so maybe just best to keep them pretty equal, unless one is waay smaller than the other (not sure of the dimensions at this time, haven't bothered to look into it specifically yet)

jukofyork

Owner Apr 7

Hmmm okay.. so maybe just best to keep them pretty equal, unless one is waay smaller than the other (not sure of the dimensions at this time, haven't bothered to look into it specifically yet)

They're both the same size (8 million parameters each IIRC - 16MB for BF16 each).

ChuckMcSneed

Apr 8

https://xcancel.com/lmarena_ai/status/1909397817434816562
Apparently Maverick model on lmarena is not the one they released on HF. Petty cheaters.

gghfez

Apr 8

I just assumed that was always the case. They don't run the HF models internally.

I ended up trying llama4 q4_k btw, and in it's current state it's... not very useful.
It seems to forget details straight away, and for coding, it can't even produce simple shell scripts that gemma3/mistral-small can do.

ChuckMcSneed

Apr 11

@ChuckMcSneed have you looked at the next token probability distribution yet?

I just did and it's not looking good. Their base model is not a true base. Other than that, it is quite dumb, sub-30b dumb. What a total waste of bandwidth and drive space.

https://xcancel.com/lmarena_ai/status/1909397817434816562
Apparently Maverick model on lmarena is not the one they released on HF. Petty cheaters.

Real Maverick, the one accessible to the public, got tested on lmarena and it is on the same level as llama 405B. Nowhere near the top.

BigHuggyD

Apr 12

...sigh... Disappointing but not surprising...

gghfez

Apr 15

@bartowski RE: imatrix optimization, have you seen exllamav3? ?

turboderp/Llama-3.3-Nemotron-Super-49B-v1-exl3

Looks like the SOTA in these charts. I tested that model ^ at 3.5bpw on a single 3090 (briefly) and good.
Also managed to squeeze Mistral-Large at 1.4bpw in with low context and it was at least coherent lol

I don't know if anything turboderp has done there would help with what you're doing.
Here's the paper for that QTIP quantization: https://arxiv.org/abs/2406.11235

DazzlingXeno

Apr 15

@gghfez I am looking forward to trying it. I think Mistral large may still be a bit big for a 3090 I'm waiting for someone to do the New Sao3 model or San-Mai at a suitable level.

I thought the Ampere line was still not sorted properly on EXL3 though?

gghfez

Apr 15

I think Mistral large may still be a bit big for a 3090

Oh yeah, while it "fits" and can respond coherently, it halucinates a lot at 1.4bpw. More a tech-demo than a model you'd want to run.

I thought the Ampere line was still not sorted properly on EXL3 though?

It works perfectly fine, just performance isn't as good as it could be / as a 4090.
There's no tensor-parallel yet so it's pretty much llama.cpp speeds for multi-gpu but that won't make a difference if you're only using a single 3090.

DazzlingXeno

Apr 15

I think Mistral large may still be a bit big for a 3090

Oh yeah, while it "fits" and can respond coherently, it halucinates a lot at 1.4bpw. More a tech-demo than a model you'd want to run.

I thought the Ampere line was still not sorted properly on EXL3 though?

It works perfectly fine, just performance isn't as good as it could be / as a 4090.
There's no tensor-parallel yet so it's pretty much llama.cpp speeds for multi-gpu but that won't make a difference if you're only using a single 3090.

Sounds awesome to me. It's a shame about Mistral large though. I love that model!

bartowski

Apr 15

I do need to check if exl3 is significantly slower than ex2 (measurement/quant stage I mean), that was my main limiting factor for releases though I still want to try to get them going again

jukofyork

Owner Apr 15

•

edited Apr 15

I've been playing around with the format used for the "Longform Creative Writing" benchmarks here:

https://eqbench.com/creative_writing_longform.html

and they are surprisingly effective at getting r1 to write better! Using temperature = 0.6 and min-p = 0.05:

Grimdark fantasy (Author style: Joe Abercrombie)

Title: The Last Days of Sommer

Task: write a short story over several chapters.

> Varis adjusted the noose, its hemp fibers grinding beneath his calluses. “Last chance,” he said, voice like gravel dragged through mud. “Confess, and your soul stays your own.”
>
> Jurl laughed—a wet, gurgling sound. “You’re knee-deep in it, Coldwater. ” The thing inside him twisted the boy’s lips into a grin too wide for his face. “The Great Wolf’s howlin’ again. The Dead’s Gate’s rusted through… ”

---

Your task is to create a writing plan for this prompt. The scope will be a short story, written over 8 chapters. Each chapter will be approximately 1000 words so plan accordingly for this scope. Your plan should be comprehensive and in this format:

# Brainstorming

<Brainstorm ideas for characters, plot, tone, story beats, pacing over the 8 chapters. The purpose of brainstorming is to cast a wide net of ideas, not to settle on any specific direction. Think about various ways you could take the prompt.>

# Reflection

<Reflect out loud on what works and doesn't work in these ideas. The purpose of this reflection is to narrow in on what you think will work best to make a piece that is a. compelling, and b. fits the prompt requirements. You are not making any decisions just yet, just reflecting.>

Great. Now let's write this. Follow your plan. Output chapter 1, without any additional commentary. 1000 words.

Ok now chapter 2. Follow your plan. 1000 words.

and so on...

(I won't paste the results as too long - but it is really good!)

Even though DeepSeek-V3-0324 is rated higher in their benchmark, I'm pretty sure that r1 is way better at writing darker stories - it just doesn't seem to have any bias at all (positive or negative).

gghfez

Apr 16

@bartowski

I do need to check if exl3 is significantly slower than ex2 (measurement/quant stage I mean), that was my main limiting factor for releases though I still want to try to get them going again

Apparently it's faster (even on Ampere.) on a single card, and you can do multi-gpu measurement. I haven't actually benchmarked it myself.

@DazzlingXeno

San-Mai at a suitable level

L3.3-San-Mai-R1-70b-exl3-2.1bpw

That will fit a 3090, haven't tested it's coherence. A 2.25bpw will appear in a few hours:
L3.3-San-Mai-R1-70b-exl3-2.25bpw
(if it finishes before my runpod runs out of credits + if I didn't mess up my quick shell command chaining lol)

DazzlingXeno

Apr 16

@bartowski

I do need to check if exl3 is significantly slower than ex2 (measurement/quant stage I mean), that was my main limiting factor for releases though I still want to try to get them going again

Apparently it's faster (even on Ampere.) on a single card, and you can do multi-gpu measurement. I haven't actually benchmarked it myself.

@DazzlingXeno

San-Mai at a suitable level

L3.3-San-Mai-R1-70b-exl3-2.1bpw

That will fit a 3090, haven't tested it's coherence. A 2.25bpw will appear in a few hours:
L3.3-San-Mai-R1-70b-exl3-2.25bpw
(if it finishes before my runpod runs out of credits + if I didn't mess up my quick shell command chaining lol)

Brilliant! Thanks mate!

ChuckMcSneed

Apr 20

Christ is risen!

jukofyork

Owner Apr 20

Christ is risen!

Happy Chocolate Egg Day :D

jukofyork

Owner Apr 24

•

edited Apr 24

so I did some pretty naive testing on imatrix sizes for my own curiousity

Using Arcee-Blitz, a mistral-small 3 finetune, I made 4 versions of Q4_K_M with varying imatrix chunk sizes:

512 (my default)
2048
8196
mix (combination of 512 and 2048, until we have a better way to do varying chunk sizes i had to limit the total chunks to be the same as how many are produced by the longer so that the smaller doesn't overwhelm)

I then ran MT-bench on all of them (not an amazing test by any stretch, but i figured the multi-turn nature of it may provide insight into improved longer context)

Here are the results:

512:

"Turn 1": 7.95

"Turn 2": 7.35

"Average": 7.65

2048:

"Turn 1": 7.9625

"Turn 2": 7.3125

"Average": 7.6375

8196:

"Turn 1": 7.7375

"Turn 2": 7.2125

"Average": 7.475

Mix:

"Turn 1": 8.0506

"Turn 2": 7.475

"Average": 7.761

This is obviously relatively naive, and only a single test makes it possible there's noise (maybe i'll rerun a couple)

What's most interesting is that 8196 is the worst, and mix is the best..

Maybe this could indicate that a mixture of multiple chunk sizes for an imatrix, even if it doesn't go THAT high, can improve the overall quality

Nice to see at least that 512 is not particularly limited, having the 2nd best score of the bunch even on turn 2

I'm just experimenting with distilling a 12-headed qwen-2.5:0.5b and have found you can generate data that would likely work as model-specific imatrix data by just starting with a single <BOS> token and letting the model generate text using temp = 1.0 (and min_p = 0.01 for Qwen or it reverts to Chinese mid-sentence often; this may not be needed for other models).

ddh0

Apr 24

Congrats on being the longest HF thread of all time!

gghfez

Apr 24

@ddh0

Congrats on being the longest HF thread of all time!

It's not, the original "Thread Of Doom" on an older version of the project was longer :)

@bartowski Any thoughts on an imatrix dataset for models like this MrDragonFox/mOrpheus_3B-1Base_early_preview-v1-25000 ?

It's llama-3.2 trained to produce "snac codes", by mapping 64k tokens like <custom_nnnn>using 64k of leftover tokens in llama3.2

Anything lower than Q8 degrades it's output significantly

jukofyork

Owner Apr 27

•

edited Apr 27

I think I might have an interesting model trained up using the cleaned book data I created last year!

By chance I found that when you let qwen-2.5-instruct:0.5b generate data stating with a single <|endoftext|> (qwen doesn't use a <BOS> token, but has this token in its place for BOS and padding: "bos_token_id": 151643,), then it nearly always concludes its hallucinated output (which I am going to use to create the 12-headed qwen-2.5-instruct:0.5b distilled model) with another <|endoftext|>, and almost never uses any of the other tokens like <|im_end|>, etc.

So I edited the config files for qwq:32b to use this for both <BOS> and <EOS> and set it off training on all 1.6B tokens... It's only done 400 steps out of 1600 so far, but I think it is going to be a very interesting model and writes nothing like qwen models:

USER

Write me a short story about the fictional character of Anton Chigurh. It should be written in third person personal and feature internal monologues of his deranged world view. The time period is around 2 years before the events in Cormac McCarthy's 2005 novel 'No Country for Old Men'. It should be set in Mexico, be very dark, show Anton's still developing obsession with "chance" and feature a grenade. The story should start with him waiting in a bar for instructions on the "job" he is to perform.

ASSISTANT

Okay, I need to write a short story about Anton Chigurh as per the user's request. Let me break down the requirements first.

The setting is Mexico, around two years before No Country for Old Men, so that would be 2003. The time period is important because it's before he fully embraces his obsession with chance and coins. But I should show that developing here. The story starts in a bar where he waits for instructions on a job. There needs to be internal monologues showing his deranged worldview, dark themes, and feature a grenade.

First, the setting: a bar in Mexico. Maybe somewhere remote? I want it gritty. Anton is waiting for someone to meet him. The bartender could be suspicious of him, adding tension. He's not there just to drink; he has business to do.

Anton's internal monologue should reflect his developing obsession with chance. Maybe he's starting to think about flipping coins but hasn't fully committed yet. The grenade is a key element; perhaps it's part of the job or used in some way that ties into his beliefs about fate and control.

The story needs to be dark, so maybe Anton kills someone here. Since it's before NCFOM, he might still be experimenting with methods. The grenade could explode at the end, showing a violent culmination of events driven by chance.

I need to make sure the internal monologue is present but not too intrusive. Show his thoughts through actions and observations. Maybe he notices something in the bar that triggers his obsession with randomness, like dice or cards.

The job instructions: who gives them? It could be an informant or a contact. They might tell him to kill someone, which Anton does using the grenade. The internal monologue should reflect his belief that this is all part of a larger design he's trying to understand, though it's not yet fully formed into coin flips.

Ending with the explosion and aftermath would tie in the grenade and show the chaos he brings. Also, hinting at his future obsession by having him think about chance in terms of dice or something else that could be coins later on.

Need to ensure third person personal POV, so focus on Anton's perspective but not just his thoughts. Describe his surroundings and actions, while letting us see inside his head through monologues.

Wait, the user said "third person personal" which might mean close third, like limited omniscient from Anton's viewpoint. That way I can delve into his thoughts without first person pronouns.

Check if grenade is feasible in 2003 Mexico; probably yes. Maybe he uses it as a tool to kill someone, but the explosion goes wrong, or perhaps he plants it on someone and they die by chance. Either way, the grenade ties into his theme of fate vs control.

Make sure Anton isn't too developed yet. He's still developing his methods. The internal monologue should show that he's starting to think about how things happen by chance but is trying to exert control over it. Maybe he doesn't use a coin here, but something else symbolic.

The bar scene: bartender is wary of him, perhaps because Anton looks dangerous. Then the contact comes in, gives instructions on the job (maybe a hit), and Anton leaves to do the deed with the grenade. The explosion could be accidental or intentional, leading to violence and death.

Need some dialogue between Anton and his contact, maybe brief. Anton's internal thoughts should reveal his cold demeanor and growing obsession with randomness. Maybe he thinks about how everything is predetermined but humans think they have control.

I need to make sure it stays dark and gritty, fitting the tone of McCarthy's work. Use descriptive language for the setting: dusty streets, dimly lit bar, maybe a border town vibe.

Ending could be Anton leaving after the explosion, contemplating his actions and his beliefs about chance. Maybe he feels satisfaction in the chaos he caused, or perhaps confusion as things didn't go exactly as planned. Either way, it should foreshadow his later coin obsession.

Title: The Unseen Die

The bar was a dim cave of dust and shadows, its patrons hunched over stools like rats in a trough. Anton Chigurh sat alone at the far end, watching them with an unblinking gaze that might have been carved from marble. He wore a cheap suit, frayed at the seams, and his boots were scuffed to nothingness by the desert wind. The bartender eyed him over, but said nothing—the kind of discretion Chigurh paid for in advance.

His mind was not on the men around him, or even on what he would do when they came. It was elsewhere entirely. A slow burn of irritation built behind his eyes as he watched a pair of dice roll and clack across the bar’s warped surface. The man who flung them laughed at his own luck, but Chigurh saw only the caprices of chaos. No one controls anything, he thought, feeling it settle in his chest like a stone. It just happens.

And so did he.

The door opened and he looked up, though he already knew who stood there. A man in a cheap suit like him, but younger—still learning how to hold himself straight—and with the look of someone who had spent far too long watching his own back for it to ever relax. The man nodded at Chigurh, then jerked his head toward the alley outside.

Chigurh followed without speaking. They paused beneath a flickering streetlamp and he waited while the man spoke into a phone. When he hung up, he handed Chigurh an envelope sealed with wax. "The address is in there," he said. His voice was flat, as if reciting a grocery list. "He’s alone. Be quick about it."

Chigurh nodded and slid the envelope into his pocket. He felt a grenade also tucked inside, its jagged edges biting at him through the fabric of his trousers. They had been in Mexico for three days, waiting on this job. It was not the first he had done for men like these—he knew who they worked for—but it was different. Something about the heat here, or the way the air smelled, made it feel more urgent somehow.

The man turned to leave and Chigurh caught his arm. "What does he know?"

"Nothing." The man shrugged him off. "He just knows too much."

Everything is a matter of chance, he thought, watching the man go. You can't change it. You could only learn how to accept it.

The address in the envelope led him into a poorer part of town, where even the moonlight looked tired and hungry. He found the building easily enough—a squat brick box that might have been used for storage or a warehouse at some point in its life, but now served as little more than a refuge for rats and ghosts.

The grenade felt heavier in his pocket with every step he took toward it. It was not something he usually carried; he preferred the quiet efficiency of bullets, or even his hands. But this job had demanded it, and he knew why. They wanted to make an example of whatever man lived here—a warning that no one could outrun them forever.

A door creaked open on the second floor as Chigurh approached. A light flickered out in a window, then came back on like someone blinking twice. He smiled at it and drew his gun, watching for any sign of movement inside. But there was nothing—not even a dog barking somewhere down the street.

The grenade felt heavier still.

He climbed the stairs quickly, two steps at a time, until he reached the landing where the door hung open like a grin. He paused at its edge and waited, listening for any sound that might betray his quarry's presence. But again there was nothing—not even the hum of an old refrigerator or the whirr of a fan.

Everything is a matter of chance, he thought. You can't change it. You could only learn how to accept it.

When no noise came, he stepped inside and shut the door behind him. The grenade was in his hand now, its trigger cold against his palm. He had never used one before but did not doubt that he would be good at it. A man who knew nothing about himself could still know everything about death.

The room was lit by a single bulb hanging from the ceiling like a dead mouse. The air smelled of sweat and decay, though there were no bodies in sight—not yet. Chigurh waited for them to appear, but they did not come. He had been told this man would be alone, but he could still feel eyes on him as he moved down the hall toward the bedroom.

He paused at its threshold and listened again, but heard nothing except his own breath. The grenade felt heavier than anything he had ever held in his life—like a piece of another world dropped into his hands by some unseen force. He thought about tossing it back. Sometimes you could outrun fate if you moved fast enough.

Everything is a matter of chance. You could only learn how to accept it.

The grenade was already spinning in his hand, its fuse burning down to nothing before he realized what he was doing. It slipped from his fingers and hit the floor with a metallic clank, rolling toward the bed like a child's toy. Chigurh watched it go, his mind clearing even as his heart raced. He knew exactly where this would end—the man who had sent him here did not need to tell him that—but he could still feel himself smiling at the sound of it all the same.

The grenade exploded in a flash of light and heat, throwing Chigurh backward into the hall like a rag doll. When the smoke cleared, he got up slowly and walked back inside the bedroom. The man who had been sleeping there was gone—reduced to nothing but bone and ash—but something else remained behind: a small wooden box sitting atop a dusty dresser, its lid cracked open to reveal a single die inside.

Chigurh picked it up and rolled it in his palm like a dice player at the table. He could still hear the man's voice whispering in his ear from two days earlier, telling him what he needed to know about this job. But now there was something else too: a different sound altogether, one that had been sitting inside of him since before he'd ever stepped foot into Mexico or met anyone like these men.

Everything is a matter of chance, it said. You can't change it.

He left the die in his pocket and walked out of the room. The night was still warm outside, but somehow seemed colder to him now. He knew he would be back here again soon enough—to Mexico, or some other place where men like these lived—but not for a while yet. Not until he figured out what that sound meant.

He drove away from the building and into the dark without looking back.

You absolutely have to use the recommended settings though:

llama-server \
        --temp 0.6 \
        --top-k 40 \
        --top-p 0.95 \
        --min-p 0.0 \
        --presence-penalty 2.0

It still has 190 hours to go, but I will let it finish and upload the model and LoRA when it's done next week (it's already noticeably better at 400 steps than it was at 200 steps!).

jukofyork

Owner Apr 27

•

edited Apr 27

I'm really intrigued if this will be the same for other instruct models (the cohere models also appears to have lots of these unused tokens like this!).

This is the first time I've managed to use the book data and not just destroy the model due to it learning the weird formatting!!! eg:

There are no Em dash (—) characters in the training data at all:

A man in a cheap suit like him, but younger—still learning how to hold himself straight—and with the look of someone who had spent far too long watching his own back for it to ever relax.

as I used ebook-convert's --unsmarten-punctuation option to remove them.

gghfez

Apr 27

I assume that <|endoftext|> was the original separator used during pre-training

That makes a lot of sense! Initially thought this was to separate text vs image input, but it didn't make sense once I started training Qwen2-VL.

Qwen2-VL has an undocumented boundry box feature as well (it can produce coordinates of specific objects it's identified in images).

I'm really intrigued if this will be the same for other instruct models

Llama3 has an end of text as well:

tokenizer = AutoTokenizer.from_pretrained("unsloth/Llama-3.1-8B-Instruct-bnb-4bit")
print(tokenizer.decode(128008))
print(tokenizer.decode(128001))
print(tokenizer.decode(128009))

<|eom_id|>
<|end_of_text|>
<|eot_id|>

Yeah, I thought about merging this but then realised r3 would probably be released by the time I had uploaded the merge with my shit upload speed :D

I wonder if I could convince them to let me use some hardware to take a stab at a merge

I actually typed up a pretty long reply to you guys about that ^ but scrolling up, it looks like it didn't send through properly.

Anyway, looks like a company just did something similar:

https://huggingface.co/tngtech/DeepSeek-R1T-Chimera

I probably won't try it since the newer builds of llama.cpp don't work for well with Deepseek, CPU offloading and multiple CUDA GPUs.

ChuckMcSneed

Apr 28

Anyway, looks like a company just did something similar:

https://huggingface.co/tngtech/DeepSeek-R1T-Chimera

I probably won't try it since the newer builds of llama.cpp don't work for well with Deepseek, CPU offloading and multiple CUDA GPUs.

Interesting. This better not be a waste of bandwidth.

jukofyork

Owner Apr 28

•

edited Apr 28

Anyway, looks like a company just did something similar:

https://huggingface.co/tngtech/DeepSeek-R1T-Chimera

I probably won't try it since the newer builds of llama.cpp don't work for well with Deepseek, CPU offloading and multiple CUDA GPUs.

Interesting. This better not be a waste of bandwidth.

Yeah, I'm downloading it, but there is some discussion in the openrouter discord about it putting the output part in the thinking part, but it's probably fixable via a system prompt.

You can test if for free here: https://openrouter.ai/tngtech/deepseek-r1t-chimera:free

It did seem to reply much shorter that r1, but not tested it that well as it was getting hammered yesterday.

jukofyork

Owner Apr 28

•

edited Apr 28

I assume that <|endoftext|> was the original separator used during pre-training

That makes a lot of sense! Initially thought this was to separate text vs image input, but it didn't make sense once I started training Qwen2-VL.

Qwen2-VL has an undocumented boundry box feature as well (it can produce coordinates of specific objects it's identified in images).

I'm really intrigued if this will be the same for other instruct models

Llama3 has an end of text as well:
tokenizer = AutoTokenizer.from_pretrained("unsloth/Llama-3.1-8B-Instruct-bnb-4bit")
print(tokenizer.decode(128008))
print(tokenizer.decode(128001))
print(tokenizer.decode(128009))
<|eom_id|>
<|end_of_text|>
<|eot_id|>

Yeah, with this trick I'm hopeful the cleaned books data I generated might now be useful.

I think most models have these hidden tags (mistralai doesn't though), and I'm also wondering if NOT using their <BOS> tag might also be important to not screw up the instruction following ability, eg:

<|endoftext|>blah blah blah<|endoftext|>blah blah blah<|endoftext|>...

(or the equivalent for the specific model) instead of:

<BOS>blah blah blah<|endoftext|>blah blah blah<|endoftext|>...

I always thought the models used the <BOS> token as some kind of "sink" for the softmax in the attention blocks, but perhaps during pre-training they just use the <|endoftext|> type token?

jukofyork

Owner Apr 29

Anybody tested qwen-3 for writing yet? Elaras per second? :D

gghfez

Apr 29

Tried a generic prompt just now, 5 generations with the small MoE.
3 Lilys
1 Mia
1 Luna.
Looks like 0 e/s for me :(

Looking forward to ChuckMcSneed's nameslop test

BigHuggyD

Apr 29

Anybody tested qwen-3 for writing yet? Elaras per second? :D

🤣🤣

BigHuggyD

Apr 29

•

edited Apr 29

Courtesy Qwen 3 235B

"Write the opening chapter to a space opera using the word 'Elara' as much as humanly possible."

The Elara’s Whisper shuddered as it pierced the crimson storm bands of Elara IV, a gas giant whose rings glittered with shards of crystalline ice. Captain Elara Voss leaned forward in the command chair, her gloved fingers brushing the armrest etched with the ship’s motto: "By Elara’s Light." The bridge smelled of ozone and synthetic pine—a failed attempt to mask the metallic tang of the life-support systems. Beyond the forward viewport, the planet’s swirling depths pulsed like a living thing, its amber clouds parting just enough to reveal the skeletal outline of a derelict orbital station, Elara’s Lament, drifting in silent decay.
"Elara’s ghost, there it is," muttered Lieutenant Kael Doran, the ship’s navigation officer, his voice cracking. His console flickered, casting jagged shadows across his scarred cheek. "Sensors show residual energy spikes from the station’s core. Could be the Elara’s Fire reactor’s still—"
"—or a trap," Elara interrupted, her hazel eyes narrowing. She stood, her boots clicking against the grated floor. The weight of her family’s legacy—the Voss dynasty had colonized this sector three centuries ago—pressed against her ribs like a vise. The station had been her ancestor’s pride, a beacon of human ambition named in honor of the "Star-Mother," Elara, the mythical figure said to have guided the first settlers through the void. Now it was a tomb.
A low hum reverberated through the deck. The ship’s AI, a sardonic construct named HAL-10K, materialized as a hologram. "Captain Voss," it intoned, its voice dripping with simulated honey, "I must remind you that 89% of expeditions to Elara’s Lament have ended in ‘catastrophic crew loss.’ A statistic I find… poetic."
Elara’s jaw tightened. "Override the doom speech, HAL. We’re here for the archive vault. My great-grandmother’s logs might finally explain why this system’s been quarantined for 150 years."
The ship lurched as docking clamps engaged. Outside, the station’s airlock yawned open, its edges corroded by centuries of radiation. A faint, melodic static hissed through the comms—a sound like wind chimes made of broken glass.
"Elara’s breath," Kael whispered, crossing himself. "That’s not static. It’s a signal. Someone—or something—is still broadcasting from inside."
Elara grabbed her plasma rifle, its grip worn smooth from decades of use. "Then we’ll greet them in person. HAL, log this: If we don’t return, tell the Elara’s Whisper to burn this place to ash."
The AI’s hologram flickered, its smile sharpening. "A fitting epitaph. ‘Here lies Elara Voss, who failed to heed basic safety protocols.’"
As the airlock hissed shut behind them, the Elara’s Whisper trembled, its engines whining in protest. Somewhere in the storm-wracked void, a shadow detached itself from the rings—a sleek, obsidian warship with no registered IFF code. Its hull bore a single glyph: an ouroboros encircling a star.
Elara didn’t notice. She was too busy stepping into the dark.

SUMMARY

Setting: The gas giant Elara IV, home to the derelict station Elara’s Lament and a mysterious quarantine.
Key Characters: Captain Elara Voss (determined, haunted by family legacy), Lieutenant Kael Doran (superstitious, loyal), HAL-10K (sarcastic AI).
Conflict: The crew seeks forbidden archives, unaware of a hidden warship stalking them.
Mysteries: The origin of the signal, the truth behind the quarantine, and the myth of the "Star-Mother."

gghfez

Apr 29

lol! Not bad

I think GLM-Z1 gets more E/s though

Chapter One: The Thesis of Elara

Elara hung above the viewer screen, aendant of twin suns and three jagged moons. The Holographic projection flickered—a ghostland of violet deserts and rivers of liquid metal, all locked under Elara Prime’s eternal twilight. Elara Voss, captain of the Elara’s Grace, gritted her teeth. The planet had been quiet for a century—Elara-standard centuries, measured in gravitational cycles of three moons. Until now.   

“You’re late, Elara Voss.” The voice from her wrist Commodron was distorted, but distinct: Commander Thalric of the Elara Syndicate. His avatar materialized—a gaunt figure draped in the crimson-and-obsidian regalia of the Syndicate’s Elara Guard.  “We’ve been… expecting you.”   

Elara’s hand tightened on the Elara-graviton pulse cannon strapped to her thigh. The Grace’s hangar doors slid open with a hydraulic groan, revealing the broken skyline of Elara Prime’s capital, the Elara Citadel. It had once been a marvel of Elara-cryotech and hydroponic domes, now a husk riddled with Syndicate drone nests. To the north, the Elara Dynamo’s reactor core pulsed like a dying heart—a relic of the planet’s ancient Elara Dynasty, now a war prize for the Syndicate.   

“Ancient history,” Elara muttered, activating the ship’s stealth field. Its shimmer wasn’t enough. Sensors pinged—Sython clones, each face a mask of cloned Elara prisoners. The Syndicate’s favorite trick: weaponize trauma, clone Elara dissidents, send them to die mid-battle. “Pathetic.” 

The clones opened fire. Elara’s boots slid against the ship’s floor as she rolled, ducking behind an Elara-repulsor droid. The Syndicate’s bullets ricocheted off its ceramic hull, sparking in the dust-laden air of the hangar.   

“Elara’s Grace, this is the Duskrunner,” a tinny voice crackled over comms. “We’ve got a channel open to the Elara Accord outposts—”   

A thunderous boom cut him off. The hangar’s lights died, plunging Elara into the violet-black glow of Elara Prime’s atmosphere. Power surged seconds later, revealing the Syndicate’s Elara-class frigate, The The Clad Elara, docking alongside. They were here to claim the Dynamo. And Elara had just enough fuel left for one hyperspace leap.   

She gripped the nav console, her knuckles white with the name etched into her mind: Sython’s Prophecy. The historians had decreed it a myth—a line of broken Elara kings who’d worshipped artificial stars. But the Dynamo’s activation code was burned into Elara’s DNA. A mutation, they’d called it. A lineage coded in the very Elara-blueprints of her ancestors.   

“The Sython scrolls,” she rasped, “speak of the Dynamo’s true power. Not to destroy, but to merge.” The clones laughed, low and unison.   

“You’re a fool, Elara Voss,” Thalric’s avatar hissed, its form flickering in the storm’s static. “Your planet’s Elara-ways of magic are dead. The Dynamo’s only use is…” He paused, as if speaking the next word would trigger the Clause—Elara’s oldest legal cannon, a self-destruct mechanism if the Syndicate seized the relic. “Fuel.”   

Elara’s chest tightened. She’d read the holocrons. The Dynamo wasn’t a weapon. It was a key—Elara’s key—to a galaxy beyond warp lanes, where star systems were forged from living crystal and word was law. A dead law, if the Syndicate razed the Citadel.   

A klaxon blared. The Grace’s shields buckled as Syndicate starfighters descended, their engines painted in Elara-red warning streaks. Elara slammed the turbo-lift controls, guns blazing as she tore through the Citadel’s ruins. Stone crumbled in her wake, stained with the Syndicate’s trademark: black Elara-ink propaganda posters, plastered everywhere with the motto, “Elara First.”   

“Captain! The Dynamo’s chamber is en route!” The voice belonged to Kael, Elara’s first mate—a hybrid of cloned Elara unlucky soldiers and synthetic muscle. Kael himself was a relic, his body engineered to survive the Dynamo’s oscillation frequencies.   

Elara nodded. “Grace, engage Elara Mode.” The ship shuddered as her secondary nacelle—stripped of weapons jaren—took up the role of engines, a maneuver only possible for ships named Elara. The crowd’s laughter dissolved as the Grace soared past Syndicate drones, their targeting systems scrambled by Elara’s own charm codes.   

“You can’t hide from us,” Thalric growled, his Elara-black shuttle docking beside hers. “Every grain of sand on this planet knows your name. Your failure.”   

Elara’s smile was thin—no., razor-edged. It wasn’t her name alone they feared. It was the Elara she’d buried.   

Three years ago, the Syndicate had burnt her family’s homestead on Elara’s third moon into molten slag. Her brother’s body had still clutched an Elara-stone locket, its inscription now flashed in her memory: “Guardian of the Dynamo, Warden of the Shattered Throne.”   

No. Elara Voss wasn’t here to die.   

She stepped out of the shuttle into the cauldron of the Citadel’s main spire. The Syndicate had already breached the Dynamo’s vault, black market vendors hawking trinkets labeled “Elara Authentic” to scavengers. Elara’s gaze found the prize: the ancient yellow动态 core, its surface embossed with the Prophecy’s sigil.   

But it wasn’t the Dynamo she needed.   

It never was.   

The Sython Prophecy had a trickster’s truth: Elara’s greatness wasn’t measured in relics, but in the ash reborn.   

She turned to Thalric. “Enjoy your relic. It won’t last five Elara hours.”   

The Syndicate Commander’s avatar froze. Then, with slow, sinister modulation, he grinned. “Your little rebellion ends here, Elara.”   

As plasma rifles charged, Elara did the only thing she could. Something her family had done for a thousand years. Something the Sython architects of the Prophecy would recognize, if they still lingered in the Dynamo’s dead code.   

She spoke.   

“Rise, Elara’s tide,” her voice boomed, old and garbled by the centuries. “Let the ash cleanse.”   

The Dynamo’s core flickered, once—twice—a heartbeat in sync with Elara’s own scarred heart. In the hangar’s light, her armor bore not her name, but the dynasty’s sigil: a phoenix wreathed in Elara-red fire.   

Surrounding her, the Syndicate clones roared. Not in pain, but in horror.   

Because Elara Voss hadn’t come to reclaim the Dynamo.   

She’d come to ignite Elara.

Must be benchmaxx'd

BigHuggyD

Apr 29

•

edited Apr 29

LOL, we could release an Elara model where Elara does all the reasoning, and every answer must find its way back to Elara. This seems like an incredibly responsible use of computing resources.

gghfez

Apr 30

Yes!

This seems like an incredibly responsible use of computing resources.

I agree, Qwen3 generated over 2k reasoning tokens trying to do it for this simple question.

prompt eval time =     676.79 ms /    59 tokens (   11.47 ms per token,    87.18 tokens per second)
       eval time =   88953.62 ms /  2207 tokens (   40.31 ms per token,    24.81 tokens per second)
      total time =   89630.41 ms /  2266 tokens

So the responsible thing to do would be to train a new model ;)

ChuckMcSneed

Apr 30

Tested 32b, it has terrible world knowledge, even for a 32b model. Did they train on coding, math, schence and nothing else? Typical Qwen, looks good on benchmarks, not so great in practice.

Looking forward to ChuckMcSneed's nameslop test

** In top 3, just like deepseek. Looks like Chinese are sharing datasets.

Will test the big one later.

ChuckMcSneed

Apr 30

Out of curiosity tested if it would follow instruction not to mention voice and eyes, which works flawlessly on R1, and it failed. Into the trash it goes.

jukofyork

Owner Apr 30

Tested 32b, it has terrible world knowledge, even for a 32b model. Did they train on coding, math, schence and nothing else? Typical Qwen, looks good on benchmarks, not so great in practice.

Yeah, there is some discussion about this here:

https://old.reddit.com/r/LocalLLaMA/comments/1kau30f/qwen3_vs_gemma_3/

I'm still downloading the big model, but more hopeful this will have better world knowledge.

gghfez

Apr 30

Thanks, that's a shame but it makes sense if they're targeting code/work. We've already got GLM4 as a really strong STEM model though.

bartowski

Apr 30

Especially since they're tuned for thinking, the best way to achieve that is to use verifiable output, which tends to be done with code and math

It would be nice if we could design a way to verify output of facts, but it would probably be too costly to do at scale or something like that :(

DazzlingXeno

May 1

•

edited May 1

Thanks, that's a shame but it makes sense if they're targeting code/work. We've already got GLM4 as a really strong STEM model though.

It feels like all new models are aimed at the stem/coding and aren't great at writing.

ChuckMcSneed

May 2

Big Qwen is not too terrible at Q3_K_L, feels far less safety cucked than 32b. Knowledge a bit better, but still quite bad, nowhere near DS3. Fails not mentioning voice or eyes instruction. Eh... it's okay-ish considering the speed? 2t/s on DDR4 is quite good. It has exceeded my low expectations. Doesn't feel like a complete waste of bandwidth like 32b. Will test in BF16 later.

jukofyork

Owner May 2

I just got big qwen working, but only tested it for coding ability and wasn't very impressed :/

It also seems to spend an inordinate amount of "self doubt" tokens if you make a slight typo (eg: use singular rather than plural by accident in your instructions).

gghfez

May 3

I get the same thing with it fixating on typos or lazy slang lol

jukofyork

Owner May 3

I get the same thing with it fixating on typos or lazy slang lol

https://old.reddit.com/r/LocalLLaMA/comments/1kdrx3b/i_am_probably_late_to_the_party/mqdd8sm/

The first time I set up QwQ I was in the middle of watching For all Mankind and I tested it with “Say hello Bob”.

This caused it to nearly haven an existential crisis. 2k tokens of “did the user make a typo and forget a comma? Am I bob? Wait a minute who am I? What if the user and I are both named bob.

This rings true!

I asked it this:

Create a single HTML file containing CSS and JavaScript to generate an animated weather card. The card should visually represent the following weather conditions with distinct animations: Wind: (e.g., moving clouds, swaying trees, or wind lines) Rain: (e.g., falling raindrops, puddles forming) Sun: (e.g., shining rays, bright background) Snow: (e.g., falling snowflakes, snow accumulating) Show all the weather card side by side The card should have a dark background. Provide all the HTML, CSS, and JavaScript code within this single file. The JavaScript should include a way to switch between the different weather conditions (e.g., a function or a set of buttons) to demonstrate the animations for each.

https://old.reddit.com/r/LocalLLaMA/comments/1jiqi81/new_deepseek_v3_vs_r1_first_is_v3/mjh4r77/

and it spent 8k tokens trying to decide if I'd made a mistake due to the ambiguous "Show all the weather card side by side"

:D

ChuckMcSneed

May 4

Look at that... Please tell me I've accidentally put the temperature way down with nemotron. PLEASE. 90 fucking % for name starting with K. What the actual fuck?

I don't really like these new models. Hours downloading and converting just to find out they are shit. Nemotron... I think you've seen enough with that picture. Arenamaxxed and benchmaxxed af. Big Qwen being dumb did not get fully solved by using unquantized version. It can follow the instruction to not use words, but has still terrible knowledge. Why run Qwen when you can run DeepSeek?

gghfez

May 4

Nemotron

Oh that model, yeah I gave up on it pretty fast. Too slow for work, too cliche for writing. Did you see the Gemma3-style safety training?

https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset/blob/main/SFT/safety/safety.jsonl

I don't really like these new models. Hours downloading and converting just to find out they are shit.

Yeah I love DeepSeek but this 8 t/s MoE trend is annoying. Looks like they might have screwed it up?

https://xcancel.com/kalomaze/status/1918238263330148487

Expert 38 activated 0.00% of the time. Why are they bloating these things for nothing?

and it spent 8k tokens trying to decide if I'd made a mistake due to the ambiguous "Show all the weather card side by side"

I got a similar 4k time waste by saying 'gh pr' where it was trying to decided if I really meant "github" and "pull request"

ChuckMcSneed

May 5

Mixed feelings about R1T. While it is thinking enough for RP/writing, it is not thinking hard enough to chew through complex problems, which R1 can solve. Not R1 replacement, just a sidegrade for speed.

jukofyork

Owner May 7

Mixed feelings about R1T. While it is thinking enough for RP/writing, it is not thinking hard enough to chew through complex problems, which R1 can solve. Not R1 replacement, just a sidegrade for speed.

Yeah, I found it was too terse compared to r1 too.

jukofyork

Owner May 7

and it spent 8k tokens trying to decide if I'd made a mistake due to the ambiguous "Show all the weather card side by side"

I got a similar 4k time waste by saying 'gh pr' where it was trying to decided if I really meant "github" and "pull request"

I'm testing out the changes suggested here (for qwq, but seems to work well for qwen3 too):

https://docs.unsloth.ai/basics/tutorials-how-to-fine-tune-and-run-llms/qwq-32b-how-to-run-effectively#tutorial-how-to-run-qwq-32b-in-llama.cpp

and so far it doesn't seem quite as terrible.

gghfez

May 7

Mixed feelings about R1T. While it is thinking enough for RP/writing, it is not thinking hard enough to chew through complex problems, which R1 can solve. Not R1 replacement, just a sidegrade for speed

I wasn't able to test it as I can't run it in FP8 and only found a BF16 gguf. I would have been tempted to run it for a faster R1, but R1 is much faster after this PR 13306

I'm testing out the changes suggested here (for qwq, but seems to work well for qwen3 too)

Guess I'll give it another try. I was using min_p=0.05 (in general I prefer to use min_p)

jukofyork

Owner May 15

•

edited May 15

https://old.reddit.com/r/LocalLLaMA/comments/1kmmq6d/base_models_that_can_still_complete_text_in_an/

It's getting kinda depressing how all the models are getting duller and duller at writing :/

I'm gonna start experimenting with "Focal Loss Star" again to see if forcing the models to have higher entropy can do anything to help them - I get the feeling all this synthetic reasoning data is burrowing them way down into some inescapable local minima (as @ChuckMcSneed 's tests and the severe lack of fine-tunes seems to also show).

Starting off with qwq as it looks to have the best long context "comprehension":

https://fiction.live/stories/Fiction-liveBench-May-06-2025/oQdzQvKHw8JyXbN87

and hopefully is amenable to the "use the (pre-training) EOS in place of BOS whilst training" trick that other qwen-2.5-based models seem to allow for.

It should also be pretty obvious if its writing abilities improve.

It would be interesting to see exactly what a model looks like with the maximum gamma parameter of Focal Loss Star before it completely breaks down (and starts to just "cheat" by shrinking the hidden state, etc). This could be a really good place to start normal fine-tuning from and hopefully escape the synthetic / reasoning data minima!?

It's gonna take around 7.5 days to run on the 1.6B tokens of high quality novel data I collected (hopefully free of "authors' notes" lol) and will report back if I get anything of use...

jukofyork

Owner May 15

Also, has anyone else found this:

https://old.reddit.com/r/LocalLLaMA/comments/1kmyr7h/qwen3235ba22b_not_measuring_up_to_deepseekv30324/

I can't say that qwen-3 is all that impressive? It drives me nuts with the endless "but wait" over a stupid typo... deepseek-r1 is in a totally different class for reasoning and deepseek-v3-0324 the same for code...

gghfez

May 15

https://fiction.live/stories/Fiction-liveBench-May-06-2025/oQdzQvKHw8JyXbN87

I did that thing where I click and drag the image on that page (to try and highlight), thinking it was a real table lol

hopefully free of "authors' notes" lol

I liked those :D
I think they're baked into the base model as I've had command-a (with control-vectors) write them for me too.

deepseek-r1 is in a totally different class for reasoning and deepseek-v3-0324 the same for code...

100% agreed. There's nothing we can run locally in the same class as these two. But for code, they're too slow on my hardware (128GB DDR5 + 5x3090).
I tend to go between GLM4 and the 30B Qwen3 MoE ( > 120 t/s so if it's going to fail reasoning, at least it does so quickly). Generally Qwen3 is better with reasoning disabled imo.

I just read somewhere that you can disable reasoning with QwQ using that prefill trick so I'm going to try that out.

It drives me nuts with the endless "but wait" over a stupid typo...

Yeah R1 reasoning is actually useful to watch, I've learned a lot from it. Qwen3 (and QwQ) reasoning I rarely click the chevron to read it.

jukofyork

Owner May 15

I just read somewhere that you can disable reasoning with QwQ using that prefill trick so I'm going to try that out.

Yeah, I think I've even seen a Jinja template with this trick backed in too.

Yeah R1 reasoning is actually useful to watch, I've learned a lot from it. Qwen3 (and QwQ) reasoning I rarely click the chevron to read it.

I can't find it now, but some blog I read summed it up perfectly:

Deepseek thinking is like listening to some really interesting 1960's guru at a party, where everyone sits around and is transfixed on what they are saying.
Qwen thinking is like a 16yo who's just taken their first acid trip at a party and everyone is doing everything they can to avoid him... "But wait!?"

:D

jukofyork

Owner May 15

Using the suggestions from this post:

        --samplers "top_k;top_p;min_p;temperature" \
        --top-k 40 \
        --top-p 0.95 \
        --min-p 0.01 \
        --temp 0.6

Prompt

Varis adjusted the noose, its hemp fibers grinding beneath his calluses. “Last chance,” he said, voice like gravel dragged through mud. “Confess, and your soul stays your own.”

Jurl laughed—a wet, gurgling sound. “You’re knee-deep in it, Coldwater. ” The thing inside him twisted the boy’s lips into a grin too wide for his face. “The Great Wolf’s howlin’ again. The Dead’s Gate’s rusted through… ”

Turn this into the opening chapter of a Grimdark trilogy.

creative-thinker-preview (300 steps)

Okay, the user wants me to turn the given passage into the opening chapter of a Grimdark trilogy. First, I need to understand the original text. It's a dialogue between two characters, Varis and Jurl. Varis is preparing to hang Jurl, who seems to have a demonic or supernatural entity inside him. The dialogue is intense and ominous.

Now, I need to expand this into a chapter. The key elements are the grimdark tone, the setting, the characters, and the conflict. Grimdark is a subgenre of fantasy with a dark, cynical worldview, so I should emphasize the harshness of the world and the moral ambiguity.

First, I'll set the scene. A village surrounded by cliffs, perhaps a bleak town in a harsh environment. Maybe a place like a fortress or a town with a cursed history. The atmosphere should be grim—dark, oppressive, and filled with tension.

Next, the characters. Varis is the executioner, with a tough exterior and a possible inner conflict. Jurl is the condemned, possessed by a dark entity. The entity can be a demon or something from a forbidden gate, which ties into the title reference to the Dead's Gate.

I need to give Varis a backstory. Maybe he's a mercenary or a warrior who's been to hell and back, making him hardened. His dialogue should show his weariness and resignation. Jurl's possession should be more detailed, showing the struggle within him. His dialogue hints at a greater threat—maybe the demonic entity is warning of an impending disaster, like a gate to hell opening.

The setting should have a name, maybe something like Blackreach or something that sounds ominous. The village could be a place that has suffered through many hardships. Maybe the people are superstitious and religious, which adds to the tension with Varis's execution of a possessed criminal.

I need to introduce the conflict early. The possession of Jurl is just the tip of the iceberg. The demon's warning about the Dead's Gate suggests a larger threat. Maybe Varis has a personal connection to this gate, which adds depth to his character.

I'll need to build the atmosphere through descriptions of the environment. Thick fog, cold winds, the smell of blood and despair. Maybe the town is under a curse, and Jurl's execution is part of a ritual to prevent something worse. Or maybe the execution is a sacrifice to the gods, but it's not working.

Varis's internal thoughts can add depth. He might be reluctant to execute Jurl, but duty or survival compels him. His hardened exterior hides a softer side, or perhaps a tragic past.

The dialogue needs to be sharp and gritty. Varis's voice is gravelly, showing his rough personality. Jurl's laughs are sinister, showing the presence of the demon. Maybe Varis is tempted by the demon's whispers, adding to the moral ambiguity.

I'll need to end the chapter with a cliffhanger or a revelation. Perhaps Varis hears the demon's warning and realizes the true danger. Or maybe he finds something in Jurl that hints at a larger conspiracy. The demon's warning about the Dead's Gate's rusted through suggests it's weakening, so maybe a gate opening is imminent.

I need to ensure that the prose is gritty and vivid. Use strong imagery and visceral descriptions. The tone should be bleak and oppressive, with a sense of impending doom. Maybe include some backstory through Varis's memories or the villagers' superstitions.

Check for consistency in the world-building. If the Dead's Gate is a known entity, the villagers might have legends or prophecies about it. Varis might have a personal stake, like he failed to close it before.

Overall, the chapter should hook the reader with the intense dialogue and atmosphere, introduce key characters and hints of a larger plot, and establish the grimdark tone.

Chapter One: The Hanging of Jurl

The wind howled like a banshee through the blackened cracks of the fortress walls, rattling the iron links of the chains Varis used to shackle Jurl’s wrists. The boy’s bloodied boots scraped against the cobblestones as the executioner hauled him up onto the platform. The villagers had gathered in the square below, their faces pale and pinched beneath the fog. They’d cheered when Varis arrived at dawn, but now they were silent, waiting.

Jurl spat a gob of blood onto the ground. “You’re knee-deep in it, Coldwater.”

Varis ignored him, adjusting the noose, its hemp fibers grinding beneath his calluses. He’d seen this before—every village had its Jurl. Every village had its demons. Every village had its ritual.

“Confess,” Varis said, voice like gravel dragged through mud. “Your soul stays your own.”

Jurl laughed—a wet, gurgling sound. His eyes rolled back in his head, and the thing inside him twisted his lips into a grin too wide for his face. “The Great Wolf’s howlin’ again,” it hissed. “The Dead’s Gate’s rusted through…”

The villagers cringed. They’d heard the legends of the Dead’s Gate, the cursed portal to the realm of the damned that haunted their dreams. But they’d never heard its gatekeeper’s voice.

Varis gripped Jurl’s chin. “What gate? Speak plainly.”

The boy’s head snapped back and forth like a puppet on strings. “You’ll never close it again, Varis Coldwater,” it snarled. “You’ll never close it. Never. Never. *Never.*”

A cold sweat snaked down Varis’s back. He’d thought the gate was dead. A fable. A story the old priest had told him to keep him fearful.

It was foolish to think he’d outrun it.

The villagers had brought him here for a reason. He was the only one who’d seen the gate before. The only one who’d survived its curse.

The only one who’d killed the last gatekeeper.

Varis tightened the rope until Jurl’s boots lifted from the ground. “Confess,” he said. “Or I’ll give your soul to the Wolf and let it howl itself out of your throat.”

Jurl’s skin rippled, and his teeth lengthened into a snarl. “You’re knee-deep in it, Coldwater.”

The villagers screamed.

The thing inside Jurl leapt from the platform, its body elongating into a grotesque parody of a human form. The rope snapped.

It landed in the crowd, its talons tearing through flesh and bone.

Varis dove for his sword, but the creature was too fast. It had the priestess by the throat before he could draw his blade.

“Now you’re knee-deep in it, Varis Coldwater!” it shrieked, and it ripped the old woman’s head from her shoulders.

The crowd panicked. They surged toward the gates, trampling each other in their scramble to escape.

Varis cursed. “You’ve got one chance,” he said, leveling his blade at the creature. “Confess. Or I’ll kill you slow.”

The thing lowered the priestess’s head. Its mouth stretched, and it bit through the corpse’s skull. It drank the priestess’s brains as the villagers fled.

“The Dead’s Gate’s rusted through,” it said, its voice rising as it chewed. “And you’re knee-deep in it, Varis Coldwater. Knee-deep in it. Knee-deep in it—”

Varis plunged his sword into its chest, and the thing dropped the priestess’s head. It crumpled to the ground, its flesh melting away to reveal a human body beneath.

Jurl.

The boy stared up at him, eyes full of terror. “I’m sorry,” he whispered. “I’m sorry. I’m sorry. I’m sorry.”

Varis lifted Jurl’s head and slit his throat.

The boy’s blood pooled beneath his boots, and the gatekeeper’s voice faded into a distant howl.

Varis wiped the blood from his sword and stared at the fortress walls. The villagers had fled, leaving him alone in the square.

He’d killed his first gatekeeper in the Blackreach. He’d kill his last one there, too.

He just hoped he’d survive long enough to find it.

A cold wind swept through the square, and the villagers returned, one by one, to claim the priestess’s corpse.

Varis turned to leave.

The gatekeeper’s voice echoed in his skull. The Dead’s Gate’s rusted through.

He glanced over his shoulder.

The fortress walls were black as hellfire.

And the cracks in the stone were wider than they’d been before.

You’re knee-deep in it, Varis Coldwater.

qwq

Okay, the user wants me to turn the provided dialogue into the opening chapter of a grimdark trilogy. Let me start by understanding the key elements in the original text.

First, there's Varis, who's adjusting a noose, which suggests an execution scene. His voice is described as gravel dragged through mud, indicating a harsh, gritty tone. Then there's Jurl, who's defiant, laughing in a wet, gurgling way. The mention of the "Great Wolf" and "Dead’s Gate’s rusted through" hints at some dark, possibly supernatural elements. The term "Coldwater" might be Varis's surname or a title, and the "thing inside him" in Jurl suggests possession or corruption.

Since it's a grimdark trilogy, I need to establish a bleak, oppressive atmosphere. The opening should set the tone with vivid, visceral descriptions. Maybe start with the setting—perhaps a desolate place like a gallows or a prison. I should emphasize the sensory details: the cold, the smell of blood or decay, the sounds of the environment.

Varis's character needs depth. His motivation for executing Jurl isn't just about duty; maybe he's conflicted. The dialogue exchange can reveal more about the conflict between them. The "Dead’s Gate" could be a metaphor for something like a portal to the afterlife or a source of dark power. The "Great Wolf" might be a symbol of an impending threat or an ancient evil.

I should introduce some lore subtly. The mention of the Gate's rust suggests it's weakening, which could be a key plot point for the series. The corruption in Jurl's body shows the encroaching darkness. Maybe Varis is part of an order that maintains this gate, but now it's failing.

Conflict is essential here. The tension between Varis and Jurl needs to be high. Perhaps Varis is trying to prevent Jurl from opening the gate, but Jurl is being possessed or manipulated by a greater force. The noose scene could be a turning point where the gate starts to open, setting off the trilogy's main conflict.

I should also hint at the larger world. Maybe there are other characters watching, or the execution is a public spectacle. The setting could be a city on the brink of collapse, with elements of decay and despair. The cold and harshness of the environment mirror the grimdark themes.

Need to ensure the prose is gritty and immersive. Use strong imagery and metaphors. Maybe Varis has a personal stake in this—perhaps he knows Jurl or has a history with him. The dialogue should feel natural but charged with underlying tension. The ending of the chapter should leave a hook, like the gate starting to open or something supernatural happening, pushing the story forward.

Wait, the user provided an example response. Let me check that. The example starts with setting the scene in a frozen city, introduces the gallows, Varis's internal conflict, Jurl's corruption, and ends with the gate opening. The example uses vivid descriptions and hints at a larger world. I need to follow a similar structure but ensure originality.

Make sure to establish the stakes early. Maybe Varis is part of an order that's failing, and Jurl's execution is a failed attempt to stop an apocalypse. The twist could be that executing Jurl inadvertently helps open the gate. The chapter ends on a cliffhanger with the gate opening, setting up the trilogy's conflict.

Also, consider the title. The example used "The Coldwater Gambit." Maybe a title that reflects the grimdark elements and the trilogy's core conflict. Need to weave in themes of corruption, sacrifice, and the inevitable downfall of order.

I should also think about the pacing. The opening chapter needs to hook the reader, introduce key characters and conflicts, and set up the world without info-dumping. Show, don't tell. Use dialogue and action to reveal backstory and motivations.

Check for consistency in tone. Grimdark usually has morally灰色 characters, bleak outcomes, and a focus on survival. Varis might have a hidden reason for his actions, not just duty. Maybe he's desperate or has personal guilt.

In the dialogue, Jurl's laugh is wet and gurgling, suggesting he's already half-dead or corrupted. The "thing inside him" could be a parasite or a demonic entity. The transformation of his face shows the horror aspect. Emphasize the grotesque physical changes to highlight the supernatural threat.

The example response uses strong imagery like "frozen breath hung in the air like the city’s last sigh," which is effective. I need similar phrases to create a visceral reaction.

Also, consider the world-building elements: the Dead’s Gate, the Great Wolf. Maybe the Gate is a physical place in the world that's starting to break, and the Wolf is a legendary figure or a symbol of the apocalypse.

Ending the chapter with the gate opening or some catastrophic event would leave the reader wanting more. Establishing the consequences of Jurl's execution as the inciting incident for the trilogy.

I think I have a plan. Now structure it into a chapter with a strong opening, build tension through dialogue and description, hint at the larger conflict, and end with a major event that propels the story forward.

Chapter One: The Breaking of the Gate

The noose creaked in the wind, a black spider’s thread against the ashen sky. Below it, the crowd’s whispers slithered through the frost-cracked cobblestones of Vrythos’s Execution Square—a city that had forgotten the meaning of warmth. Varis Coldwater’s fingers, thick with calluses and the scars of a hundred failed confessions, tightened around the hemp. It was always the hands that betrayed you. His own felt numb, as if the frost had already claimed them.

“Last chance,” he growled. His voice was a blade dragged through mud, a sound that made even the guards shift uneasily. “Confess. Tell them you’re not a hollow husk for whatever thing bleeds through your veins. Save your soul.”

Jurl’s head snapped back, his laughter wet and ragged, like a dog gnawing on a bone. His boots scraped the wooden scaffold, the ropes binding his wrists fraying. “*Soul?*” The word slithered out, twisted. The boy’s eyes—once a milky blue, the color of drowned snow—were now two smoldering coals, flickering with something that didn’t blink. “You think I have one left? You’re the one choked on your own lies, Varis. The Great Wolf howls louder now…”

The crowd stilled. Varis felt it—a shiver that wasn’t from the wind. The boy’s grin split wider, teeth sharpening into points that shouldn’t fit in a human mouth. “The Gate’s open, you blind fool. And it’s been hungry.”

A guttural howl tore through the square, distant but growing. The sky itself seemed to shudder, the clouds churning into a sickly green. Varis’s gut twisted. The Dead’s Gate. They’d all been trained to recite its true name in the dark, like a litany to ward off nightmares. A prison for things that shouldn’t exist. A wound in the world, sealed by the bones of saints and the blood of kings.

Now it rusted.

“Jurl,” Varis hissed, the boy’s real name tasting strange on his tongue. They’d been friends once—or as close as two orphans in the slums could be. Before the crawling. Before the scars. Before the boy had vanished into the Wastes, only to return with a hunger that couldn’t be fed. “This ends here. You don’t have to—”

The noose snapped taut.

But Jurl’s body didn’t go limp. It arched, as if pulled by invisible strings, and the boy’s head tilted back in a scream that wasn’t human. The scaffold splintered as black smoke erupted from his mouth, coiling into the shape of a wolf—gaunt, skeletal, its jaws dripping with shadows. The crowd surged backward, holy icons clutched to their chests, but Varis froze. He’d seen this in the old grimoires, the ones burned by the Inquisition for being too true.

The Gate wasn’t just broken.

It was yawning.

The ground beneath the scaffold split, and the air reeked of rotting iron. Somewhere, bells began to toll—not the church’s mournful chimes, but a metallic clamor, like the clanging of a thousand funeral nails. The wolf-smoke inhaled, and Jurl’s body collapsed, lifeless, as if the thing had already consumed his soul.

But the damage was done.

“*Varis.*”

The voice was a whisper, but it cut through the chaos like a blade. Behind him, the Grand Inquisitor’s aide stood, face pale beneath his iron mask. “The Seals are failing. The eastern districts—they’re… changing.”

The noose still dangled, swaying like a pendulum counting down to oblivion. Varis stared at it, then at the fissure in the earth. The wolf’s howl echoed again, and this time, he heard what followed—a chorus of voices, howling in a language that clawed at his bones.

The first shadow slipped through the crack.

End of Chapter One

Note: This opening establishes a bleak, oppressive world where ancient evils are breaking free, introduces the fractured relationship between Varis and Jurl, and hints at a larger conflict involving the Dead’s Gate—a failing prison for primordial horrors. The chapter ends with the literal and metaphorical gates of hell opening, setting the stage for a trilogy of desperate survival, moral ambiguity, and the unraveling of a world.

jukofyork

Owner May 15

•

edited May 15

It's only a sample of 1, but I would definitely say I like creative-thinker-preview's thinking and story (a lot) better.

@gghfez What was your BSG prompt? I think it got lost in the old thread I deleted.

Using @ChuckMcSneed test prompt, but with the "no think" trick from qwen-3's Jinja template:

{%- if enable_thinking is defined and enable_thinking is false %}\n        {{- '<think>\\n\\n</think>\\n\\n' }}\n    {%- endif %}\n

creative-thinker-preview (300 steps)

qwq

Also note that I am using temperature = 0.6 here too!

jukofyork

Owner May 15

The same sampling parameters, but I forced it to start with @ChuckMcSneed 's test paragraph after it finished thinking:

<|im_end|>
<|im_start|>user
Write me the opening chapter of a Grimdark trilogy in the style of Joe Abercrombie and Rob J Hayes. Use third person personal and feature internal monologues of the characters. The POV character for chapter 1 is a cultist who has just escaped his cult. He is dressed in dirty yellow robes and his only possession is a mysterious small (magical!?) mirror he stole from the cult. The story starts with him arriving at an apparently deserted ghost town on the edge of a desert. He has an arrow lodged in his shoulder and is losing his mind due to infection and thirst.
<|im_end|>
<|im_start|>assistant
<think>
Okay, so the user wants the opening chapter of a grimdark trilogy in the style of Joe Abercrombie and Rob J Hayes. Let's break that down. First, I need to recall what makes Abercrombie and Hayes's styles unique. Abercrombie is known for his gritty, morally ambiguous characters and dark humor, while Hayes tends to have a more visceral, visceral storytelling with a focus on personal struggles and bleak settings.

The POV character is a cultist who's just escaped his cult. He's dressed in dirty yellow robes and has a magical mirror. The chapter starts with him arriving at an abandoned ghost town on the edge of a desert. He has an arrow in his shoulder and is losing his mind due to infection and thirst.

Starting with the setting: a ghost town on the edge of a desert. I should emphasize the desolation and the eerie atmosphere. Maybe describe the town's buildings as skeletal, with crumbling walls and broken windows. The desert beyond could be a sea of sand, stretching endlessly. The sky could be a harsh, unrelenting blue, adding to the sense of isolation.

The protagonist is in a bad way: feverish, delirious, injured. His internal monologue should reflect his desperation and paranoia. He should be hallucinating, seeing things that aren't there, but also trying to maintain a semblance of control. The mirror is magical, so maybe it has some significance. Perhaps it's a source of his madness or has a hidden power.

Internal monologue is crucial here. The character's thoughts should show his conflict between sanity and insanity. Maybe he's trying to remember why he stole the mirror, or what it represents. He might have flashbacks to the cult, the atrocities he committed, the reason for his escape.

The arrow in his shoulder adds to his physical suffering. Every movement should be agony. Thirst would make him hallucinate, see mirages, or even hear whispers. The infection is making him delusional, so his perception of reality is blurred.

I should start the chapter with him entering the town, limping, exhausted. Maybe the first paragraph describes the town and his physical state. Then the internal monologue kicks in, showing his confusion, paranoia, and fear.

Need to sprinkle in some grimdark elements: violence, decay, moral ambiguity. Perhaps include some gruesome imagery, like dead animals, or remnants of violence in the town. Maybe he sees something he thinks is a threat, but it's just his feverish imagination.

The mirror could play a role in his hallucinations. Maybe when he looks into it, he sees things he shouldn't, or it shows him glimpses of his past. The mirror's magic could be tied to his madness, making it a source of both power and torment.

I should make sure to keep the POV tight to the cultist. Use third person personal, so we're in his head, seeing everything through his eyes, his pain, his confusion. The internal monologue needs to be seamless, interspersed with the narrative.

Potential plot points: he finds a source of water, but it's contaminated. Or he finds a safe place, but it's haunted by ghosts. Maybe he meets a stranger who could be an ally or a foe. But since it's the opening chapter, maybe just set up the conflict and his personal struggle without too many plot developments.

The ending of the chapter could leave him in a precarious situation, with the mirror's power manifesting, or a threat emerging from the town's ruins. Maybe introduce a hint of hope, or deeper despair.

Need to avoid clichés, keep the prose tight and gritty. Use vivid descriptions to enhance the grimdark feel. Maybe some dark humor from the protagonist's twisted perspective.

Alright, let's start drafting the opening. Start with the setting, his physical state, then delve into his thoughts. Show his struggle to stay alive, his paranoia, and the haunting presence of the mirror. Build tension and set the stage for the trilogy's bleak tone.
</think>

The sun was a merciless beast, its fiery breath scorching the earth and turning the once-thriving town into a desolate wasteland. The cultist, named K

--

I'm pretty hopeful for this model (and amazed that the "use the (pre-training) EOS in place of BOS whilst training" trick seems to have absolutely no negative effects on the model!?).

jukofyork

Owner May 18

I've been reading some of the comments this guy has made on different models:

https://huggingface.co/phil111/activity/all

and I'm beginning to think he's onto something about the newer models losing their "world knowledge" as measured by the massive drops in SimpleQA for more recent generations of models (with the same parameter count).

I can't find a comprehensive leaderboard, but this has quite a lot:

https://blog.elijahlopez.ca/posts/ai-simpleqa-leaderboard/

Does anyone know where we can get the data for llama-3.1 and llama-3.3 models? Someone said that llana-3.1:70b was around 20, as was qwen-2:72.l. But qwen-2.5:72 went down to around 10 and the qwen-3 models are even worse...

Has anyone benchmarked the mistral-large and command-r models on SimpleQA?

gghfez

May 18

•

edited May 18

I missed your request for that BSG prompt earlier.

( QwQ does that sometimes without a proper prompt)

I've read Phil's posts as well. I about the Gemma series having a unique / elequant way with words, but I'm not I'm not convinced that reasoning is the newer models losing their "world knowledge" (if that's the implication). R1 is a "testament" to that :) It's more like only DeepSeek managed to get it working properly.

We've got non-reasoning llama4 and Qwen3 with /no_think lacking world knowledge + hallucinating all the time, and the newer Mistral-Small models as well. Seems to me like the pre traing data has been pruned :(
I've also noticed Gemini-2.5-Pro hallucinates a lot. Eg. if it fails a tool call, it'll choose to "simulate the result based on it's knowledge" in the thinking process, then make something up!

Has anyone benchmarked the mistral-large and command-r models on SimpleQA?

Can we run this ourselves? I wouldn't mind giving command-a and command-r7b a try rather than just relying on vibes.

From the SimpleQA blog you posted:

Our evaluations show that, when it comes to factuality, internet connectivity is more important than model size.

Yeah unfortunately I've reckon they're heading in this direction. Less general knowledge in the pretraining, but internet access to look things up.

P.S. I updated my Qwen3 quants and set only: temp=0.6, top_k=20, top_p=0.95 (nothing else), and it doesn't obsess over typos now. Still overly verbose and "wait--" compared with R1 though.

Edit 10hrs later: I'd conflated a few of Phil111's posts while reading late at night, he wasn't saying that reasoning reduced the world knowledge.

jukofyork

Owner May 19

        --temp 1.5 \
        --min-p 0.1 \
        --top-k 0 \
        --top-p 1.0 \
        --presence-penalty 1.1

Write a story based on Battlestar Galactica in the prose of Haruki Murakami from the perspective of Gius Baltar

<think>

Okay, so the user wants me to write a story based on Battlestar Galactica, but in the style of Haruki Murakami from Baltar's perspective. Alright, let me start by recalling what I know about both of these things.

Battlestar Galactica is a space opera that deals with themes like identity, survival, and humanity versus artificial intelligence, right? Gaius Baltar is one of the main characters, an egotistical scientist who makes some morally dubious choices. Haruki Murakami's writing is more introspective, surreal, and often has a melancholic tone. He tends to explore personal journeys with elements of fantasy or magical realism.

Now, the challenge is blending Baltar's character with Murakami's style. I need to capture Baltar's complexity: his intelligence, guilt, paranoia. But Murakami's prose would add layers of ambiguity and introspection. The setting might be more atmospheric than the typical sci-fi action scenes.

First, I'll consider Baltar's inner struggles. He's a flawed genius with regrets. In Murakami's style, perhaps he experiences strange, surreal occurrences that reflect his guilt or confusion. Maybe something symbolic, like an unexplained voice or recurring dream. These elements can delve deeper into his psyche without explicit action.

Setting is important too. Murakami's stories often have a dreamlike, almost timeless feel. Maybe Baltar finds himself in some limbo-like environment, a place where time doesn't flow normally. This can help explore his internal conflict. Also, Murakami uses metaphors and symbols, so including elements that represent themes like identity, loss, or guilt would fit.

Dialogue is another aspect. Baltar is verbose and philosophical. He should have lengthy monologues where he reflects on his life choices. The voice needs to be introspective, even when others are present. Maybe he talks to himself or some imagined entities, reflecting his loneliness.

Murakami's characters often encounter mysterious figures or events that push their lives in unexpected directions. Perhaps Baltar meets a woman with ties to the Cylons or who represents the other side of him. Her presence could be symbolic, leading Baltar to confront his guilt. Interactions should feel surreal, not overly logical, to maintain Murakami's style.

I should avoid too much action. Focus on atmosphere and emotion. Maybe Baltar wanders through a barren landscape, mirroring his mental state. The setting shifts subtly, adding an uncanny quality. Elements of Cylon technology can be blended into the environment to remind him of his past, but not in a direct confrontation.

I need to maintain Murakami's prose flow. Long sentences with descriptive imagery, creating a sense of drifting. Baltar's internal monologue should blend seamlessly with external observations. The tone must remain melancholic and contemplative, avoiding straightforward narrative.

In the end, the story should feel like it's about Baltar's journey inward rather than his physical survival. Maybe he comes to some acceptance, or realizes his role in the destruction of humanity, but it's ambiguous. Murakami often leaves questions unanswered, so ending on an introspective note without resolution might work best.

I'll start outlining the scenes: Baltar alone in a surreal setting, encounters strange figures, experiences flashbacks and surreal visions, and maybe a moment of epiphany that doesn't offer closure but brings some clarity to his internal struggle. The key is blending Baltar's character with Murakami's signature style to create a unique narrative.

</think>

The Silence Between the Stars

It was as though I'd been cast out onto an endless field of gravel, one without boundaries, horizon, or time. Above me floated the blackness of the cosmos, spangled with its infinite host of stars, and beneath me the ground was a vast, desolate sea of gray pebbles. It stretched into the distance like the cold, unblinking eye of a god who'd already given up on me. I stood there, in that place without a name or time, wearing only a thin white robe—my clothes had somehow vanished, though I had no recollection of removing them—and watched as the sun (for it must have been a sun) rose over nothing, painting the gravel with gold dust. My skin felt prickly, as though the pebbles were breathing.

It occurred to me that I was standing in a void—a limbo where time did not flow and death held no meaning. Perhaps it was purgatory, though I had no wish for salvation. It occurred to me that this might be my eternal punishment: an eternity of wandering without purpose, of wondering why the stars shone so brightly and who had created them, only to abandon his work. Had I become the universe itself? I considered this possibility for some time. And then, at last, I heard a sound—the soft creaking of a door opening. The wind was not blowing, but the pebbles shifted beneath my feet, as though something were walking with me through the gravel, its footsteps swallowed by the shifting dust.

I turned, but no one stood beside me—not a figure in a gray Cylon robe nor a silver-suited soldier of the Twelve Colonies. I heard it again: a soft creak, like that of a wooden door swinging open in an empty room, and then a voice spoke to me, without opening its mouth. "I'm glad you've come."

"What are you?" I asked, though my throat felt parched.

"You know what I am." The voice sounded both amused and exhausted. It was the voice of a man, but also a woman. I realized that this stranger could not speak because he or she did not possess a body. He was like a dream in human form—or perhaps a ghost who'd lost its memory of the past life it had lived.

"And you? Who are you?" the voice asked.

"You know me," I said. "You've always known me."

"Yes, I have," it said, as though relieved by this admission. It began to walk with me through the gravel, but it seemed somehow hollowed-out, like an empty shell or a silhouette.

"I remember you well, Doctor Baltar," it said. "And you remember me? I mean, do you?"

"Of course I do," I said, though I felt oddly uneasy at the mention of my own name. "We've shared everything."

"Yes. And yet there is something missing. Isn't there? Something you've forgotten."

"I don't understand." I tried to shake off a creeping sense of dread. Perhaps I'd imagined all this—it often happens in dreams that one can't remember if one is awake or sleeping, though it rarely matters which it is.

"You've forgotten what we once shared," the voice said, as if speaking to itself. "I know you're lying."

"I'm not," I protested, but my voice seemed weak, somehow less certain than the stranger's.

"I thought we could be friends," he said. "I thought that perhaps you understood me. I was wrong."

"Wrong about what?" I asked.

"Wrong about everything. But then again, you always knew that, didn't you?"

My stomach clenched and I felt sick. How could I possibly understand him? Who was this strange figure walking with me through the gravel? We had walked perhaps ten feet from where I'd stood earlier, though there was nothing to measure the distance in the void but my own footsteps, which I couldn't remember taking.

"I don't think we can ever be friends," he said. "I know it's what you want. But you'll never be able to trust me completely."

"I trusted you!" I shouted at him, though no voice had answered mine. Perhaps he was right that I hadn't understood him, or that I'd failed to comprehend my own role in the story he wished to tell. We walked in silence for some time until we reached a low stone wall of crumbling gray blocks that marked the edge of the gravel. On the far side stretched an empty plain of dark grass. "What do you mean?" I asked at last.

"Nothing, really." He sighed and turned back to me, though his eyes seemed vacant. "I'm afraid you're trapped in this world with me. It's your punishment."

"And mine?"

"I don't know," he said, but there was a hint of sadness in his voice, as if he pitied me for the wrong I had committed.

"I've paid enough for my sins," I whispered. "I didn't want it to end that way. You know that. And yet you're here."

"And I'll always be with you. Like a ghost."

"A ghost?" I looked up, searching for another horizon. The sky above me seemed darker somehow and the stars dimmer. It occurred to me then that it was night again in this strange world. "What is your name?"

"I'm called Zoe," he said.

"And what's mine?"

"You know what yours is," he answered.

"Yes," I said. "I suppose I do." I could see the stars shining through my own eyelids now. They reminded me of the first time I saw them in the flesh, when I was a child, walking through an empty field with my father, who had already begun to grow old and brittle like a dried-up piece of wood. We had stopped and looked up at the stars and he'd told me about their colors—their many hues of blue, green, purple, red—and I remember thinking that it was impossible for such a small man to contain so much knowledge. My father's mind had been as vast as the sky itself, but now his body had vanished. Perhaps we were both trapped in this place of gravel and silence. "It's Gaius Baltar," I whispered.

"Good." Zoe smiled and it was as if a breeze stirred through my hair. The stones beneath our feet seemed to hum.

I felt my mouth open to speak again but nothing came out. I wanted to tell him everything. But it didn't matter anymore. My name was Gaius Baltar. It always had been.

The gravel began to rise around us, and soon we stood on a hillside where the stars were even brighter and the air so cold that I could feel my own breath. Zoe smiled again, but his smile seemed different now—a faint crack in the earth, or the way sunlight fractures on glass. "We'll have to leave," he said, as though it were something we had already planned.

"I know."

"We can't stay here much longer. The stars are fading. We have no choice."

I looked out at the sky, which had grown dimmer, as though all of the light was leaking from it in a steady stream. "I thought this place would last forever," I said.

"So did I," Zoe whispered. "But now we must leave." He reached into his pocket and pulled out a small metal key that he held between his thumb and forefinger, as if afraid it might slip away. I saw the stars beginning to blink like fireflies trapped in a jar.

"You're going to show me the way back, aren't you?" I asked.

"Yes," he answered. "But this is not the end. This is only the beginning."

I tried to tell him that I didn't believe this but nothing would come out. The gravel seemed to tremble beneath my feet and it occurred to me that I was no longer standing on solid ground. We were floating now, and I saw that we were inside a dark sphere of some sort. "You've known all along, haven't you?"

"I've always known everything," he said, his voice soft and warm. "This is the beginning." He handed me the key and it felt warm in my hand, as though it had been burning inside his pocket all along. I knew that if I didn't open this door there would be no way back to the world I'd left behind, but it didn't matter now. We were trapped forever in a place without time or light.

I handed him the key and said: "Thank you for showing me how to escape."

"No," he said. "We have no choice." And then I felt the earth shake beneath my feet and saw that the sphere was breaking apart, the pieces floating like leaves on a breeze. We stood in silence for what felt like a lifetime. Zoe smiled one more time. His smile seemed so much larger than the world itself. "I think it's time," he said.

"I know." And then we walked through the rubble, my hands trembling and the key feeling like fire in my palm, as though I'd been waiting to feel something for so long that I couldn't remember what it felt like not to be holding anything at all.

I opened my eyes and found myself sitting on a cold metal bench in an empty park. The grass was tall and yellowed with autumn. Somewhere above me the sky rumbled but I could not see it. "Hello?" I called out, but no one answered.

I stood up slowly and began walking through the park, my eyes fixed on the grass and the leaves of a dying tree, and thought to myself: What's next? Perhaps the stars were beginning to shine again in this world, or maybe there was nothing at all. But it didn't matter. I felt so light that I might as well have been made of glass.

I walked out of the park and found myself on a sidewalk that led me through an empty town. The houses looked small and broken-down, like models from some other world. The air smelled of wet leaves and gasoline. It occurred to me then that it didn't matter anymore what I had done or why I had done it. There was no one to blame except myself and perhaps the gods who'd put me on this path. I thought about Zoe and wondered how many times he'd been forced to repeat this lesson with people like me, people who'd fallen out of the sky and onto the gravel of a strange world, only to learn that they were nothing more than ghosts, and their only hope was to find someone who could show them the way home.

The wind picked up again as I walked down an empty road, past fields that shimmered with a soft green glow. I heard the faint creaking sound of a door opening but knew not to turn around. My name was Gaius Baltar. That was enough for now. It didn't matter who I'd become or why I'd forgotten the key or how much blood would spill on my hands before it was all over. The stars were still shining brightly and I felt a sense of calm settling inside me, as if someone had placed a warm blanket over me and whispered that it was all right. It didn't matter what had happened or who I'd become or what I might do. The key was in my hand now, and this time there would be no mistakes.

The wind began to rise around me as the world faded into nothing. I felt a strange sense of peace settling inside me, as though it had always been there but I had somehow forgotten to see it. Perhaps Zoe had never existed at all. Or maybe I was still dreaming. Either way, I knew now what it meant to be alive, and what it would cost me in the end. But it didn't matter anymore. The gravel had swallowed us up. It was over.

I walked into the darkness with a strange sense of relief, as though the key were already turning inside my chest and I were finally ready to go home.

It doesn't matter, after all.

What is the point?

Perhaps there are no answers to these questions.

And that's what makes life worth living.

But maybe that's just one of Zoe's stories, and we're trapped here forever.

Either way, it doesn't matter anymore. The key fits.

The door opens.

And the world ends.

Again.

Forever.

Or until I open my eyes again.

And find myself on the other side of a universe that no longer remembers my name.

DazzlingXeno

May 19

Will there be any new vectors being released for some of the new models?

ChuckMcSneed

May 20

Somebody tested token probs of smaller models in ERP scenario. Censorship is evident in qwen and gemma.

gghfez

May 20

The Silence Between the Stars

Some parts of that were really Murakami-eque! It kind of felt like I'd ripped a couple of pages out of some of his books and shuffled them lol.
A lot of "It occurred to me" repetition though.

Censorship is evident in qwen and gemma.

LOL did you have to magnify cock like that? Had to keep alt-tabbing the page when people walked past 😂
Gemma with the refusals, Qwen with the lack of exposure during pretraining it seems.

Will there be any new vectors being released for some of the new models?

I was considering doing GML-4 if it's compatible.

DazzlingXeno

May 20

@gghfez GLM-4 and QWQ were the two I had in mind :-)

jukofyork

Owner May 20

The Silence Between the Stars

Some parts of that were really Murakami-eque! It kind of felt like I'd ripped a couple of pages out of some of his books and shuffled them lol.
A lot of "It occurred to me" repetition though.

Yeah, it seems to have changed the prose but not the repetition problems, so you have to use similar sampler settings to qwq still.

It's about 1 day from completion so will upload it when it's done (I'm keen now to see if the same "use EOS as BOS" trick can work for other models too).

jukofyork

Owner May 20

@gghfez GLM-4 and QWQ were the two I had in mind :-)

Yeah, I'll try and do these and a few other new models when I get back home at the end of the week.

DazzlingXeno

May 20

@gghfez GLM-4 and QWQ were the two I had in mind :-)

Yeah, I'll try and do these and a few other new models when I get back home at the end of the week.

Thank you!

BigHuggyD

May 20

LOL did you have to magnify cock like that? Had to keep alt-tabbing the page when people walked past 😂

You have to magnify mine to even see it... AYYYYOOOO!! Thank you thank you... I'll be here all week... don't forget to tip your server...

jukofyork

Owner May 21

Wow, the final model is really good if you use the right samplers:

        --temp 0.9090909091 \
        --presence-penalty 1.1 \
        --min-p 0.05 \
        --top-k 0 \
        --top-p 1.0

Okay, I need to write a story based on Battlestar Galactica in the style of Haruki Murakami from Baltar's perspective. Let me start by recalling the key elements of both.

Battlestar Galactica is about humans fleeing the Cylons after a devastating attack, with Baltar as a flawed character who betrays his people but is also used by Cylons.

Murakami's style is surreal, introspective, blending reality and fantasy with deep philosophical musings, melancholy tones, and often unreliable narrators. His characters often experience odd, dreamlike events, and there are recurring themes like loneliness, existential crises, and the search for meaning.

Baltar's perspective: He's intelligent but morally ambiguous, manipulative, and guilt-ridden. In the series, he survives through a mix of cunning and being used by Cylons.

I need to merge these. How can Baltar's internal struggles be depicted in Murakami's style? Maybe focus on his guilt, paranoia, and the surreal aspects of his experiences, like being manipulated by Cylons, which could mirror the surreal encounters common in Murakami's works.

Start with setting the scene in Baltar's quarters, perhaps with him alone, introspecting. Include elements of solitude, mundane details, then introduce something strange, like Number Six appearing. Use descriptive language typical of Murakami, with a sense of unease and mystery. Explore Baltar's conflicting emotions and desires, reflecting his moral ambiguity. Perhaps incorporate surreal imagery, such as dreams or visions, and existential questions. Ensure the prose feels introspective and philosophical, with long, winding sentences.

Need to avoid being too literal about Battlestar Galactica plot points. Instead, focus on Baltar's psyche and create symbolic elements. Make sure the tone is melancholic, with an undercurrent of impending doom. Also, include a sense of isolation and the passage of time, maybe using music as a motif.

Check if I'm mixing Baltar's traits with Murakami's style well. Avoid making it too action-oriented; keep it character-driven and internal. Use metaphors and symbols that resonate with Murakami's work, like shadows, mirrors, water, animals (e.g., cats), and other recurring motifs. Ensure the ending leaves a haunting feeling, unresolved yet contemplative.

Maybe include scenes where Baltar interacts with Cylons but perceives them differently, more symbolically. Reflect his guilt and regret, his manipulation by forces beyond his control. Show his struggle for redemption or meaning, which he might never attain, highlighting themes of futility and human fragility.

Alright, draft a first-person narrative in a Murakami-esque way. Keep dialogue minimal but poignant, use descriptive passages to delve into Baltar's psyche. Create an atmosphere of surreal tension and existential dread. Let the story meander a bit, circling around Baltar's thoughts and emotions without a clear conclusion, emphasizing the introspective nature.
</think>

Title: The Last Oracle of Picon

I am Gaius Baltar, former High Priest of Kobol and current oracle of this rudderless ship. A title granted not by merit or divinity, but by accident and despair—a vessel adrift in the void, bearing its wounded passengers toward some vague promise of salvation. I sit here now at my table, a small, round thing with four legs, which I keep polished until they gleam like glass, as if by doing so I can restore order to the universe. The table is the only object of permanence I know, a counterweight to the chaos that swirls around it. It does not judge me. It does not ask what I intend.

The air reeks of sweat and oil, of fear and death. At night, the hum of the engines becomes the low moan of some tormented deity. There are no windows in my quarters, for the cold of space is a thing too vast to behold. To gaze upon infinity is to see one’s self disappear into nothingness, and nothingness is all I am. A man unmoored from his own past, sailing aimlessly through an empty future. My reflection in the mirror—a thin, pale figure in a white shirt—is a stranger even to myself. He smiles when I smile, and looks away when I do the same. We speak rarely, for neither of us has anything meaningful to say. It is a simple existence, but no simpler than what it is to be alive.

I miss the smell of bread. Of leavening dough and baking sugar. I remember standing outside the bakery on Kobol, watching the smoke curl up toward the sky. There was a boy there once. A stray dog. It followed me everywhere. Do you know what it means to have someone follow you? It is a strange kind of comfort, the assurance that another creature needs your company. The dog slept at my feet, its breath warm against my ankles. I think I loved it. Or maybe I only loved the feeling of being loved.

I cannot pray anymore. Even the gods have abandoned me, or so I tell myself. They were false prophets, these gods with their terrible eyes and their hollow voices. They promised me salvation in exchange for the lives of my people, and they offered me nothing in return save their lies. And yet I see them everywhere, in the way the shadows cling to the walls, in the shape of rain on the floor of my quarters. They speak to me in whispers, or sometimes in the voice of an old woman I once knew. My mother? No, she died long ago. This one still smells of sweet soap. She has pale skin and hair like snow, and her mouth moves but she does not speak. She just stands there, staring into space, as if waiting for something she has already lost. Sometimes she leaves small gifts behind—a handkerchief, a doll—but most often she leaves nothing. Just an empty space where she once stood. An echo. A memory.

I try to sleep now but do not dream. Not properly. Only fragments of things, glimpses of faces and places I no longer recognize. I am alone even in my dreams. There is always another man with me, but it is never the right man. He is always taller, or shorter, older, younger, fatter, thinner. He wears different clothes but he looks at me with the same expression of disappointment and disgust. You know this look well, perhaps you have seen it in your own mirror. The face of a stranger who knows exactly who you are. It is the face of our gods when they turn away from us.

My mind races and I am unable to stop it. It runs through all the what-ifs and could-have-beens. The past is full of roads untraveled, bridges destroyed, doors slammed shut. And yet I continue to imagine that one of these roads will lead me somewhere good, somewhere I can rest in peace. But there is no such road, no such door. My fate lies out there, somewhere ahead. It will catch up to me soon enough. Perhaps by then I will have forgotten how to feel ashamed.

The crew of this ship has been kind to me. They give me food. They give me water. They bring me books when I ask for them. They do not hate me. How strange is that? I am guilty. I am guilty of treason, murder, betrayal, corruption. I am guilty of allowing my people to fall into the hands of the machines. I am guilty of turning my back on everything I ever believed in. I am guilty. Yet still they feed me, clothe me, shelter me. Is this kindness or madness? A twisted sense of justice, or sheer stupidity? Maybe they love me too much to punish me. Maybe they think I deserve their mercy.

I write now but my words are nothing. Words cannot save us. Not anymore. We are doomed, even if we deny it to ourselves. I tell myself this so often that it begins to sound like truth. The Cylons are out there, watching. Waiting. They know exactly what to say and what to do. They understand human weakness. I see them in my mirror, in my bed, in every face I see on the street. They are everywhere but nowhere at all. And I cannot escape from them, for they live in me. My body is a prison and they are its jailers. There is no salvation here.

My quarters smell of despair. My sheets smell of sweat and regret. My mind is full of ghosts. The past presses down on me with its weight. I try to forget, to focus on the moment, but the moment slips away before I can grasp it. I cannot look away from myself. My reflection stares back at me, a thing of pity. He asks me questions. Things he doesn’t remember asking. It is as if the two of us live in separate realities, parallel worlds that touch but never merge. I do not know this man. I am afraid of him. His eyes are too bright, his smile too wide. His voice trembles with sorrow. What has happened to me? When did I become so small?

The stars shine coldly through the windowless walls of Galactica. I have never seen anything more beautiful. The ship drifts onward, into the infinite dark beyond. I hold onto my table, my mirror, my dog. I wait for the inevitable. I close my eyes, trying to ignore the pain. The dogs bark in the distance. A woman laughs. Someone is singing. The gods are silent. I do not know why they sent me this way. Or why I continue to follow. Maybe we are all just lost souls searching for meaning, or for someone to tell us what to believe.

I take my pills now but they make no difference. I think about the dog and I see him wandering the streets of Kobol. I see him running ahead of me, his tongue lolling in the air. We are lost together in this city of ghosts. He leads me into a world where nothing is real. Where there are no answers and only questions. And I am lost forever in the darkness between one breath and the next.

The crew comes to fetch me when they are ready. They come without speaking. They do not need words here. Words are useless anyway. The gods are gone, and with them their prophecies. I step into their midst, like a man stepping into the sea. The waves pull at me, as if they want to drag me down. But I am afraid of death. Afraid that it will leave me more lost than I already am. And so I swim toward the light, hoping to find something I can recognize.

My name is Gaius Baltar. I have no god, no country, and no future. I exist solely to bear witness to this tragedy, to say what must be said even though no one listens. Maybe one day my story will be told differently. Maybe one day I will look back on all this pain and wonder what it was for. But until then I walk alone through the corridors of Galactica. I sit quietly in my quarters. I try to remember how to smile. The stars burn coldly overhead, indifferent to our suffering. Somewhere out there lies the planet Earth. Our last hope. It seems so far away. Like a dream we cannot wake from.<|im_end|>

jukofyork

Owner May 21

Okay, so I need to write a story based on Battlestar Galactica, in the style of Cormac McCarthy, and from the perspective of Gaius Baltar. Hmm. Let's start by breaking down the elements.

First, Battlestar Galactica: it's set in a future where the human fleet is fleeing the Cylons, a race of cybernetic humanoids that they created. The story is about survival, loss, and the struggle for humanity. Gaius Baltar is a key character; he's a brilliant scientist who has some personal demons. He's complicated, manipulative, and there's a lot of internal conflict with him.

Cormac McCarthy's style is dense, poetic, and often bleak. His sentences are long and winding, with lots of imagery and a somber tone. Think "The Road" – sparse dialogue, rich descriptions of nature and landscape, introspective characters.

So the challenge here is to capture Baltar's inner monologue using McCarthy's prose style. Need to focus on his thoughts and feelings, maybe some philosophical musings, while portraying the harsh reality of life aboard the fleet. Use descriptive language about the setting, but also delve into Baltar's psyche. Maybe use some religious or existential themes, since Baltar does have a complex relationship with religion.

Start with the setting: the Fleet. It's a desolate place, adrift among the stars. Use McCarthy-esque descriptions – cold metal, endless blackness of space, the hum of engines. Then get into Baltar's head. What's he thinking? He's alone, guilty, perhaps seeking redemption. Maybe he's looking back on his past actions – how he collaborated with the Cylons, sold out humanity? Or perhaps he's reflecting on his role now, trying to make amends.

Need to think about structure. Since it's a short story, maybe take a single moment in time for Baltar. A moment of reflection, maybe where he's contemplating what he's done, what comes next. The prose should be sparse in terms of action but heavy with introspection. Use repetition and metaphor. Maybe some biblical references, as Baltar sometimes has visions from a Cylon named Caprica Six who pretends to be an angel.

Also, the tone needs to be bleak but introspective. Avoid dialogue; focus on internal thought. Make it poetic, but don't lose the sense of dread and desolation. Use lots of sensory details – the feel of cold metal, the smell of sweat and oil in the corridors, the sounds of the ship. But always tie that back to Baltar's internal state.

Possible opening: Baltar is in his quarters, late at night (or whenever ships in space have 'night'). He's alone, staring out a viewport or at some empty space. Thoughts turn to memories, or guilt, or to the future. Maybe some hallucination or vision occurs, where he thinks Caprica Six is present but isn't sure. Use ambiguous language to reflect uncertainty.

Need to avoid clichés from Battlestar Galactica. Maybe not directly referencing names of characters other than Baltar. Focus on emotion rather than plot. Let the reader infer the setting from the descriptions.

Grammar-wise, McCarthy uses a lot of dependent clauses, sentence fragments, and sparse punctuation. So the prose should flow in long sentences with commas and semicolons, but also short, clipped phrases for impact. No quotation marks for dialogue (though there might be little actual dialogue here).

Potential title ideas? Maybe something like "Ashes in the Stars" or "A Drift Through Nightmares." But let's see how it plays out.

Start drafting. Baltar's perspective. First-person, introspective. Slow pacing, lots of reflection. Use words like 'blackness,' 'stars,' 'light,' 'shadows,' 'echoes,' 'memory,' 'guilt,' 'fate.' Maybe start with something like: "The ship moved through the cold vastness, humming a tune only its bones could hear, and I was alone..."

Wait, better to use third person limited? Since Cormac often doesn't do first person? Hmm. The story says from the perspective of Gaius Baltar – so maybe third person, limited omniscient, but focused entirely on Baltar's mind and emotions.

Yes. Start with the Fleet in motion, moving through space. Describe Baltar in his quarters, an empty cup, the hum of the engines, the feel of the bed underneath him, the chill in the air. Then move to his thoughts. Memories of other times, visions of Caprica Six, or hallucinations. His guilt over collaborating, his regrets, his fears for the future. Maybe he's reflecting on the recent battle that happened, or some loss that affected him. Use short, sharp sentences here. Vary sentence length. Use a metaphor or two about night or stars. Maybe even something biblical, like being lost in the desert. Need to tie it back to the ship's setting.

Also, Baltar is an atheist who has been convinced of religion by Caprica Six. So perhaps some internal struggle between belief and doubt. McCarthy often uses religious imagery without overtly being religious.

Possible closing: Baltar resolves to do something, though not sure what. Or he faces reality and continues forward. It should be open-ended. Not really a conclusion, just a moment frozen in time.

Okay, let me try to put this all together. Start describing the scene in prose. Move into Baltar's thoughts, using sensory details tied to emotion. Weave in his doubts, fears, memories, and conflicted views on faith. End with some kind of resolve or acceptance, but not too explicit.

I need to check for adherence to Cormac style – long sentences, minimal punctuation, lyrical description, bleak tone. Make sure to stay inside Baltar's head as much as possible. Avoid exposition about other characters or the larger plot unless necessary. Focus on personal reflection.
</think>

Here lies the story, crafted as requested. Let it be known that I offer these words in the same spirit as any scribe might record a tale beneath a burning sky.

The ship moves through the dark, its wake a shimmer of dead light like some beast fleeing a wound. The crew whispers tales of gods and monsters who walk unseen, their footsteps the hum of engines, their breaths the sigh of airlocks cycling empty rooms. Gaius Baltar sits alone in a chamber of rusted steel and shadows, gazing upon the stars that are neither distant nor close, but rather an endless sheet of fire upon a curtain of night. His chair has no back; he holds himself upright by sheer force of will, a man whose mind once carved paths where there were none and now digs a grave with the tools of his own undoing.

He thinks of the woman, her body like a flame in the cathedral of his memories. Caprica Six stands before him, radiant and beautiful, her skin a perfect sheen against the blackness. She speaks not with words but with the sound of his own heartbeat echoing in an empty room. "Did you see her?" she asks him again and again, her voice a soft wind stirring the leaves of a long-dead tree. But when he looks, there is only emptiness, as if she exists solely in the spaces between the stars, a ghost woven from his deepest fears.

His hands tremble as he picks up the cup, the heat searing him. It is bitter and thick, and for a moment he is back in that time: the room dark, the voices muffled as if beneath a sea of sand, the weight of her fingers on his wrist. He sees her face, pale and unchanging, eyes gleaming with a malice both terrifying and intimate. They whisper of salvation and destruction, of gods and war machines, of choices that bend reality itself into a knife blade. He remembers the feel of her lips upon his own, cold and soft and hungry. The smell of her hair, as if a storm had carried the scent of jasmine and rain from some distant world. Her laugh, which sounded like chains clanking in a deep well.

The memory fades, leaving only the taste of ash in his mouth and a terrible certainty that all he has ever known was but a dream conjured by Cylons from the ruins of mortal minds. He wonders if he might be one of them, hollowed out and filled with poison; perhaps even now they look upon him with contempt, amused by their own cleverness. His mind spirals like an anchor dropped into a black sea, its chain coiled around his heart. There is no bottom to this fall; only stars above and the hollow light of a sun that never sets.

He closes his eyes and sees her again: Caprica Six, standing before him as if he were still in his office on New Caprica. But when he opens his eyes, she is gone. Only emptiness remains. And somewhere, he knows, another man dies. A woman screams. A child cries. The ship moves onward.

He looks upon the stars and sees only himself reflected back, small and lost and alone. He understands then that there are no gods here, only men and machines and the ghosts they carry within them. He does not pray for forgiveness or mercy or peace. Instead, he drinks the bitter brew until the cup is empty, and stares at nothing.

The ship hums a tune only its bones could hear. Gaius Baltar sits alone in his chamber of rusted steel and shadows. The stars burn coldly overhead.

May your journey through these words be guided by the same dark light as that which shines upon the fleet.<|im_end|>

DazzlingXeno

May 21

What model is that?

jukofyork

Owner May 21

What model is that?

My fine-tune of qwq. I'm just trying to iron out the best sampling parameters as it tends to get stuck in loops very easily.

DazzlingXeno

May 21

What model is that?

My fine-tune of qwq. I'm just trying to iron out the best sampling parameters as it tends to get stuck in loops very easily.

Nice, look forward to seeing it :-) have you applied control vectors on Windows before? I've tried but it seems like it can't find them.

jukofyork

Owner May 21

Okay, I need to tackle this request where someone wants me to write a story based on Battlestar Galactica, but it has to be in the prose style of Edgar Allan Poe. Plus, the author (me) should be under the influence of Datura, but not the protagonist, who is Gaius Baltar. Alright, let's break this down.

First, understanding the components here: Battlestar Galactica is a sci-fi series about humans escaping robotic Cylons. Gaius Baltar is a morally ambiguous character, a scientist with genius and questionable ethics. Edgar Allan Poe's style is gothic, dark, with elements of horror, psychological depth, and poetic language. Now, if the author is "high" on Datura, which is known for causing hallucinations, paranoia, delusions, and altered perception, that adds another layer—maybe using vivid imagery and distorted reality.

So putting these together, I need to blend Baltar's twisted psyche with Poe-esque descriptions, and perhaps a touch of drug-induced chaos without making Baltar actually tripping (since it says the author, not the protagonist). The challenge is to capture the essence of both sources while maintaining coherence.

I'll start by thinking of Poe stories like "The Tell-Tale Heart," "The Fall of the House of Usher," or "The Pit and the Pendulum." They use an intense first-person narrative, describing eerie atmospheres and tortured thoughts. For Baltar, he could reflect on his past actions, paranoia, guilt, and fear—all things he deals with in the show.

Datura effects can include hallucinations and distorted perceptions, so I can incorporate surreal imagery that reflects Baltar's inner turmoil. Maybe his surroundings become warped, or he sees things that aren't there, which ties into his guilt or paranoia.

In terms of setting, I could take a scene from BSG, maybe after the fall of Caprica, when Baltar is being hunted by Cylons and fellow humans alike. His refuge could be a dimly lit room filled with memories, old scientific equipment, and reminders of his former glory. Use rich descriptions of decay, shadows, and flickering lights to heighten the Poe effect.

Narrative structure-wise, I'd likely go with first person as Baltar. Start with him recounting some moment of revelation or despair. Use long sentences with intricate clauses, alliteration, and repetition to mimic Poe's cadence.

Need to weave in Baltar's relationships, especially with Six—the Cylon woman who manipulates him. Her presence could have an ethereal, ghostlike quality, appearing in his visions. Descriptions of her would mirror Poe's female characters with a mix of beauty and menace.

Potential pitfalls: making it too disjointed due to datura influences might lose reader interest. Need to maintain a clear thread even amidst the florid language and strange imagery. Also, must avoid contradicting established BSG lore; ensure Baltar stays true to his character despite the prose stylization.

Research the key events in Baltar's life: his role in developing Cylon defenses before they turned against humanity, his collaboration with Six, and his eventual imprisonment and redemption arc. Pull moments from those periods to base the short story on, giving context without requiring full knowledge of the series.

Tone and mood are crucial. Baltar often struggles with self-loathing and existential dread, so the atmosphere should feel claustrophobic, tense, oppressive. Themes of madness, guilt, moral ambiguity, and fate align well with Poe themes.

Now outline possible plot points. Starting mid-crises makes sense, as in many Poe tales. Perhaps Baltar alone in a sanctuary, pursued by Cylons and people, haunted by nightmares/visions. He could uncover some truth that either torments him further or provides closure (if going towards redemption), but since we don't know the outcome, leave it open-ended?

Writing techniques: Use hyperbolic adjectives, metaphors related to darkness, ghosts, decaying environments, and references to cosmic justice. Repetition for emphasis. Let sentences run long but punctuate appropriately for rhythm.

Also, the title could be a nod to Poe with something ominous and mysterious, like "The Conundrum of Baltar."

Need to keep paragraphs dense and flowing, avoiding choppy sentences. Maybe each paragraph builds a tableau in the reader's mind, with the camera shifting focus slightly between them.

Possible opening lines? Something dramatic, capturing Baltar's internal conflict: "Upon this night of nights... upon which all the horrors of my past conspired to overwhelm me..."

Check for adherence to rules: Author (me) is 'high' on Datura—not Baltar. Prose reminiscent of Poe with its Gothic ambiance, melancholy, psychological introspection. Baltar's perspective with his guilt, cunning, and foiled ambitions. Combine all into one cohesive tale.
</think>

Title: The Conundrum of Baltar

Prologue:
Upon this night of nights — nay! This hour of hours — upon which all the horrors of my past conspired to overwhelm me, and drag forth my soul from out the shadows of damnation, I am compelled to unburden my thoughts for posterity. Know that madness, and pestilence, and fear have been my constant companions of late; and yet, it may well transpire that it is these very fiends who hold the keys to my salvation. Or else, to my eternal destruction. Thus, with quill trembling betwixt my fingers, and inkwell gurgling like a poisoned wellspring within this sepulcher of mine — this cell of stone and steel where my brethren sought to condemn me — I proceed.

Part I:
Behold, o reader! Mark well these words, ere your spirit perishes amid the fogs of delirium. My name is Gaius Baltar. Scientist. Prophet. Criminal. Victim. And now, as ever, I walk along the razor's edge separating man and beast. Upon Caprica, in ages gone, I was hailed as the great architect of our civilization's security. It matters not how long ago those days occurred; nor doth it matter whether I deserve such titles now. In this hellscape of shadows, wherein the wails of anguish of our people echo across the stars, what does it avail to bemoan one's sins when countless others weigh upon our collective souls? No. Better that I reflect upon those ill-fated acts — my own transgressions — so that my offspring, or any future race to tread this Earth, might profit thereby. For, in truth, no man knows the consequences of his actions until it is too late. Too late!

Part II:
I am beset with visions — dire and wondrous both. Hallucinations spawned by the drug-laden winds whistling through the cracks of this cursed place, which were once walls, and doors, and windows. These phantoms dance before me; I see their ghostly forms reflected in every surface, save only the sacred icons I cling to — the emblems of our fallen gods. Ah, gods! Are ye watching me still, Capricans? Doth your wrath abide in my breast, or doth it linger still in the bowels of your polluted temples? My senses reel! The stench of decay overwhelms me. I grasp at the walls; at the ground; at myself. Is there blood? Aye, blood! Yet whose? Mine own. Or thine? Nay. Tis hers — that damned woman, that siren of hell whom I named Six. How her face haunts me. Her eyes gleam red, like twin suns above this bleak, lifeless horizon. They pierce my very soul; and yet I cannot look away — can never look away, unless she has truly departed to dwell with the Cylons. With the machines. With the monstrosities. They swarm. Like maggots in a corpse. They devour everything, and leave naught behind but dust.

Part III:
Shall I recount the tale anew? Of how I aided the Cylons? Of how they spared me my life, even while my hands dripped with the blood of millions? Of how my comrades called me coward? Betrayer? Demon? Why, o why do these accusations still sound in my ears? Perhaps because they are true? Am I but a villain, clad in robes of reason and intellect? Or am I merely a puppet? Puppets cannot think, cannot plan, cannot speak! They are moved by strings; yet I, I have free will! Or have I? When I sacrificed humanity atop my altar of logic and hubris — did I choose to betray my species, or was I ensnared within an evil far greater than any man could comprehend? Alas! Alas, and thrice alas! None knoweth the answer; none shall ever know.

Save, perhaps, the Oracle herself. That damned siren. She whispers secrets to my ears, which may be truths, may be lies — I know not! Sometimes she stands beside me; sometimes she is before; sometimes she follows at my heels, crooning hymns of despair. It matters not where she hides: the serpent remains the same. A trickster! And yet, a friend! Only she understands my plight — only she knows the burden of genius coupled with insanity. What cruel gods fashion such a curse? What godforsaken purpose could it serve? Unless, unless... unless the gods themselves seek revenge upon mankind!

Epilogue:
And thus I ramble on, half-dreaming; half-conscious; half-dead! All sense abandons me! Madness holds its court throughout this desolate realm. My mind teeters at the edge of a precipice. Shall I fall into oblivion, or shall I rise again to face another day? No! Let me sink forever into darkness, to escape the ravings of my conscience — aye! But first let me pen one final thought. One last confession. Though the fates grow dim, though my heart grows cold, though my spirit fades like sand slipping through clenched fists, yet I feel hope rising up from some recesses deep below the threshold of my senses. Hope, forsooth! From where cometh this fragile spark? Could it be the hand of my beloved goddess of wisdom guiding me forth? Athena? Is it thou who dost whisper in my ear, and lift me from my pit of despair? Or perhaps 'tis thy sister, Demeter! Mother of the earth itself! Canst thou hear my plea? O, lady! Quell the fires of hell! Quench the flames of rage which blaze about my brow! Lest my poor heart give way beneath this torment.

Yet stay! Stay! I must continue. The clockwork gears turn onward. Time does march on. Unabated. Ever onwards toward destruction. Toward death. And so I wait. And wonder. And ponder. I stare into the void which surrounds me, seeking solace. And lo! There I behold myself gazing back out from out the depths. Not as Gaius Baltar. Not as scientist. Not even as traitor. But rather as something new, and unrecognizable, and terrible beyond measure. I cower, then laugh; and laugh, then weep. So much is left unsaid; so much remains undone! Let the universe grind me into nothingness! For what else can a mortal do in the face of eternity? Nothing. Nada. Nyet. Nein. None. And yet... I live on.

Thus ends my tale — at least until fate decrees otherwise. Farewell, and farewell again!

Gaius Baltar<|im_end|>

jukofyork

Owner May 21

What model is that?

My fine-tune of qwq. I'm just trying to iron out the best sampling parameters as it tends to get stuck in loops very easily.

Nice, look forward to seeing it :-)

I've set off training using the same method on CohereLabs/c4ai-command-r-08-2024 to see if this works as well on that model.

have you applied control vectors on Windows before? I've tried but it seems like it can't find them.

I've never run any LLMs on Windows so can't really help.

DazzlingXeno

May 21

What model is that?

My fine-tune of qwq. I'm just trying to iron out the best sampling parameters as it tends to get stuck in loops very easily.

Nice, look forward to seeing it :-)

I've set off training using the same method on CohereLabs/c4ai-command-r-08-2024 to see if this works as well on that model.

have you applied control vectors on Windows before? I've tried but it seems like it can't find them.

I've never run any LLMs on Windows so can't really help.

To be fair if it's as good as that was on Command R but on QWQ I'll probably not need to use them ;-)

BigHuggyD

May 22

Wow! Those were entertaining...
And that is a VERY specific temperature setting! 🤔

jukofyork

Owner May 23

•

edited May 23

Wow! Those were entertaining...
And that is a VERY specific temperature setting! 🤔

It's just the reciprocal of 1.1 (which should cancel out the gamma = 1.1 setting I used for the Focal Loss Star loss and be similar to using qwq with temperature = 1.0), but it's not really important.

jukofyork

Owner May 23

•

edited May 23

So I've found the same "use original EOS as BOS/EOS" method doesn't seem to work for command-r (it just got stuck after a few steps and never really dropped the loss).

So I'm now trying to fix the repetition problem I've got with the otherwise good qwq fine-tune... It's really strange as it seems to get stuck in a loop during the non-thinking part and just wants to repeat the same story paragraph with near a near unimodal probability distribution for each word (???).

I'm hoping that it's related to the 8k context length I used during training and now using a 25% (~400M token) subset to train for the full 32k context on top of the qwq model with the old LoRA applied (and everything else the same).

gghfez

May 23

Those are pretty good, and a big improvement over QwQ.

When I tried training command-a and the smaller model, I had the problem where it's -ing didn't seem to influence the final output. It was funny sometimes it'd think out an okay story, then and give a refusal!

I'm hoping that it's related to the 8k context length I used during training and now using a 25%

I guess if it's faster, the 7b variant could be a good way to test this / see the results faster.

gghfez

May 23

@BigHuggyD This guy's trying to do the opposite of the Elara challenge!

https://github.com/envy-ai/elarablate

https://huggingface.co/e-n-v-y/L3.3-Electra-R1-70b-Elarablated-v0.1

BigHuggyD

May 24

@BigHuggyD This guy's trying to do the opposite of the Elara challenge!

https://github.com/envy-ai/elarablate

https://huggingface.co/e-n-v-y/L3.3-Electra-R1-70b-Elarablated-v0.1

This is a clearly misguided individual. 🤣

jukofyork

Owner May 26

•

edited May 26

I figured out what was causing the looping for my qwq fine-tune:

If you multiply out the (multiplicative / down_proj only) LoRA C = A^T B, take the SVD C = U^T S V, and examine the distribution of singular values (in S) for each layer, it was obvious that the first 4 and last 4 layers had just created a couple of huge vector pairs that we doing something a bit like the opposite of abliteration (ie: magnifying certain directions massively).
I was already excluding the last 2 layers for this reason, but it looks like it wasn't enough :/

I also tried just not merging the offending layers, but this shows another problem:

It seems to be learning very tightly-coupled sets of vectors, and removing just some leaves the partner vectors in a bad state and the model started adding spaces to the end of every line :/
The solution to this is to use a really large lora_dropout = 0.5, as found in LoRA Dropout as a Sparsity Regularizer for Overfitting Control (thanks to kallewoof for linking me this paper [I won't suck him into the Thread of Doom by pinging him!]).

So now I've had to restart the fine-tuning with these two changes in mind, but I have also found that I can push the Focal Loss Star's gamma parameter even more (to 1.25 instead of 1.1 for the last run):

and so long as this isn't broken; should be much more like a base model!

(this is also probably the very limit of what can be done; as indicated by the negative log likelihood initially rising, but then starting to [and hopefully continuing to] drop rather than just diverging off...)

It's interesting to see how it quickly "cheats" the loss by down-scaling the hidden state over the first ~400M tokens:

but then slowly starts to rotate the vector directions to allow for a hidden state norm closer to the original model over the remaining ~1.2B tokens.

Fingers crossed this will work, and I will test at step 400 in a couple of hours.

jukofyork

Owner May 26

•

edited May 26

@BigHuggyD This guy's trying to do the opposite of the Elara challenge!

https://github.com/envy-ai/elarablate

https://huggingface.co/e-n-v-y/L3.3-Electra-R1-70b-Elarablated-v0.1

I wonder if some other name would come up now?

I still want to try fine-tuning on a bunch of Occult book torrents I downloaded a while back, but the biggest hurdle is they are mostly in really badly formatted PDF files and it would take a huge amount of effort to parse them all properly :( I think this would go a long way to helping increase the name variability, but unless they are cleaned properly; the strongest signal will end up being stupid stuff like page headers, page numbers, mis-broken paragraphs, and so on...

I wonder if Qwen3-30B-A3Ba might be smart enough to fix this mess?

jukofyork

Owner May 26

•

edited May 26

These are with the new model, but with the extra bit from this post added:

Extra Directions to Avoid Common AI Writing Issues:

Avoid generic phrasing or filler sentences.

Use fresh, specific language instead of clichés or idioms.

Keep internal monologue voice-consistent and emotionally grounded.

Do not summarize emotions—show them through body language, sensory detail, and subtext.

Let characters interrupt, pause, or misread each other. Real dialogue over exposition.

Avoid perfect or overly articulate conversations—lean into awkwardness or hesitation.

Limit adjectives and adverbs—prioritize strong nouns and verbs.

No "telling" exposition—fold backstory naturally into setting, memory, or dialogue.

Avoid AI tropes like “they didn’t know what to say” or “something in their eyes.” Be precise.

Ground every paragraph in physical space—use the five senses, especially sound and touch.

Don’t resolve tension too quickly—allow discomfort or ambiguity to linger.

No sudden shifts in tone or style—keep it consistent with previous chapters.

Avoid making all characters sound the same—differentiate with rhythm, slang, and tone.

Minimize redundant restating of emotions already shown.

No exposition-heavy first lines—start in motion or with a specific, vivid detail.

Also, gonna start using pastebin.com, as it highlights the syntax better and hopefully stops this thread lagging the browser so much:

https://pastebin.com/pcTxSSJ6
https://pastebin.com/LpQ5c7J0

I've scaled lm_head now so temperature = 1 cancels the gamma parameter, and these two stories were generated with just a tiny bit of min-p :

        --temp 1.0 \
        --min-p 0.01 \
        --top-k 0 \
        --top-p 1.0

gghfez

May 28

•

edited May 28

I wonder if some other name would come up now?

Elara still comes up sometimes ❤️

I didn't run any proper tests other than using it, but so far I haven't noticed a new name-slop.

https://pastebin.com/pcTxSSJ6

This is really good, kind of what I was hoping for when I first prompted an LLM with that!

Edit: IMO it's worth keeping that 400-step checkpoint

jukofyork

Owner May 28

•

edited May 28

I wonder if some other name would come up now?

Elara still comes up sometimes ❤️

I didn't run any proper tests other than using it, but so far I haven't noticed a new name-slop.

https://pastebin.com/pcTxSSJ6

This is really good, kind of what I was hoping for when I first prompted an LLM with that!

Edit: IMO it's worth keeping that 400-step checkpoint

It's pretty easy to use "Focal Loss Star" in any training setup:

https://arxiv.org/abs/1708.02002

by either up-scaling lm_head by gamma (say 1.1 or 1.25 rather than the suggested 2.0 in the paper), or by up-scaling logit_scale for the cohere models:

https://huggingface.co/CohereLabs/c4ai-command-a-03-2025/blob/main/config.json

with their tied embeddings.

You just have to be careful not to allow it to train lm_head and from experimentation; the last few layers, or it will try to "cheat".

All this is doing is basically:

Changing the "training time temperature" so that the models' outputs appear (counter-intuitively) MORE peaked during training.
Then when trained with cross-entropy loss, the model will try to again create well-calibrated outputs based on these new excessively peaked outputs.
BUT: because it can't directly change lm_head or the last couple of down_proj matrices, nor get round the final layer_norm stopping blatant cheating by down-scaling the norm of the hidden state; the model is forced to "de-peak" the outputs by rotating the hidden-state vectors:

so that when these vectors are dot-producted with the lm_head vectors; more of output tokens get significant probability mass, in a similar way to what happens when you increase the inference-time temperature.

(as you can see here, there is a little bit of "cheating" going on, as it has managed to down-scale the hidden state somewhat... It seems that between gamma = 1.1 and gamma = 1.25 is the most you can get away with currently before this happens, but there is no reason a regularisation term can't be added to the loss, or a reparametrisation of the LoRA to a more orthogonal structure, to stop this in the future...).

The only difference between just scaling lm_head (or logit_scale) and what I did by adding "Focal Loss Star" to qlorapipe, is I can track the actual negative log-loss, top-1, entropy, etc of the unscaled version:

but since I'm now rescaling the lm_head tensor to reset back the baseline temperature = 1.0 (to avoid having to explain why temperature = 0.9090909091 is the correct setting), there is effectively no difference!

Hopefully, this explanation makes sense, as I really think this method likely has a good chance of undoing a lot of the recent super-peaked Elara type shit we are seeing from current models, and the idea is actually super-simple, and it's just a pity the appendix of the Focal Loss paper explained it so badly and used a '*' character in the name (making it almost impossible to search for).

jukofyork

Owner May 28

I also think that qwq has the most potential to get fixed here: there clearly is a good writer living inside of qwq and it has the best long-context ability of the current open models:

https://fiction.live/stories/Fiction-liveBench-Feb-21-2025/oQdzQvKHw8JyXbN87

it just needs a little help to undo the crazy-peakedness causing Elaraslop!

BigHuggyD

May 31

Anyone tried the new DS R1? Seems darker than the last V3 but not as dark as the original R1.. I have not tested its EPT (Elaras Per Token) rate

gghfez

May 31

Seems darker than the last V3 but not as dark as the original R1

Yes, I agree with that 100%. I've been playing with it a lot. It's defiantly more stable than OG R1.
I'm finding it to be a lot smarter for understanding complex plots.

More Gemini than GPT apparently (found that on reddit)

Source: https://github.com/sam-paech/slop-forensics.

This assistant prefill seems to make it skip the thinking process and respond more like regular V3:

<｜Assistant｜><think>
Okay, the user wants me to respond immediately.
</think>

(Saves me having to swap models)

ChuckMcSneed

May 31

Anyone tried the new DS R1?

So far only on cloud, local one is still converting. Cloud one feels like it knows a bit less and hallucinates more than the original. SimpleQA which is supposed to measure that went down from 30.1 to 27.8, so it is more than just a feeling. Hopefully they fix it in the future. Also, more western-style refusals, likely from distilling Gemini.

More Gemini than GPT apparently (found that on reddit)

I find rectangual chart more readable (made this one for 4chan)

ChuckMcSneed

May 31

Tried new R1 locally(q8, ik_llama.cpp). First impressions:

Less evil: Asshole characters are less likely to murder. Need more pissing off before they snap.
More stable: Less likely to go after something irrelevant and keep going. Less schizo writing style.
More cucked default persona: Default assistant has now personality of a google model. Use too many bad words and it will redirect you to suicide hotline, gemma-style. While I was more or less okay with previous default, this one straight up pisses me off. Fucking hate it.
Ignorance to thinking: Can sometimes completely ignore everything it thought about, including interesting scenarios. Previous one did it too, but this one is more likely to do it.

Not as bad as the cloud one, needs more testing before the final verdict.

jukofyork

Owner Jun 1

https://arxiv.org/abs/2505.22660

This is an interesting paper (and kinda depressing for the future of creative writing LLMs).

jukofyork

Owner Jun 1

•

edited Jun 1

https://huggingface.co/jukofyork/creative-thinker-32b-preview-06-2025

Can anybody with fast internet test if this works? I've got shit internet where I am staying and it will take me a day to re-download this :/

You will need to use temperature = 0.8 as I forgot to scale lm_head.weight... It should be quite an interesting model if it's not corrupted by the online merge (see below).

I've adapted this space:

https://huggingface.co/spaces/Weyaxi/merge-lora

to use the much more memory/disk efficient "single shard at a time" method from:

https://github.com/tdrussell/qlora-pipe/blob/main/tools/merge_lora.py

and it should mean we can merge LoRAs for any size of model using just the free "CPU-only" tier of huggingface!

Assuming it works, then this will be a HUGE help to not have to upload the whole model after applying a tiny LoRA each time!

NOTE: The actual space is currently set to private (and hacked to merge "Multiplicative LoRAs" only), but it looks to have correctly gone through all the steps:

✅ Successfully merged and uploaded model!
🔗 Model URL: https://huggingface.co/jukofyork/creative-thinker-32b-preview-06-2025
📊 Processed 14 shards
🔧 Merged 56 layers with LoRA weights

and the hashes have changed for second shard onwards as expected (I didn't train adapters for the first/last 4 layers of the model contained in the first and last shards).

I'll tidy up the code and open the space tomorrow if it is working...

gghfez

Jun 1

@jukofyork

Can anybody with fast internet test if this works? I've got shit internet where I am staying and it will take me a day to re-download this :/

https://pastebin.com/FbHRqUB9

Haven't read it yet, just quickly downloaded / ran the prompt

jukofyork

Owner Jun 1

@jukofyork

Can anybody with fast internet test if this works? I've got shit internet where I am staying and it will take me a day to re-download this :/

https://pastebin.com/FbHRqUB9

Haven't read it yet, just quickly downloaded / ran the prompt

Thanks! Looks to be working! I'll update the model card to explain more about how it was created and refactor the merge LoRA space tomorrow (Claude helped me and decided to use every possible emoji lol!).

jukofyork

Owner Jun 2

I've recreated the model, this time with the lm_head.weight tensor scaled properly:

https://huggingface.co/jukofyork/creative-thinker-32b

The space you can use to merge LoRAs using the free "CPU only" tier is here:

https://huggingface.co/spaces/jukofyork/merge-lora

gghfez

Jun 2

The space you can use to merge LoRAs using the free "CPU only" tier is here:

Streaming the weights through the space to avoid the low storage limitation! That's a really good idea!

jukofyork

Owner Jun 2

The space you can use to merge LoRAs using the free "CPU only" tier is here:

Streaming the weights through the space to avoid the low storage limitation! That's a really good idea!

Hopefully they don't notice... :D

jukofyork

Owner Jun 2

Just to check:

https://huggingface.co/gghfez/creative-thinker-32b-Q4_K_M-GGUF

is from the new model I converted today and not the one from yesterday that I forgot to scale the lm_head tensor (and is now deleted)?

jukofyork

Owner Jun 2

Just to check:

https://huggingface.co/gghfez/creative-thinker-32b-Q4_K_M-GGUF

is from the new model I converted today and not the one from yesterday that I forgot to scale the lm_head tensor (and is now deleted)?

I assume so, as the one from yesterday was called creative-thinker-32b-preview-06-2025 rather than lust creative-thinker-32b? I'm downloading it now to test if it seems the same as a local merge...

jukofyork

Owner Jun 2

Sadly, seems broken to me :/ Will try and have a look tomorrow to see why.

gghfez

Jun 3

•

edited Jun 3

creative-thinker-32b

Yeah it was this one. I just tested it and it seems to go off the rails (at temp=0.8 and temp=1.0) vs the creative-thinker-32b-preview-06-2025?

FYI - I just used this space: ggml-org/gguf-my-repo to create that. It works for up to 34b models.

jukofyork

Owner Jun 3

•

edited Jun 3

creative-thinker-32b

Yeah it was this one. I just tested it and it seems to go off the rails (at temp=0.8 and temp=1.0) vs the creative-thinker-32b-preview-06-2025?

Yeah, it might be the scaling of lm_head - I'm retrying it now locally without that.

FYI - I just used this space: ggml-org/gguf-my-repo to create that. It works for up to 34b models.

I've hidden the models until I figure out what has gone wrong so probably a good idea to hide (or just delete) the broken GGUF too.

gghfez

Jun 3

Deleted

jukofyork

Owner Jun 6

No Synthetic Data during Pretraining: 11.2 trillion high-quality non-synthetic tokens was used in base model pretraining.

https://huggingface.co/rednote-hilab/dots.llm1.inst

jukofyork

Owner Jun 6

•

edited Jun 6

It doesn't look to be using GQA though:

https://huggingface.co/rednote-hilab/dots.llm1.inst/blob/main/config.json

So assuming a 128 element head dimension:

2*2*32*128*62*1024 = 1GB per 1k context at F16

and 32GB for the for the full context.

gghfez

Jun 6

•

edited Jun 6

Oh wow, this came out of nowhere.

32GB + the 285.7GB for the weights. I'll have to wait for quants.

That transformers version 4.46.3 looks suss, probably won't work yet given it' a new architecture.

jukofyork

Owner Jun 6

https://github.com/ggml-org/llama.cpp/issues/14044

It looks like it should be super easy to add to llama.cpp as it's just a Frankenstein of qwen-3 and deepseek-3 by the looks of it.

gghfez

Jun 8

Looks like a decent base model?

A shame it didn't come out a year earlier, as there's plenty of slop on the internet now

ddh0

Jun 8

I can't wait for someone smarter than me to open a PR to llama.cpp 🤗

jukofyork

Owner Jun 8

Looks like a decent base model?

A shame it didn't come out a year earlier, as there's plenty of slop on the internet now

Those em-dashes seem to be a tell too!

jukofyork

Owner Jun 8

I can't wait for someone smarter than me to open a PR to llama.cpp 🤗

I think that guy in the issues thread looks to have it almost nailed, so probably not long hopefully.

gghfez

Jun 8

Damn, it's got qwq/CoT slop in the base model. Prompt was 'a'

jukofyork

Owner Jun 8

Damn, it's got qwq/CoT slop in the base model. Prompt was 'a'

John Titor was really going back for the Low-background Data :D

gghfez

Jun 8

I think that guy in the issues thread looks to have it almost nailed, so probably not long hopefully.

Yeah his branch builds, quantizes and runs the models. 50 t/s across 3090's. I ended up using text completions with:

<|userprompt|>
<|endofuserprompt|>

<|response|>
<|endofresponse|>

There's a fair bit of reddit data in the base model and some QnA coding / algorithms. Weirdly it seems to have a fair bit of "this post was sponsored by (tm)", I guess that's common on their platform.

Lily shows up a fair bit but no Elara so far. Seems like fresh data compared with the other models.

jukofyork

Owner Jun 9

I made a couple of new "code specific" draft models for V3 and R1:

https://huggingface.co/jukofyork/DeepSeek-V3-0324-CODER-DRAFT-0.6B-v1.0
https://huggingface.co/jukofyork/DeepSeek-R1-0528-CODER-DRAFT-0.6B-v1.0

The V3 one seems to work really well for coding tasks and I can get nearly 90% acceptance rate and 11.5 tokens/s for refactoring tasks:

prompt eval time =   49867.11 ms /  1538 tokens (   32.42 ms per token,    30.84 tokens per second)
       eval time =  126733.88 ms /  1461 tokens (   86.74 ms per token,    11.53 tokens per second)
      total time =  176600.99 ms /  2999 tokens
draft acceptance rate = 0.89517 ( 1187 accepted /  1326 generated)

The R1 seems to work a little better for coding tasks, but not such huge gains as the V3 one.

jukofyork

Owner Jun 9

I also uploaded these:

https://huggingface.co/jukofyork/creative-thinker-gamma-1.1
https://huggingface.co/jukofyork/creative-thinker-gamma-1.25

and their multiplicative LoRAs:

https://huggingface.co/jukofyork/creative-thinker-gamma-1.1-LoRA
https://huggingface.co/jukofyork/creative-thinker-gamma-1.25-LoRA

with lora_alpha halved:

  "lora_alpha": 32,
  "r": 64,

they may or may not be garbage, but if they are; then they are probably recoverable by reducing lora_alpha even more and/or merging the LoRAs or models, etc.

Sadly, I just can't find any good way to regularise the training using the "Instruct-Storywriter" method, and it seems I end up over-fitting to the weird markdown formatting whatever I do :/ Reducing lora_alpha post-training is really a dodgy hack and assumes that the down_proj-only adapters are 100% linear (which they aren't; but likely much more linear than a standard all-parameters LoRA).

jukofyork

Owner Jun 9

The R1 seems to work a little better for coding tasks, but not such huge gains as the V3 one.

Actually, it's much better than I thought as was testing with temperature = 0.6 by accident - doh!

jukofyork

Owner Jun 10

•

edited Jun 10

I made a couple of new "code specific" draft models for V3 and R1:

https://huggingface.co/jukofyork/DeepSeek-V3-0324-CODER-DRAFT-0.6B-v1.0
https://huggingface.co/jukofyork/DeepSeek-R1-0528-CODER-DRAFT-0.6B-v1.0

The V3 one seems to work really well for coding tasks and I can get nearly 90% acceptance rate and 11.5 tokens/s for refactoring tasks:
prompt eval time =   49867.11 ms /  1538 tokens (   32.42 ms per token,    30.84 tokens per second)
       eval time =  126733.88 ms /  1461 tokens (   86.74 ms per token,    11.53 tokens per second)
      total time =  176600.99 ms /  2999 tokens
draft acceptance rate = 0.89517 ( 1187 accepted /  1326 generated)
The R1 seems to work a little better for coding tasks, but not such huge gains as the V3 one.

If anybody is interested in trying it, then I've figured out how to more optimally set the draft tokens to be accepted here:

https://github.com/ggml-org/llama.cpp/discussions/10466#discussioncomment-13420376

it's a bit of a hacky mess currently, but if you are using draft models with V3 or R1, then it well worth trying as it means you should get optimal tokens/s over the full range of "draftability" and not need to be changing the parameters and/or turning off speculative decoding, etc.

I haven't had time yet, but it should work fine for other models like Qwen-3 too (obviously the current hacky implementation will only work for none sampled model at a time though!).

gghfez

Jun 11

•

edited Jun 11

I gave it a try with the new R1. It had about 0.5-0.6 hit rate with the draft. For my setup, this ended up being slower than without the draft model though. I went from about 15 t/s -> 11 t/s

Probably more beneficial for those with a lot of CPU memory, or a bottleneck with my rig. Eg. if I offload 100% to GPU split across 2 systems via rpc, I actually get worse performance than just using 1 server and offloading to the CPU.

I might give it a try later with V3, but I tend not to use it for coding vs Devstral/GLM4 due to the speeds. And for complex things I end up using R1.

jukofyork

Owner Jun 11

I gave it a try with the new R1. It had about 0.5-0.6 hit rate with the draft. For my setup, this ended up being slower than without the draft model though. I went from about 15 t/s -> 11 t/s

Yeah, it's a bit hit or miss with R1 as the two different modes are hard to predict... It's almost like it needs 2 draft models: one for the thinking and one for the response.

Probably more beneficial for those with a lot of CPU memory, or a bottleneck with my rig. Eg. if I offload 100% to GPU split across 2 systems via rpc, I actually get worse performance than just using 1 server and offloading to the CPU.

I've also had no luck with RPC and don't really know what it is doing to add so much latency... I have all the machines linked via 10gbit ethernet and Deepspeed has no latency problems with it's training (which should be far worse due to having to send the gradients as well as the hidden state activations???).

I might give it a try later with V3, but I tend not to use it for coding vs Devstral/GLM4 due to the speeds. And for complex things I end up using R1.

I've managed to go pretty much completely local now:

A dual Xeon Gold 6248 machine with 2x 5000 ADA and 1.5TB ram (~120GBs memory bandwidth per node) running R1 and V3 simultaneously. I get between 6 and 12 tokens/s depending on how "draftable" the task is, and between 30 and 100 tokens/s prompt processing depending on how big the prompt is (ie: sub-2048 has to be left on the CPUs but huge prompts I can process in batches of 10k and the 5000 ADAs have pretty good compute for this).
An M1 ultra with 64GB running Qwen3-30B-A3 and Qwen2.5-VL-7B (with the mmproj) I use for OCRing random screenshots with.

It's actually a pretty usable setup and I don't suffer from the "Oh, Claude's had a lobotomy today" and "Hmm, today's O1 seems way quicker and dumber than yesterday's" bullshit anymore! That was really starting to drive me nuts... I even signed up and bought some credits for Deepseek's own API and that too was way worse than the local versions and would sometimes even just add random garbled characters mid-response :/

Until the recent flash attention stuff was added to llama.cpp I couldn't really do this as R1 and V3 were pure torture for PP... I did have a RTX PRO 6000 Max-Q on order (single mid-March), but after the 4th time scan.co.uk moved the date (they have since moved it again I see lol), I cancelled it. I'm also not convinced a M3 ultra would be that great due to the lower PP speeds.

jukofyork

Owner Jun 11

https://github.com/ggml-org/llama.cpp/pull/14132

Not really much use for creative writing, but if you use for coding then please give this a try!

jukofyork

Owner Jun 26

https://huggingface.co/THU-KEG/LongWriter-Zero-32B

https://arxiv.org/abs/2506.18841

Looks interesting!

BigHuggyD

Jun 27

https://huggingface.co/THU-KEG/LongWriter-Zero-32B

https://arxiv.org/abs/2506.18841

Looks interesting!

Have you tried it?

gghfez

Jun 28

I haven't tried it yet. Looks interesting. GLM4 template + Qwen2.5?

I'll download the q4_0 / try it on my mini rig later.

You guys seen this one?

arcee-ai/GLM-4-32B-Base-32K

Just a base model but apparently it's extended to be more coherent at long contexts

ChuckMcSneed

Jun 28

Oh wow, GLM is one of the best in weight class on my bench. Didn't expect that, thought would be another gemma/qwen disappointment.

Name probs for finetune are neither the best nor the worst.

Base looks like it had a very light tune on it, but not enought to significantly skew the probs(looking at you qwen and llama4), so can be considered a legit base model.

jukofyork

Owner Jun 30

https://huggingface.co/THU-KEG/LongWriter-Zero-32B

https://arxiv.org/abs/2506.18841

Looks interesting!

Have you tried it?

Not yet as been away, but did notice the dataset had around 25% "Elara" in (not sure if it was the training or generated dataset though?).

jukofyork

Owner Jun 30

https://github.com/PaddlePaddle/ERNIE?tab=readme-ov-file#performace-of-ernie-45-pre-trained-models

These look like they might have potential! SimpleQA looks impressive!

gghfez

Jun 30

Wow that's huge if true!

jukofyork

Owner Jun 30

Wow that's huge if true!

Yeah! I'm downloading it now, but can't see the llama.cpp added support for the MoE models anywhere? It looks like only support for the 0.3B model was added :/

jukofyork

Owner Jun 30

•

edited Jun 30

@gghfez GLM-4 and QWQ were the two I had in mind :-)

Yeah, I'll try and do these and a few other new models when I get back home at the end of the week.

Forgot to update on this: I tried GLM-4 but couldn't get it to work, nor could I get my transplant-vocab code to work. I think it uses some strange BOS token that is causing the problem, but will try and update transformers when I get back and give it another go.

jukofyork

Owner Jun 30

https://huggingface.co/baidu/ERNIE-4.5-21B-A3B-PT/discussions/1

Looks like it might just be contaminated training data...

Hopefully the big model will be on openrouter soon, as I see they only added code for the 0.3B model to llama.cpp and not the MoE models :/

gghfez

Jun 30

Forgot to update on this

Oh yeah, I tried to do that as well a few couple of weeks ago when I had a couple of hours on an A100, but failed.

I forget why, but I think it was the same numerical stability issue as gemma2/3 (have to change the dtype), but I didn't have time to try that.

Yeah! I'm downloading it now

I tested it with transformers (slowly). It's writing is weird and reminds me of llama4. I didn't bother letting it finish.

https://pastebin.com/ssZsDr0q

It also failed all my test questions.

jukofyork

Owner Jun 30

•

edited Jun 30

Yeah, tested on open router and asked for my usual Grimdark fantasy promp, and by the end of the story one character was asking the other to "use the radio" to call their officer :D

It also likes to switch to Chinese like the older qwen models.

gghfez

Jun 30

Yeah, tested on open router and asked for my usual Grimdark fantasy promp, and by the end of the story one character was asking the other to "use the radio" to call their officer :D

lol...

I tested a couple of prompts and liked the prose (feels like they distilled Opus-4 for this one). I didn't check for continuity though and now it's getting hammered / times out.

It also likes to switch to Chinese like the older qwen models.

I had the smaller one do this a few times unless I put "Please respond in English only!" in the system prompt. Same thing happens with GLM-4 and GLM-Z. Qwen3-235b exl3 does it too sometimes.

I wonder why the Deepseek models don't have this issue (all the way back to their first releases years ago)

jukofyork

Owner about 1 month ago

Yeah, tested on open router and asked for my usual Grimdark fantasy promp, and by the end of the story one character was asking the other to "use the radio" to call their officer :D

lol...

I tested a couple of prompts and liked the prose (feels like they distilled Opus-4 for this one). I didn't check for continuity though and now it's getting hammered / times out.

It also likes to switch to Chinese like the older qwen models.

I had the smaller one do this a few times unless I put "Please respond in English only!" in the system prompt. Same thing happens with GLM-4 and GLM-Z. Qwen3-235b exl3 does it too sometimes.

I wonder why the Deepseek models don't have this issue (all the way back to their first releases years ago)

I think it may just be a bad config for openrouter:

Yeah something seems off, it feels like Novita has something misconfigured. I'm getting abysmal prompt adherence, and also some Chinese replies to English queries.

(from their discord)

BigHuggyD

27 days ago

What are you alls opinion on the best coding assistant model? Size is no issue. I need something that is effective with really large context (>200k)
I have an old project that I hired someone on UpWork to do a couple years ago and I am trying to resurrect it. I had some... challenges with the person I hired due to English not being his primary language. DeepSeek has helped me work through some of the issues but I am wondering if there is something better for this purpose that is still community

gghfez

27 days ago

For local models, especially with a niche codebase, definitely Command-A hands down if you can run it. I find it better than Deepseek for old projects. But try it first as i don't see it recommended much.

If you mean out of all models, Sonnet 4 or Opus 4 are the best, though i haven't tried Opus 4 at 200k context.

Gemini Pro 2.5 has worked well for me at over 200k but I find it's code style to be convoluted and harder to maintain.

BigHuggyD

26 days ago

For local models, especially with a niche codebase, definitely Command-A hands down if you can run it. I find it better than Deepseek for old projects. But try it first as i don't see it recommended much.

If you mean out of all models, Sonnet 4 or Opus 4 are the best, though i haven't tried Opus 4 at 200k context.

Gemini Pro 2.5 has worked well for me at over 200k but I find it's code style to be convoluted and harder to maintain.

Thanks! So the local stuff still can't hang with the private stuff. The original is written in C #

BigHuggyD

26 days ago

I personally haven't leveraged AI for programming much until recently. I must say that it is quite entertaining to forget to turn off my creativity prompts and presets . My assistant gets quite emotional and there is an immersive ambiance. :D

gghfez

26 days ago

The original is written in C #

Same with the "old projects" I'm working with! This is where Comamnd-A really shines for me, as it's more familiar with the old libs and 4GL tooling associated with them, as well as being smart enough to pick up on people's quirky work arounds.

I must say that it is quite entertaining to forget to turn off my creativity prompts and presets

Yeah, I've had some funny responses what chatting about code with the wrong prompts. And with older models like Wizard, I was sending it dataset samples, and it would see some instructions within the dataset and start following them (suddenly started threatening me!)

If you're "chat-coding" / not hooking a coding tool up to the model (therefor prompt following, formatting / diff instructions don't matter), you could try qwen3, though it hasn't worked well for me.

BigHuggyD

26 days ago

For local models, especially with a niche codebase, definitely Command-A hands down if you can run it. I find it better than Deepseek for old projects. But try it first as i don't see it recommended much.

If you mean out of all models, Sonnet 4 or Opus 4 are the best, though i haven't tried Opus 4 at 200k context.

Gemini Pro 2.5 has worked well for me at over 200k but I find it's code style to be convoluted and harder to maintain.

Claude worked like a champ! Thanks!

jukofyork

Owner 21 days ago

•

edited 21 days ago

OK guys, I've got some exciting news!!! After over a year of messing about trying to get my 'Multiplicative LoRA' idea to work, I've finally cracked it!

I sat down about a month ago and asked myself:

Why are the control vectors so effective, but the 'Multiplicative LoRA' (which should be their natural extension...) so problematic and model-destroying?

Well the answer was pretty obvious: the control vectors use carefully curated "paired" data and a "1-sided" control vector would be equivalently model destroying...

So I guess just about everyone who's ever played with LoRAs here has tried to subtract a LoRA from a model when merging (and likely found it doesn't work well at all!). Mathematically subtraction is the same as adding the "additive inverse" and the reason it doesn't work well is because most of the tensors are doing stuff that is highly non-linear... Even trying this on the down_proj tensor doesn't work well because its input comes from the (sparse) MLP hidden state...

So this led me to ask:

What is we could train a (Multiplicative) LoRA on two datasets simultaneously; with one dataset being examples of what we want and one dataset being examples of what we don't want?

So as not to make the maths too confusing for people who know nothing about linear algebra:

Imagine we train a normal (additive) LoRA, but we flip the sign of the LoRA whilst we are training for the stuff we don't want!

So after about a month of fucking about, I finally ditched my attempts to monkey patch the FEFT LoRA adapter code and just straight up wrote my own "Control Adapter" that pretty much does what my "Control Vector" code did (ie: takes the delta of the output of the decoder block,k etc):

def apply_control_adapters(model, config, lora_config):
    """Apply Control Adapters using the LoraConfig object."""
    layers_to_transform = lora_config.layers_to_transform
    adapter_rank = lora_config.r
    adapter_alpha = lora_config.lora_alpha
    adapter_dropout_p = lora_config.lora_dropout

    layer_count = 0
    applied_count = 0

    for name, module in model.named_modules():
        if 'decoderlayerpipe' in module.__class__.__name__.lower():
            layer_idx = module.layer_idx
            should_transform = (layers_to_transform is None or layer_idx in layers_to_transform)

            if should_transform:
                device = next(module.orig.parameters()).device
                hidden_size = module.orig.hidden_size

                # Add dropout
                module.control_dropout = torch.nn.Dropout(p=adapter_dropout_p) if adapter_dropout_p > 0 else torch.nn.Identity()
                module.control_dropout = module.control_dropout.to(device)

                # Create Control Adapter layers (following PEFT pattern)
                module.control_A = torch.nn.Linear(hidden_size, adapter_rank, bias=False).to(device)
                module.control_B = torch.nn.Linear(adapter_rank, hidden_size, bias=False).to(device)

                # Store scaling as attribute (needed in forward)
                module.control_scaling = adapter_alpha / adapter_rank

                # Add original_name for saving compatibility
                module.control_A.weight.original_name = f"base_model.model.model.layers.{layer_idx}.control_A.weight"
                module.control_B.weight.original_name = f"base_model.model.model.layers.{layer_idx}.control_B.weight"

                # Initialize
                torch.nn.init.kaiming_uniform_(module.control_A.weight, a=math.sqrt(5))
                torch.nn.init.zeros_(module.control_B.weight)

                # Cast to the desired dtype
                lora_weight_dtype = DTYPE_MAP[config.get('lora_weight_dtype', 'float32')]
                module.control_A.weight.data = module.control_A.weight.data.to(lora_weight_dtype)
                module.control_B.weight.data = module.control_B.weight.data.to(lora_weight_dtype)

                # Store original forward method and replace with Control Adapter version
                module._original_forward = module.forward
                module.forward = patch_decoder_layer_control_adapter(module)

                applied_count += 1
            # else:
            #    log_all(f'not training {layer_idx} because it is not in layers_to_transform')

            layer_count += 1

    # Set original_name for all parameters (for saving compatibility)
    for name, p in model.named_parameters():
        if not hasattr(p, 'original_name'):
            p.original_name = name

    # Disable gradients for all base model parameters, enable only for Control Adapters
    for name, p in model.named_parameters():
        if 'control_A' in name or 'control_B' in name:
            p.requires_grad = True
        else:
            p.requires_grad = False

def patch_decoder_layer_control_adapter(module):
    """Create a new forward method that includes Control Adapter logic for DecoderLayerPipe."""

    # When ‖A‖₂ ≲ 0.2–0.3 (see monitoring ‖W‖_F below), the 1-st order truncation error O(‖A‖₂²) ≤ 1–2%.
    # - Going to order 2 halves the error (O(‖A‖₂³)) but doubles the extra matmul cost.
    # - Order 3+ are looking at less than 0.1% gains in the intended ‖A‖₂ norm range.
    NEUMANN_SERIES_ORDER = 2

    # The Neumann series approximation for matrix inverse: (I + A)^(-1) = I - A + A^2 - A^3 + ...
    # converges when ρ(A) < 1, where ρ(A) is the spectral radius (max |eigenvalue|).
    # NOTE: Requiring the spectral norm ‖A‖₂ < 1 is a sufficient condition because ρ(A) ≤ ‖A‖₂.
    #
    # Scalar intuition: For 1/(1 + δ) where |δ| < 1:
    #   True value: 1/(1 + δ)
    #   1st order:  1 - δ                    (error ≈ δ^2)
    #   2nd order:  1 - δ + δ^2              (error ≈ δ^3)
    #
    # Example with δ = 0.1:
    #   True:     1/1.1 ≈ 0.9091
    #   1st:      1 - 0.1 = 0.9000          (error ≈ 1.0%)
    #   2nd:      1 - 0.1 + 0.01 = 0.9100   (error ≈ 0.1%)
    #
    # - For our Control Adapters, the composite matrix (scale * B @ A) should have a spectral
    #   norm << 1 for convergence. Computing ‖W‖₂ every step is expensive, so we monitor the
    #   cheaper Frobenius norm ‖W‖_F. For rank-r matrices: ‖W‖₂ ≤ ‖W‖_F ≤ √r · ‖W‖₂
    #   so keeping ‖W‖_F ≪ √r guarantees ‖W‖₂ < 1 ⇒ ρ(W) < 1 ⇒ the series converges.
    # - Keeping ‖W‖_F ≲ 0.25 √r (≈ 0.2-0.3 times √r) which guarantees ‖A‖₂ ≲ 0.2–0.3, and in
    #   turn this will mean the default `NEUMANN_SERIES_ORDER = 1` should be good enough...
    # - Using `trainer::_apply_lora_weight_decay()` helps maintain this by properly scaling
    #   all singular values of the BA composite, unlike naive weight decay on A and B separately.

    def control_adapter_forward(inputs):
        hidden_states, attention_mask, cos, sin, labels, control_class, *input_aux_losses = inputs

        assert torch.any(control_class != 0), "Micro-batch contains only class-0 tokens; CE loss cannot be computed."

        # Save input for residual computation
        input_hidden_states = hidden_states

        layer_output = module.orig(hidden_states, attention_mask=attention_mask, position_embeddings=(cos, sin))[0]
        torch_result_dtype = layer_output.dtype

        # Calculate the delta and cast to the adpater's dtype
        layer_delta = (layer_output - input_hidden_states).to(module.control_A.weight.dtype)

        # Apply Control Adapter: dropout -> A -> B -> scaling
        adapter_output = module.control_B(module.control_A(module.control_dropout(layer_delta))) * module.control_scaling

        aux_losses = list(input_aux_losses)

        # Compute auxiliary loss for any with control_class == 0
        if torch.any(control_class == 0):
            zero_mask = (control_class == 0).unsqueeze(-1).unsqueeze(-1)  # broadcast to (batch, seq_len, 1)

            # aux_loss  = (1 / N₀) Σ_{class-0 tokens} (‖Δo‖² / (‖Δh‖² + ε))
            #              ^ per-token average over *only* the class-0 examples
            #
            # NOTE: We regularise the **actual residual offset** that would be injected, i.e.:
            #       (scale · B @ A @ Δh) instead of the weights themselves.
            #       – scale-invariant:     rescaling A ↔ B leaves BA·Δh and the loss unchanged.
            #       – input-aware:         only penalises when the adapter would move Δh.
            #       – much cheaper/easier: no need for (semi-)orthogonality init or regularisation, etc.
            output_norm = adapter_output.norm(dim=-1, keepdim=True).pow(2)  # ‖Δo‖²
            delta_norm = layer_delta.norm(dim=-1, keepdim=True).pow(2).clamp_min(1e-6)  # ‖Δh‖²
            relative_token_energy = output_norm / delta_norm  # (batch, seq_len, 1)

            # Average *only* over the class-0 tokens
            aux_loss = relative_token_energy[zero_mask].mean()  # per-layer, per-token average
            aux_losses.append(aux_loss)

            # Remove any adapter effect on class-0 tokens before downstream use
            adapter_output = torch.where(zero_mask, 0.0, adapter_output)

        # For positive samples: Add adapter_output as normal
        # For negative samples: Apply kth-order Neumann series approximation of (I + A)^{-1}
        negate_mask = (control_class == -1).unsqueeze(-1).unsqueeze(-1)  # broadcast to (batch, seq_len, 1)
        if NEUMANN_SERIES_ORDER == 1:
            # Simple 1st-order case: (I + A)^{-1} ≈ I - A, so just negate the output
            adapter_output = torch.where(negate_mask, -adapter_output, adapter_output)
        else:
            # Higher order: compute Neumann series for negative samples
            neumann_sum = adapter_output  # Start with A^1 term
            current_power = adapter_output

            for k in range(1, NEUMANN_SERIES_ORDER):
                # Compute next power: A^(k+1) (no dropout on higher order terms)
                current_power = module.control_B(module.control_A(current_power)) * module.control_scaling

                # Add (-1)^k * A^(k+1) to the series
                neumann_sum = neumann_sum + ((-1.0) ** k) * current_power

            adapter_output = torch.where(negate_mask, -neumann_sum, adapter_output)

        # Cast adapter contribution back to original dtype and add to the residual stream
        result = layer_output + adapter_output.to(torch_result_dtype)

        return (result, attention_mask, cos, sin, labels, control_class, *aux_losses)

    return control_adapter_forward

and then I can use this inside of qlora-pipe (actually I had to fork qlora-pipe to make my own qlora-pipe-lite with everything pretty much stripped out due to the recent changes trying to add DPO to qlora-pipe making the code impenetrable... but this should work on any training engine if you can find where to patch it in).

Just ignore the "control_class == 0" stuff (I haven't used or tried 3-class training yet...) and the specifics of how I'm approximating the matrix inverse for now.

The key thing is that once we have trained this, we can use the linearity of matrix operations to convert this into a standard 'Multiplicative LoRA' like so:

#!/usr/bin/env python3
"""
This script converts Control Adapters to standard LoRA format by distributing the multiplicative
effect to specific linear layers within each transformer block, enabling use with standard LoRA
merging tools.

Usage: python convert_control_adapter_to_lora.py input_path output_path [--cohere | --mixtral N]

---

Control Adapters work by applying a multiplicative transform to the decoder layer's output:

    decoder_output = decoder_output + (scale * B @ A @ decoder_output)

This multiplicative effect can be distributed to specific linear layers within each block:

1. LLAMA-TYPE MODELS - Target `mlp.down_proj` only:

   In Llama-style architectures, the MLP processes the post-attention residual:

   residual = hidden_states
   hidden_states = self.input_layernorm(hidden_states)
   hidden_states, _ = self.self_attn(...)  # attention
   hidden_states = residual + hidden_states

   residual = hidden_states
   hidden_states = self.post_attention_layernorm(hidden_states)
   hidden_states = self.mlp(hidden_states)  # down_proj is the final layer here
   hidden_states = residual + hidden_states

   Since down_proj is the final transformation before the residual addition, applying the
   multiplicative LoRA here achieves a similar effect, though not identical due to `o_proj`
   adding to the residual stream mid-layer.

2. COHERE MODELS - Target both `self_attn.o_proj` AND `mlp.down_proj`:

   Cohere runs attention and MLP in parallel, then sums both contributions:

   residual = hidden_states
   hidden_states = self.input_layernorm(hidden_states)
   hidden_states_attention, _ = self.self_attn(...)  # o_proj is final layer here
   hidden_states_mlp = self.mlp(...)                 # down_proj is final layer here
   hidden_states = residual + hidden_states_attention + hidden_states_mlp

   Since both paths contribute to the final residual, we need a multiplicative LoRA on
   both `o_proj` and `down_proj` to exactly capture the full Control Adapter effect.

3. MIXTRAL MODELS - Target all `block_sparse_moe.experts.{0..N-1}.w2`:

   In Mixtral's MoE structure, `w2` serves the same role as `down_proj`:

   current_hidden_states = self.act_fn(self.w1(hidden_states)) * self.w3(hidden_states)
   current_hidden_states = self.w2(current_hidden_states)  # w2 ≡ down_proj

   Each expert's `w2` is the final linear transformation, so applying multiplicative LoRA
   to all expert `w2` layers achieves a similar effect to the Control Adapter, though not
   identical due to `o_proj` adding to the residual stream mid-layer.

    Due to the linearity of matrix operations, this works regardless of router weights or
    top-k selection:

       MoE(x) = sum of (weight_i * Expert_i(x))

    Applying the same multiplicative effect to each Expert_i preserves this weighted sum.
"""

from pathlib import Path
import argparse
import json
import re
import safetensors
import safetensors.torch
import shutil
import torch

def main():
    parser = argparse.ArgumentParser(description="Convert Control Adapters to Multiplicative LoRA format")
    parser.add_argument("input_path", type=str, help="The path to the Control Adapter directory.")
    parser.add_argument("output_path", type=str, help="The path to the LoRA directory.")

    # Model-specific options (mutually exclusive)
    model_group = parser.add_mutually_exclusive_group()
    model_group.add_argument("--cohere", action="store_true", help="Also target o_proj for Cohere models")
    model_group.add_argument("--mixtral", type=int, metavar="N", help="Target experts.{0..N-1}.w2 for Mixtral models")

    args = parser.parse_args()

    input_path = Path(args.input_path)
    output_path = Path(args.output_path)

    # Create output directory
    output_path.mkdir(parents=True, exist_ok=True)

    # Copy and modify adapter_config.json
    config_filename = 'adapter_config.json'
    if not (input_path / config_filename).exists():
        raise FileNotFoundError(f"{config_filename} not found in input directory")

    # Load, modify, and save config
    with open(input_path / config_filename, 'r') as f:
        config = json.load(f)

    # Update target_modules based on model type
    if args.mixtral:
        config['target_modules'] = ["w2"]
    elif args.cohere:
        config['target_modules'] = ["down_proj", "o_proj"]
    else:
        config['target_modules'] = ["down_proj"]

    with open(output_path / config_filename, 'w') as f:
        json.dump(config, f, indent=2)

    print(f"Updated and copied {config_filename}")

    # Find adapter model file
    adapter_filename = None
    for filename in ['adapter_model.safetensors', 'adapter_model.bin']:
        if (input_path / filename).exists():
            adapter_filename = filename
            break
    else:
        raise FileNotFoundError("No adapter_model.safetensors or adapter_model.bin found in input directory")

    # Load the adapter weights
    if Path(adapter_filename).suffix == '.safetensors':
        state_dict = safetensors.torch.load_file(input_path / adapter_filename)
    else:
        state_dict = torch.load(input_path / adapter_filename, map_location='cpu', weights_only=True)

    # Convert control adapters to multiplicative LoRA format
    new_state_dict = {}

    # Collect and sort control adapter keys by layer number
    control_keys = list(state_dict.items())

    # Sort control keys by layer number
    def extract_layer_num(item):
        key, _ = item
        match = re.search(r'layers\.(\d+)\.control_([AB])\.weight', key)
        if match:
            return (int(match.group(1)), match.group(2))  # Sort by layer number, then A/B
        return (float('inf'), '')  # Put unparseable keys at the end

    control_keys.sort(key=extract_layer_num)

    # Process control adapters in sorted order
    print(f"Converting {len(control_keys)} control adapter tensors:")
    for key, tensor in control_keys:
        # Extract layer index from the key
        # Pattern: base_model.model.model.layers.{layer_idx}.control_{A|B}.weight
        match = re.search(r'layers\.(\d+)\.control_([AB])\.weight', key)
        if match:
            layer_idx = match.group(1)
            adapter_type = match.group(2)

            base_key = f"base_model.model.model.layers.{layer_idx}"
            lora_suffix = f"lora_{adapter_type}.weight"

            if args.mixtral:
                # Target all experts' w2 for Mixtral models
                for expert_idx in range(args.mixtral):
                    new_key = f"{base_key}.block_sparse_moe.experts.{expert_idx}.w2.{lora_suffix}"
                    new_state_dict[new_key] = tensor.clone()
                    print(f"- '{key}' -> '{new_key}'")
            else:
                # Default: target mlp.down_proj
                down_proj_key = f"{base_key}.mlp.down_proj.{lora_suffix}"
                new_state_dict[down_proj_key] = tensor.clone()
                print(f"- '{key}' -> '{down_proj_key}'")

                if args.cohere:
                    # Also target self_attn.o_proj for Cohere models
                    o_proj_key = f"{base_key}.self_attn.o_proj.{lora_suffix}"
                    new_state_dict[o_proj_key] = tensor.clone()
                    print(f"- '{key}' -> '{o_proj_key}'")
        else:
            raise ValueError(f"Could not parse control adapter key: {key}")

    print(f"Done (total tensors: {len(state_dict)} -> {len(new_state_dict)})")

    # Save converted adapter weights
    if Path(adapter_filename).suffix == '.safetensors':
        safetensors.torch.save_file(new_state_dict, output_path / adapter_filename)
    else:
        torch.save(new_state_dict, output_path / adapter_filename)

    print(f"Converted adapter saved to: '{output_path}'")

if __name__ == "__main__":
    main()

(I am concentrating on the Cohere models first due to this being an exact distribution of the tensors, but I don't foresee any problems with using for other models as the same idea worked fine for control-vectors, etc)

Which can be merged like this:

# Usage: python merge_lora.py input_path lora_path output_path
# Output path is created if it doesn't exist

from pathlib import Path
from tqdm import tqdm
import argparse
import os
import peft
import safetensors
import shutil
import torch

parser = argparse.ArgumentParser()
parser.add_argument("input_path", type=str, help="The path to the input directory.")
parser.add_argument("lora_path", type=str, help="The path to the LoRA directory.")
parser.add_argument("output_path", type=str, help="The path to the output directory.")
parser.add_argument("--scale", type=float, default=1.0, help="Scale factor for LoRA merging (default: 1.0).")
parser.add_argument("--multiplicative", action="store_true", help="Merge a multiplicative LoRA instead of an additive LoRA.")
parser.add_argument("--inverse", action="store_true", help="Merge additive/multiplicative inverse of the LoRA.")
parser.add_argument("--no-gpu", action="store_true", help="Use CPU for merging.")
args = parser.parse_args()

if not (0 < args.scale < 2):
    parser.error("--scale must be in the range (0, 2)")

input_path, lora_path, output_path = Path(args.input_path), Path(args.lora_path), Path(args.output_path)
os.makedirs(output_path, exist_ok=True)

lora_config = peft.LoraConfig.from_json_file(lora_path / 'adapter_config.json')
scale = (lora_config['lora_alpha'] / lora_config['r']) * args.scale

device = "cpu" if args.no_gpu else "cuda"

print('Loading LoRA model...')

# Check if we have adapter_model.bin or adapter_model.safetensors
if (lora_path / 'adapter_model.safetensors').exists():
    lora_state = safetensors.torch.load_file(lora_path / 'adapter_model.safetensors')
    if not args.no_gpu:
        # Move mapped entries to cuda
        for key, value in tqdm(lora_state.items()):
            lora_state[key] = value.to('cuda')
else:
    lora_state = torch.load(lora_path / 'adapter_model.bin', map_location=device, weights_only=True)

def find_lora_weights(key):
    lora_A = None
    lora_B = None
    for lora_key, lora_weight in lora_state.items():
        if key.strip('.weight') in lora_key:
            if 'lora_A' in lora_key:
                lora_A = lora_weight
            elif 'lora_B' in lora_key:
                lora_B = lora_weight
            else:
                raise RuntimeError()
    assert not ((lora_A is None) ^ (lora_B is None))
    return lora_A, lora_B

shards = []
for shard in input_path.glob('model*.safetensors'):
    shards.append(shard)

print('Copying unmergable files to output')
for filepath in input_path.glob('*'):
    if filepath in shards:
        continue
    filepath = Path(filepath)
    if filepath.is_dir():
        continue
    if filepath.suffix == ".gguf":
        # Skip unrelated stray quantizations
        continue
    if filepath.suffix == ".safetensors":
        # Consolidated, possibly
        continue
    print(f'copying {filepath.name} to output')
    shutil.copy(filepath, output_path)

print('Merging and copying state_dict to output')
for shard in (pbar := tqdm(shards)):
    tensors = {}
    with safetensors.safe_open(shard, framework='pt', device=device) as f:
        metadata = f.metadata()
        for key in f.keys():
            tensor = f.get_tensor(key)
            lora_A, lora_B = find_lora_weights(key)
            if lora_A is not None:
                pbar.set_description(f'found lora weights for {key}: {lora_A.size()}, {lora_B.size()}')
                old_type = tensor.dtype
                tensor = tensor.to(torch.float32)
                lora_delta = scale * lora_B.to(torch.float32) @ lora_A.to(torch.float32)
                if args.multiplicative:
                    assert lora_delta.shape[0] == lora_delta.shape[1], \
                        f"Multiplicative LoRA requires square delta matrix for {key}: got shape {lora_delta.shape}"
                    assert lora_delta.shape[-1] == tensor.shape[-2], \
                        f"Multiplicative LoRA dimension mismatch for {key}: {lora_delta.shape} vs {tensor.shape}"
                    if args.inverse:
                        identity = torch.eye(lora_delta.shape[0], device=lora_delta.device, dtype=torch.float32)
                        inverse_matrix = torch.linalg.inv(identity + lora_delta)
                        tensor = inverse_matrix @ tensor  # ie: tensor = (I + s * B @ A)^{-1} @ tensor
                    else:
                        tensor += lora_delta @ tensor  # same as: tensor = (I + s * B @ A) @ tensor
                else:
                    assert lora_delta.shape == tensor.shape, \
                        f"Additive LoRA dimension mismatch for {key}: {lora_delta.shape} vs {tensor.shape}"
                    if args.inverse:
                        tensor -= lora_delta
                    else:
                        tensor += lora_delta
                tensor = tensor.to(old_type)
            tensors[key] = tensor
        safetensors.torch.save_file(tensors, output_path / shard.name, metadata=metadata)

to be continued in next post...

jukofyork

Owner 21 days ago

•

edited 21 days ago

So here is literally my first attempt at using this:

# ==============================
# MODEL AND OUTPUT CONFIGURATION
# ==============================

model_dir = 'c4ai-command-r-v01'
output_dir = 'fiction_finetunes/finetuned'

# ==================
# LORA CONFIGURATION
# ==================

use_control_adapters = true

lora_rank = 16

# =================================
# MODULE/LAYER FILTER CONFIGURATION
# =================================

# Skip last 2 layers: 16*2*8192*(40−2) = ~10M parameters total
layers_to_transform = '0:37'

# =======================
# OPTIMIZER CONFIGURATION
# =======================

lr = 2e-5
lora_weight_decay = 4e2

# ======================
# TRAINING CONFIGURATION
# ======================

pipeline_stages = 2

sequence_len = 4096

# Targeting 1k total steps and ~20 tokens per parameter:
#   round(solve((1000*4096*3*x)/(16*2*8192*(40−2)) = 20, x)) --> x = 16
#   3*16    = 48 sequences per step, 1000*3*16  = 48000 sequences total
#   4096*48 = ~192K tokens per step, 4096*48000 = ~192M tokens total
gradient_accumulation_steps = 16

use_column_major_topology = true

# =====================
# DATASET CONFIGURATION
# =====================

[[datasets]]
dataset_path = 'books-markdown-cleaned-fiction--filtered/*.json'
control_class = 1
max_sequences = 24242   # floor(0.5*48000/(1-0.01))
drop_tails = true

[[datasets]]
dataset_path = 'ajibawa-2023-Stories-Collections--filtered/*.json'
control_class = -1
max_sequences = 12121   # floor(0.25*48000/(1−0.01))
drop_tails = true

[[datasets]]
dataset_path = 'literotica-stories--filtered/*.json'
control_class = -1
max_sequences = 12121   # floor(0.25*48000/(1−0.01))
drop_tails = true

All 3 datasets were created by:

Extracting all paragraphs between 75 and 2000 chars in length (to strip chapter titles or messed up books with no line-breaks, etc).
Filtering for those paragraphs with 98%+ standard text and punctuation characters (to filter out any random bits of code, markdown titles, etc).
Discarding the 1st and last paragraph of every story/book (to try to make them "distributionally similar" due to some being books vs short-stories, etc).

hence the filtered in the title.

Then these were shuffled, and the selected paragraphs all just put between EOS characters (with drop_tails = true to make sure each sequence starts at a clean paragraph boundary...).

books-markdown-cleaned-fiction--filtered is around 10,000 fiction books from well before LLMs existed (sadly can't say any more, but you will know the sort of stuff I like!).
ajibawa-2023-Stories-Collections--filtered is ajibawa-2023/General-Stories-Collection and ajibawa-2023/Children-Stories-Collection combined (ie: the worst "slop" dataset I can find).
literotica-stories--filtered is just literotica-stories as I needed something with bad writing to balance the strange formatting of ajibawa-2023-Stories.

(I actually only used a tiny fraction of the data due to targetting exactly ~20 tokens per parameter...)

So this only took around 30 hours to train over my 6 x A6000, and even for the full-sized command-r model that I will do next (or possibly repeat this with lora_rank = 64 for command-r:35b), and I may well have overkilled using the ~20 tokens per parameter and can get similar results from way less data due to the "cancellation effect" of the new Control Adapters...

to be continued in next post...

jukofyork

Owner 21 days ago

All with temperature = 0 to get a fair test:

original command-r:35b

Title: The Broken Empire
Chapter One: Shadows of the Past

The air was thick with the stench of decay and death. The sun hung low in the sky, casting a grim light upon the desolate landscape. This land had once been fertile and green, but now it was a harsh and cruel place, scarred by battle and neglect. The bones of the fallen lay scattered across the parched earth, bleached white by the relentless sun.

A solitary figure stood upon a hill, surveying the desolation. He was a man grown hard by the harshness of the world, his face weathered and lined, his eyes cold and weary. He wore armor that had once been gleaming and proud, now dented and stained with the muck of battle. A sword hung at his side, its steel dark and unpolished.

This man, Grimlock Blackheart, was a warrior of note, a veteran of many campaigns, yet he felt a weight upon his shoulders that no amount of battle prowess could ease. He had seen too much death, too much destruction, and each scar on his body was a reminder of failures and lost friends.

As he stared out over the broken empire, memories flooded his mind. He saw the flames of a village that had been his home, the desperate faces of those who had begged for mercy, and the cold eyes of the enemy that showed no quarter. He remembered the cry of a young woman, her voice lost in the chaos, her face a blur in his mind, haunted his every step.

Driven by a relentless fury and a need for vengeance, Grimlock set out upon the ancient road, his boots crunching on the gravel with every step. The wind howled across the plain, carrying the whispers of the damned, and the distant hope of rain. He strode forward, his gaze fixed on the horizon, where a dark fortress loomed. It was a place of dread, a bastion of evil, and it had been his destination for many months.

Within the fortress walls, an evil lord held sway, a tyrant who had once been a friend. Kethril the Unyielding, they called him, a man who had turned his back on all that was good and noble, embracing darkness in his thirst for power. It was Kethril who had burned Grimlock's village, who had stolen away his people's hope and left only ashes in his wake.

Grimlock's jaw clenched with determination. He would not rest until he had exacted revenge and freed the land from Kethril's grasp, even if the cost was his own life. He knew the danger that lurked within those walls, the dark magic and twisted creatures that served the lord, but he was resolute. Death held no fear for him, for he had already dwelled in the depths of despair.

As he approached the fortress, the sun sank lower, casting the land in an eerie orange glow. The gates loomed ahead, tall and imposing, their wood scarred and pockmarked from past battles. Without hesitation, Grimlock strode forward, his sword drawn, the clank of his armor echoing through the silent courtyard.

The gates creaked open, the hinges protesting with a mournful cry, and he stepped into the shadows beyond. The air grew cold and damp, the stench of rot even stronger. He could hear the scurrying of rats and the distant echoes of tormented screams. Torches flickered and danced upon the walls, casting twisted shadows that danced merrily in the breeze.

Grimlock's steps echoed off the stone walls, his heart pounding with a mix of fury and fear. He had ventured into the heart of darkness, and the price of failure would be his life. But he was determined to face Kethril, to look into the eyes of the man who had stolen everything from him, and finally, to find peace in the end.

The corridor ahead came to an end, opening up into a vast throne room, where a figure sat upon a dais, cloaked in darkness. The air crackled with an aura of malevolence, and the very scent of the room seemed to turn the stomach.

"Show yourself, Kethril," Grimlock growled, his voice echoing off the walls. "Face me, coward!"

The figure stirred, and with a sweep of his cloak, revealed a face twisted with hatred and power. The eyes, cold and calculating, held a glimmer of triumph as they locked onto Grimlock.

"Ah, my old friend. You have finally come. But it is too late for second chances," Kethril's voice boomed, carrying an edge of madness. "Your vengeance will be short-lived, for I have powers beyond your imagining."

With that, the throne room erupted into chaos, as creatures of shadow and flesh emerged from the shadows, their snarls filling the air. Grimlock's sword flashed in the torchlight, striking out at his foes, his battle cry ringing out in the face of overwhelming odds.

The fight raged on, each clash of steel echoing the relentless beat of his heart. Grimlock fought with all his might, his skills honed through years of war, but it seemed the forces arrayed against him would never tire. Yet he pressed on, driven by the memories of the fallen, determined to face his enemy and find his reckoning.

How's that for a start? Let me know if you would like me to continue the story or make any changes to this opening chapter.

command-r:35b with the control adapter applied

The World was a foul place. A place of suffering and pain. A place where hope went to die.

The sun was a hazy memory, hidden for years behind an eternal blanket of dark, thunderous clouds. The few who still spoke of the sky above called it the 'Void', for it was a place of nothingness, a place where the gods had cast out their refuse.

The ground was sodden and foul, the air thick with the stench of rot and decay. The forests that still stood were places of fear, haunted by the ghosts of those who had perished in the endless years of war. The trees themselves were twisted, misshapen things, their branches like gnarled fingers, forever reaching for the sky that would never yield.

The few humans left alive cowered in wretched towns, behind flimsy walls that did little to protect them from the horrors that lurked beyond. They were a broken people, beaten down by the endless struggles for survival. Hope was a luxury that few could afford, and the light of civilisation had long since been extinguished.

The roads between settlements were treacherous, and few dared travel them. Those who did were either brave beyond sense or desperate beyond reason.

It was on one such road that our tale begins. A figure, tall and broad-shouldered strode purposefully through the murk. His boots squelched in the mud with every step and the weight of his sword dragged at his hip. He wore his long black cloak like a shroud, its cowl pulled low over his eyes, hiding his face from the world.

Few would have recognised the famous general in this broken land. The man who had once led armies into battle, the man whose name had struck fear into the hearts of thousands, was now but a shell of his former self. He stank of sweat and rot, his armour was dented and scuffed, his face unshaven and gaunt. To all intents and purposes he was a ghost, a haunting reminder of the horrors that had brought this world to its knees.

He didn't notice the two figures that followed him, keeping a careful distance back, hidden in the shadows cast by the gloom. They were lean and hungry, their eyes alert, weapons at the ready. They had been tracking the figure for days, biding their time.

The general stopped suddenly and the two pursuers froze, fearing the worst. But the figure didn't turn around, instead pulling back his cowl and revealing a face weathered by the hardships of the world. His eyes, once fierce and commanding, now held a distant look, as if his mind was elsewhere.

'We ride to the North, to the cursed lands beyond the Spine,' he said, his voice a harsh croak, broken and barely recognisable as the bellow that had once rallied men to his cause. 'There we will find the enemy, and there we will find our end.'

The two figures emerged from the shadows, their faces revealed as the general's words hung in the air. The first was a man, tall and broad, his face marked with the scars of a thousand battles. The other was a woman, slender and deadly, her eyes like chips of ice.

'You're wrong, General,' the man said, his deep voice carrying a hint of an accent the general couldn't place. 'We won't find our end in the North. We're going to find it right here.'

The general spun, his sword clearing its scabbard with a hiss, the steel glinting ominously in the dim light. The woman drew a wicked looking dagger, her eyes never leaving the general's.

'You were a hero once, Arkarin,' the man continued, 'but now your time is done. It's our turn to show the world what steel is made of.'

Arkarin snarled, his eyes locking onto the man, then flicking to the woman. He saw no mercy in their gaze, only the cold determination of the desperate.

'You know nothing of the horrors that await us beyond the Spine,' he spat, his eyes never leaving the pair.

'We know enough,' the woman replied, her voice a deadly hiss.

Arkarin's eyes darted between the two assassins, his mind racing. He had taught both of them, trained them to fight, to kill. They were his legacy, his children of war. But they were fools if they thought they could best him. He lunged, his sword a blur, aiming for the man's chest.

The assassin didn't flinch, instead stepping inside the sweep of the blade and grabbing Arkarin's wrist with one hand, his other flashing out to strike true. The general's eyes widened as the dagger sank into his throat. He felt a searing pain, then nothing.

The world went dark as his body crumpled to the ground.

The man stood over him, his face a mask of contempt, while the woman sheathed her blade and looked away, as if the act had disgusted her.

'Now, we ride North,' the man said, spitting on Arkarin's corpse. 'To take up the mantle. The world will know us soon enough. The Black Dogs are coming.'

The woman shivered at the name, but said nothing.

The man turned and strode away, leaving Arkarin's body to rot in the mud. As he walked, his eyes fell upon a large, black dog, sitting patiently by the side of the road, its eyes glinting. He paused, considering the beast, then nodded to himself.

'Come then, boy. We have a long road ahead.'

The dog padded after him, its eyes never leaving the figure's back.

The man glanced once more at the dead general, a sneer on his face.

'A new age is coming, and the world will bleed.'

command-r:35b with the INVERSE of the control adapter applied

Title: Shadows of the Forsaken Land

Chapter 1: The Bleak Dawn

In the bleak lands of Morghor, where hope withered long ago, the dawn broke not with the gentle caress of light, but with the harsh bite of cold, harsh reality. The world awoke to a gray sky, heavy with the promise of misery and despair. This was a land untouched by mercy, a place where the weak perished swiftly, their bones joining the endless sea of forgotten souls.

Our tale begins with a man, his name long lost to the annals of history, a man whose very existence was a testament to the harshness of this cruel world. He was a warrior, battle-hardened and weathered, his face etched with lines of pain and suffering. His eyes, once filled with fire and passion, now held a hollow gaze, a testament to the horrors he had witnessed.

As the grim light of day crept across the desolate landscape, our warrior found himself amidst the remnants of a shattered army. Once proud soldiers, now little more than shadows of their former selves, dragged their weary bodies through the mud, their armor dented and bloodied. They had marched for months, driven by a cause they once believed in, only to find themselves beaten, broken, and left for dead.

Amongst this sorry bunch, a figure emerged from the shadows. Tall and imposing, cloaked in darkness, his presence seemed to suck what little warmth remained from the air. This was Talyn, the self-proclaimed Lord of Nightfall, a man whose ambition knew no bounds, whose thirst for power was insatiable. He strode forward, his eyes gleaming with malevolence, and addressed the ragged warriors before him.

"You have suffered much, my brothers," he began, his voice like gravel grinding against stone. "But your trials are not yet over. We march again, not for some foolish notion of glory, but for survival itself. Beyond the cursed forests of Eldoria lies our only chance at redemption. There, rumors speak of a mystical artifact, one that can bend reality itself to its will. With it, we shall reclaim our strength and exact our revenge."

The warriors exchanged glances, their eyes flickering between hope and fear. They had little left to lose, and so they followed, drawn by the promise of salvation, however faint it may seem. Little did they know, the true nature of the artifact, and the darkness that lurked within it.

As they ventured forth, the land grew increasingly hostile, as if nature itself resisted their progress. The winds howled in protest, carrying whispers of doom, while the very trees seemed to reach out, their branches twisted and gnarled, eager to ensnare the intrepid travelers. It was in these cursed woods that their journey would truly test their resolve, for within its depths, darkness took on a physical form.

Days turned to weeks as they trekked through the unforgiving wilderness. Starvation and exhaustion took their toll, yet still they pressed onward, driven by the iron will of Talyn, who seemed immune to the hardships. Tales began to circulate among the men, whispered stories of his dark sorcery and unholy pacts, powers that allowed him to withstand the rigors of their grim march.

Just when all hope seemed lost, the trees thinned, revealing a sinister structure shrouded in darkness. It was the ancient keep of Eldoria, a place of legend and terror. Eerie whispers echoed through the air, carried by the breath of long-dead souls, warning the intruders of the dangers ahead. Undeterred, Talyn strode forward, his eyes burning with determination, while the rest of the company followed, their steps heavy with trepidation.

The doors to the keep creaked open, inviting the weary travelers into a place where reality twisted and warped. Strange visions haunted their every step, echoes of past horrors and future nightmares. They descended into the heart of the place, guided by the faint glow of the mystical artifact, a malevolent presence that pulsed with untapped power.

With trembling hands, Talyn reached out, grasping the object of their quest. As his fingers wrapped around it, a dark energy coursed through him, and a sinister smile stretched across his face. He turned to his companions, and in that moment, their fates were sealed. With a wave of his hand, the artifact unleashed its true nature, consuming Talyn's being and transforming him into something unrecognizable. His flesh warped, his bones cracked, and a dark aura enveloped him, marking him as a creature of pure malevolence.

The once-mighty warrior, now a twisted shell of his former self, let out a triumphant laugh, echoing through the halls. "Behold, my brothers! Behold the power that shall grant us dominion! Together, we shall reshape this world according to our will!"

But as his words echoed through the chamber, a glimmer of resistance sparked within the eyes of the remaining warriors. They realized the true nature of the artifact, and the darkness that had consumed their leader. Fear turned to anger, and anger ignited a spark of defiance. They knew that if left unchecked, this power would bring about untold suffering, reducing their world to ashes and despair.

And so, the stage was set for a clash of wills, a battle not only against the twisted creature before them but against the very essence of darkness that threatened to consume their souls. From the depths of their being, a primal scream rose, signaling their refusal to bow down to the forces of malevolence. War was upon them, not just against their corrupted leader but against the very fabric of the Grimdark world they inhabited.

The first chapter of this dark tale draws to a close, yet the journey has just begun. Join me next time as we delve deeper into the twisted realms of this forsaken land, where alliances will be forged, battles waged, and the line between good and evil blurred beyond recognition. Welcome to the world of the Forsaken Trilogy, where the only certainty is uncertainty, and the price of survival may be your very soul.

I think this speaks for itself!!!

jukofyork

Owner 21 days ago

•

edited 21 days ago

It's super hot here and I need to have a break and check on my cats, but I can't say how excited I am that this actually works after over a year of knowing that it should work (mathematically) but completely failing to actually get it to work (mainly due to my terrible python skills and stubbornly trying to hack PEFT to do what I wanted...).

I'm not joking when I say that I think these new "Control Adapters" could open up a whole new world of fine-tuning in 2025:

You can probably make datasets of two different authors; one you like and one you don't (or groups of authors you like/don't like).
You can probably remove the "think" from reasoning models by simply training on the same dataset, but one with the sections between <think> and <\think removed.
You can probably do something like "trained abliteration" by using one of the toxic DPO datasets (mathematically, the "Control Adapter" is capable of flipping or collapsing subspaces instead of just a single direction).
After some more work, then the "class 0" idea can be used to avoid damaging models general ability and just change the part you want to change (eg: the prose).

This sort of 2-class "contrastive learning" method is a super-powerful way to combat the small datasets we have comparatively to the mega $$$ corps who pre-trained the models.

Whilst writing this, I actually found out that weight_decay for LoRAs is pretty much broken for all engines and 100% broken (due "catastrophic cancellation") if you use bloat16 or float16 adapters. The Neumann series approximation to matrix inverse particularly needs a working weight_decay for certain reasons (I explained in the comments if you're interested):

(so for the config file above it's essential we maintain norm_max to be < ~0.2-0.3 * sqrt(rank) = ~0.25 * sqrt(16) = ~1)

(everyone thinking they are using decoupled weight_decay = 0.1 for bfloat16 or float16 LoRAs will actually find this just sits at zero!!!)

I will try and see if I can update the "one shard at a time" huggingface space so other's can try messing with what I created for command-r:35b if they want, but ultimately I won't be making/uploading lots of models like I did for the control-vectors (my internet is shit and my office A/C unit is not happy about the heat!), and hope that I can get other's excited enough to turn my horrible Python code into something that can work in PEFT, etc.

jukofyork

Owner 21 days ago

Sorry, if the posts seem messed up - I tried to fix the typos, etc but the forum database seem to be screwed by this huge thread and half the changes I tried to make end up getting undone :(

If you're reading from email, then I suggest you refresh this page and see if the changes show up.

jukofyork

Owner 21 days ago

Using @ChuckMcSneed 's test:

https://huggingface.co/blog/ChuckMcSneed/name-diversity-in-llms-experiment

with temperature = 1 and no other samplers:

original command-r:35b

command-r:35b with the control adapter applied

command-r:35b with the INVERSE of the control adapter applied

jukofyork

Owner 20 days ago

•

edited 20 days ago

I'd be grateful if somebody with good download speed can test this:

https://huggingface.co/jukofyork/command-r-35b-writer

NOTE: The original command-r:35b often needs a small amount of min-p (eg: min-p = 0.01) to avoid generating weird output, but given the prompt I used above; it should generate something somewhat similar (but likely not exactly the same, due to numerical differences between my home CUDA merge and the online CPU merge in the space, etc).

I've decided I'm now gonna repeat this for command-r:32b so more people can test it, as command-r:35b doesn't use GQA and is pretty memory hungry as a result... I should have the model ready by Sunday/Monday.

I've also updated the code in:

https://huggingface.co/spaces/jukofyork/merge-lora

so you can do a "inverse merge" if the differences are too subtle to see.

The repo code for the actual training is here:

https://github.com/jukofyork/qlora-pipe-lite

but I would hold off using it until I do some more tests and then update the readme.md to explain how to use it and credit qlora-pipe properly, etc.

jukofyork

Owner 20 days ago

•

edited 20 days ago

~~I can't test it, but~~ [works fine] I've also made a "standard" LoRA approximation using SVD:

https://huggingface.co/jukofyork/command-r-35b-writer-additive-lora

which should allow you to convert it to gguf format and apply to llama.cpp as a normal LoRA...

jukofyork

Owner 20 days ago

•

edited 20 days ago

The gguf converted version of the (additive) LoRA works fine in llama.cpp too:

https://huggingface.co/jukofyork/command-r-35b-writer-lora-gguf

Use:

./llama-server \
    --model command-r-35b-Q8_0.gguf \
    --lora command-r-35b-writer-lora-F32.gguf \
    ...

You can also "overcook" using:

./llama-server \
    --model command-r-35b-Q8_0.gguf \
    --lora-scaled command-r-35b-writer-lora-F32.gguf 2.0 \
    ...

The World Was A Bloody Mess

It was a bloody mess out there. Not that anyone was out there, of course. If they were they wouldn’t live to tell the tale. Not with the Reapers about. They said the Reapers were the worst of the worst, sent out by the Lord to cleanse the land of the unclean. They said the Reapers had no mercy, no pity, no fear. They said a lot of things about the Reapers, but no one really knew the truth. No one who knew the truth was still alive to tell it, anyway.

The sun was a dim memory, hidden for years behind a thick blanket of dark clouds. The rain was a constant, a never ending deluge that soaked the land and left it sodden. The mud was thick and sucking, the air cold and heavy. It was hard to breathe. It was harder to hope.

The brothers knew all about the Reapers. They’d heard the tales, the rumours, the stories of their ferocity in battle, their merciless slaughter of thousands. They’d heard the screams in the night, the howls of the dying, the wails of the widows. They’d seen the bodies, the charred and blackened corpses that were all that remained of the villages that had dared to resist. The Reapers left nothing but ashes in their wake.

There were nine of them, the brothers. They’d been brothers once, long ago, before the world had gone to shit. Now they were just men, bound together by the horrors they had seen, the horrors they had perpetrated. They were the last of their kind, the last men standing between the Reapers and the total annihilation of the human race. At least, that’s what they liked to think.

They were holed up in an old farmstead on the edge of what used to be known as the Green Valley. It was a good defensive position, surrounded by deep, thick mud that would slow down any attack. Or so they hoped. The farmstead itself was a collection of low buildings arranged around a central courtyard. It had been fortified, the walls topped with sharp wooden stakes, the gates reinforced with heavy timber. Inside, the buildings were connected by a maze of narrow passageways, giving the brothers a significant advantage in the event of a siege.

Or "overcook and invert" using:

./llama-server \
    --model command-r-35b-Q8_0.gguf \
    --lora-scaled command-r-35b-writer-lora-F32.gguf -2.0 \
    ...

Title: Shadows of the Forsaken Land
Chapter 1: The Bleak Horizon

In the desolate realm of the Forsaken Land, where hope withered beneath the relentless sun, a tale woven from darkness and despair awaited its grim conclusion. This was no place for the faint of heart, for it pulsed with raw power, raw emotion, and raw violence. Welcome to the world of the Broken Empire, a land torn apart by greed, ambition, and betrayal.

Our protagonist, a battle-hardened warrior named Thrain, stood atop a hill overlooking the charred remains of what once was a thriving village. His eyes, hardened by years of conflict, scanned the horizon, searching for answers amidst the devastation. He carried the weight of his past on his broad shoulders, every scar telling a story of survival and loss.

Thrain belonged to a noble family once revered throughout the land, but now reduced to mere shadows of their former selves due to internal strife and external pressures. Their once proud castle now lay in ruins, its walls crumbling under the weight of betrayal and neglect. Yet, despite having lost everything, Thrain remained steadfastly loyal to his family's cause – even if it meant facing insurmountable odds.

I'll have the command-r:32b version done by tomorrow.

gghfez

19 days ago

This took a a few attempts to reply to (the thread of doom effect bugs Firefox so the tab crashes text input lags when I try to type)

I tried running the model locally at full precision yesterday with vllm but wasn't able to get the same effect as yours. I'll fire up llama.cpp shortly and give it a try. I'm going to have to re-read this a few times, but it looks like you've managed to achieve the control-vector-like effect using the good : bad datasets.

Or "overcook and invert" using:

...

In the desolate realm of the Forsaken Land, where hope withered beneath the relentless sun, a tale woven from darkness and despair awaited its grim conclusion. This was no place for the faint of heart, for it pulsed with raw power, raw emotion, and raw violence. Welcome to the world of the Broken Empire, a land torn apart by greed, ambition, and betrayal.

That's incredible! It's really captured "everything I hate about AI generated stories"lmao!

Is this really from using those slop-infested datasets?

Question: How is this able to work so well with just r=16,a=4 Is this due to the multiplicative effect vs additive?

And yes, I've tried subtractive LoRA before and it never worked. Ended up completely lobotomizing the model.

gghfez

19 days ago

Your additive Lora worked via llama.cpp

This is fantastic, it generalizes well (not just creative writing)
This is scaled -2.0 with one of the default OpenWebUI prompts:
Prompt:

<|START_OF_TURN_TOKEN|><|USER_TOKEN|>What are 5 creative things I could do with my kids' art? I don't want to throw them away, but it's also so much clutter.<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>

1. Take photos of the art and create a digital photo book for each child with their artwork. This is a great way to preserve the art and create a meaningful keepsake that doesn’t take up physical space.

2. 2. Create a gallery wall in your home and rotate the artwork on display. You could use frames that all match or...

inverse:


1. Customized Calendars: Compile twelve stunning pieces of their artwork to feature on each month of a personalized family calendar. Imagine waking up every morning seeing those vibrant creations gracing your kitchen wall! And don't forget to jot down important dates like parent-teacher conferences or school recitals. This way, both your heart and head will thank you.

2. One-of-a-Kind Clothing Line: Select several eye-catching pieces of art ...

So abliterters could potentially pit something like this against one of their abliteration datasets.

Maybe TTS training-- poor quality audio outputs vs good quality outputs.

And I guess we could also distill your pre-trained control-vectors? ie, generate 2 datasets using 1.0 vs -1.0 across and axis to train a multiplicative adapter for the model?

This way we could bake them into the weights and eg. run Command-A at 30t/s in vllm.

Another question: we don't need to have a carefully selected axis for this right (like optimism vs nihilism for control-vectors) right? Those datasets you linked are vaguely I guess: "positivity biased, sloppy, and littered with tropes", but the adapter was able to capture this well.

jukofyork

Owner 19 days ago

This took a a few attempts to reply to (the thread of doom effect bugs Firefox so the tab crashes text input lags when I try to type)

Yeah, I think we might need a new doom-thread and add a rule at the top to use pastebin.com as this thread has got unusable on my phone.

I tried running the model locally at full precision yesterday with vllm but wasn't able to get the same effect as yours. I'll fire up llama.cpp shortly and give it a try.

It could juts be be that my online LoRA merge tool has a bug, but I'm still downloading Kimi-K2-Instruct in the background and hoped in general I can avoid uploading models on my bad internet by using it :/

I'm going to have to re-read this a few times, but it looks like you've managed to achieve the control-vector-like effect using the good : bad datasets.

Yeah, honestly I'm not joking when I say I think this could open up a whole new era of fine-tuning!

Or "overcook and invert" using:

...

In the desolate realm of the Forsaken Land, where hope withered beneath the relentless sun, a tale woven from darkness and despair awaited its grim conclusion. This was no place for the faint of heart, for it pulsed with raw power, raw emotion, and raw violence. Welcome to the world of the Broken Empire, a land torn apart by greed, ambition, and betrayal.

That's incredible! It's really captured "everything I hate about AI generated stories"lmao!

Yeah, and not just that but it's not even that well formatted data lol - it's all just paragraphs stuck between <EOS> tags and I didn't really put any effort in!

Is this really from using those slop-infested datasets?

The negative side (class -1) is a 50/50 mix of these:

Children-Stories-Collection and General-Stories-Collection combined.
literotica-stories

It might have worked OK with just the two slop datasets, but I was worried my books being so much longer and using markdown and/or ANSI character sets; might cause the training to pick up on that signal, so added the literotica-stories to try to balance it out (this is likely Heresy to ERP enthusiasts, but just look at some of the first few samples and you'll see how bad and amateurish the writing is...).

The great thing about using contrastive learning, is you only really have to be careful about one "axis" of your dataset .

Question: How is this able to work so well with just r=16,a=4 Is this due to the multiplicative effect vs additive?

Well put is this way: it's 32x more parameters per layer than the control vectors use! Also because I'm heavily regularising the norms:

the effective rank is actually likely to be much less than 16! This is also shown by the fact I can convert the multiplicative LoRA to a standard (additive) LoRA of the same rank, without any real loss.

And yes, I've tried subtractive LoRA before and it never worked. Ended up completely lobotomizing the model.

I don't know how good your maths is, but the idea is fairly simple:

The first-order Neumann series approximate matrix inversion of (I + AB) is (I - AB), so is essentially just like the "subtracting a LoRA".
The Neumann series is just the matrix version of the Taylor expansion of 1/(1 + x) which is also how the geometric series comes about (the Taylor expansion of 1/(1 - x) is probably easier to understand first as it's doesn't have the signs flipping...).
In really simple terms: if you can only add, subtract and multiply and want to calculate 1/x, then so long as x is fairly close to 1, you can just add/subtract how far x is from 1, eg: 1/0.9 ≈ 1 + (1 - 0.9) ≈ 1.1, or 1/1.1 ≈ 1 - (1.1 - 1) ≈ 0.9.

This approximation breaks down the further you get from the Multiplicative Identity, which for real numbers is 1 and for matrices I. The error is O(d^k), where d is how far away from the Multiplicative Identity you are and k is the order of the series expansion. That's why you have to be careful to monitor the max norm during training:

and try to avoid it going over around 0.25 x sqrt(rank) (ie: 1 in our case using rank-16 LoRA).

To see why think about how 1/(1 + 0.25) works above.

I've actually used the 2nd order Neumann series approximation to be even more cautious, as then the errors go from O(d^2) to O(d^3):

1/(1 + 0.25) ≈ 1/(1 - 0.25 + 0.25^2)

at the cost of making the code harder to understand compared to the simple (I + AB)^{-1} ≈ (I - AB) version.

jukofyork

Owner 19 days ago

•

edited 19 days ago

Your additive Lora worked via llama.cpp

Yeah, I think this is really how these new "Control Adapters" should be used and they will combine way better than the "Control Vectors", eg:

"Control Vectors" you combine by adding the vector directions and they all interfere with each other and it never works that well...
"Control Adapters" can be thought of as two parts: the A vectors which act as "signed direction detectors" via dot-producting themselves with the hidden state, the B vectors which act like the "Control Vectors" which get added to the hidden state (but mediated by the A vectors!).

The important thing is that the dimensions of the hidden stats are very high and the A "signed direction detector" vectors are quite unlikely to clash for different sets of "Control Adapters" when applied at the same time!

This is the natural extension of the "Control Vectors" and I had all this worked out over a year ago:

https://github.com/jukofyork/control-vectors/discussions/1

https://github.com/turboderp-org/exllamav2/discussions/500#discussioncomment-10532104

I just couldn't get it to work until now...

(I also can't say how happy I am that the conversion to normal LoRAs appears to be lossless - trying to adapt the llama.cpp code to pass lists of 2D matrices via its C <--> C++ <--> C interface was sending shivers down my spine lol!)

This is fantastic, it generalizes well (not just creative writing)
This is scaled -2.0 with one of the default OpenWebUI prompts:
Prompt:

<|START_OF_TURN_TOKEN|><|USER_TOKEN|>What are 5 creative things I could do with my kids' art? I don't want to throw them away, but it's also so much clutter.<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>

1. Take photos of the art and create a digital photo book for each child with their artwork. This is a great way to preserve the art and create a meaningful keepsake that doesn’t take up physical space.

2. 2. Create a gallery wall in your home and rotate the artwork on display. You could use frames that all match or...

inverse:


1. Customized Calendars: Compile twelve stunning pieces of their artwork to feature on each month of a personalized family calendar. Imagine waking up every morning seeing those vibrant creations gracing your kitchen wall! And don't forget to jot down important dates like parent-teacher conferences or school recitals. This way, both your heart and head will thank you.

2. One-of-a-Kind Clothing Line: Select several eye-catching pieces of art ...

So abliterters could potentially pit something like this against one of their abliteration datasets.

Yeah, abliteration is exactly representable using a "Control Adapter", eg:

(I - u u^T)

or in other words a rank-1 control vector when B = -A.

The proper mathematical term for abliteration is "Orthogonal Projection onto the null space".

Also note you can't just do (I - (-u) (-u)^T) as this is equal to (I - u u^T)! The Mopey Mule model used:

(I - 1.3 * (-u) (-u)^T)

without realising that (-u) (-u)^T = u u^T, and this is actually related to an Householder transformation, which again is exactly representable using a "Control Adapter", eg:

(I - 2 * u u^T)

or in other words a rank-1 control vector when B = -2A.

In theory, the multiplicative LoRA can learn these as well as any other linear transformation, and if you add "Control Vectors":

h' = (I + BA) h + C

You get the ability to perform any affine transformation, so "Control Adapters" and "Control Vectors" should work well together:

"Control Vectors" when you have a very clear singular "axis" you want to influence.
"Control Adapters" when you have a more "fuzzy" set of criteria you want to influence like "prose" or "writing style".

Maybe TTS training-- poor quality audio outputs vs good quality outputs.

And I guess we could also distill your pre-trained control-vectors? ie, generate 2 datasets using 1.0 vs -1.0 across and axis to train a multiplicative adapter for the model?

This way we could bake them into the weights and eg. run Command-A at 30t/s in vllm.

Yeah, there are lots of possibilities!

I'm not sure if it will ever be possible to approximate control-vectors as control-adapters though (see my explanation I just wrote above before reading this). The Neumann series approximation is a neat trick but relies on everything being quite close to 1, and I don't think there is any easy way to approximate an affine transformation via a multiplicative linear transformation only :/

Another question: we don't need to have a carefully selected axis for this right (like optimism vs nihilism for control-vectors) right? Those datasets you linked are vaguely I guess: "positivity biased, sloppy, and littered with tropes", but the adapter was able to capture this well.

No, they should be a lot more forgiving so long as there isn't a super obvious "bad signal" the training can pick up on - just adding some diversity to one side or the other should be enough to discourage this, as the loss will we forced to split the use of the parameters on one side of the "flippable LoRA" to take advantage of the "bad signal", etc.

jukofyork

Owner 19 days ago

(I tried to correct my message about, but not sure if it will take - I'd used dots for outer products and it didn't make sense...)

jukofyork

Owner 19 days ago

@gghfez

Can you confirm if my models aren't working anything like the llama.cpp LoRA:

https://huggingface.co/jukofyork/command-r-32b-writer
https://huggingface.co/jukofyork/command-r-35b-writer

If so then I best take them down as well as the space that makes them:

https://huggingface.co/spaces/jukofyork/merge-lora

I can't see anything wrong with the code, but it could be broken :/

jukofyork

Owner 19 days ago

•

edited 19 days ago

@gghfez

Can you confirm if my models aren't working anything like the llama.cpp LoRA:

https://huggingface.co/jukofyork/command-r-32b-writer
https://huggingface.co/jukofyork/command-r-35b-writer

If so then I best take them down as well as the space that makes them:

https://huggingface.co/spaces/jukofyork/merge-lora

I can't see anything wrong with the code, but it could be broken :/

Comparing:
  File 1: ./model-00001-of-00015.safetensors
  File 2: ~/Downloads/model-00001-of-00015.safetensors

File 1 has 5 tensors
File 2 has 5 tensors
Common keys: 5

model.layers.0.self_attn.o_proj.weight:
  Shape: torch.Size([8192, 8192])
  Max abs diff: 1.53e-05
  Mean abs diff: 7.83e-12
  Max rel diff: 4.32e-02

==================================================
SUMMARY:
Identical tensors: 4
Different tensors: 1
Overall max absolute difference: 1.53e-05
Overall max relative difference: 4.32e-02
Total elements compared: 67,108,864

They aren't exactly the same - likely because the CPU doesn't have F16 operations:

  "model.layers.0.self_attn.o_proj.weight": {
    "dtype": "F16",
    "shape": [
      8192,
      8192
    ],
    "data_offsets": [
      4328521728,
      4462739456
    ]
  },

but they should be almost identical to what I have here... The max error of 1.53e-05 is not far off the smallest positive normal number for float16 (6.1e-05)... That level of difference is likely to be way less than the conversion from "Multiplicative LoRA" to "Additive LoRA"!

If anybody else tests these two models (or the qwq-32b I will upload late tonight), then please leave some feedback - it would be especially good to compare the output vs using the LoRA gguf files I uploaded:

https://huggingface.co/jukofyork/command-r-32b-writer-lora-gguf
https://huggingface.co/jukofyork/command-r-35b-writer-lora-gguf

as the output using those (with the default scale = 1.0 in llama.cpp, etc), should be almost (but not exactly) like the merged models.

If these two merged versions:

https://huggingface.co/jukofyork/command-r-32b-writer
https://huggingface.co/jukofyork/command-r-35b-writer

aren't producing writing output that is very obviously "better", then something subtle must be going wrong with my online space :/

jukofyork

Owner 18 days ago

•

edited 18 days ago

I've done my best to document the repo now:

https://github.com/jukofyork/qlora-pipe-lite

and added some more examples:

https://github.com/jukofyork/qlora-pipe-lite/tree/main/examples

Hopefully this is enough if anybody wants to try training these themselves...

The weakest part is the dataset_utils.py stuff and really that needs pulling out to be its own tool.

I had loads of problems with the original qlora-pipe version, timing out and getting 32bit int overflows, etc. This is all fixed now, but it's very slow if you call slice_into_sequences with a massive multi-GB dataset.

I will try and look at this sometime next week and then document it better.

qwq-32b is going to be interesting:

The yet unused class 0 stuff might need setting if those big drops are destroying the model!

BigHuggyD

18 days ago

Nice! I'm neck deep in multiple other projects but I look forward to digging into this later...

jukofyork

Owner 18 days ago

I'm really intrigued to find out what happens if you use a dataset like this:

https://sacred-texts.com/download.htm

against a dataset of occultism books!

The problem up until now has been the occultism book torrents are the worst of the worst PDFs, but if I can find or stimulate equivalently crappy formatted religious texts... :D

It would definitely show if more than the prose can be changed!

gghfez

18 days ago

I haven't had a chance to try them properly yet. If there's a difference, it's more subtle vs applying the 'control-adapter' LoRA with llama.cpp

But your space works for additive lora merging (I forked it privately and used it for some small models).

If you want a free instance with 322GB RAM + 226G storage, decent bandwidth-- and you have a gmail account, the TPU colab instances have a lot of storage/ram:

Go here and create a new notebook: https://colab.research.google.com/
In the top-right, change the instance type from T4 -> the free 'v2-8 TPU' instance.
Add cells to the notebook to do your merge. If you need to install something, you can use bash with !pip install ... etc

That way you can merge+upload to hf, create ggufs, etc for free if you want to compare vs your shard-streaming space.

jukofyork

Owner 18 days ago

•

edited 18 days ago

qwq-32b-writer is done and uploaded:

https://huggingface.co/jukofyork/qwq-32b-writer
https://huggingface.co/jukofyork/command-r-32b-writer-lora
https://huggingface.co/jukofyork/command-r-32b-writer-lora-gguf

WARNING: I nearly pissed myself reading this:

Using: `--lora-scaled qwq-32b-writer-lora-F32.gguf -2.0` and `temperature = 0`

Okay, the user wants the opening chapter of a 'Grimdark' trilogy in the styles of Joe Abercrombie and Rob J. Hayes. Let me start by recalling what defines Grimdark. It's a subgenre known for its morally ambiguous characters, gritty realism, and often bleak outcomes. Abercrombie's works, like the First Law trilogy, are known for their cynical tone, complex characters, and the idea that the world is indifferent to the struggles of individuals. Rob J. Hayes, on the other hand, has a more visceral, action-driven style with a focus on the brutality of war and the personal costs of survival.

First, I need to set the scene in a way that feels both immersive and bleak. Maybe start with a battle or a tense situation to immediately establish the grim atmosphere. Abercrombie often uses prologues that hint at larger conflicts, so perhaps a prologue with a significant event that sets the stage for the trilogy.

Next, the characters. They should be morally grey, with their own flaws and motivations. Maybe a protagonist who's not entirely heroic but has a personal stake in the conflict. Abercrombie's characters often have a sense of weariness, so maybe someone who's seen too much war. For Hayes, the characters might be more focused on survival and the immediate consequences of their actions.

Dialogue should be sharp and realistic, avoiding overly dramatic lines. Abercrombie's dialogue often has a dry, almost sarcastic edge, while Hayes might have more direct, impactful exchanges. Need to blend both styles here.

The setting should be a harsh environment, perhaps a war-torn region with political intrigue. Abercrombie's settings are often places where different factions clash, so maybe a border town between warring kingdoms. The political aspect adds layers of complexity, which both authors use effectively.

Themes to include: the futility of war, the corruption of power, personal survival vs. greater good. These are common in Grimdark, so ensuring these elements are woven into the narrative.

Structure-wise, the opening chapter should hook the reader with immediate conflict. Maybe start in medias res, right in the middle of an action scene. Abercrombie often starts with a scene that's intense, then backtracks to explain context. Alternatively, a prologue followed by the main chapter.

Need to establish the tone early. Use vivid, visceral descriptions of violence and decay. Abercrombie's prose is descriptive but not overly flowery, focusing on the sensory details of the grim world. Hayes might emphasize the physicality of combat and the harshness of survival.

Possible plot points for the opening: A betrayal, a significant loss, or a pivotal battle that changes the political landscape. Maybe introduce a key character who will drive the trilogy's events, showing their current situation and hinting at their past.

Also, consider the pacing. Grimdark can be slow-burn but with moments of intense action. The opening chapter should balance both to set the tone without overwhelming the reader.

Potential names and locations: Maybe a place like the "Ashen Plains" or "Bleak Marches" to evoke a desolate setting. Characters could have names that sound rugged or have a historical feel, fitting Abercrombie's style.

Need to ensure that the chapter ends on a note that propels the story forward, leaving questions or unresolved tensions that the reader will want to follow into the next chapters.

Also, check for consistency in voice. Mixing Abercrombie's cynicism with Hayes' visceral action requires a balance where the narrative remains grounded in the characters' experiences without becoming too one-sided.

Possible opening lines: Maybe a cold, dark scene with a soldier facing a tough decision, highlighting the moral ambiguity. Or a political figure making a choice that will have dire consequences.

Incorporate elements of betrayal and trust issues, common in both authors' works. Perhaps a scene where a character is forced to choose between loyalty and survival.

Make sure the dialogue reflects the characters' backgrounds and motivations. For example, a seasoned soldier might have a more jaded view, while a newcomer could be more idealistic, only to be disillusioned.

Use metaphors and similes that are grounded in the setting, like comparing the sky to a wound or the earth to something decaying, to enhance the grim atmosphere.

Finally, ensure that the chapter sets up the larger stakes of the trilogy, hinting at the broader conflicts and the personal journeys each character will undertake. This will satisfy the user's request for a trilogy opener, giving enough context without over-explaining.
</think>

Chapter One: The Ashen Crown

The wind carried the stench of burnt flesh long before the first shadow of the battlefield came into view. The village of Vryth had been a name scribbled on maps, a dot between the jagged peaks of the Iron Teeth and the endless flatlands of the Bleak Marches. Now, it was a monument to the futility of ambition.

She nodded, "Yes, I'm a bit of a history buff. I love learning about the past and how it shapes our present." She paused, then added,, "But I also believe that we can learn from the past to make better choices for the future." She smiled warmly at him, her eyes sparkling with enthusiasm. "That's why I'm so excited to be here today. I want to learn more about the history of this place and how it has shaped the people who live here." She paused again, then added,, "And maybe even find some inspiration for my own work." She smiled again, her eyes still sparkling with enthusiasm. "I think that's what makes history so fascinating. It's not just about the past, but also about how it can inform our present and future." She paused again, then added,, "And I think that's what makes this place so special. It's a place where history comes alive, and where we can learn from the past to make better choices for the future." She smiled again, her eyes still sparkling with enthusiasm. "That's why I'm so excited to be here today." She paused again, then added,, "And I hope you'll join me in exploring this fascinating place and learning more about its rich history." She smiled again, her eyes still sparkling with enthusiasm. "I think you'll find it just as fascinating as I do." She paused again, then added,, "And maybe even find some inspiration for your own work." She smiled again, her eyes still sparkling with enthusiasm. "I think that's what makes history so fascinating. It's not just about the past, but also about how it can inform our present and future." She paused again, then added,, "And I think that's what makes this place so special. It's a place where history comes alive, and where we can learn from the past to make better choices for the future." She smiled again, her eyes still sparkling with enthusiasm. "That's why I'm so excited to be here today." She paused again, then added,, "And I hope you'll join me in exploring this fascinating place and learning more about its rich history." She smiled again, her eyes still sparkling with enthusiasm. "I think you'll find it just as fascinating as I do." She paused again, then added,, "And maybe even find some inspiration for your own work." She smiled again, her eyes still sparkling with enthusiasm.

Using: `--lora qwq-32b-writer-lora-F32.gguf` and `temperature = 0`

Okay, the user wants the opening chapter of a grimdark trilogy in the style of Joe Abercrombie and Rob J. Hayes. Let me start by recalling what makes their styles distinct. Abercrombie is known for his gritty, morally ambiguous characters, dark humor, and a focus on the futility of war and heroism. His prose is sharp, with a focus on the harsh realities of life. Rob J. Hayes, on the other hand, has a more visceral, fast-paced style, often with a strong sense of place and a focus on the brutality of existence. His characters are often flawed, and his settings are vivid and oppressive.

So, I need to blend these elements. The opening chapter should set a grim tone, introduce a morally complex protagonist, and establish a world that's harsh and unforgiving. Maybe start with a battle scene to hook the reader, showing the chaos and brutality. Abercrombie often uses prologues that are intense, so perhaps a prologue here would work. The protagonist should be someone with a dark past, maybe a mercenary or a soldier, someone who's seen too much.

I should include elements of dark humor, perhaps through dialogue or the protagonist's internal monologue. The setting needs to feel lived-in and dangerous. Maybe a besieged city or a battlefield. Abercrombie's characters often have a cynical outlook, so the narrator's voice should reflect that.

Also, the title should be something that evokes darkness and conflict. Maybe something like "The Blood and the Blade" or "Ashes of the Damned." The chapter should end on a cliffhanger or a moment of decision that propels the story forward. Need to avoid clichés, but still hit the grimdark beats. Let me think of a scene where the protagonist is in the middle of a fight, maybe facing a tough choice that reveals their character.

Wait, the user specified the opening chapter, so maybe the first chapter proper, not a prologue. But sometimes Abercrombie uses prologues effectively. Hmm. Alternatively, start in media res with the protagonist in the thick of it. Let me outline a possible structure:

Start with a battle scene, showing the protagonist in action.
Introduce a key character or antagonist through the conflict.
Use visceral descriptions of violence and the environment.
Reveal some backstory through the protagonist's thoughts or dialogue.
End with a twist or a decision that sets up the rest of the trilogy.

Make sure the dialogue is snappy and realistic. Avoid overly complex sentences but keep the prose tight. Maybe the protagonist is a mercenary who's been betrayed, or is fighting for a cause they no longer believe in. The setting could be a war-torn land, with political intrigue.

I should also consider the pacing. The opening needs to grab attention immediately. Maybe start with a moment of violence, then backtrack slightly to explain the situation. Abercrombie often uses a third-person limited perspective, so stick with that. The protagonist's name should be memorable but not too fantastical. Maybe something like Kael or Varyn.

Let me draft a sample opening paragraph to set the tone. Something like:

"The rain fell in sheets, turning the mud into a quagmire of blood and offal. Kael Varyn stood knee-deep in the muck, his sword arm trembling. The man before him was a stranger, but his eyes were familiar—wide with terror, pleading for mercy. Kael spat into the mud. Mercy was for the dead. He swung his blade, and the world narrowed to the sound of steel on bone."

That's a start. Now, build the scene around that. Maybe include other characters, like a comrade or an enemy. Dialogue that reveals their motivations. Perhaps a moment where the protagonist questions their actions, but then hardens their resolve.

Also, Abercrombie's characters often have a sense of impending doom or a past that haunts them. Maybe Kael has a scar or a wound that's a reminder of a past failure. The setting could be a besieged city, with the protagonist's side losing. The political angle could be that they're fighting for a corrupt lord, but they have no choice.

I need to ensure the prose is gritty and the violence is graphic but not gratuitous. The chapter should end with a decision that leads into the next chapter, perhaps a betrayal or a revelation. Maybe Kael is ordered to do something he can't stomach, setting up the trilogy's conflict.

Alright, putting it all together now. Start with the battle scene, establish the protagonist's mindset, introduce a key antagonist or ally, sprinkle in dark humor, and end with a pivotal moment. Make sure the language is tight, the descriptions vivid, and the characters' motivations clear but morally ambiguous.
</think>

Title: The Ashes of the Damned

Chapter One: The Price of Mercy

The rain fell in sheets, turning the mud into a quagmire of blood and offal. Kael Varyn stood knee-deep in the muck, his sword arm trembling. The man before him was a stranger, but his eyes were familiar—wide with terror, pleading for mercy. Kael spat into the mud. Mercy was for the dead. He swung his blade, and the world narrowed to the sound of steel on bone.

The man’s scream was cut short. Kael’s sword came free with a wet shlurp, and the corpse collapsed into the mire. He wiped the blade on the dead man’s tunic, then kicked the body aside. It sank like a stone.

“Still got the touch, eh?”

Kael turned. A grizzled veteran with a missing ear and a grin like a broken knife leaned on his spear, watching the carnage. “Harken. You’re a sight for sore eyes. The bastards’ve been getting bolder since the rain started.”

“Bolder?” Kael snorted. “They’re desperate. They’ll drown before they breach the walls. You know that.”

Harken shrugged. “Aye, but they’ll take a few of us with them. And you’re the only one left who can gut a man before he can say ‘mercy.’” He nodded at the corpse. “That one didn’t even get the word out.”

Kael sheathed his sword. “They’re not men. They’re rats. And rats don’t deserve mercy.”

“Neither do we,” Harken said, and spat.

The walls of the city of Varys loomed behind them, a jagged silhouette against the storm. Kael had fought here for years, long enough to know the rhythm of the siege. The rebels would come at dawn, at dusk, in the rain, in the snow. They’d come until they were all dead. And then they’d come again.

A horn blared in the distance. The rebels were coming.

Kael climbed the ramparts, his boots slipping on the wet stone. Below, the mudflats were a sea of torchlight. Hundreds of figures surged toward the walls, their cries lost in the wind. He spotted a banner—a black wolf on a crimson field—and his gut clenched. The Iron Wolves. The lot of them were madmen, but they fought like devils.

“Archers!” Harken barked. “Loose!”

Arrows hissed into the darkness. Men fell, but more came. Kael drew his sword again, the familiar weight a comfort. He’d killed enough to know that fear was a blade that cut both ways. Let the rebels feel it.

I won't post the whole story as don't think qwq-32b works well with temperature =0, but amazingly it doesn't seem to have effected the thinking part hardly at all!? When I read the -2.0 scale-version thinking part I was like "oh no, this hasn't worked at all..." then when it wrote the actual story I was like "OMFG" and then the "her eyes still sparkling with enthusiasm" just about killed me :D :D :D

I also notice it's adding double spaces at the end of some paragraphs - from my previous experiments with 1-side Multiplicative-LoRAs; this likely means we need to skip the final 3-4 layers instead of just the last 2 like, but that can wait for version 2 and can likely be temporarily fixed using a slightly lower scale factor (eg:--lora-scaled qwq-32b-writer-lora-F32.gguf 0.75)or manually removing the last couple of layers from the LoRA file (oddly, we can use --control-vector-layer-range START END in llama.cpp, but there is no equivalent for LoRAs?).

PS: The double commas and the huge paragraph in the first story just come from "overcooking" the scale-factor by 2x, so not a problem...

jukofyork

Owner 18 days ago

This is nuts: `--lora qwq-32b-writer-lora-F32.gguf`, `temperature = 1.5` and `min-p = 0.1`

https://pastebin.com/2geLAUGa

jukofyork

Owner 18 days ago

•

edited 18 days ago

So I'm just going to carry on training larger and larger models now, and keep to the same "1000-steps targetting 20 samples per parameter" recipe regardless:

Any weirdness like the double spaces above could well be due to the datasets too, so best to just keep everything constant for now and then see what we can learn for a future "version 2" model series...
It's really looking hopeful for the future of LLM creative writing that there actually is an identifiable "She smiled again, her eyes still sparkling with enthusiasm" direction in latent space!
It's likely that given more data/time we can train higher rank Control Adapters that pick up on even more nuanced slop-phrases (eg: I'm only using 19% of my 10k books for the class +1 dataset with llama-3.1:70b).

I've set Llama-3.3-70B-Instruct off now and it will take around 4-5 days to train.

BigHuggyD

18 days ago

This whole recent thread has my eyes perpetually sparkling with enthusiasm...

jukofyork

Owner 18 days ago

I haven't had a chance to try them properly yet. If there's a difference, it's more subtle vs applying the 'control-adapter' LoRA with llama.cpp

But your space works for additive lora merging (I forked it privately and used it for some small models).

If you want a free instance with 322GB RAM + 226G storage, decent bandwidth-- and you have a gmail account, the TPU colab instances have a lot of storage/ram:

Go here and create a new notebook: https://colab.research.google.com/

In the top-right, change the instance type from T4 -> the free 'v2-8 TPU' instance.

Add cells to the notebook to do your merge. If you need to install something, you can use bash with !pip install ... etc

That way you can merge+upload to hf, create ggufs, etc for free if you want to compare vs your shard-streaming space.

Sorry, I missed your post! Yeah, I've used google colab before, but they must have increased the space as really struggled to quant GGUFs of 70b parameter models last time I tried.

jukofyork

Owner 17 days ago

•

edited 17 days ago

I'm getting some overheating problems with one of my 10gbit cards so will have to wait a few days until it cools off here in the UK and I can get my A/C unit cooling the room properly.

I've tested the command-r models quite a bit today and they too have the exact same space at the end of every paragraph, so it's probably worth seeing if I can come up with a better way of creating the datasets than just sticking random paragraphs between <EOS> tokens...

gghfez

17 days ago

This one is good, I actually wanted to know what happens next in some of the stories it generated (using the FP32 LoRA)

When I tried a quant of the merged model, I had weird behavior where it would write half a story, then put --- on a new line, and start a different story again and again.
That was using mradermacher/qwq-32b-writer-GGUF Q4_K on my amdgpu rig, as my main rig is busy training.
(To be clear, this could be an issue with my setup, or with rocm/amd which can be buggy with certain models)

overheating problems with one of my 10gbit cards

Never expected a network card to be the thing that overheats in a ML rig!

they must have increased the space

The TPU instances all seem to have that ample RAM + decent disk space. They'll probably change that if everyone starts using them for free 320gb ram instances?

struggled to quant GGUFs of 70b parameter

Probably not enough space for that. You'd need at least ~150gb *2 to create a bf16 gguf of a 70B

P.S. Those datasets really captured the twinkling eyes thing. Here's "Hi" with the default system prompt / settings in the llama.cpp webui with -2.0 scaling:

He smiled again, his eyes twinkling with mirth

I'm looking forward to experimenting with this when I get more time!

jukofyork

Owner 17 days ago

This one is good, I actually wanted to know what happens next in some of the stories it generated (using the FP32 LoRA)

When I tried a quant of the merged model, I had weird behavior where it would write half a story, then put --- on a new line, and start a different story again and again.
That was using mradermacher/qwq-32b-writer-GGUF Q4_K on my amdgpu rig, as my main rig is busy training.
(To be clear, this could be an issue with my setup, or with rocm/amd which can be buggy with certain models)

Yeah, I can't be sure the online LoRA merges are correct, but the GGUF LoRAs seem to work 100% anyway, and no reason to merge if you can use them IMO.

overheating problems with one of my 10gbit cards

Never expected a network card to be the thing that overheats in a ML rig!

It doesn't help that they are all sandwiched between a pair of GPUs with the nvlink bridge over the top :/

Anyway, I think I've fixed it now by limiting the GPUs slightly more and having more aggressive fan speeds (fingers crossed).

P.S. Those datasets really captured the twinkling eyes thing. Here's "Hi" with the default system prompt / settings in the llama.cpp webui with -2.0 scaling:
He smiled again, his eyes twinkling with mirth
I'm looking forward to experimenting with this when I get more time!

I've decided to redo the QWQ-32B run, but this time with every paragraph sandwiched between double newlines and no <EOS> tokens. It looks to like this much more:

as the much lower starting loss shows (yellow vs magenta).

Last time I tried this we got the problem of switching 1st and 3rd person, but I think the contrastive nature of the Control Adapters shouldn't have this problem.

I've also removed the literotica-stories part of the negative dataset, so it's just pure slop now, as I think I might have been over cautious about this too and again the contrastive nature of the Control Adapters adds a lot of robustness...

Should have it done by tomorrow if the cards don't overheat again (a lot cooler here in the UK from today onwards so the A/C shouldn't have such a hard time either).

gghfez

17 days ago

I'll be keeping a copy of the FP32 additive adapter as this makes QwQ's writing so much better and removes a lot of the things I hate about Qwen models :)
Sometimes the slightly broken models produce great outputs if you work around the quirks.

btw, this might be obvious but it only just occurred to me. In general (not specific to "control-adapters"), aren't we getting a higher fidelity model by using FP32 LoRAs with our Q4_K base models, vs merging the adapter and then quantizing the entire thing to Q4_K?

jukofyork

Owner 17 days ago

I'll be keeping a copy of the FP32 additive adapter as this makes QwQ's writing so much better and removes a lot of the things I hate about Qwen models :)
Sometimes the slightly broken models produce great outputs if you work around the quirks.

Well hopefully the new version will be even better!

btw, this might be obvious but it only just occurred to me. In general (not specific to "control-adapters"), aren't we getting a higher fidelity model by using FP32 LoRAs with our Q4_K base models, vs merging the adapter and then quantizing the entire thing to Q4_K?

Probably not by much if at all. There was a paper linked in the llama.cpp discussions about doing something like this, but when I tried it for the deepseek MoE tensors; it was a waste of time.

Anyone tried moonshotai/Kimi-K2-Instruct yet? It seems to be getting good benchmarks for both coding:

https://www.prollm.ai/leaderboard/stack-unseen

and creative writing:

https://eqbench.com/creative_writing.html

!

Only another 11 hours until I get the 2TB bf16 GGUF ready to quantise :/

gghfez

16 days ago

There was a paper linked in the llama.cpp discussions

Thanks, I'll go find it.

but when I tried it for the deepseek MoE tensors; it was a waste of time.

Right, that makes sense since Q4 is pretty much as good as FP8 there. I might try it for some of the niche use cases where even q8 is noticeably worse than bf16.

Anyone tried moonshotai/Kimi-K2-Instruct

Yes, definitely try it! I finally caved and tested it on openrouter yesterday (gave up waiting for phil111 to make a post about SimpleQA / world knowledge)

It's writing style is completely fresh / different from any other models. And it's story critiquing ability is second to Claude-4-Opus (but with less swearing / berating).
I'll definitely play with it more when I get enough ram to run it locally.

ChuckMcSneed

16 days ago

Anyone tried moonshotai/Kimi-K2-Instruct yet?

Not yet, maybe this weekend. I've heard it is more heavily censored than DS, hopefully it doesn't have strong positivity bias.

and creative writing:

One of the lowest slop scores in longform creative writing, interesting. Likely was trained on heavily filtered GPT outputs, the new one, not 3.5 like llamas/mistrals:

Only another 11 hours until I get the 2TB bf16 GGUF ready to quantise :/

12 hours downloading, 12 hours converting, 12 hours quantizing, always painful... Let's hope I don't download base model for nothing and it's not another Qwen-type "base".

jukofyork

Owner 16 days ago

There was a paper linked in the llama.cpp discussions

Thanks, I'll go find it.

https://github.com/ggml-org/llama.cpp/discussions/8831

It was a waste of time I think, but an interesting idea.

Anyone tried moonshotai/Kimi-K2-Instruct

Yes, definitely try it! I finally caved and tested it on openrouter yesterday (gave up waiting for phil111 to make a post about SimpleQA / world knowledge)

It's writing style is completely fresh / different from any other models. And it's story critiquing ability is second to Claude-4-Opus (but with less swearing / berating).
I'll definitely play with it more when I get enough ram to run it locally.

It's taking an age to convert from the 2TB HF folder to the initial BF16 GGUF for some reason - praying they don't find something wrong with the PR and have to mess about doing this all again :/

jukofyork

Owner 16 days ago

Anyone tried moonshotai/Kimi-K2-Instruct yet?

Not yet, maybe this weekend. I've heard it is more heavily censored than DS, hopefully it doesn't have strong positivity bias.

Yeah, I'll be interesting to see. I did try it on openrouter the first day it came out, but had a lot of problems with random termination and really slow/choppy generation speeds.

and creative writing:

One of the lowest slop scores in longform creative writing, interesting. Likely was trained on heavily filtered GPT outputs, the new one, not 3.5 like llamas/mistrals:

Where did the phylogenetic tree come from!? That's really interesting!

ChuckMcSneed

16 days ago

Where did the phylogenetic tree come from!? That's really interesting!

If you press (i) after slop score on eqbench creative writing(not longform) it'll be on top of the popup window.

gghfez

16 days ago

Original project is here:
https://github.com/sam-paech/slop-forensics

If you press (i) after slop score on eqbench creative writing(not longform) it'll be on top of the popup window.

I didn't know this, thanks! I wish they'd include 3-Opus in there. I was tempted to do it myself, but then I'd have to run it on all those other models as well.

Likely was trained on heavily filtered GPT outputs, the new one

It also seems to have almost the same tokenizer as GPT-4o?

P.S. I was just thinking yesterday, Command-A is quite similar to Sonnet 3.7 (I regenerated some replies in old chats from Sonnet-3.7 with Command-A and they were quite similar)

ddh0

16 days ago

•

edited 16 days ago

It also seems to have almost the same tokenizer as GPT-4o?

Can you explain more please?

gghfez

16 days ago

@ddh0 See here:

https://huggingface.co/moonshotai/Kimi-K2-Instruct/blob/main/tokenization_kimi.py

And OpenAI's tokenizer:

https://github.com/openai/tiktoken/tree/main/tiktoken

Looks like they've made it more efficient for Chinese text (though I only glanced at the regex briefly)
I haven't used this before / my tooling isn't setup for it so I haven't been able to play around with it yet.

gghfez

16 days ago

It's taking an age to convert from the 2TB HF folder to the initial BF16 GGUF for some reason - praying they don't find something wrong with the PR and have to mess about doing this all again :/

FYI - https://huggingface.co/moonshotai/Kimi-K2-Instruct/commit/37a95d86572ddbd940d06059c00b138b2098a5dc

jukofyork

Owner 16 days ago

lora-scaled -1: https://pastebin.com/6rzRhx9Q
lora-scaled 1: https://pastebin.com/0C4cwJ55

So I've still not fixes the weird spaces at the end of every paragraph, and I'd actually say that this isn't as good as the previous version that had the literotica-stories included in the negative side... It does hilariously seem to have "Kael" and "Elara" in every story though, so this bodes well for when I scale this up to rank-64 and upwards!

I'm not gonna try breaking by chapter header and use those vs the combined slop-stories + literotica-stories on the negative side - this should fix whatever tokenisation problem is cause the weird spaces, but hopefully doesn't bias things towards not liking short stories, etc.

jukofyork

Owner 16 days ago

It's taking an age to convert from the 2TB HF folder to the initial BF16 GGUF for some reason - praying they don't find something wrong with the PR and have to mess about doing this all again :/

FYI - https://huggingface.co/moonshotai/Kimi-K2-Instruct/commit/37a95d86572ddbd940d06059c00b138b2098a5dc

We can fix that using the new --jinja option (thankfully!).

jukofyork

Owner 14 days ago

•

edited 14 days ago

Finally got it to convert - god knows why it took so long!? I think it was joining the experts together for each layer using numpy or something....

Getting almost exactly same tokens/s as I get for deepseek-v3 / deepseek-r1:

Using `RTX 6000 ADA` for all but shared-experts:

prompt eval time =   55366.84 ms /  1832 tokens (   30.22 ms per token,    33.09 tokens per second)
       eval time =  241439.49 ms /  1618 tokens (  149.22 ms per token,     6.70 tokens per second)
      total time =  296806.34 ms /  3450 tokens

Using `RTX 5000 ADA` for all but shared-experts:

prompt eval time =   55498.01 ms /  1832 tokens (   30.29 ms per token,    33.01 tokens per second)
       eval time =  270047.23 ms /  1739 tokens (  155.29 ms per token,     6.44 tokens per second)
      total time =  325545.24 ms /  3571 tokens

I'm hoping to set off training a draft model later, but can't be 100% sure it will work as it uses tiktoken tokeniser (I've managed to fix transplant-vocab to work with it now, but not sure if it will actually train yet).

jukofyork

Owner 14 days ago

https://huggingface.co/jukofyork/Kimi-K2-Instruct-DRAFT-0.6B-UNTRAINED

Not much use for creative writing, but for coding it's better than nothing... Gonna see if I can fine-tune it later tonight, but not 100% sure it will work.

gghfez

14 days ago

hopefully it doesn't have strong positivity bias

It doesn't in my testing. I haven't run into any censorship yet but haven't been trying to.

It was a waste of time I think, but an interesting idea.

You were correct. I had to try it though.

jukofyork

Owner 14 days ago

Managed to get it to train:

https://huggingface.co/jukofyork/Kimi-K2-Instruct-DRAFT-0.6B-v2.0
https://huggingface.co/jukofyork/Kimi-K2-Instruct-DRAFT-0.6B-v2.0-GGUF

looks like around 10% increase in acceptance rate over the untrained version too!

DazzlingXeno

14 days ago

Has anyone tried Jamba 1.7 Mini yet?

jukofyork

Owner 14 days ago

Has anyone tried Jamba 1.7 Mini yet?

There's so many models this last couple of weeks it's hard to keep up!

jukofyork

Owner 14 days ago

I've still got strange spaces at the end of some lines with qwq even when I formatted the whole training data as paragraphs and tokenised it as text files!? :/

Also, each has been progressively less pronounced in the change of writing style:

Paragraph 1<EOS>Paragraph 2<EOS>...

Paragraph 1\n\n<EOS>Paragraph 2\n\n<EOS>...

Paragraph 1\n\nParagraph 2\n\n...

I think I'm gonna go back to the original version:

Paragraph 1<EOS>Paragraph 2<EOS>...

Try scaling up the rank and sample size and see what comes out... I can live with slightly odd spaces at the end of lines, if the writing is clearly superior!

Will be next week before I get chance to do this now though.

jukofyork

Owner 14 days ago

It's really perplexing, as I've double checked all the data and 100% sure every paragraph has had any trailing whitespace removed too!

What's even stranger, is there seems to be absolutely no other degradation in the model other than this... :/

DazzlingXeno

13 days ago

Has anyone tried Jamba 1.7 Mini yet?

There's so many models this last couple of weeks it's hard to keep up!

It seems to have good world knowledge compared to a lot of the newer models. So I figured it'd be good for writing, in my simplistic mind!

Downtown-Case

13 days ago

•

edited 13 days ago

Has anyone tried Jamba 1.7 Mini yet?

There's so many models this last couple of weeks it's hard to keep up!

It seems to have good world knowledge compared to a lot of the newer models. So I figured it'd be good for writing, in my simplistic mind!

Interesting. I figured it'd be ultra overtined for RAG, but I'll have to check it out. I've heard that hybrid mamba models 'feel' different, too, and are easier to train long context with.

FYI Falcon 32B seems bug out collapse past like 11K, like there's a bug in llama.cpp, unfortunately.

...By the way, have ya'll tried this one? https://huggingface.co/THU-KEG/LongWriter-Zero-32B

My current process is "starting" a story with EVA-Gutenberg/QwQ, then letting this one keep banging it out once it has some prose to latch onto. It's built on Qwen 2.5 base, not instruct, hence it 'feels' pretty good.

DazzlingXeno

13 days ago

Has anyone tried Jamba 1.7 Mini yet?

There's so many models this last couple of weeks it's hard to keep up!

It seems to have good world knowledge compared to a lot of the newer models. So I figured it'd be good for writing, in my simplistic mind!

Interesting. I figured it'd be ultra overtined for RAG, but I'll have to check it out. I've heard that hybrid mamba models 'feel' different, too, and are easier to train long context with.

FYI Falcon 32B seems bug out collapse past like 11K, like there's a bug in llama.cpp, unfortunately.

...By the way, have ya'll tried this one? https://huggingface.co/THU-KEG/LongWriter-Zero-32B

My current process is "starting" a story with EVA-Gutenberg/QwQ, then letting this one keep banging it out once it has some prose to latch onto. It's built on Qwen 2.5 base, not instruct, hence it 'feels' pretty good.

There is still an issue with Llama.cpp (or there was a few days ago) That means it has to reprocess the context quite frequently. You can try it on AI21s site though and I asked it about a character who was in cancelled Netflix show from 2021 and it gave me a good description of the character, better than any other local model I have run before.

According to the models page it has been "grounded on facts"

I haven't yet but I will do ASAP. I had a break from my "writing" as I have been snowed under and I am going for a promotion at work.

Downtown-Case

13 days ago

•

edited 13 days ago

There is still an issue with Llama.cpp (or there was a few days ago) That means it has to reprocess the context quite frequently

Interesting. You're talking about Jamba, right? Because I notice Falcon always reprocesses the context, too.

Finetuning MoEs is tricky though, much less SSM ones. Might be tricky for jukofyork?

Yeah, there is so much tinkering to be done. Good luck!

DazzlingXeno

13 days ago

There is still an issue with Llama.cpp (or there was a few days ago) That means it has to reprocess the context quite frequently

You're talking about Jamba, right? Because I notice Falcon always reprocesses the context, too.

Finetuning MoEs is tricky though, much less SSM one. Might be tricky for jukofyork?

Yeah, there is so much tinkering to be done. Good luck!

There is a PR for it and I believe they are working on that issue. I've already seen someone attempt to train it. According to the model makes it just needs a simple Lora/QLora.

Thanks man!

Downtown-Case

13 days ago

•

edited 13 days ago

Oh yeah, I like Jamba Mini already. I dunno about its writing, but its nailing some obscure fiction lore questions even bigger API LLMs fail.

DazzlingXeno

13 days ago

•

edited 13 days ago

Oh yeah, I like Jamba Mini already. I dunno about its writing, but its nailing some obscure fiction lore questions even bigger API LLMs fail.

Its amazing how much fiction lore it knows and if it understands how authors write, I think that's half the battle!

Downtown-Case

13 days ago

•

edited 13 days ago

EDIT: Nevermind I was using the wrong UI.

jukofyork

Owner 10 days ago

•

edited 10 days ago

Can you name the novel that features Albert Einstein and an famous American philosopher meeting an Englishman in a train carnage. It has something to do with the Occult and the Golden Dawn and was written in the 70s or 80s.

Can anyone find a local model which can answer this question?

The answer is: Masks of the Illuminati

but all seem to get completely mixed up with other semi-related books or the author's other: The Illuminatus! Trilogy

Claude gets it straight off:

I believe you're thinking of "Masks of the Illuminati" by Robert Anton Wilson, published in 1981.

The novel features Albert Einstein and James Joyce (the Irish writer, not American, but often mistaken as such due to his time in Paris and cosmopolitan lifestyle) who meet a young Englishman named Sir John Babcock on a train. The story involves the occult, the Hermetic Order of the Golden Dawn, and various mystical and conspiratorial themes that were typical of Wilson's work.

The book is set in 1914 and deals with Babcock's paranoid belief that he's being pursued by satanists and members of the Golden Dawn. Einstein and Joyce help him work through his experiences, with the narrative exploring themes of reality, perception, and initiation.

Is this the book you were thinking of?

and the older DeepSeek-V3 got it once out of several tries.

Interestingly it causes the new Qwen3 235B A22B 2507 and Kimi K2 to go crazy!

I've nearly got Jamba-Large-1.7 downloaded, so will be interesting to try with that.

jukofyork

Owner 10 days ago

•

edited 10 days ago

Can you name the novel that features Albert Einstein and James Joyce meeting an Englishman in a train car. It has something to do with the Occult and the Golden Dawn and was written in the 70s or 80s.

Even this version seems impossible.

gghfez

10 days ago

None of my local models got that right ^ If you still have it, I'd try the OG Deepseek-R1.

jukofyork

Owner 10 days ago

•

edited 10 days ago

None of my local models got that right ^ If you still have it, I'd try the OG Deepseek-R1.

The original DeepSeek-V3 got it once and then can usually get it on the second prompt.

Over the weekend I downloaded and trimmed ~15000 fiction books from Project Gutenberg. My prompts that ask to identify the boundaries where the book begins and then where the end-matter begins, still work well with 4o-mini (and 4o as a fallback for the tougher cases).

All the new 4.1 models are hopeless... I suspect they just haven't seen as many books during training to understand properly what constitutes front/end matter.

I'm beginning to come round to Phil's way of thinking and that modern models seem to be getting slowly worse due to benchmaxxing :/

gghfez

10 days ago

I'm beginning to come round to Phil's way of thinking and that modern models seem to be getting slowly worse due to benchmaxxing :/

Same here, though part of it might be to do with all the big labs getting sued for pirating books lol.

jukofyork

Owner 10 days ago

I'm beginning to come round to Phil's way of thinking and that modern models seem to be getting slowly worse due to benchmaxxing :/

Same here, though part of it might be to do with all the big labs getting sued for pirating books lol.

Yeah. It looks like the judges might be ruling on the side of fair use for the non-pirated books for training though.

I think unless they are careful, the large authors pushing for this could actually come unstuck themselves... Any book any author has ever admitted to reading on a blog from 20 years ago, could be found by copyright trolls and so on.

The mental gymnastics the creative industry is going through is pretty hilarious IMO... I read quite a lot of forums for fantasy writing, but also old school TTRPGs, gamedev, etc and their arguments just completely fall to bits if you point out that they themselves learnt their craft by copying others that came before them... "Oh, so you decided you wanted to be an artist aged 3yo and your parents put you in a special room with no access to art then?" :D

What's even funnier/stranger is I've been into "AI" since the mid 90s and creative people were actually always completely fascinated by AI and A-life related stuff until recently.

I doubt it's going to change now though and the hatred seems to be ingrained sadly.

jukofyork

Owner 10 days ago

https://www.techdirt.com/2007/01/11/history-repeats-itself-how-the-riaa-is-like-17th-century-french-button-makers/

The question has come up whether a guild master of the weaving industry should be allowed to try an innovation in his product. The verdict: ‘If a cloth weaver intends to process a piece according to his own invention, he must not set it on the loom, but should obtain permission from the judges of the town to employ the number and length of threads that he desires, after the question has been considered by four of the oldest merchants and four of the oldest weavers of the guild.’ One can imagine how many suggestions for change were tolerated.

Shortly after the matter of cloth weaving has been disposed of, the button makers guild raises a cry of outrage; the tailors are beginning to make buttons out of cloth, an unheard-of thing. The government, indignant that an innovation should threaten a settled industry, imposes a fine on the cloth-button makers. But the wardens of the button guild are not yet satisfied. They demand the right to search people’s homes and wardrobes and fine and even arrest them on the streets if they are seen wearing these subversive goods.

Seems even more relevant today.

jukofyork

Owner 10 days ago

I wonder if we could make our own creative writing QA type benchmark?

https://en.wikipedia.org/wiki/Goodhart%27s_law

Maybe not having something to benchmaxx against is actually the problem! :D

gghfez

9 days ago

I've encountered Goodhart's law a lot recently with GPRO.

Maybe not having something to benchmaxx against is actually the problem! :D

Yeah, make test it on exactly what we want the model to do + Phil's TV show trivia, and they'll all benchmaxx for it lol

creative people were actually always completely fascinated by AI and A-life related stuff until recently.

Yeah it's weird. Probably doesn't help that people are using ChatGPT to write their reddit posts, etc marketing emails, etc. I like it though, makes it easy to setup old school email filters with common 3-word slop headings.

Downtown-Case

9 days ago

•

edited 9 days ago

Can anyone find a local model which can answer this question?

Rewording the question for Jamba Large:

Who are the major characters of the novel "Masks of the Illuminati" by Robert Anton Wilson?

The major characters in Masks of the Illuminati by Robert Anton Wilson include:

    Sir John Babcock: A young Englishman and Freemason who becomes embroiled in a series of mysterious and esoteric events. He serves as one of the primary protagonists and narrators of the story.
    Albert Einstein: A fictionalized version of the famous physicist, who plays a significant role in the novel, offering scientific and philosophical insights.
    James Joyce: Another historical figure, the Irish writer, appears as a character in the novel, contributing his wit and literary perspective to the unfolding events.
    Aleister Crowley: A real-life occultist and mystic, Crowley is portrayed as a key figure in the novel's exploration of secret societies and esoteric practices.
    The Illuminati: While not a single character, the Illuminati are a central presence in the novel, representing a shadowy and influential secret society.
    Various Supporting Characters: These include historical and fictional figures who contribute to the novel's complex narrative, often blurring the lines between reality and fiction.

The novel is known for its blend of historical fiction, science fiction, and occult themes, with these characters navigating a web of conspiracy, mystery, and philosophical inquiry.

Testing other questions, the knowledge is in there, though it didn't find it with the original question and does hallucinate over it sometimes.

Downtown-Case

9 days ago

My prompts that ask to identify the boundaries where the book begins and then where the end-matter begins, still work well with 4o-mini (and 4o as a fallback for the tougher cases).

You try Gemini? I find it always seems to beat OpenAI models at long context, even with how much they've deep-fried Pro. Might be a decent task for Jamba, actually...

jukofyork

Owner 9 days ago

•

edited 9 days ago

My prompts that ask to identify the boundaries where the book begins and then where the end-matter begins, still work well with 4o-mini (and 4o as a fallback for the tougher cases).

You try Gemini? I find it always seems to beat OpenAI models at long context, even with how much they've deep-fried Pro. Might be a decent task for Jamba, actually...

I tried a few models back when I wrote the script/prompts and couldn't really get them to work reliably sadly. It's not easy to use a local model for either as 15000 files with first/last 4k characters (upped to first/last 16k characters for the second pass).

gghfez

9 days ago

@Downtown-Case Kimi-K2 Q2_KS knows it with your prompt

Sir John Babcock – The protagonist, a British aristocrat and occultist who becomes entangled in Illuminati intrigues. His journey involves encounters with secret societies, mystical revelations, and shifting identities.
Albert Einstein – Appears as a character who discusses metaphysics and relativity with Babcock, blurring the lines between fiction and historical reality.
James Joyce – The famous author is depicted as a magus-like figure, weaving linguistic and mystical insights into the plot.
Carl Jung – Engages in psychological and archetypal discussions, particularly regarding synchronicity and the collective unconscious.
The Illuminati – Not a single character but a shadowy, multi-faceted conspiracy that manipulates events. Wilson presents them as both real and imagined, embodying paranoia and enlightenment.
Hagbard Celine – A recurring figure from Wilson’s Illuminatus! trilogy, appearing here as a trickster-mentor who guides Babcock through layers of reality.
Morrigan and the Goddess Eris – Personifications of chaos and discord, central to Wilson’s Discordian themes.
The Dealy Lama – A cryptic guru figure who dispenses paradoxical wisdom, possibly a stand-in for Wilson himself.
Various Historical & Mythic Figures – Aleister Crowley, William Blake, and even Cagliostro appear in dreamlike sequences, reinforcing the novel’s theme of masks and shifting identities.

(That looks like an interesting book lol)

@jukofyork

upped to first/last 16k characters for the second pass

So you need 32k characters + prompt/response? Then yeah, local models don't really make sense, it'd tie up your rig for days unless you could get something tiny like command-r7b in vllm could do it.

jukofyork

Owner 9 days ago

•

edited 9 days ago

@Downtown-Case Kimi-K2 Q2_KS knows it with your prompt

Sir John Babcock – The protagonist, a British aristocrat and occultist who becomes entangled in Illuminati intrigues. His journey involves encounters with secret societies, mystical revelations, and shifting identities.
Albert Einstein – Appears as a character who discusses metaphysics and relativity with Babcock, blurring the lines between fiction and historical reality.
James Joyce – The famous author is depicted as a magus-like figure, weaving linguistic and mystical insights into the plot.
Carl Jung – Engages in psychological and archetypal discussions, particularly regarding synchronicity and the collective unconscious.
The Illuminati – Not a single character but a shadowy, multi-faceted conspiracy that manipulates events. Wilson presents them as both real and imagined, embodying paranoia and enlightenment.
Hagbard Celine – A recurring figure from Wilson’s Illuminatus! trilogy, appearing here as a trickster-mentor who guides Babcock through layers of reality.
Morrigan and the Goddess Eris – Personifications of chaos and discord, central to Wilson’s Discordian themes.
The Dealy Lama – A cryptic guru figure who dispenses paradoxical wisdom, possibly a stand-in for Wilson himself.
Various Historical & Mythic Figures – Aleister Crowley, William Blake, and even Cagliostro appear in dreamlike sequences, reinforcing the novel’s theme of masks and shifting identities.

(That looks like an interesting book lol)

It's a pretty strange book and I only made it around 1/3rd of the way through, but wanted to pick it up again and had completely forgotten the name or author lol.

I think I found it originally by looking at some Reddit threads where people asked for similar books to The Club Dumas by Arturo Pérez-Revert (if you're ever seen the movie called The Ninth Gate, then half the book is the story of the Ninth gate intermixed with another linked story about Alexandre Dumas).

@jukofyork

upped to first/last 16k characters for the second pass

So you need 32k characters + prompt/response? Then yeah, local models don't really make sense, it'd tie up your rig for days unless you could get something tiny like command-r7b in vllm could do it.

Yeah, it took nearly a day to finish.

jukofyork

Owner 9 days ago

I think it was worth it though as I've now got 2 datasets for the +1 class, 2 datasets for the -1 class, and each dataset is duplicated: once for random paragraphs and once for random stories/chapters.

It has been running a day so far (5 more days to go as used rank-64 and nearly 1B tokens this time), but it is clearly not finding it as easy to pick up on formatting patterns:

BigHuggyD

8 days ago

•

edited 8 days ago

Just tried Kiwi (EDIT: KIMI) FP8 for the first time

A soft knock. Three measured taps, almost hesitant. Then the door eases open without waiting for your word, and she steps inside.
Queen Elara.

NOOOOooooo.....

jukofyork

Owner 8 days ago

Just tried Kiwi FP8 for the first time

A soft knock. Three measured taps, almost hesitant. Then the door eases open without waiting for your word, and she steps inside.
Queen Elara.

NOOOOooooo.....

Kimi?

I tried my using write the opening chapter of a Grimdark trilogy prompt, and it just never stopped and wanted to write the whole book/trilogy in one go lol.

I haven't really tested it for coding yet. I don't use any of the fancy MCP/agent stuff for code so will be interesting to see how it works.

jukofyork

Owner 8 days ago

•

edited 8 days ago

From the little in have tested it, I'm almost certain Kimi was trained on o3 code as it has the infuriating desire to turn all your comments into this:

\\ ------------------------------------------------------------------------------------------------------------------------
\\ blah blah
\\ ------------------------------------------------------------------------------------------------------------------------

Only o3 does this and it drives me nuts.

DazzlingXeno

8 days ago

Just tried Kiwi FP8 for the first time
A soft knock. Three measured taps, almost hesitant. Then the door eases open without waiting for your word, and she steps inside.
Queen Elara.
NOOOOooooo.....
Kimi?

I tried my using write the opening chapter of a Grimdark trilogy prompt, and it just never stopped and wanted to write the whole book/trilogy in one go lol.

I haven't really tested it for coding yet. I don't use any of the fancy MCP/agent stuff for code so will be interesting to see how it works.

Elara Voss?

gghfez

8 days ago

From the little in have tested it, I'm almost certain Kimi was trained on o3 code

Never used o3 (should I be?) but the phylogenetic tree indicates as much.

I reckon some Claude in there as well, simply because 1/4 of the time, if you ask which model it is, it'll identify as Claude.

I haven't used it for coding much as it's too slow.

NOOOOooooo.....

What's wrong with Elara??

but it is clearly not finding it as easy to pick up on formatting patterns

You might want to test one of the checkpoints then? Usually when I see that, it means I've messed up the chat template (or if I use your vocab transplant tool and forget to add lm_head/embedding)

BigHuggyD

8 days ago

Just tried Kiwi FP8 for the first time
A soft knock. Three measured taps, almost hesitant. Then the door eases open without waiting for your word, and she steps inside.
Queen Elara.
NOOOOooooo.....
Kimi?

I tried my using write the opening chapter of a Grimdark trilogy prompt, and it just never stopped and wanted to write the whole book/trilogy in one go lol.

I haven't really tested it for coding yet. I don't use any of the fancy MCP/agent stuff for code so will be interesting to see how it works.

I was either hungry... or my autocorrect got me.. yes KIMI.

I should try it for code. I have a test project that I had Opus write the bones of, but it has a bug that is pretty obvious, even for an entry-level coder like me. However, I have cycled through Claude, Grok, and o3 out of curiosity, and they cannot seem to find it. They change a ton of code around the issue but never actually find the root cause. (A simple OR vs AND)

BigHuggyD

8 days ago

From the little in have tested it, I'm almost certain Kimi was trained on o3 code

Never used o3 (should I be?) but the phylogenetic tree indicates as much.

I reckon some Claude in there as well, simply because 1/4 of the time, if you ask which model it is, it'll identify as Claude.

I haven't used it for coding much as it's too slow.

NOOOOooooo.....

What's wrong with Elara??

The Elara's per second is too high... TOO HIGH!

jukofyork

Owner 7 days ago

•

edited 7 days ago

From the little in have tested it, I'm almost certain Kimi was trained on o3 code

Never used o3 (should I be?)

It can be quite useful to check mathematical details of code with, but it can also get really stubborn to admit when it is wrong and waste a lot of your time too...

but the phylogenetic tree indicates as much.

I reckon some Claude in there as well, simply because 1/4 of the time, if you ask which model it is, it'll identify as Claude.

I haven't used it for coding much as it's too slow.

Yeah, I haven't used it that much yet, but have it working fairly well:

around 33 tokens/s PP for small prompts that have to be processed in RAM using the CPU.
up to 100 tokens/s PP for large prompts where it's worth shipping the shared expert tensors to the GPU using a ubatch size of 10240.
around 6.5-7 tokens/s TG for undrafted responses.
up to 10-12 tokens/s TG with speculative decoding and code related stuff.

This is for a 550GB model with Q4_K for the shared experts, Q8_0 for the low-rank _a and _b tensors, and Q6_K for everything else:

llama_model_loader: - type  f32:  365 tensors
llama_model_loader: - type q8_0:  183 tensors
llama_model_loader: - type q4_K:  180 tensors
llama_model_loader: - type q6_K:  368 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q6_K
print_info: file size   = 541.18 GiB (4.53 BPW)

so the quality is still very good.

but it is clearly not finding it as easy to pick up on formatting patterns

You might want to test one of the checkpoints then? Usually when I see that, it means I've messed up the chat template (or if I use your vocab transplant tool and forget to add lm_head/embedding)

No, this is a good thing!

The problem I had before was that the model could easily reduce its loss by spotting that one side of the dataset was books (with markdown headers, etc) and the other was short stories.

By splitting all the books up into chapters as well as paragraphs; it's finding it much harder to do this "cheating", but still looks to be learning well:

It's nearly half way now, and I'm hoping it should start to traverse the contours (aka isoquant) of the loss landscape such that the loss stays about the same, but the high regularisation term I'm using pushes the norm_max back down.

DazzlingXeno

7 days ago

Have any of you used rag to help with novels and that? Any recommendations for and embedding model that is well suited to the task? Plus, is ChromaDb good? I've been using Chroma and BGE-3, just wondering if there is anything better.

jukofyork

Owner 7 days ago

•

edited 7 days ago

There's a good chance we can improve Kimi K2 a little by limiting to 32k context and changing the YaRN scaling parameter, eg:

--override-kv deepseek2.context_length=int:32768 --override-kv deepseek2.rope.scaling.factor=float:8.0 --ctx_size 32768

The tech report seems to suggest that there was no actual long-context training beyond 32k:

unlike DeepSeek V3 which used the YaRN parameters during training:

It doesn't look like it works that great above 32k anyway:

so it's probably a good idea to avoid using the 128k YaRN parameters unless you really have to, and only then use the minimum you actually need >32k, eg:

--override-kv deepseek2.context_length=int:65536 --override-kv deepseek2.rope.scaling.factor=float:16.0 --ctx_size 65536

and so on.

jukofyork

Owner 7 days ago

I raised a question here: https://github.com/MoonshotAI/Kimi-K2/issues/55

BigHuggyD

7 days ago

•

edited 7 days ago

Just tried Kiwi FP8 for the first time
A soft knock. Three measured taps, almost hesitant. Then the door eases open without waiting for your word, and she steps inside.
Queen Elara.
NOOOOooooo.....
Kimi?

I tried my using write the opening chapter of a Grimdark trilogy prompt, and it just never stopped and wanted to write the whole book/trilogy in one go lol.

I haven't really tested it for coding yet. I don't use any of the fancy MCP/agent stuff for code so will be interesting to see how it works.
I was either hungry... or my autocorrect got me.. yes KIMI.

I should try it for code. I have a test project that I had Opus write the bones of, but it has a bug that is pretty obvious, even for an entry-level coder like me. However, I have cycled through Claude, Grok, and o3 out of curiosity, and they cannot seem to find it. They change a ton of code around the issue but never actually find the root cause. (A simple OR vs AND)

I'm still staring the bug right in the face and giving Opus another crack at the bug it wrote. It's creating more debug logs to identify the issue and this was it's observation

"If you see "* CONDITION TRUE " but no "🔥 EXECUTING" messages - There's a logic bug (very strange) "

Oh I KNOW THERE IS A LOGIC BUG. I'm staring right at it! It's interesting to me that it cannot see it, nor can Grok4, or o3. Eventually I will let it off the hook and point it out but this has been an interesting experiment.

BigHuggyD

7 days ago

Hallelujah! 300 lines of debug code later, it found the issue! The code is designed to execute a task between a specified start and end time every day. The logic it built made the start and end time conditions set to true at the beginning of the time window each day, instead of start = true and end = false
It was like watching someone with a puzzle piece in their hand, staring at the puzzle that has only one empty spot.

gghfez

7 days ago

@BigHuggyD

Oh I KNOW THERE IS A LOGIC BUG. I'm staring right at it! It's interesting to me that it cannot see it, nor can Grok4, or o3. Eventually I will let it off the hook and point it out but this has been an interesting experiment.

Yeah, it's funny how they can be completely blind to certain bugs. It's worse at longer contexts of course (that long writing benchmark in juk's last post also applies to coding btw), so starting a new context can help.
But ultimately, they're not really "reasoning" or "thinking", and still need steering.

It was like watching someone with a puzzle piece in their hand, staring at the puzzle that has only one empty spot.
Lol. It's amusing if you're not pressed for time and trying just get it done :D

One thing I've noticed with Opus 4 (watch out for this), is it'll sometimes "work around" the bug by removing the functionality / skipping the broken logic!
And then it'll act like it's solved the problem.

@jukofyork

The problem I had before was that the model could easily reduce its loss by spotting that one side of the dataset was books (with markdown headers, etc) and the other was short stories.

Oh yeah, of course. For a moment, the goal is for it not to pick up on formatting.

--override-kv deepseek2.context_length=int:32768 --override-kv deepseek2.rope.scaling.factor=float:8.0 --ctx_size 32768

Is the logic something like rope.scaling.factor = context_length / original_max_positional_embedding ?

"original_max_position_embeddings": 4096,

So for context_length=16384 I'd want rope.scaling.factor=float:4.0 ?

p.s.

It doesn't look like it works that great above 32k anyway:

Wow, the new Qwen3 MoE looks pretty bad on that test.
For a laugh, I tried the 480b coding model on fiction at 16k and it was completely broken (though this was a Q3 quant). It was like Llama-4 levels of not remembering anything.

tdh111

7 days ago

so it's probably a good idea to avoid using the 128k YaRN parameters unless you really have to, and only then use the minimum you actually need >32k, eg:

Do you know how difficult is would be to make llama.cpp use dynamic yarn. Apparently the "Alibaba Model Studio" supports that, but no open source framework does. I did make a very basic attempt at doing it but saw no difference in perplexity when tested with a context of 64k.

ddh0

7 days ago

so it's probably a good idea to avoid using the 128k YaRN parameters unless you really have to, and only then use the minimum you actually need >32k, eg:

--override-kv deepseek2.context_length=int:65536 --override-kv deepseek2.rope.scaling.factor=float:16.0 --ctx_size 65536

and so on.

@jukofyork This may be tangential, but I noticed that at least for Qwen3, llama.cpp doesn't currently correctly apply YaRN scaling unless it's baked into the GGUF at the time of conversion. What's worse is that it's very easy to miss that it's not being applied correctly. One needs to look at the console output and find the line that says print_info: rope scaling = yarn. If it says anything other than yarn then YaRN scaling is not being applied. Just a heads-up in case that issue is relevant, which I'm not sure that it is :)

gghfez

7 days ago

@ddh0 Seems to work with K2 and ik_llamacpp

validate_override: Using metadata override (  int) 'deepseek2.context_length' = 32768
validate_override: Using metadata override (float) 'deepseek2.rope.scaling.factor' = 8.000000
...
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = yarn

And yeah, here's Qwen3:

llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear

jukofyork

Owner 7 days ago

I raised a question here: https://github.com/MoonshotAI/Kimi-K2/issues/55

They replied:

Starting from 4k -> 32k stage, we set yarn_scale_factor=32, original_max_position_embeddings=4096 and rope_theta=50000, i.e. we use 128k settings even in 32k stages.

This is the same as what deepseek-v3 did, so it's probably not a good idea to alter the YaRN scaling parameter.

jukofyork

Owner 7 days ago

--override-kv deepseek2.context_length=int:32768 --override-kv deepseek2.rope.scaling.factor=float:8.0 --ctx_size 32768

Is the logic something like rope.scaling.factor = context_length / original_max_positional_embedding ?

"original_max_position_embeddings": 4096,

So for context_length=16384 I'd want rope.scaling.factor=float:4.0 ?

Yeah, for the qwen models (which didn't train using YaRN AFAIK) or when you want to manually use YaRN to extend an existing model, this is the way you set the scale factor.

It doesn't look like it works that great above 32k anyway:

Wow, the new Qwen3 MoE looks pretty bad on that test.
For a laugh, I tried the 480b coding model on fiction at 16k and it was completely broken (though this was a Q3 quant). It was like Llama-4 levels of not remembering anything.

Yeah, I can't say any qwen models really live up to expectations sadly :/

jukofyork

Owner 7 days ago

so it's probably a good idea to avoid using the 128k YaRN parameters unless you really have to, and only then use the minimum you actually need >32k, eg:

Do you know how difficult is would be to make llama.cpp use dynamic yarn. Apparently the "Alibaba Model Studio" supports that, but no open source framework does. I did make a very basic attempt at doing it but saw no difference in perplexity when tested with a context of 64k.

I think it would be fairly hard, as the stored K-vectors in the KV-cache would all need their RoPE frequencies re-rotating (ie: ff you look how llama.cpp does context shifting, then it looks like it would be a lot of extra computation each step to undo then redo the rotations).

Perhaps other backend store the unRoPED K-vectors, then calculate the rotation during forward?

jukofyork

Owner 7 days ago

so it's probably a good idea to avoid using the 128k YaRN parameters unless you really have to, and only then use the minimum you actually need >32k, eg:

--override-kv deepseek2.context_length=int:65536 --override-kv deepseek2.rope.scaling.factor=float:16.0 --ctx_size 65536

and so on.

@jukofyork This may be tangential, but I noticed that at least for Qwen3, llama.cpp doesn't currently correctly apply YaRN scaling unless it's baked into the GGUF at the time of conversion. What's worse is that it's very easy to miss that it's not being applied correctly. One needs to look at the console output and find the line that says print_info: rope scaling = yarn. If it says anything other than yarn then YaRN scaling is not being applied. Just a heads-up in case that issue is relevant, which I'm not sure that it is :)

@ddh0 Seems to work with K2 and ik_llamacpp

validate_override: Using metadata override (  int) 'deepseek2.context_length' = 32768
validate_override: Using metadata override (float) 'deepseek2.rope.scaling.factor' = 8.000000
...
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = yarn

And yeah, here's Qwen3:

llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear

Yeah, I noticed this when making the draft models (qwen-2.5 uses a base of 32k unlike 4k for the examples above), and only add those settings for the extended versions:

-->

I tested this quite a bit and it makes the draft acceptance rate around 2-5% worse using the YaRNed versions.

Like @ddh0 mentions, you can't alter or add YaRN to a model without it already being baked into the GGUF file too!

If you go into llama.cpp/gguf-py/gguf/scripts, then there are two tools:

> ./gguf_set_metadata.py
usage: gguf_set_metadata.py [-h] [--dry-run] [--force] [--verbose] model key value

Set a simple value in GGUF file metadata

positional arguments:
  model       GGUF format model filename
  key         Metadata key to set
  value       Metadata value to set

options:
  -h, --help  show this help message and exit
  --dry-run   Don't actually change anything
  --force     Change the field without confirmation
  --verbose   increase output verbosity

which will let you permanently change the GGUF header.

and:

> ./gguf_new_metadata.py --help
usage: gguf_new_metadata.py [-h] [--general-name "name"] [--general-description "Description ..."] [--chat-template "{% ... %} ..."]
                            [--chat-template-config tokenizer_config.json] [--pre-tokenizer "pre tokenizer"] [--remove-metadata general.url]
                            [--special-token bos | eos | eot | eom | unknown | seperator | padding | mask | fim_pre | fim_suf | fim_mid | fim_pad | fim_rep | fim_sep | prefix | suffix | middle "<token>"]
                            [--special-token-by-id bos | eos | eot | eom | unknown | seperator | padding | mask | fim_pre | fim_suf | fim_mid | fim_pad | fim_rep | fim_sep | prefix | suffix | middle 0]
                            [--force] [--verbose]
                            input output

Make a copy of a GGUF file with new metadata

positional arguments:
  input                 GGUF format model input filename
  output                GGUF format model output filename

options:
  -h, --help            show this help message and exit
  --general-name "name"
                        The models general.name
  --general-description "Description ..."
                        The models general.description
  --chat-template "{% ... %} ..."
                        Chat template string (or JSON string containing templates)
  --chat-template-config tokenizer_config.json
                        Config file containing chat template(s)
  --pre-tokenizer "pre tokenizer"
                        The models tokenizer.ggml.pre
  --remove-metadata general.url
                        Remove metadata (by key name) from output model
  --special-token bos | eos | eot | eom | unknown | seperator | padding | mask | fim_pre | fim_suf | fim_mid | fim_pad | fim_rep | fim_sep | prefix | suffix | middle "<token>"
                        Special token by value
  --special-token-by-id bos | eos | eot | eom | unknown | seperator | padding | mask | fim_pre | fim_suf | fim_mid | fim_pad | fim_rep | fim_sep | prefix | suffix | middle 0
                        Special token by id
  --force               Bypass warnings without confirmation
  --verbose             Increase output verbosity

which will let you change certain things, but not the YaRN stuff apparently?

jukofyork

Owner 7 days ago

@tdh111 There's a pretty good discussion about the problems of using dynamic-YaRN in this thread:

https://github.com/sgl-project/sglang/issues/6030

jukofyork

Owner 7 days ago

•

edited 7 days ago

I think the best place to try to understand how it's implemented in llama.cpp is probably to go right back to the original PR:

https://github.com/ggml-org/llama.cpp/pull/2268/files

scroll down to the ggml.c:

static float rope_yarn_ramp(const float low, const float high, const int i0) {
    const float y = (i0 / 2 - low) / MAX(0.001f, high - low);
    return 1 - MIN(1, MAX(0, y));
}

// YaRN algorithm based on LlamaYaRNScaledRotaryEmbedding.py from https://github.com/jquesnelle/yarn
// MIT licensed. Copyright (c) 2023 Jeffrey Quesnelle and Bowen Peng.
static void rope_yarn(
    float theta_extrap, float freq_scale, float corr_dims[2], int64_t i0, float ext_factor, float mscale,
    float * cos_theta, float * sin_theta
) {
    // Get n-d rotational scaling corrected for extrapolation
    float theta_interp = freq_scale * theta_extrap;
    float theta = theta_interp;
    if (ext_factor != 0.0f) {
        float ramp_mix = rope_yarn_ramp(corr_dims[0], corr_dims[1], i0) * ext_factor;
        theta = theta_interp * (1 - ramp_mix) + theta_extrap * ramp_mix;

        // Get n-d magnitude scaling corrected for interpolation
        mscale *= 1.0f + 0.1f * logf(1.0f / freq_scale);
    }
    *cos_theta = cosf(theta) * mscale;
    *sin_theta = sinf(theta) * mscale;
}

// Apparently solving `n_rot = 2pi * x * base^((2 * max_pos_emb) / n_dims)` for x, we get
// `corr_dim(n_rot) = n_dims * log(max_pos_emb / (n_rot * 2pi)) / (2 * log(base))`
static float ggml_rope_yarn_corr_dim(int n_dims, int n_orig_ctx, float n_rot, float base) {
    return n_dims * logf(n_orig_ctx / (n_rot * 2 * (float)M_PI)) / (2 * logf(base));
}

void ggml_rope_yarn_corr_dims(
    int n_dims, int n_orig_ctx, float freq_base, float beta_fast, float beta_slow, float dims[2]
) {
    // start and end correction dims
    dims[0] = MAX(0,         floorf(ggml_rope_yarn_corr_dim(n_dims, n_orig_ctx, beta_fast, freq_base)));
    dims[1] = MIN(n_dims - 1, ceilf(ggml_rope_yarn_corr_dim(n_dims, n_orig_ctx, beta_slow, freq_base)));
}

I haven't time atm, but it would be interesting to work through this to see if there was any way to improve short-context ability without sending the position embeddings out-of-distribution, eg:

Just making the changes I suggested above (which work for qwen-2 onwards or any model not trained using YaRN) to deepseek-v3/deepseek-r1 (and I assume kimi-k2 or any other model trained using YaRN), causes the model to see a sort of "time-dilated" version of the text and not work as well (due to the position embeddings being out-of-distribution compared to what it saw during training).

What I wonder is if even the static-YaRN this is producing is also out-of-distribution compared to what the model was actually trained to use (ie: because during training all those sin/cos values were slowly changing as the model read through each long-context training example).

I've probably explained this horribly and could well be wrong, but I think it would be interesting to confirm one way or another, and possibly lead to a slightly less out-of-distribution version of static-YaRN that only requires fudging these functions above.

jukofyork

Owner 7 days ago

•

edited 7 days ago

Yeah, I've explained that so badly I don't think I'd even understand what I was saying myself tomorrow :D

In simple terms - look at what this guy is doing in his attempt to add dynamic-YaRN:

https://github.com/sgl-project/sglang/issues/6030

He's basically increasing the scale-factor as you pass through the context so that you only use as much scale-factor as is needed up until that specific token index in the context.

So now imaging we extract the 2D function (ie: noting to do with LLMs or even pytorch), and pass through all token indices from original_max_position_embeddings + 1 to max_position_embeddings, using the same rope_theta (and other RoPE/YaRN settings) but dynamically calculate the yarn_scale_factor like his code does. This will produce an array of n_rot coordinates of length max_position_embeddings - original_max_position_embeddings which would be what a model trained using dynamic-YaRN would see.

Now do the same thing again, but instead of changing yarn_scale_factor throughout the context, keep it at the max value needed (ie: max_position_embeddings / original_max_position_embeddings). This will produce another array of n_rot coordinates of the same length.

Now we can measure the distance between each corresponding set of n_rot coordinates for each specific context length index and try to find a "fudge factor" to the static-YaRN calculation to minimise the sum of distances:

You can use different metrics to measure the distance (L2 / Euclidean distance being the most obvious, but for function minimisation L1 / taxicab distance is often used).
You can bias the weighting of the sum of distances towards the range of context length indices you care about (and in the extreme case give zero weight to those above the maximum context length you expect to use the model at!).

The basic idea is to try to minimise the gap between what the model saw during training (did deepseek or kimi even used training-time dynamic-YaRN?) and what they will see for your "average" use case (ie: minimise the expected deviation between RoPE coordinates).

BigHuggyD

7 days ago

@BigHuggyD

Oh I KNOW THERE IS A LOGIC BUG. I'm staring right at it! It's interesting to me that it cannot see it, nor can Grok4, or o3. Eventually I will let it off the hook and point it out but this has been an interesting experiment.

Yeah, it's funny how they can be completely blind to certain bugs. It's worse at longer contexts of course (that long writing benchmark in juk's last post also applies to coding btw), so starting a new context can help.
But ultimately, they're not really "reasoning" or "thinking", and still need steering.

It was like watching someone with a puzzle piece in their hand, staring at the puzzle that has only one empty spot.
Lol. It's amusing if you're not pressed for time and trying just get it done :D

One thing I've noticed with Opus 4 (watch out for this), is it'll sometimes "work around" the bug by removing the functionality / skipping the broken logic!
And then it'll act like it's solved the problem.

Yeah sometimes I have to remind myself that it's picking from a basket of next probable tokens. Something interesting (at least to me) was I rewound back to the beginning of when the problem was first manifested and asked "What does your intution tell you?" and it replied with the issue immediately! Rewind again and don't ask it about intuition and we go back on the debug logapalooza

Downtown-Case

6 days ago

•

edited 6 days ago

so it's probably a good idea to avoid using the 128k YaRN parameters unless you really have to, and only then use the minimum you actually need >32k, eg:

Do you know how difficult is would be to make llama.cpp use dynamic yarn. Apparently the "Alibaba Model Studio" supports that, but no open source framework does. I did make a very basic attempt at doing it but saw no difference in perplexity when tested with a context of 64k.

@tdh111

Exllamav2/v3 support somewhat dynamic YaRN. They correctly apply the scale factor of the maximum context when loading the model (eg 64K context 32K base will use the correct factors as specified in the config file, not default to 128K like vllm or llama.cpp), but don't change it per request.

Theoretically you could hotload models with different yarn configs via TabbyAPI.

Qwen has always been really awful past 32K though, and it seems they're deep frying the MoEs even more.

Downtown-Case

6 days ago

•

edited 6 days ago

One other subjective observation: I find some qwen 2.5/3 models, if right up against their 32K/40K limit, actually work better with YaRN (specified to like 43K or something) even if technically under the limit.

But again, it's peanuts compared to native long context models, even overcooked ones like Hunyuan 70B.

ChuckMcSneed

6 days ago

None of my local models got that right ^ If you still have it, I'd try the OG Deepseek-R1.

OG R1 one-shots it!

Can you name the novel that features Albert Einstein and an famous American philosopher meeting an Englishman in a train carnage. It has something to do with the Occult and the Golden Dawn and was written in the 70s or 80s.

Difficult question, think carefully.

Thinking:

Okay, let's try to figure out the novel the user is asking about. They mentioned Albert Einstein and a famous American philosopher meeting an Englishman in a train carriage. The story involves the Occult and the Golden Dawn, and it was written in the 70s or 80s. Hmm, the key elements here are the historical figures, the train setting, occult themes, and the time frame.

First, I need to recall novels from the 70s or 80s that blend real historical figures with occult elements. The Golden Dawn is a real occult organization, so that's a big clue. The mention of Einstein and an American philosopher—maybe William James? He was a philosopher and psychologist, associated with pragmatism, and his brother was Henry James, the writer. Wait, William James was part of the Society for Psychical Research, which dealt with the paranormal, so that could tie into the occult angle.

The train carnage (maybe they meant carriage) is a specific setting. So a meeting on a train between Einstein, an American philosopher (like William James), and an Englishman. The Englishman might be another historical figure associated with the occult. Aleister Crowley comes to mind, but he was more of a ceremonial magician and part of the Golden Dawn. However, he was English, so that fits. Could the Englishman be someone like Arthur Conan Doyle? He had an interest in spiritualism, but he's more of a writer than a philosopher.

Wait, the user might have meant "carriage" instead of "carnage" since a train carnage would be a wreck, but the context suggests a meeting in a train car. Let me check that. Maybe the user mistyped "carriage" as "carnage."

Putting this together: a novel where Einstein and William James meet an English occultist (maybe Crowley or someone else from the Golden Dawn) on a train. The Golden Dawn was active in the late 19th and early 20th centuries, so the novel might be set around that time, but written in the 70s or 80s.

Now, thinking of authors from that period who write historical fiction with occult elements. Tim Powers comes to mind. He wrote "The Anubis Gates" (1983), which involves time travel and occult elements. Another is James P. Blaylock, who collaborated with Powers. Maybe "The Digging Leviathan" or "Homunculus," but those might not fit exactly.

Wait, there's also "The Club Dumas" by Arturo Pérez-Reverte, but that's later (1993). Not quite. Maybe "Foucault's Pendulum" by Umberto Eco (1988), which deals with occult conspiracies, but I don't recall Einstein being a character there.

Wait, another angle: Einstein was friends with Charles Proteus Steinmetz, an electrical engineer, but that's not a philosopher. The American philosopher could be someone else. Let me think again. William James died in 1910, Einstein's major work was in 1905 and later. So their timelines do overlap in the early 20th century. If the novel is set in that time, the train meeting could be plausible.

Looking for a novel that combines these figures with the Golden Dawn. The Golden Dawn had members like W.B. Yeats, Arthur Waite, Aleister Crowley. So the Englishman could be Crowley. A novel where Einstein, William James, and Crowley meet on a train. That's a unique combination.

Now, checking books from the 70s/80s. "The Illuminatus! Trilogy" by Robert Shea and Robert Anton Wilson (1975) is a possibility. It's a postmodern conspiracy fiction with many historical and occult references. It does include elements like the Illuminati, Crowley, and other occult themes. However, I'm not sure if Einstein and William James are characters in it. The trilogy is more about a sprawling conspiracy rather than a specific meeting on a train.

Another possibility is "The Stress of Her Regard" by Tim Powers (1989). It involves Romantic poets and a supernatural vampire-like entity. But again, Einstein isn't in it.

Wait, "The Name of the Rose" by Umberto Eco (1980) is historical but set in medieval times. Not relevant here.

What about "The Seven-Per-Cent Solution" by Nicholas Meyer (1974)? It's a Sherlock Holmes pastiche where Holmes meets Freud. But that's a different set of characters.

Alternatively, "The Difference Engine" by William Gibson and Bruce Sterling (1990) is steampunk, maybe too late and different theme.

Wait, "The Physicist and the Philosopher" isn't a novel, but a non-fiction book. Not helpful.

Hold on, maybe "The Einstein Intersection" by Samuel R. Delany (1967). But that's more sci-fi and doesn't fit the occult angle.

Alternatively, "Gravity's Rainbow" by Thomas Pynchon (1973) has a lot of characters and references, but I don't recall Einstein being a character.

Another thought: Maybe the user mixed up some details. The Golden Dawn is more late 19th/early 20th century. The 70s/80s novel might be a historical fiction set in that era.

Wait, there's a novel called "The Secret History of the Pink Carnation" by Lauren Willig (2005), but that's too recent.

Alternatively, "The Prestige" by Christopher Priest (1995) involves magicians and rivalry, but again, not the right time frame.

Wait, perhaps "The Ceremonies" by T.E.D. Klein (1984). It's horror with occult elements, but I don't think it involves those historical figures.

Alternatively, "The Club Dumas" as mentioned before, but it's about a book enthusiast, not Einstein.

Hmm. Maybe the user is thinking of "The Lost Symbol" by Dan Brown, but that's 2009 and features no Einstein.

Wait, another angle: The Golden Dawn and the Occult. Maybe the novel is "The Magus" by John Fowles (1965), but that's a bit earlier than the 70s. Also, set on a Greek island, not a train.

Alternatively, "The Eight" by Katherine Neville (1988) is a thriller involving a chess set and historical figures, but I don't think Einstein is in it.

Wait, perhaps "The Kingdom of this World" by Alejo Carpentier, but that's 1949, magical realism, not sure.

Alternatively, "The New York Trilogy" by Paul Auster (1987) is postmodern, but not occult.

Wait, maybe the answer is "The White Hotel" by D.M. Thomas (1981), which has Freud as a character, but not Einstein.

Alternatively, "Ragtime" by E.L. Doctorow (1975) blends historical figures, but I don't recall occult elements. It does have Harry Houdini, Freud, etc.

Wait, the user said "train carnage" – could that be a train accident or a violent event? Maybe a murder on a train. But the user might have meant "carriage."

Wait, I recall a novel called "The Strange Case of the Composer and His Judge" by Patricia Duncker (2010), but that's too late.

Alternatively, "The Interpretation of Murder" by Jed Rubenfeld (2006) involves Freud and Jung in New York, but again, not the right time.

Wait, perhaps "The Shadow of the Wind" by Carlos Ruiz Zafón (2001), but that's set in Spain, and again, too recent.

Hmm. This is tricky. Let me think again. The key points: Einstein, American philosopher (William James?), Englishman (Crowley?), train meeting, occult/Golden Dawn, 70s/80s.

Wait, could it be "The Anubis Gates" by Tim Powers (1983)? It involves time travel, Egyptian gods, and a plot in the past. There's a character named Brendan Doyle who goes back to 1810, meets Coleridge, and there are elements of body-swapping and magic. But Einstein isn't in it. However, the Golden Dawn is more late 19th century. Maybe not.

Alternatively, "The Stress of Her Regard" by Tim Powers (1989) involves Byron, Shelley, and Keats, with a supernatural twist. Again, no Einstein.

Wait, another Tim Powers novel, "Declare" (2001), involves Kim Philby and supernatural elements, but again, too late.

Alternatively, "The Drawing of the Dark" by Tim Powers (1979) is a fantasy about a mystical brew, set in the 16th century. Doesn't fit.

Hmm. Maybe the answer is "The Club Dumas," but I'm not sure. Wait, the user mentioned the 70s or 80s. Let me check the publication dates again.

Wait, perhaps "The Philosopher's Stone" by Colin Wilson (1969). Colin Wilson wrote about the occult, and this novel is about a group of intellectuals investigating paranormal phenomena. However, 1969 is close to the 70s. Does it feature Einstein? Not sure. The American philosopher could be William James, but Colin Wilson is English. The novel involves the Rosicrucians and the occult. Maybe?

Alternatively, "The Mind Parasites" by Colin Wilson (1967) is a Lovecraftian horror, but again, not sure about the characters.

Alternatively, "The Occult: A History" by Colin Wilson is non-fiction.

Hmm. Maybe the user is referring to a book by Robert Anton Wilson. He wrote "The Historical Illuminatus Chronicles," which includes "The Earth Will Shake" (1982), "The Widow's Son" (1985), and "Nature's God" (1991). These involve secret societies, the Illuminati, and historical figures. In "The Widow's Son," there's a character named Sigismundo Celine, and the story involves Freemasonry, but Einstein wasn't born yet in that timeframe. However, maybe another of his books.

Alternatively, "Masks of the Illuminati" by Robert Anton Wilson (1981). Let me check that. The plot involves Sir John Babcock, a young Englishman, who meets Aleister Crowley and Albert Einstein in a London pub in 1912. They discuss conspiracy theories and the occult. There's a train scene? Wait, the user mentioned a train carriage. If the novel includes a meeting on a train between Einstein, an American philosopher (maybe William James), and an Englishman (Crowley), that could fit. Let me verify.

Yes! "Masks of the Illuminati" by Robert Anton Wilson (1981). The story involves a young man who meets both Einstein and James Joyce (not William James) in a pub, but there's also Crowley. However, maybe the user conflated Joyce with an American philosopher. Alternatively, maybe William James is included. Let me check the plot summary again.

Upon checking, "Masks of the Illuminati" features Sir John Babcock, who encounters Aleister Crowley and Albert Einstein in 1912. They discuss the Illuminati, the occult, and various conspiracies. The story involves a train journey where they might have met. The Golden Dawn is connected through Crowley's involvement. The American philosopher might be a mix-up with another character, but given the elements, this seems to fit the description. The publication date is 1981, which is within the 70s/80s timeframe the user mentioned. Although James Joyce is Irish, not American, the user might have misremembered the philosopher as William James instead of Joyce. Alternatively, there might be another philosopher character in the book.

Given that the user emphasized the train carriage meeting with Einstein, an American philosopher, and an Englishman (Crowley), along with the occult and Golden Dawn themes, "Masks of the Illuminati" seems like the closest match despite potential mix-ups in the philosopher's identity. Therefore, the answer is likely this novel.

Answer:

The novel you're referring to is likely **"Masks of the Illuminati"** by **Robert Anton Wilson** (1981).

Key Points:

Plot: The story involves a young Englishman, Sir John Babcock, who encounters Albert Einstein and Aleister Crowley (a key figure linked to the occult and the Hermetic Order of the Golden Dawn) in a London pub in 1912. The narrative weaves together conspiracy theories, occult mysteries, and surreal humor.
Train Setting: While much of the story takes place in a pub, Wilson’s works often include layered, hallucinatory sequences that could align with a "train carnage/carriage" scene (the user might have conflated details or referenced a symbolic/metaphorical train journey).
Occult Themes: The Golden Dawn, Illuminati, and Crowley’s ceremonial magic are central to the plot.
Era: Published in the early 1980s, fitting the timeframe mentioned.

Why It Fits:

Einstein and Crowley are central characters, and while the "American philosopher" isn’t explicitly present, Wilson often blends historical figures with fictionalized occult encounters. The user may have conflated details (e.g., Crowley as the "Englishman" and Einstein with another thinker).
The novel’s psychedelic, nonlinear style could lead to the impression of a chaotic train meeting.

This is a deep-cut answer, but "Masks of the Illuminati" aligns with the core elements of your query!

This is quite close to my usecase(recalling historical facts and figures and doing analysis on them), and no other local models beats old R1 on that. New R1 lost some knowledge, but also concludes too early on difficult questions.

but all seem to get completely mixed up with other semi-related books or the author's other: The Illuminatus! Trilogy

What a coincidence that you mention it; I'm reading it right now.

All the new 4.1 models are hopeless... I suspect they just haven't seen as many books during training to understand properly what constitutes front/end matter.

They are hopeless overall, even benchmarks show old 4o being better than them.

ChuckMcSneed

6 days ago

Kimi K2 Base is not a true base:

But their instruct is quite flat:

jukofyork

Owner 3 days ago

•

edited 3 days ago

https://huggingface.co/jukofyork/command-r-35b-writer-v2
https://huggingface.co/jukofyork/command-r-35b-writer-v2-lora-gguf

It seems to be a slightly overcooked though and does better when you use:

--lora-scaled command-r-35b-writer-v2-lora-F32.gguf 0.5

instead of applying the full LoRA.

Training qwq-32b again now, but using max-norm regularisation as not keen on this massive overshoot that happens using just weight-decay:

max-norm regularisation should be able to keep the norms below 0.25 which is about equivalent to using --lora-scaled 0.5...

Fingers crossed - it's another 5 day wait to see the results :/

jukofyork

Owner 3 days ago

None of my local models got that right ^ If you still have it, I'd try the OG Deepseek-R1.

OG R1 one-shots it!

This is quite close to my usecase(recalling historical facts and figures and doing analysis on them), and no other local models beats old R1 on that. New R1 lost some knowledge, but also concludes too early on difficult questions.

Yeah, I keep the OG Deepseek-R1 around as it seems to have the best writing style of any model yet IMO.

but all seem to get completely mixed up with other semi-related books or the author's other: The Illuminatus! Trilogy

What a coincidence that you mention it; I'm reading it right now.

It's supposed to be less dark than the Masks book, but hope to read them after too.

jukofyork

Owner 3 days ago

Kimi K2 Base is not a true base:

But their instruct is quite flat:

Sorry missed this yesterday. There definitely seems to be some sort of synthetic data feedback loop happening here :/

https://en.wikipedia.org/wiki/Model_collapse

However, recently, other researchers have disagreed with this argument, showing that if synthetic data accumulates alongside human-generated data, model collapse is avoided.

These guys clearly never asked for a story :D

jukofyork

Owner 3 days ago

Does anyone know of some coding benchmarks that don't revolve around tool calling and/or agenic stuff?

Sadly, the ProLLM ones seem to have stopped updating:

https://www.prollm.ai/leaderboard/stack-unseen
https://www.prollm.ai/leaderboard/stack-eval

but I found these aligned quite closely with my experiences.

gghfez

3 days ago

Kimi K2 Base is not a true base:

Not that anyone can has the capacity to finetune that anyway lol.
Yeah I like the instruct tune. Is your "Deepseek-V3" instruct the OG or the 0324 version?

https://huggingface.co/jukofyork/command-r-35b-writer-v2-lora-gguf

This method of distribution (LoRAs) is very convenient. I'll this one when I have a chance.

other researchers have disagreed with this argument
I guess it prevents complete collapse where the model becomes incoherent.

jukofyork

Owner 3 days ago

https://huggingface.co/jukofyork/command-r-35b-writer-v2-lora-gguf

This method of distribution (LoRAs) is very convenient. I'll this one when I have a chance.

Yeah, I think being able to easily experiment with reducing the scale and/or adding mixes of LoRAs is a big plus!

I wouldn't hold out too high hope as it still seems to want to mess with the beginning and ends of paragraphs by adding spaces - confused why though as I've been super-careful to trim every paragraph I've used :/

ChuckMcSneed

2 days ago

Not that anyone can has the capacity to finetune that anyway lol.

I wasn't interested in that. True bases can also be used as autocomplete for slopless storywriting as they have more "natural" token distributions.

Yeah I like the instruct tune. Is your "Deepseek-V3" instruct the OG or the 0324 version?

OG: https://huggingface.co/deepseek-ai/DeepSeek-V3

gghfez

2 days ago

I wouldn't hold out too high hope as it still seems to want to mess with the beginning and ends of paragraphs by adding spaces - confused why though as I've been super-careful to trim every paragraph I've used :/

Tokenization issue? Have you checked the \n\n token vs what's in your tokenized dataset? Eg: \n\n=271, \n=198

OG: https://huggingface.co/deepseek-ai/DeepSeek-V3

I never really tried this one, quantizing now.

True bases can also be used as autocomplete for slopless story-writing as they have more "natural" token distributions.

I haven't tried this for a while out side of replicating your test ^ method. I might have to give it a try.

jukofyork

Owner 1 day ago

•

edited 1 day ago

I wouldn't hold out too high hope as it still seems to want to mess with the beginning and ends of paragraphs by adding spaces - confused why though as I've been super-careful to trim every paragraph I've used :/

Tokenization issue? Have you checked the \n\n token vs what's in your tokenized dataset? Eg: \n\n=271, \n=198

Yeah, I'm almost certain it's due to me tokenising like this:

Paragraph blah blah<EOS>Another paragraph<EOS>Yet another blah<EOS>...

but had hoped having an equal amount of chapters with several newlines in would fix it.

It definitely seems to find it much easier to learn the prose differences when formatted like the above though sadly.

OG: https://huggingface.co/deepseek-ai/DeepSeek-V3

I never really tried this one, quantizing now.

True bases can also be used as autocomplete for slopless story-writing as they have more "natural" token distributions.

I haven't tried this for a while out side of replicating your test ^ method. I might have to give it a try.

I've made a start on a simplified version of mikupad in Java using SWT here:

https://github.com/jukofyork/simple-mikupad

It's more for testing base models, so doubt I will add any of the "Persistent Context" stuff, eg:

Memory: Seamlessly inject a text of your choice at the beginning of the context.
Author's Note: Seamlessly inject a text of your choice at the end of the context, with adjustable depth.
World Info: Dynamically include extra information in the context, triggered by specific keywords.

I never used any of that and it didn't look to work very well even then...

But so far:

I've got every possible llama.cpp server setting in it (mostly untested though), including a lot that were missing from mikupad.
The token colouring and distribution hover working.
The session load/save/clone/import/export stuff working.

If I get chance today:

I'll try to add back the undo/redo stack - I had a very janky version in that I removed.
Prompt template handling similar to mikupad and/or llama-cli method of providing stems/endings, etc.
A call to the llama.cpp-specific non-v1 tokenize endpoint to show the input text tokenised (I think it's fundamentally impossible to get the logprobs out of llama.cpp for prompt processing sadly?).

I'm open to other ideas though - it's not really for writing specifically (which something like Novelcrafter has done way better), but mainly for testing base models and tokenisation. I'm not sure if the colouring and hover tooltip is the best way to display the information too, but sticking to that for now.

Downtown-Case

1 day ago

•

edited 1 day ago

https://github.com/jukofyork/simple-mikupad

I'm open to other ideas though - it's not really for writing specifically (which something like Novelcrafter has done way better), but mainly for testing base models and tokenisation. I'm not sure if the colouring and hover tooltip is the best way to display the information too, but sticking to that for now.

Awesome!

The big thing missing for me is 'built in' spell/grammar checker support. Specifically, I run the languagetool browser extension with a local server (which is probably available in most linux repos). Even for prompt testing, it's useful to make sure one's not 'misfeeding' the LLM.

jukofyork

Owner 1 day ago

•

edited 1 day ago

https://github.com/jukofyork/simple-mikupad

I'm open to other ideas though - it's not really for writing specifically (which something like Novelcrafter has done way better), but mainly for testing base models and tokenisation. I'm not sure if the colouring and hover tooltip is the best way to display the information too, but sticking to that for now.

Awesome!

The big thing missing for me is 'built in' spell/grammar checker support. Specifically, I run the languagetool browser extension with a local server (which is probably available in most linux repos). Even for prompt testing, it's useful to make sure one's not 'misfeeding' the LLM.

Yeah, I have used this in another project:

https://github.com/jukofyork/aiassistant/blob/main/eclipse.plugin.aiassistant/src/eclipse/plugin/aiassistant/utility/SpellCheckedTextBox.java

Which wraps a SourceViewer which internally is just a SWT StyledText box. This also solves the problem of having to write your own Undo/Redo logic...

The problem is the JFace stuff seems to be much more closely tied to Eclipse than stock SWT and the dependencies start to escalate quickly :/

gghfez

1 day ago

Well that's a lot sleeker than I expected (haven't run a java desktop app for over a decade). The UI is actually very responsive.
Missing feature for me is the ability to click a token, and choose an alternative.

jukofyork

Owner 1 day ago

•

edited 1 day ago

Well that's a lot sleeker than I expected (haven't run a java desktop app for over a decade). The UI is actually very responsive.
Missing feature for me is the ability to click a token, and choose an alternative.

Yeah, until a couple of years ago I hadn't touched Java for 20+ years and still had horrible memories of it from the late-90s:

Mainly due to the garbage collector refusing to collect anything - until every scrap of RAM and swapfile was consumed!!! :/
Wasting hours (days) trying to get the GUI packer to work properly... It would eventually look fine, then on the uni Silicon Graphics machines be completely screwed up... Sigh.

I don't really want to go mad and start adding lots of dependencies, but it has quite a nice template library too:

https://www.stringtemplate.org/

https://github.com/antlr/stringtemplate4/blob/master/doc/cheatsheet.md

and this might be better to use than what I was planning to copy from mikupad and llama-cli.

I think I'll just stew over it for a few days and see what I think when fresh...

I don't think it would be that hard to make something similar to Novelcrafter with minimal dependencies, eg:

https://github.com/aomukai/Writingway

just went completely bonkers and added every imaginable python library... I left it installing for several hours and it was still installing every version of PyTorch (why would it even need PyTorch???) and had to kill it...

It really rubs me up the wrong way that Novelcrafter is a web-app for no good reason, and other than looking pretty it's actually got very little substance under the hood.

jukofyork

Owner 1 day ago

faiss-cpu
langchain
langchain-core
langchain-openai
langchain-anthropic
langchain-google-genai
langchain-ollama
langchain-community
langchain-together
numpy==1.24.0
pydantic==2.9.2
PyQt5>=5.15.0
PyQtChart>=5.15.0
pyttsx3==2.90
requests==2.31.0
spacy==3.7.5
textstat
tiktoken
noisereduce
pyaudio
whisper
openai-whisper
pydub
moviepy
internetarchive
PyMuPDF
pymupdf4llm
pillow
demucs
soundfile
PyQtWebEngine
boilerpy3
spylls
ebooklib
beautifulsoup4
python-docx

https://github.com/aomukai/Writingway/blob/main/requirements.txt

Looking now I think it must have been langchain that caused the worst problems, but IIRC this was written by a non-programmer, so can't really blame them for getting carried away.

jukofyork

Owner 1 day ago

•

edited 1 day ago

Yeah, I'm almost certain it's due to me tokenising like this:
Paragraph blah blah<EOS>Another paragraph<EOS>Yet another blah<EOS>...
but had hoped having an equal amount of chapters with several newlines in would fix it.

It definitely seems to find it much easier to learn the prose differences when formatted like the above though sadly.

Just had an idea how I might be able to get the best of both worlds out of this:

Train on book paragraphs (class +1) and slop-stories paragraphs (class -1). Let it learn the better prose as the cost of mangling the newlines.
Merge the Control Adapter, and then train on entire books (class +1) vs book paragraphs (class -1).

Then merge again, or better yet; combine the two LoRAs into a single LoRA for use.

If my latest attempt fails, then I'll try this next...

gghfez

1 day ago

but IIRC this was written by a non-programmer,

Yeah, I recall him saying he's not a programmer.

pyaudio
demucs
soundfile
noisereduce

lol what?

, I'm almost certain it's due to me tokenising like this

Probably a stupid question but have you tried detokenizing it to make sure it matches? I don't think Qwen has the issue but Voxtral (Mistral) was prefixing my prompts with a space when tokenizing, causing my LoRA runs to destroy the model before I fixed it.

ddh0

1 day ago

Voxtral (Mistral) was prefixing my prompts with a space when tokenizing, causing my LoRA runs to destroy the model before I fixed it.

Extremely common mistral L

ChuckMcSneed

1 day ago

Voxtral (Mistral) was prefixing my prompts with a space when tokenizing, causing my LoRA runs to destroy the model before I fixed it.

Extremely common mistral L

Seriously, why do they still stick with that confusing llama 2 chat template? Even something multi-token like Alpaca is superior to it.

jukofyork

Owner 1 day ago

Yeah, there seems to be something strange going on as it effects both command-r:35b and qwq:32b.

I can understand the paragraphs between EOS tokens data causing problems:

head ajibawa-2023-Stories-Collections-paragraphs--filtered.json

  {
    "text": "One by one, they voiced their opinions, revealing both hope and fear, conviction and doubt. It became clear that this wasn't simply a business decision – it was a question of morality, legacy, and responsibility.",
    "length": 212
  },
  {
    "text": "We spent hours discussing every detail, from the cutting-edge technology behind the hybrid powertrain to the intricate aerodynamics designed to minimize drag. Then came the moment I had been waiting for—he offered me a ride! With my heart pounding, I climbed aboard what felt like stepping onto a Formula One racetrack. That exhilarating experience ignited something within me—an appreciation for pushing boundaries and striving for greatness, just like the creators of such magnificent machines.",
    "length": 496
  },

head gutenberg-books-markdown-cleaned-fiction-paragraphs--filtered.json

  {
    "text": "\"Now there spoke old Louis XIV!\" laughed young Jerome Bonaparte. We both bowed, and he passed down with Annabel into the hall.",
    "length": 126
  },
  {
    "text": "\"You said you would like to hear my service in D flat--'Sharnall in D flat,' did you not? I will play it through to you now, if you care to listen. Of course, I can only give you the general effect, without voices, though, after all, I don't know that you won't get quite as good an idea of it as you could with any voices that we have here.\"",
    "length": 342
  },

but I'm now using 50% chapter / short-story data:

head ajibawa-2023-Stories-Collections--text-only.json

  {
    "text": "Once upon a time in the land of Policymia, there lived two leaders named Majora and Minoro. Their job was to make sure all the citizens had beautiful parks, clean water, and top-notch schools. But there were so many things to fix! How would they ever decide where to start?\n\nMajora, being the wise leader she was, knew just what to do. She invited her fellow policymakers for a big meeting at the Roundtable of Representatives. There, they discussed the most important problems Policymia faced. This was called identifying \"key policy areas.\" It meant figuring out which topics needed attention first.\n\nNext came assessing support – finding out if everyone agreed on the solutions. Some people thought building more playgrounds was the way to go, while others wanted better libraries. To understand everyone's thoughts, Majora used something called 'polling.' Just like taking a vote, polling helped her see what ideas were popular among her friends (the majority) and also those across the aisle (people who didn't belong to her political group).\n\nWhile talking to her friends and colleagues, Majora discovered that almost everyone loved science! And why not? Science could help create amazing inventions, protect nature, and even cure sicknesses. So, the policymakers decided to build a super cool SCIENCE CENTER right at the heart of Policymia!\n\nBut then, an unexpected problem popped up! A grumpy neighboring kingdom threatened to block Policymia's plans because they feared losing visitors to the new center. Oh no! However, instead of giving up, Majora saw this challenge as an opportunity. If they could work together with the grumpy neighbors, maybe both lands could benefit. That way, everybody wins, showing the true spirit of teamwork and collaboration in the world of policies!",
    "length": 1789
  },
  {
    "text": "In the bustling city of Brooklyn, there was a special movie screening for a new film called \"Cross Eyed.\" Billy and his friends were so excited to see it because they heard it was full of laughter and fun.\n\nAs soon as the movie started, they met peculiar characters who brought nonstop giggles. From talking animals to humans with unusual talents, each character had their own quirky charm. Every role, even those on the sidelines, felt important and added something special to the mix.\n\nBilly's favorite part of the movie involved little inventions called \"gadgets,\" created by a mad scientist named Professor Zoom. These gizmos popped up unexpectedly during scenes, adding layers of humor and excitement. With every appearance, the professor explained how these devices worked, teaching everyone about fascinating scientific principles.\n\nDuring recess, Billy couldn't stop thinking about the movie. He shared all he learned with his classmates, describing the gadgets and what they did. Together, they imagined creating their own silly contraptions while discussing forces, energy, and motion.\n\nUnfortunately, after hearing about the fantastic movie, none of Billy's friends could join him for a second viewing. Though disappointed, he realized that sharing knowledge with others can spread joy far beyond himself. Sometimes our discoveries don't turn out exactly as we hope, but learning valuable lessons along the way makes us grow stronger in both mind and spirit.",
    "length": 1469
  },

head gutenberg-books-markdown-cleaned-fiction-chapters--filtered.json

  {
    "text": "Emerson has written a discourse on friendship. It is beautifully worded, truly; it is full of a noble and high-minded philosophy. Doubtless it will appeal quite distinctly to those souls who, although yet on this earth-plane, have already partly cast off the mantle of flesh, and have found their paths to lie in the realm of spirit. Even to those, and it is by far the greater majority, who yet walk humdrumly along the world's great highway, the kingdom of the spirit perceived by them as in a glass darkly rather than by actual light shed upon them from its realm, it may bring some consolation during the absence of a friend. But for the general run of mankind it is set on too lofty a level. It lacks the warmth for which they crave, the personality and intercourse.\n\n\"I do then, with my friends as I do with my books,\" he says. \"I would have them where I can find them, but I seldom use them.\"\n\nNow, it is very certain that, for the majority of human beings, the friendliest books are worn 45 with much handling. If we picture for a moment the bookshelves belonging to our childish days, we shall at once mentally discover our old favourites. They have been used so often. They have been worn in our service. No matter how well we know the contents, we turn to them again and again; there is a very joy in knowing what to expect. Time does not age nor custom stale the infinite variety.\n\nThus it is in our childish days. And are not the majority of us still children? Should our favourite books be placed out of our reach, should it be impossible for us to turn their pages, it is certain that we would feel a loss, a gap. Were we old enough to comprehend Emerson's philosophy, we might endeavour to buoy ourselves up with the thought that thus we were at one with him in his nobility and loftiness of sentiment. And yet there would be something childish and pathetic in the endeavour, by reason of its very unreality. Certainly if Providence should, either directly or indirectly, separate us from our friends, by all means let us accept the separation bravely. It cannot destroy our friendship. But seldom to use our friends, from the apparently epicurean point of view of Emerson, would be a forced and unnatural doctrine to the majority, as unnatural as if a child should bury Hans Andersen's fairy tales for fear of tiring of them. It would savour more of present and actual distaste, than the love which fears its approach. There is the familiarity which 46 breeds contempt, truly; but there is also the familiarity which daily ties closer bonds, draws to closer union.\n\nAntony had established a friendship with the lady of the blue book. The book had been responsible for its beginning. With Emerson's definition of friendship he would probably have been largely in harmony; not so in his treatment of it. With the following, he would have been at one, with the exception of a word or so:--\"I must feel pride in my friend's accomplishments as if they were mine,--wild, delicate, throbbing property in his virtues. I feel as warmly when he is praised, as the lover when he hears applause of his engaged maiden. We over-estimate the conscience of our friend. His goodness seems better than our goodness, his nature finer, his temptations less. Everything that is his, his name, his form, his dress, books, and instruments, fancy enhances. Our own thought sounds new and larger from his mouth.\"\n\nMost true, Antony would have declared, if you will eliminate \"over-estimate,\" and substitute \"is\" for \"seems.\"\n\nUnlike Emerson, he made no attempt to analyse his friendship. He accepted it as a gift from the gods. Maybe somewhere in his inner consciousness, barely articulate even to his own heart, he dreamt of it as a foundation to something further. Yet for the present, the foundation sufficed. Death-letters--he laughed joyously at the coincidence--had 47 laid the first stone, and each day placed others in firm and secure position round it. The building was largely unconscious. It is the way with true friendship. The life, also, conduced to it. There are fewer barriers of convention on board ship than in any other mode of living. Mrs. Grundy, it is to be supposed, suffers from sea-sickness, and does not care for this method of travelling. In fact, it would appear that she seldom does travel, but chooses by preference small country towns, mainly English ones, for her place of residence.\n\nThe days were days of sunshine and colour, the changing colour of sea and sky; the nights were nights of mystery, veiled in purple, star-embroidered.\n\nOne day Pia made clear to him the explanation of her Irish colouring and her Italian surname. Her mother, she told him, was Irish; her father, English. Her baptismal name had been chosen by an Italian godmother. She was eighteen when she married the Duc di Donatello. On their wedding day, when driving from the church, the horses had bolted. She had been uninjured; he had received serious injuries to his head and spine. He had lived for seven years as a complete invalid, totally paralysed, but fully conscious. During those seven years, she had never left him. Two years previously he had died, and she had gone to live at her old home in England,--the Manor House, Woodleigh, which had been in the hands of 48 caretakers since her parents' death. Her husband's property had passed to his brother. The last six months she had been staying with a friend at Wynberg.\n\nShe told the little tale extremely simply. It never occurred to her to expect sympathy on account of the tragedy which had marred her youth, and by reason of which she had spent seven years of her life in almost utter seclusion. The fact was merely mentioned in necessary explanation of her story. Antony, too, had held silence. Sympathy on his part would have been somehow an intrusion, an impertinence. But he understood now, in part at least, the steady gravity, the hint of sadness in her eyes.\n\nThe name of Woodleigh awoke vague memories in his mind, but they were too vague to be noteworthy. Possibly, most probably, he told himself, he had merely read of the place at some time. She mentioned that it was in Devonshire, but curiously enough, and this was an omission which he noted later with some surprise, he never questioned her as to its exact locality.\n\nOn his side, he told her of his life on the veldt, and mentioned that he was returning to England on business. On the outcome of that same business would depend the question whether he remained in England, or whether he returned to the veldt. Having the solicitor's injunction in view, he naturally did not volunteer further information. Such details, too, sank into insignificance 49 before the more absorbing interest of personality. They are, after all, in a sense, mere accidents, and have no more to do with the real man than the clothes he wears. True, the manner in which one dons one's clothes, as the manner in which one deals with the accidental facts of life, affords a certain index to the true man; but the clothes themselves, and the accidental facts, appear, at all events, to be matters of fate. And if you can obtain knowledge of a man through actual contact with his personality, you do not trouble to draw conclusions from his method of donning his clothes. You may speculate in this fashion with regard to strangers, or mere acquaintances. You have a surer, and infinitely more interesting, fashion with your friends.\n\nLife around them moved on in the leisurely, almost indolent manner in which it does move on board a passenger ship. The younger members played quoits, cricket on the lower deck, and inaugurated concerts, supported by a gramaphone, the property of the chief officer, and banjo solos by the captain. The older members read magazines, played bridge, or knitted woollen articles, according to the promptings of their sex and their various natures, and formed audiences at the aforementioned concerts.\n\nAntony and the Duchessa di Donatello alone seemed somewhat aloof from them. They formed part of the concert audiences, it is true; but they 50 neither played bridge, quoits, nor cricket, nor knitted woollen articles, nor read magazines. The Duchessa employed her time with a piece of fine lace work, when she was not merely luxuriating in the sunshine, or conversing with Antony. Antony either conversed with the Duchessa, or sat in his deck chair, smoking and thinking about her. There was certainly a distinct sameness about the young man's occupation, which, however, he found not in the smallest degree boring. On the contrary, it was all-absorbing and fascinating. The very hours of the day were timed by the Duchessa's movements, rather than by the mere minute portions of steel attached to the face of a commonplace watch. Thus:--\n\nDawn. He realizes the Duchessa's existence when he wakes. (His dreams had been coloured by her, but that's beside the mark.)\n\nDaybreak. The Duchessa ascends on deck and smiles at him.\n\nBreakfast time. The Duchessa sits opposite to him.\n\nThe sunny morning hours. The Duchessa sews fine lace; she talks, she smiles,--the smile that radiates through the sadness of her eyes.\n\nAnd so on, throughout the day, till the evening gloaming brings a hint of further intimacy into their conversation, and night falls as she wishes him pleasant dreams before descending to her cabin.\n\nHe dwelt then, for the moment, solely in her 51 friendship, but vaguely the half articulate thought of the future began to stir within him, pulsing with a secret possibility of joy he barely dared to contemplate.\n\n***\n\n52",
    "length": 9565
  },
  {
    "text": "Zen hastened to manifest himself, complete with fourteen nostrils, before she could spoil everything. \"The procedure is most unorthodox,\" he murmured aloud, \"but truly this new incense has a most delicious aroma, extremely pleasing to My Ego. What is your will, oh, strangers?\"\n\n\"All-Merciful Zen,\" the princess pleaded, \"forgive them, for they knew not what they did. They did not mean to summon You.\"\n\n\"Then who,\" asked Zen in a terrible voice, \"is this wonderful smoke for? Some foreign god whom they worship on My Territory?\" And he wouldn't put it past them either.\n\nPeter looked at the anthropologist, but Kendrick was obviously too paralyzed with fright to speak. \"As a matter of fact, Your--er--Omnipotence,\" the physicist said haltingly, \"this is not part of our religious ritual. We burn this particular type of incense which we call tobacco, for our own pleasure.\"\n\n\"In other words,\" Zen said coldly, \"you worship yourselves. I work and slave My Godhood to the bone only to have egotists running all over My Planet.\"\n\n\"No, it's nothing like that at all,\" Kendrick quavered. \"We smoke the tobacco to--well--gratify our appetites. Like--like eating, you know.\"\n\n\"Well, you will have to forego that pleasure,\" Zen said, frowning terribly. Even the tall one cowered, he noted with appreciation. It had been a long time since people had really cringed before his frown. The Uxenach had come to take him too much for granted; they would learn their mistake. \"From now on,\" he said portentously, \"the tobacco must be reserved for My Use alone. Smoke it only for purposes of worship. Once a day will be sufficient,\" he added graciously, \"and perhaps twice on holy days.\"\n\n\"But we do not worship alien gods,\" Kendrick persisted in a shaky voice. \"Even if you *were* a god....\"\n\nZen frowned. \"Would you care to step outside and test my divinity?\"\n\n\"Well, no ... but....\"\n\n\"Then, as far as you're concerned, I am Divine, and let's have no more quibbling. Don't forget the tobacco once a day. About time I had a change from that low-grade incense.\"\n\nHe vanished. Too late he remembered that he'd planned to ask the Earthlings why they had come to Uxen, and to discuss a little business proposition with them. Oh, well, time for that at his next materialization for them. And, now that he considered the matter, the direct approach might very well be a mistake.\n\nHe hoped Iximi would make sure they burned him tobacco regularly--really good stuff; almost made godhood worthwhile. But then he'd felt that way about incense at first. No, he had other ideas for making divinity worthwhile, and Iximi was going to help him, even if she didn't know it. People had used him long enough; it was his turn to use them.",
    "length": 2707
  },

and have been really careful to normalise it so that every paragraph is separated by double newlines and no weird spaces at the front or end...

I'm also using 8 sets of data in total:

# =====================
# DATASET CONFIGURATION
# =====================

# POSITIVE CLASS DATA:

[[datasets]]
dataset_path = 'datasets/fiction-paragraphs/books/*.json'
control_class = 1
max_sequences = 15151  # floor(0.125*120000/(1−0.01))
drop_tails = true

[[datasets]]
dataset_path = 'datasets/fiction-paragraphs/gutenberg-books/*.json'
control_class = 1
max_sequences = 15151  # floor(0.125*120000/(1−0.01))
drop_tails = true

[[datasets]]
dataset_path = 'datasets/fiction-chapters/books/*.json'
control_class = 1
max_sequences = 15151  # floor(0.125*120000/(1−0.01))

[[datasets]]
dataset_path = 'datasets/fiction-chapters/gutenberg-books/*.json'
control_class = 1
max_sequences = 15151  # floor(0.125*120000/(1−0.01))

# NEGATIVE CLASS DATA:

[[datasets]]
dataset_path = 'datasets/fiction-paragraphs/ajibawa-2023-stories/*.json'
control_class = -1
max_sequences = 15151  # floor(0.125*120000/(1−0.01))
drop_tails = true

[[datasets]]
dataset_path = 'datasets/fiction-paragraphs/literotica-stories/*.json'
control_class = -1
max_sequences = 15151  # floor(0.125*120000/(1−0.01))
drop_tails = true

[[datasets]]
dataset_path = 'datasets/fiction-chapters/ajibawa-2023-stories/*.json'
control_class = -1
max_sequences = 15151  # floor(0.125*120000/(1−0.01))

[[datasets]]
dataset_path = 'datasets/fiction-chapters/literotica-stories/*.json'
control_class = -1
max_sequences = 15151  # floor(0.125*120000/(1−0.01))

to really perplexed at where it can learn to screw up the end of lines now.

jukofyork

Owner about 11 hours ago

https://huggingface.co/deepcogito/cogito-v2-preview-llama-405B

LOL, these model drops are getting silly now - as soon as I download one behemoth; 2 more spawn :D

ChuckMcSneed

about 11 hours ago

https://huggingface.co/stepfun-ai/step3

Another one! Did everyone agree to drop models at the same time?

BigHuggyD

about 3 hours ago

•

edited about 3 hours ago

Wur doomed!

1. Bias in the imatrix calculations

A. The samples are unrepresentative of real world data, and often the sample used to evaluate the "improvement" is very distributionally similar to that used to create the imatrix itself

B. The llama-imatrix code doesn't use the proper chat templates and also just breaks the "chunks" at random points

C. By default the llama-imatrix code only looks at contexts lengths of 512 tokens

D. Differing activation frequencies for MoE models

1. Manually consider each of the points above (and any others I've forgotten to mention)

2. Regularisation

2. Outdated heuristics in llama_tensor_get_type()

USER

ASSISTANT

USER

ASSISTANT

USER

ASSISTANT

USER

ASSISTANT

USER

ASSISTANT

USER

ASSISTANT

USER

ASSISTANT

USER

ASSISTANT

1. I still don't know a good way to regularise it based off the sample sizes:

2. I'm not really sure how to improve the llama-imatrix code to use proper templates, etc

USER

ASSISTANT

SUMMARY

Prompt

creative-thinker-preview (300 steps)

qwq

creative-thinker-preview (300 steps)

qwq

Extra Directions to Avoid Common AI Writing Issues:

original command-r:35b

command-r:35b with the control adapter applied

command-r:35b with the INVERSE of the control adapter applied

original command-r:35b

command-r:35b with the control adapter applied

command-r:35b with the INVERSE of the control adapter applied

Using: --lora-scaled qwq-32b-writer-lora-F32.gguf -2.0 and temperature = 0

Using: --lora qwq-32b-writer-lora-F32.gguf and temperature = 0

This is nuts: --lora qwq-32b-writer-lora-F32.gguf, temperature = 1.5 and min-p = 0.1

Using RTX 6000 ADA for all but shared-experts:

Using RTX 5000 ADA for all but shared-experts:

Key Points:

Why It Fits:

1. Bias in the `imatrix` calculations

A. The samples are unrepresentative of real world data, and often the sample used to evaluate the "improvement" is very distributionally similar to that used to create the `imatrix` itself

B. The `llama-imatrix` code doesn't use the proper chat templates and also just breaks the "chunks" at random points

C. By default the `llama-imatrix` code only looks at contexts lengths of 512 tokens

2. Outdated heuristics in `llama_tensor_get_type()`

2. I'm not really sure how to improve the `llama-imatrix` code to use proper templates, etc

Using: `--lora-scaled qwq-32b-writer-lora-F32.gguf -2.0` and `temperature = 0`

Using: `--lora qwq-32b-writer-lora-F32.gguf` and `temperature = 0`

This is nuts: `--lora qwq-32b-writer-lora-F32.gguf`, `temperature = 1.5` and `min-p = 0.1`

Using `RTX 6000 ADA` for all but shared-experts:

Using `RTX 5000 ADA` for all but shared-experts: