How did you do the "rewriting pipeline?"

#5
by marcuscedricridia - opened

This time the recipe focused on combining as many data sources as I possibly could, featuring synthetic data from Sonnet 3.5 + 3.7, ChatGPT 4o and Deepseek. These then went through an extensive rewriting pipeline to eliminate common AI cliches, with the hopeful intent of providing you a fresh experience

As a fellow creator, I like to learn from the veterans. And right now I'm focused on building my new model family: Qwill. It's basically just me trying to "fight" and remove the GPTisms, but you know, most datasets today are synthetic and that sterile LLM stereotypical voice can't be removed. And so what I did was I distilled from Gemini and Gemma models because people said and from my experience, these were the models that wrote the closest to humans. No overuse of em dashes, words associated with AI were rarely used, and it's fairly easy to steer their behavior with a good system prompt. I chose Qwen3 as the models to finetune because they were the fastest to tune in unsloth, given my $0 dollar budget. But Qwen3 have reasoning toggles on, now I need to generate a dataset that would have thinking and no thinking! But I realized I can't keep generating thinking trajectories for these models, it hurts their performance because they weren't tuned for CoT, and I would be hitting rate limits everytime I try to generate a dataset bigger than 1500 samples.

So I was thinking of borrowing your idea: finetuning the base model so I wouldn't need to generate and have a hard time trying to find reasoning creative writing and RP datasets. But one thing also caught my attention in this model, the rewriting of datasets. So, I got interested in this technique and if you could, provide or atleast show me or us how you do this rewriting! 😀

Owner

Due to contractual obligations I cannot share the code of this pipeline but I can however explain the core of it. Or rather, have Sonnet explain it cause y'know, modern times and all! The concept is quite basic, using a simple dictionary - The validation process is where most of the code is dedicated to, to ensure the rewritten versions don't deviate too much from the original.


The pipeline is built around an iterative rewriting process that:

  1. Identifies AI clichés in text using a comprehensive dictionary of common AI phrases/patterns (sorted by length to prioritize longer matches and prevent overlapping detection)

  2. Uses specialized LLMs to rewrite content - primarily Claude Sonnet (expensive) or Llama 3.3 Instruct 70B (budget) - Personal note: Most models suck at targeted rewriting. Surprisingly, Llama 3.3 is really good at it for some obscure reason!

  3. Implements strict validation checks that ensure:

    • Original clichés are removed
    • No new clichés are introduced
    • Speech content remains unchanged
    • Formatting is preserved (asterisks, quotes, etc.)

Unique Technical Approaches

Some key innovations in the pipeline:

  1. Non-overlapping cliché detection - Uses a position-marking system to avoid double-counting overlapping clichés

  2. Format-aware processing - Detects whether text is in markdown or narrative format and applies appropriate rewriting strategies

  3. Partial acceptance mechanism - If a rewrite removes some but not all clichés (and doesn't introduce new ones), it can be "partially accepted" and queued for further improvement

  4. Edit distance thresholds - Calculated dynamically based on the number of clichés to ensure changes are proportional to what needs fixing

Much appreciated!

marcuscedricridia changed discussion status to closed

Sign up or log in to comment