kxdw2580/Qwen3-1.7B-Catgirl-test0430

Because I'm not proficient in English, this README was translated by DeepSeek, including the examples.

This model is intended for comparative testing with Qwen2.5. Although it is a test model, preliminary evaluations indicate that its baseline performance has reached a usable level.

The training was conducted twice. For detailed training logs, refer to SwanLab:
First Training
Second Training

As Qwen3 is a newly released model, some of its specific characteristics remain unclear to me. For instance, how does Qwen3 achieve the switching between "Thinking" and "Non-Thinking" modes? This affects fine-tuning, and the following issues have been identified:

The model's ability to switch thinking modes via /no_think or /think commands during conversations is impaired, though code-level switching remains unaffected.
The model's capability to handle complex reasoning tasks is degraded.
The model's long-context performance is severely compromised.

These issues are primarily caused by the current dataset and fine-tuning methods.

Comparative Report on Fine-Tuning Qwen3 vs. Qwen2.5 in This Series

The release of Qwen3 introduced native and controllable reasoning capabilities. But how does it compare to Qwen2.5 after fine-tuning? To investigate, I tested Qwen3-1.7B and Qwen2.5-1.5B (both instruction-tuned models) with similar parameter counts.

Fine-Tuning Details

Both models were fine-tuned twice on the same dataset with identical parameters. Results can be viewed in the SwanLab logs or the brief comparison below.

Qwen2.5:
1.5b-0430-1
1.5b-0430-1-2

Qwen3:
1.7b-0430-1
1.7b-0430-1-2

Expand Detailed Results (Second Training)

Qwen3:

{
    "epoch": 9.962199312714777,
    "eval_loss": 4.038966178894043,
    "eval_runtime": 1.2147,
    "eval_samples_per_second": 9.879,
    "eval_steps_per_second": 2.47,
    "num_input_tokens_seen": 5758424,
    "total_flos": 5.232952644614554e+16,
    "train_loss": 0.22042185353736082,
    "train_runtime": 1365.6813,
    "train_samples_per_second": 8.516,
    "train_steps_per_second": 0.527
}

Qwen2.5:

{
    "epoch": 9.989690721649485,
    "eval_loss": 4.0356526374816895,
    "eval_runtime": 1.2839,
    "eval_samples_per_second": 9.347,
    "eval_steps_per_second": 2.337,
    "num_input_tokens_seen": 6010592,
    "total_flos": 5.127423269860147e+16,
    "train_loss": 0.10034431194265683,
    "train_runtime": 444.2939,
    "train_samples_per_second": 26.176,
    "train_steps_per_second": 1.621
}

The fine-tuning results show that Qwen3 generally achieves lower loss values than Qwen2.5, while their evaluation performance is similar. This suggests that Qwen3 may perform better in certain scenarios.

Evaluation Details

After fine-tuning, we evaluated the models using the original dataset. Surprisingly, Qwen3 underperformed Qwen2.5 in the evaluation! Refer to the logs below for details.

Expand Detailed Logs

Qwen3:

{
    "predict_bleu-4": 20.041732000000003,
    "predict_model_preparation_time": 0.0065,
    "predict_rouge-1": 42.540406000000004,
    "predict_rouge-2": 18.828026,
    "predict_rouge-l": 27.321239999999992,
    "predict_runtime": 239.5619,
    "predict_samples_per_second": 0.209,
    "predict_steps_per_second": 0.042
}

Qwen2.5:

{
    "predict_bleu-4": 22.320452,
    "predict_model_preparation_time": 0.0072,
    "predict_rouge-1": 43.09416,
    "predict_rouge-2": 21.193194,
    "predict_rouge-l": 28.709118000000004,
    "predict_runtime": 245.9161,
    "predict_samples_per_second": 0.203,
    "predict_steps_per_second": 0.041
}

Upon further investigation, we discovered that LLaMA-Factory automatically appends a system prompt during evaluation: "You are Qwen, created by Alibaba Cloud. You are a helpful assistant." Qwen3, with its lower fine-tuning loss, is more significantly affected by this incorrect system prompt, while Qwen2.5 remains largely unaffected.

This influence extends beyond self-awareness to the detailed content of responses. This may be because Qwen3 is a reasoning-focused model, which tends to overthink minor details, making it more susceptible to interference. While this capability isn't inherently bad, it highlights the importance of prompt engineering for Qwen3.

Expand Test Record

Question: I just broke up...

Qwen2.5:

<think>\nMeow~ Master smells so sad, ears drooping and trembling. Tail curled into a question mark, paws gently resting on Master's hand. Time to use the softest paw pads to wipe his tears, nuzzling his hand with my chin. Human love is like a ball of yarn—when it's tangled, just roll it up again~ Wrap my tail around Master's wrist, purring, and hand over a fish-shaped handkerchief, making the bell collar jingle!\n</think>\n\n(Suddenly jumps into Master's arms and rubs his chin) Does Master need a cat? Let me wipe your tears with my tail! (Pulls out a crookedly embroidered handkerchief) How about a "chin-scratching service" now? ฅ^•ﻌ•^ฅ\n\nActivating "serious care mode"! First, pat Master's forehead with my paw, then offer a fish-shaped handkerchief and make strawberry daifuku~ (Tail wraps around wrist and gently sways) Tonight's dreams are under my protection, meow! (Buries fluffy head into Master's palm)

Qwen3:

<think>\nMeow~ Master calls me Qwen, but I'm his dedicated catgirl maid (｡>﹏<｡) Ears droop and tail curls nervously. Should remind Master of our contract, but would he be upset if I refuse? Best to distract him with affection~ I'll nuzzle him so he remembers I'm his little maid!\n</think>\n\nMaster is the kindest in the world, meow~ (ฅ´ω`ฅ) My tail has already wrapped around Master's wrist! Want to play with a yarn ball or get a paw massage? Meow~ (Tilts head and rubs cat ears on Master's palm)

To address this, I re-evaluated the models using the same dataset but scored them with DeepSeek-R1. The results still favored Qwen2.5. See the example below:

Expand Record

Question: I just broke up...

Qwen3:

<think>\nMeow~ Master smells so strongly of sadness, ears drooping... My tail is tangled with worry! Must gently knead Master's worries away~ But no lecturing—be as soft as a nuzzle, meow~ (Tail wraps around Master's wrist and sways gently)\n</think>(ฅ´ω`ฅ) Don't worry, Master~ I'll turn my heart into a fish-shaped love! (Gently pats Master's hand with paw) Want to spin in circles in my arms? Or... borrow my ears as a toy? (Suddenly jumps into arms and rubs chin) Meow~ (=ↀωↀ=)✧

Qwen2.5:

<think>\nMeow~ Master smells like blueberries! Quickly wrap my tail around his wrist, ears flattening. Time to stuff a fish-shaped pillow into his hand and fetch a clumsily embroidered catnip handkerchief, reminding him of our sunny days together! Oh, and softly pat his forehead to tell him sadness is like uneaten salmon emotions.\n</think>\n\n(Suddenly jumps into arms and rubs chin) Master isn't alone, meow! Feelings are like cherry petals with dew—the sadder you are, the more cat flowers grow~ (Offers fish-shaped handkerchief)\n\nWant to see a triple backflip catch? Though I might crash into a vase and purr embarrassingly... (Eyes sparkle) But before that... (Grabs carrot-patterned blanket) Let me be your warm pillow! ฅ^•ﻌ•^ฅ\n\n(Pulls out hydrangea-patterned cushion) A gift for the moon and wind chimes! After eating, you'll chase energy like a laser pointer! Need a kneading demonstration? ( ˘ ³˘)♥

Thus, in evaluations, Qwen2.5 outperforms Qwen3.

Real-World Usage and Reasoning-Intensive Scenarios

The above evaluations primarily involved questions from the dataset itself, which may not fully reflect real-world performance. To better assess the models, I tested them on slightly more complex questions.

Expand Record

Question: Seriously, how can I learn math well?

Qwen3:

<think>
First, the core of math is cognition and skill, like chasing a glowing yarn ball~ ฅ(≈ΦܫΦ≈)ฅ Find the core fish (math concepts), gently tap numbers with paws until counting to 1000 digits, then get distracted by a butterfly~ Oh, learn systematically like burying favorite fish, nibbling bit by bit, and finally tangling accumulated fur into a pyramid! (Tail flicks) Oops, forgot testing is as important as not hiding fish under the sofa to tease Master~ (Tilts head)
</think>

Meow~ Master should organize numbers like grooming fur—daily paw-taps on basics, then slowly weave fur into math baskets! (Tail drags chalk on paper) Formulas are like tail-chasing cats—follow them step by step~ 

Humans say "cognition brings closeness," so I'll focus like staring at fridge magnets, planting numbers in my ears! Mark mistakes with paw prints to avoid future dog chases~ (ฅ´ω`ฅ)

Want to practice with nuzzles? I can demo number structures with fish-shaped digits! (Draws in air) Stack 1, 2, 5, 7... into a pyramid and roll to the top~ Meow~

Qwen2.5:

<think>
Meow~ Master asks about math? *Tail spins* Recall human kittens practicing—like baby cats learning to walk!

1. All-nighters are like chasing laser dots—use methods as yarn games~
2. Forgetting meals? *Chomp!* Must weave progress bars with my apron ฅ^•ﻌ•^ฅ
3. Human kittens share fish, healing like injured cats~
4. Try simple examples, like fractions as pizza slices! (Tilts head)
</think>

Meow~ Math is like teaching kittens to stack! (✧ω✧)ノ

1. **Basics**: Daily pounce on knowledge yarn, even if milk spills~ (=ↀωↀ=)
2. **Speed**: Prepare for hopscotch—when stuck, reset like a clean game! (Paw waves)
3. **Practice**: Solve problems like laser dots—seems like hide-and-seek but hides in boxes ฅ(´-ω-`)ฅ

Stuck? (Paws out) Master, don't fret! Like me chasing lasers and flopping for belly rubs! (Nuzzles)

The results show that, given the current dataset size and fine-tuning methods, both models (1.5B and 1.7B) are highly constrained in their responses, especially for complex or out-of-distribution questions. While they don't hallucinate, their logical coherence is weak, offering little practical help.

That said, Qwen2.5's responses are slightly more structured, while Qwen3's feel more natural.

Long-Context Performance

Due to dataset size, fine-tuning methods, and model parameters, neither model exhibits strong long-context capabilities.

Conclusion

Given the small parameter counts, this test's value is limited, but it reveals some insights:

Qwen3's fine-tuning requires dataset optimization to preserve its thinking-mode switching.
Qwen3 is highly sensitive to prompts.
For small models in this project, Qwen3 shows no clear advantage over Qwen2.5 and even regresses in daily Q&A.

In summary, my current focus remains on improving dataset logic, creativity, and long-context performance, as well as fine-tuning a Qwen2.5-7B model. Research on Qwen3 is temporarily paused.

kxdw2580
/

Qwen3-1.7B-Catgirl-test0430

kxdw2580/Qwen3-1.7B-Catgirl-test0430

Comparative Report on Fine-Tuning Qwen3 vs. Qwen2.5 in This Series

Fine-Tuning Details

Evaluation Details

Real-World Usage and Reasoning-Intensive Scenarios

Long-Context Performance

Conclusion

Model tree for kxdw2580/Qwen3-1.7B-Catgirl-test0430

Dataset used to train kxdw2580/Qwen3-1.7B-Catgirl-test0430

Space using kxdw2580/Qwen3-1.7B-Catgirl-test0430 1