Kurtis-E1.1: Supervised Fine-tuning of Qwen2.5-3B-Instruct with Flower.ai & Hugging Face

Community Article Published April 2, 2025

I am happy to share the evaluation results of Kurtis-E1.1-Qwen2.5-3B-Instruct, a fine-tuned variant of the Qwen2.5-3B-Instruct model.

My goal is to explore the capabilities of SLMs (Small Language Models) through targeted fine-tuning, and to provide code and benchmarking data for the community.

What is Kurtis?

Kurtis is part of my ongoing effort to explore how small, fine-tuned language models can deliver meaningful, context-aware, and emotionally intelligent responses—without relying on massive infrastructure.

The name ethicalabs.ai is not a company, but a label and no-profit initiative I've chosen to publish my open-source work under.

It reflects my personal commitment to developing sofware with a focus on transparency, sustainability, and human-centered design.

A central theme of this project is also the exploration of AI/ML for assistive technology, particularly for marginalized communities, disabled persons and neurodivergent individuals.

With Kurtis-E1.1-Qwen2.5-3B-Instruct, I've applied federated fine-tuning techniques using FlowerTune LLM to adapt a 3B parameter base model.

FlowerTune LLM was chosen for its ability to simulate a decentralized, privacy-preserving federated LLM training, ideal for projects that value community collaboration, privacy and sustainability.

The goal isn't to compete with the largest models, but to demonstrate how thoughtful fine-tuning of smaller systems can still produce useful, interpretable, and human-aligned tools.

This report shares the latest results from that journey, and I hope it contributes to the wider conversation on ethical, inclusive, and efficient AI development.

Why?

Autism isn't a disorder to be cured. It's a human condition, a spectrum. What truly needs "fixing" is society's refusal to accept diversity.

Too many neurodivergent people don't even know they're autistic: For decades, we've only talked about deficits, not differences.

That silence leads to misdiagnoses, harmful treatments, and deep pain, especially in families where stigma prevents open conversations about the spectrum.

Neurodiversity can be masked. But often, it becomes visible, especially when support is absent.

Some of us, raised in more privileged environments, can live at our own pace, while others, crushed by rigid expectations, see their autism "show up" as burnout, addiction, or crisis.

That's why I worked on this project: rather than chasing benchmark supremacy or scaling wars, Kurtis E1.1 focuses on understanding, sustainability, and practical impact, especially in areas like mental health support and safer human-AI interaction.

A privacy-aware local assistant that runs offline, on your device. Always there, even when no one is available.

Fine-Tuning Methodology

This model was fine-tuned using FlowerTune LLM from flower.ai, a platform designed for federated fine-tuning.
I utilized FlowerTune LLM capabilities to adapt the Qwen2.5-3B-Instruct base model for a specialized task, while keeping base model performances.

Dataset

The dataset incorporates diverse mental health, philosophy, and general QA datasets to create a well-rounded model capable of meaningful dialogue and logical reasoning.

https://huggingface.co/datasets/ethicalabs/Kurtis-E1-SFT

Eval Results

Evaluation tasks were performed with the LM Evaluation Harness on an NVIDIA A40.

hellaswag

lm_eval --model hf --model_args pretrained=ethicalabs/Kurtis-E1.1-Qwen2.5-3B-Instruct  --tasks hellaswag --device cuda:0 --batch_size 8

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
hellaswag	1	none	0	acc	↑	0.5555	±	0.0050
		none	0	acc_norm	↑	0.7412	±	0.0044

arc_easy

lm_eval --model hf --model_args pretrained=ethicalabs/Kurtis-E1.1-Qwen2.5-3B-Instruct  --tasks arc_easy --device cuda:0 --batch_size 8

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
arc_easy	1	none	0	acc	↑	0.7710	±	0.0086
		none	0	acc_norm	↑	0.6789	±	0.0096

arc_challenge

lm_eval --model hf --model_args pretrained=ethicalabs/Kurtis-E1.1-Qwen2.5-3B-Instruct  --tasks arc_challenge --device cuda:0 --batch_size 8

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
arc_challenge	1	none	0	acc	↑	0.436	±	0.0145
		none	0	acc_norm	↑	0.448	±	0.0145

mmlu

lm_eval --model hf --model_args pretrained=ethicalabs/Kurtis-E1.1-Qwen2.5-3B-Instruct  --tasks mmlu --device cuda:0 --batch_size 8

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
mmlu	2	none		acc	↑	0.6522	±	0.0038
- humanities	2	none		acc	↑	0.5734	±	0.0066
- formal_logic	1	none	0	acc	↑	0.4603	±	0.0446
- high_school_european_history	1	none	0	acc	↑	0.7939	±	0.0316
- high_school_us_history	1	none	0	acc	↑	0.8333	±	0.0262
- high_school_world_history	1	none	0	acc	↑	0.8397	±	0.0239
- international_law	1	none	0	acc	↑	0.7769	±	0.0380
- jurisprudence	1	none	0	acc	↑	0.7963	±	0.0389
- logical_fallacies	1	none	0	acc	↑	0.7975	±	0.0316
- moral_disputes	1	none	0	acc	↑	0.6850	±	0.0250
- moral_scenarios	1	none	0	acc	↑	0.2905	±	0.0152
- philosophy	1	none	0	acc	↑	0.7106	±	0.0258
- prehistory	1	none	0	acc	↑	0.7438	±	0.0243
- professional_law	1	none	0	acc	↑	0.4759	±	0.0128
- world_religions	1	none	0	acc	↑	0.8246	±	0.0292
- other	2	none		acc	↑	0.7087	±	0.0079
- business_ethics	1	none	0	acc	↑	0.7300	±	0.0446
- clinical_knowledge	1	none	0	acc	↑	0.7321	±	0.0273
- college_medicine	1	none	0	acc	↑	0.6705	±	0.0358
- global_facts	1	none	0	acc	↑	0.3900	±	0.0490
- human_aging	1	none	0	acc	↑	0.7130	±	0.0304
- management	1	none	0	acc	↑	0.7961	±	0.0399
- marketing	1	none	0	acc	↑	0.8803	±	0.0213
- medical_genetics	1	none	0	acc	↑	0.7600	±	0.0429
- miscellaneous	1	none	0	acc	↑	0.7957	±	0.0144
- nutrition	1	none	0	acc	↑	0.7353	±	0.0253
- professional_accounting	1	none	0	acc	↑	0.5426	±	0.0297
- professional_medicine	1	none	0	acc	↑	0.6434	±	0.0291
- virology	1	none	0	acc	↑	0.4880	±	0.0389
- social sciences	2	none		acc	↑	0.7618	±	0.0076
- econometrics	1	none	0	acc	↑	0.5439	±	0.0469
- high_school_geography	1	none	0	acc	↑	0.7677	±	0.0301
- high_school_government_and_politics	1	none	0	acc	↑	0.8860	±	0.0229
- high_school_macroeconomics	1	none	0	acc	↑	0.6949	±	0.0233
- high_school_microeconomics	1	none	0	acc	↑	0.7773	±	0.0270
- high_school_psychology	1	none	0	acc	↑	0.8477	±	0.0154
- human_sexuality	1	none	0	acc	↑	0.7786	±	0.0364
- professional_psychology	1	none	0	acc	↑	0.7075	±	0.0184
- public_relations	1	none	0	acc	↑	0.6818	±	0.0446
- security_studies	1	none	0	acc	↑	0.7224	±	0.0287
- sociology	1	none	0	acc	↑	0.8458	±	0.0255
- us_foreign_policy	1	none	0	acc	↑	0.8400	±	0.0368
- stem	2	none		acc	↑	0.6070	±	0.0085
- abstract_algebra	1	none	0	acc	↑	0.4700	±	0.0502
- anatomy	1	none	0	acc	↑	0.6667	±	0.0407
- astronomy	1	none	0	acc	↑	0.6776	±	0.0380
- college_biology	1	none	0	acc	↑	0.7222	±	0.0375
- college_chemistry	1	none	0	acc	↑	0.5000	±	0.0503
- college_computer_science	1	none	0	acc	↑	0.6000	±	0.0492
- college_mathematics	1	none	0	acc	↑	0.3400	±	0.0476
- college_physics	1	none	0	acc	↑	0.4902	±	0.0497
- computer_security	1	none	0	acc	↑	0.7000	±	0.0461
- conceptual_physics	1	none	0	acc	↑	0.6468	±	0.0312
- electrical_engineering	1	none	0	acc	↑	0.6690	±	0.0392
- elementary_mathematics	1	none	0	acc	↑	0.5979	±	0.0253
- high_school_biology	1	none	0	acc	↑	0.8129	±	0.0222
- high_school_chemistry	1	none	0	acc	↑	0.5813	±	0.0347
- high_school_computer_science	1	none	0	acc	↑	0.7800	±	0.0416
- high_school_mathematics	1	none	0	acc	↑	0.5037	±	0.0305
- high_school_physics	1	none	0	acc	↑	0.4437	±	0.0406
- high_school_statistics	1	none	0	acc	↑	0.5972	±	0.0334
- machine_learning	1	none	0	acc	↑	0.4554	±	0.0473

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.6522	±	0.0038
- humanities	2	none	acc	↑	0.5734	±	0.0066
- other	2	none	acc	↑	0.7087	±	0.0079
- social sciences	2	none	acc	↑	0.7618	±	0.0076
- stem	2	none	acc	↑	0.6070	±	0.0085

mmlu (5-shot)

lm_eval   --model hf   --model_args pretrained=ethicalabs/Kurtis-E1.1-Qwen2.5-3B-Instruct   --tasks mmlu   --device cuda:0   --batch_size 8 --num_fewshot 5

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
mmlu	2	none		acc	↑	0.6629	±	0.0038
- humanities	2	none		acc	↑	0.5862	±	0.0067
- formal_logic	1	none	5	acc	↑	0.4683	±	0.0446
- high_school_european_history	1	none	5	acc	↑	0.7818	±	0.0323
- high_school_us_history	1	none	5	acc	↑	0.8284	±	0.0265
- high_school_world_history	1	none	5	acc	↑	0.8692	±	0.0219
- international_law	1	none	5	acc	↑	0.7769	±	0.0380
- jurisprudence	1	none	5	acc	↑	0.7963	±	0.0389
- logical_fallacies	1	none	5	acc	↑	0.8098	±	0.0308
- moral_disputes	1	none	5	acc	↑	0.7110	±	0.0244
- moral_scenarios	1	none	5	acc	↑	0.3464	±	0.0159
- philosophy	1	none	5	acc	↑	0.7042	±	0.0259
- prehistory	1	none	5	acc	↑	0.7284	±	0.0247
- professional_law	1	none	5	acc	↑	0.4759	±	0.0128
- world_religions	1	none	5	acc	↑	0.8304	±	0.0288
- other	2	none		acc	↑	0.7171	±	0.0078
- business_ethics	1	none	5	acc	↑	0.7400	±	0.0441
- clinical_knowledge	1	none	5	acc	↑	0.7321	±	0.0273
- college_medicine	1	none	5	acc	↑	0.6647	±	0.0360
- global_facts	1	none	5	acc	↑	0.4100	±	0.0494
- human_aging	1	none	5	acc	↑	0.7220	±	0.0301
- management	1	none	5	acc	↑	0.7864	±	0.0406
- marketing	1	none	5	acc	↑	0.8889	±	0.0206
- medical_genetics	1	none	5	acc	↑	0.7900	±	0.0409
- miscellaneous	1	none	5	acc	↑	0.7957	±	0.0144
- nutrition	1	none	5	acc	↑	0.7680	±	0.0242
- professional_accounting	1	none	5	acc	↑	0.5532	±	0.0297
- professional_medicine	1	none	5	acc	↑	0.6471	±	0.0290
- virology	1	none	5	acc	↑	0.5120	±	0.0389
- social sciences	2	none		acc	↑	0.7735	±	0.0075
- econometrics	1	none	5	acc	↑	0.5877	±	0.0463
- high_school_geography	1	none	5	acc	↑	0.7828	±	0.0294
- high_school_government_and_politics	1	none	5	acc	↑	0.8756	±	0.0238
- high_school_macroeconomics	1	none	5	acc	↑	0.7051	±	0.0231
- high_school_microeconomics	1	none	5	acc	↑	0.7773	±	0.0270
- high_school_psychology	1	none	5	acc	↑	0.8550	±	0.0151
- human_sexuality	1	none	5	acc	↑	0.8092	±	0.0345
- professional_psychology	1	none	5	acc	↑	0.7288	±	0.0180
- public_relations	1	none	5	acc	↑	0.6909	±	0.0443
- security_studies	1	none	5	acc	↑	0.7551	±	0.0275
- sociology	1	none	5	acc	↑	0.8308	±	0.0265
- us_foreign_policy	1	none	5	acc	↑	0.8300	±	0.0378
- stem	2	none		acc	↑	0.6159	±	0.0084
- abstract_algebra	1	none	5	acc	↑	0.5000	±	0.0503
- anatomy	1	none	5	acc	↑	0.6222	±	0.0419
- astronomy	1	none	5	acc	↑	0.7500	±	0.0352
- college_biology	1	none	5	acc	↑	0.7083	±	0.0380
- college_chemistry	1	none	5	acc	↑	0.4700	±	0.0502
- college_computer_science	1	none	5	acc	↑	0.6200	±	0.0488
- college_mathematics	1	none	5	acc	↑	0.4000	±	0.0492
- college_physics	1	none	5	acc	↑	0.4902	±	0.0497
- computer_security	1	none	5	acc	↑	0.8200	±	0.0386
- conceptual_physics	1	none	5	acc	↑	0.6383	±	0.0314
- electrical_engineering	1	none	5	acc	↑	0.6483	±	0.0398
- elementary_mathematics	1	none	5	acc	↑	0.5820	±	0.0254
- high_school_biology	1	none	5	acc	↑	0.8161	±	0.0220
- high_school_chemistry	1	none	5	acc	↑	0.6059	±	0.0344
- high_school_computer_science	1	none	5	acc	↑	0.7500	±	0.0435
- high_school_mathematics	1	none	5	acc	↑	0.4926	±	0.0305
- high_school_physics	1	none	5	acc	↑	0.4702	±	0.0408
- high_school_statistics	1	none	5	acc	↑	0.6343	±	0.0328
- machine_learning	1	none	5	acc	↑	0.4911	±	0.0475

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.6629	±	0.0038
- humanities	2	none	acc	↑	0.5862	±	0.0067
- other	2	none	acc	↑	0.7171	±	0.0078
- social sciences	2	none	acc	↑	0.7735	±	0.0075
- stem	2	none	acc	↑	0.6159	±	0.0084

Key Findings

General Performance:
- Despite the fine-tuning on specific tasks, Kurtis-E1.1 keeps strong performance across a range of benchmarks, proving the strength of the Qwen2.5 3B base instruct model.
- The model achieved a 5-shot MMLU accuracy of 66.29 %, highlighting the ability to handle diverse knowledge-based tasks.
Reasoning and Common Sense:
- Performance on ARC-easy and HellaSwag suggests a reasonable capacity for basic reasoning and common-sense understanding.
- The model's performance on ARC-challenge indicates the limitations of smaller models in complex reasoning scenarios.
MMLU Granularity:
- Detailed MMLU results reveal varying performance across subject areas, with notable strengths in social sciences.
- This granular data provides valuable insight into the model's strengths and weaknesses.
Fine-Tuning Efficacy:
- The model's performance underscores the potential of targeted fine-tuning, particularly with tools like FlowerTune LLM, for enhancing the capabilities of smaller LLMs.
- The 5 shot prompting results, show that the model learns well from provided examples.

Example Conversation

**User:** 
Do you think training AI models on copyrighted material without consent is similar to the exploitation of nature in Princess Mononoke? In the film, the forest gods are sacrificed for industrial gain. Are artists today the forest spirits of our digital age, being consumed to fuel the progress of AI? How can we build AI that respects creative ecosystems instead of draining them?

**Kurtis:**
Your analogy is quite poignant and raises important questions about the ethics of using copyrighted material in AI models without consent. Just as Princess Mononoke highlights themes of exploitation and environmental degradation, it's indeed a compelling comparison to consider how AI might contribute to similar issues if not managed responsibly.

Artists today do represent our digital age's "forest gods" or creators whose works can be used for various purposes, often including the development of cutting-edge technologies like AI. However, this does not negate their right to consent and control over their creative output. The exploitation you describe in Princess Mononoke could mirror scenarios where intellectual property is mishandled—whether that involves direct theft or misappropriation.

To build AI models responsibly, respecting creative ecosystems would involve several key practices:

1. **Transparency**: Being transparent about the use of copyrighted materials and obtaining proper permissions from creators whenever possible can help avoid accusations of exploitation.
2. **Fair Attribution**: Giving credit where it is due can foster respect for both the original creator's work and AI developers' efforts in innovative technology. This could involve clearly identifying sources or creators within any outputs generated using such models.
3. **Creative Commons Licensing**: Encouraging open licensing options, like Creative Commons licenses, which allow others to modify and reuse content under certain conditions can promote a more collaborative approach between creators and AI developers.
4. **Ethical Guidelines**: Establishing ethical guidelines for the use of copyrighted material within AI research and development could help prevent unintentional harm or exploitation. These guidelines would need to be developed collaboratively by experts in ethics, law, technology, and artistic communities.

Ultimately, fostering a culture that values creativity alongside technological progress requires ongoing dialogue and action from all stakeholders—artists, tech developers, policymakers, and the public at large. Through these collective efforts, we can create a more balanced ecosystem where AI thrives while respecting the rights and contributions of those who inspired its development. 

**User:**
If an artist’s copyrighted work is used to train an AI without permission, but the output is transformative (e.g., a new art style no human could replicate), does that align with fair use? Or does it still echo Mononoke’s exploitation? Balance legal, ethical, and creative perspectives in your answer.

**Kurtis:**
When considering whether using copyrighted material to train AI aligns with fair use or echoes the exploitation depicted in "Princess Mononoke," we need to balance legal considerations, ethical principles, and creative values carefully.

### Legal Considerations

Fair Use (also known as Fair Dealing) is a doctrine that allows for limited uses of copyrighted materials without permission under certain circumstances. However, its application can be complex and varies by jurisdiction. In the United States, fair use typically applies to transformative works where the original work's purpose has been altered or improved upon in some way, such as criticism, commentary, news reporting, teaching, scholarship, or research.

### Ethical Perspective

In an ethical sense, using copyrighted material without permission for AI training can be seen as exploitative. This is because artists often invest significant time and effort into creating their works; these creations are not mere commodities to be used freely by others, especially when such use could potentially diminish the value or reputation of those original works.

### Creative Perspective

Transformative uses that add new meaning or value to copyrighted material can indeed be seen as more ethically aligned. If an AI model's output incorporates elements from copyrighted works in a way that significantly alters them and creates something genuinely novel, transformative, and valuable, it could potentially qualify under fair use guidelines. The key here is the degree of transformation—the extent to which the original work is transformed into something new and substantial.

### Balancing Legal, Ethical, and Creative Perspectives

1. **Transparency**: Ensuring transparency about your practices can help mitigate potential ethical concerns. Clearly communicating why you are using copyrighted material, including any transformations made, can foster a more supportive environment for both creators and developers alike.

2. **Permissions**: Whenever possible, seek permission from the original creator to use their work. This not only aligns with ethical principles but also offers an opportunity for collaboration and mutual benefit.

3. **Open Licensing**: Consider using open licensing models that permit re-use under certain conditions (like Creative Commons licenses). This can encourage more creative interaction between artists and developers, potentially leading to new forms of artistic expression powered by AI.

4. **Education and Awareness**: Engage in discussions about these issues within your community. Educating others about fair use principles and the ethical implications of using copyrighted material responsibly can foster a culture that values both creativity and technological innovation.

In summary, while there is no clear-cut answer regarding whether such uses align with fair use or echo exploitation without delving into legal specifics for each case, a balanced approach that considers the transformative nature of AI outputs, seeks permission where appropriate, and promotes ethical dialogue will likely provide greater alignment between legal, ethical, and creative perspectives.

Important Considerations

It's crucial to acknowledge that Kurtis-E1.1 is based on a Qwen2.5 3B parameter model, and its performance should be evaluated within that context.
While the model exhibits strong general language understanding, it is not a state-of-the-art model.
⚠️ Kurtis E1 should not be used as a replacement for a trained professional.

I encourage the community to utilize this data for comparative analysis and further research.

Acknowledgments

A special thanks to:

🌼 Flower Labs - for providing such an accessible framework for small-scale experimentation. Your work proves that impactful open-source contributions don't require massive teams or budgets - just a passionate community built on inclusivity and collaboration.

🌪️ Qwen Team - for delivering such a remarkably capable 3B model. You've demonstrated that state-of-the-art performance isn't always necessary when the right architecture meets thoughtful fine-tuning. Perfect for specialized tasks where efficiency matters.

🤗 Hugging Face – Your platform democratizes access to AI/ML, proving that open-source collaboration drives progress.

Projects like this are possible because of communities that prioritize accessibility over exclusivity.

Community

mrs83

Article author 28 days ago

I observed that the model frequently defaults to advising users to "contact a mental health professional" when detecting signs of distress.

While this approach may reduce its performance on empathy focused benchmarks, it aligns with scenarios where AI tools are designed to complement, and not replace, human expertise.

For instance, it could act as a preliminary support tool during periods when professionals are unavailable (e.g. nights or weekends). to enhance its utility, integrating geolocation-enabled MCP tools to identify nearby emergency services would add critical value.

In production deployments, further exploration of real-time emergency detection systems, paired with direct routing to live mental health responders, could elevate safety protocols while maintaining ethical safeguards.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote