doro's picture

1 4

doro

artiert

·

AI & ML interests

None yet

Recent Activity

reacted to CultriX's post with 👍 15 days ago

Script for QA-style dataset generation from custom data: Transform Your Personal Data into High-Quality Training Datasets with help from a LLM. Inspired by a Reddit post (link below) I've created a script that converts custom documents into question-answer pairs for LLM fine-tuning. What it does: 1. Split the input data into chunks (note: this is important, more below!) 2. QA generation: Creates contextually relevant question-answer pairs from each chunk. 3. Quality assurance: Validates outputs using both rule-based filters and LLM judges 4. Exports datasets in both CSV and JSON formats Key features: - Separate model configurations for generation and evaluation - Configurable chunk sizes and question length - Multi-language support (English and Dutch, but easy to add your own!) - Local and cloud API compatibility Quick start: Place your documents (.txt for now) in an input folder and run: ``` python generate-rag-qav4.py \ --input-dir ./rag-input/ \ --output-dir ./rag-output/ \ --output-filename finetuning_qa_dataset \ --gen-model google/gemma-3-4b \ --gen-api-base http://127.0.0.1:1234/v1 \ --judge-model google/gemma-3-4b \ --judge-api-base http://127.0.0.1:1234/v1 \ --min-chunk-len 200 \ --question-chars 20 \ --answer-chars 5 \ --lang en ``` Pro tip: The --min-chunk-len parameter is critical. Too short (< 150 chars) and questions lack context; too long (> 1000 chars) and the model struggles with focus. Start with 200-400 characters and adjust based on your content type! Use cases: - Personal knowledge base fine-tuning - Domain-specific QA dataset creation - RAG system training data preparation Note: The script includes comprehensive error handling and progress tracking, and allows resuming progress should the process get interrupted. Note2: Original Reddit post that gave me the idea: https://www.reddit.com/r/LocalLLaMA/s/avkdzk8NSn The script can be found here: https://gist.github.com/CultriX-Github/9d53565214d56b12b9002a56230d1c00

liked a model 3 months ago

artiert/Qwen2.5-3B-WebdomInstruct

updated a model 4 months ago

artiert/Qwen2.5-3B-WebdomInstruct

View all activity

Organizations

models 2

artiert/Qwen2.5-3B-WebdomInstruct

Updated Feb 28 • 42 • 2

artiert/res

Question Answering • Updated Dec 19, 2023 • 18

datasets 0

None public yet