Post
310
Script for QA-style dataset generation from custom data:
Transform Your Personal Data into High-Quality Training Datasets with help from a LLM.
Inspired by a Reddit post (link below) I've created a script that converts custom documents into question-answer pairs for LLM fine-tuning.
What it does:
1. Split the input data into chunks (note: this is important, more below!)
2. QA generation: Creates contextually relevant question-answer pairs from each chunk.
3. Quality assurance: Validates outputs using both rule-based filters and LLM judges
4. Exports datasets in both CSV and JSON formats
Key features:
- Separate model configurations for generation and evaluation
- Configurable chunk sizes and question length
- Multi-language support (English and Dutch, but easy to add your own!)
- Local and cloud API compatibility
Quick start:
Place your documents (.txt for now) in an input folder and run:
Pro tip: The --min-chunk-len parameter is critical. Too short (< 150 chars) and questions lack context; too long (> 1000 chars) and the model struggles with focus. Start with 200-400 characters and adjust based on your content type!
Use cases:
- Personal knowledge base fine-tuning
- Domain-specific QA dataset creation
- RAG system training data preparation
Note: The script includes comprehensive error handling and progress tracking, and allows resuming progress should the process get interrupted.
Note2: Original Reddit post that gave me the idea:
https://www.reddit.com/r/LocalLLaMA/s/avkdzk8NSn
The script can be found here:
https://gist.github.com/CultriX-Github/9d53565214d56b12b9002a56230d1c00
Transform Your Personal Data into High-Quality Training Datasets with help from a LLM.
Inspired by a Reddit post (link below) I've created a script that converts custom documents into question-answer pairs for LLM fine-tuning.
What it does:
1. Split the input data into chunks (note: this is important, more below!)
2. QA generation: Creates contextually relevant question-answer pairs from each chunk.
3. Quality assurance: Validates outputs using both rule-based filters and LLM judges
4. Exports datasets in both CSV and JSON formats
Key features:
- Separate model configurations for generation and evaluation
- Configurable chunk sizes and question length
- Multi-language support (English and Dutch, but easy to add your own!)
- Local and cloud API compatibility
Quick start:
Place your documents (.txt for now) in an input folder and run:
python generate-rag-qav4.py \
--input-dir ./rag-input/ \
--output-dir ./rag-output/ \
--output-filename finetuning_qa_dataset \
--gen-model google/gemma-3-4b \
--gen-api-base http://127.0.0.1:1234/v1 \
--judge-model google/gemma-3-4b \
--judge-api-base http://127.0.0.1:1234/v1 \
--min-chunk-len 200 \
--question-chars 20 \
--answer-chars 5 \
--lang en
Pro tip: The --min-chunk-len parameter is critical. Too short (< 150 chars) and questions lack context; too long (> 1000 chars) and the model struggles with focus. Start with 200-400 characters and adjust based on your content type!
Use cases:
- Personal knowledge base fine-tuning
- Domain-specific QA dataset creation
- RAG system training data preparation
Note: The script includes comprehensive error handling and progress tracking, and allows resuming progress should the process get interrupted.
Note2: Original Reddit post that gave me the idea:
https://www.reddit.com/r/LocalLLaMA/s/avkdzk8NSn
The script can be found here:
https://gist.github.com/CultriX-Github/9d53565214d56b12b9002a56230d1c00