Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
CultriX 
posted an update 2 days ago
Post
790
Script for QA-style dataset generation from custom data:
Transform Your Personal Data into High-Quality Training Datasets with help from a LLM.

Inspired by a Reddit post (link below) I've created a script that converts custom documents into question-answer pairs for LLM fine-tuning.
What it does:
1. Split the input data into chunks (note: this is important, more below!)
2. QA generation: Creates contextually relevant question-answer pairs from each chunk.
3. Quality assurance: Validates outputs using both rule-based filters and LLM judges
4. Exports datasets in both CSV and JSON formats

Key features:
- Separate model configurations for generation and evaluation
- Configurable chunk sizes and question length
- Multi-language support (English and Dutch, but easy to add your own!)
- Local and cloud API compatibility

Quick start:
Place your documents (.txt for now) in an input folder and run:

python generate-rag-qav4.py \
  --input-dir ./rag-input/ \
  --output-dir ./rag-output/ \
  --output-filename finetuning_qa_dataset \
  --gen-model google/gemma-3-4b \
  --gen-api-base http://127.0.0.1:1234/v1 \
  --judge-model google/gemma-3-4b \
  --judge-api-base http://127.0.0.1:1234/v1 \
  --min-chunk-len 200 \
  --question-chars 20 \
  --answer-chars 5 \
  --lang en

Pro tip: The --min-chunk-len parameter is critical. Too short (< 150 chars) and questions lack context; too long (> 1000 chars) and the model struggles with focus. Start with 200-400 characters and adjust based on your content type!

Use cases:
- Personal knowledge base fine-tuning
- Domain-specific QA dataset creation
- RAG system training data preparation

Note: The script includes comprehensive error handling and progress tracking, and allows resuming progress should the process get interrupted.

Note2: Original Reddit post that gave me the idea:
https://www.reddit.com/r/LocalLLaMA/s/avkdzk8NSn

The script can be found here:
https://gist.github.com/CultriX-Github/9d53565214d56b12b9002a56230d1c00

@sometimesanotion Maybe this is useful to you! :)

Now imagine this as a hashtag generator and so a RAG search can find great context. :)