CultriX's picture

CultriX PRO

CultriX

AI & ML interests

None yet

Recent Activity

replied to their post about 21 hours ago
Script for QA-style dataset generation from custom data: Transform Your Personal Data into High-Quality Training Datasets with help from a LLM. Inspired by a Reddit post (link below) I've created a script that converts custom documents into question-answer pairs for LLM fine-tuning. What it does: 1. Split the input data into chunks (note: this is important, more below!) 2. QA generation: Creates contextually relevant question-answer pairs from each chunk. 3. Quality assurance: Validates outputs using both rule-based filters and LLM judges 4. Exports datasets in both CSV and JSON formats Key features: - Separate model configurations for generation and evaluation - Configurable chunk sizes and question length - Multi-language support (English and Dutch, but easy to add your own!) - Local and cloud API compatibility Quick start: Place your documents (.txt for now) in an input folder and run: ``` python generate-rag-qav4.py \ --input-dir ./rag-input/ \ --output-dir ./rag-output/ \ --output-filename finetuning_qa_dataset \ --gen-model google/gemma-3-4b \ --gen-api-base http://127.0.0.1:1234/v1 \ --judge-model google/gemma-3-4b \ --judge-api-base http://127.0.0.1:1234/v1 \ --min-chunk-len 200 \ --question-chars 20 \ --answer-chars 5 \ --lang en ``` Pro tip: The --min-chunk-len parameter is critical. Too short (< 150 chars) and questions lack context; too long (> 1000 chars) and the model struggles with focus. Start with 200-400 characters and adjust based on your content type! Use cases: - Personal knowledge base fine-tuning - Domain-specific QA dataset creation - RAG system training data preparation Note: The script includes comprehensive error handling and progress tracking, and allows resuming progress should the process get interrupted. Note2: Original Reddit post that gave me the idea: https://www.reddit.com/r/LocalLLaMA/s/avkdzk8NSn The script can be found here: https://gist.github.com/CultriX-Github/9d53565214d56b12b9002a56230d1c00
posted an update about 21 hours ago
Script for QA-style dataset generation from custom data: Transform Your Personal Data into High-Quality Training Datasets with help from a LLM. Inspired by a Reddit post (link below) I've created a script that converts custom documents into question-answer pairs for LLM fine-tuning. What it does: 1. Split the input data into chunks (note: this is important, more below!) 2. QA generation: Creates contextually relevant question-answer pairs from each chunk. 3. Quality assurance: Validates outputs using both rule-based filters and LLM judges 4. Exports datasets in both CSV and JSON formats Key features: - Separate model configurations for generation and evaluation - Configurable chunk sizes and question length - Multi-language support (English and Dutch, but easy to add your own!) - Local and cloud API compatibility Quick start: Place your documents (.txt for now) in an input folder and run: ``` python generate-rag-qav4.py \ --input-dir ./rag-input/ \ --output-dir ./rag-output/ \ --output-filename finetuning_qa_dataset \ --gen-model google/gemma-3-4b \ --gen-api-base http://127.0.0.1:1234/v1 \ --judge-model google/gemma-3-4b \ --judge-api-base http://127.0.0.1:1234/v1 \ --min-chunk-len 200 \ --question-chars 20 \ --answer-chars 5 \ --lang en ``` Pro tip: The --min-chunk-len parameter is critical. Too short (< 150 chars) and questions lack context; too long (> 1000 chars) and the model struggles with focus. Start with 200-400 characters and adjust based on your content type! Use cases: - Personal knowledge base fine-tuning - Domain-specific QA dataset creation - RAG system training data preparation Note: The script includes comprehensive error handling and progress tracking, and allows resuming progress should the process get interrupted. Note2: Original Reddit post that gave me the idea: https://www.reddit.com/r/LocalLLaMA/s/avkdzk8NSn The script can be found here: https://gist.github.com/CultriX-Github/9d53565214d56b12b9002a56230d1c00
View all activity

Organizations

None yet

Posts 6

view post
Post
310
Script for QA-style dataset generation from custom data:
Transform Your Personal Data into High-Quality Training Datasets with help from a LLM.

Inspired by a Reddit post (link below) I've created a script that converts custom documents into question-answer pairs for LLM fine-tuning.
What it does:
1. Split the input data into chunks (note: this is important, more below!)
2. QA generation: Creates contextually relevant question-answer pairs from each chunk.
3. Quality assurance: Validates outputs using both rule-based filters and LLM judges
4. Exports datasets in both CSV and JSON formats

Key features:
- Separate model configurations for generation and evaluation
- Configurable chunk sizes and question length
- Multi-language support (English and Dutch, but easy to add your own!)
- Local and cloud API compatibility

Quick start:
Place your documents (.txt for now) in an input folder and run:

python generate-rag-qav4.py \
  --input-dir ./rag-input/ \
  --output-dir ./rag-output/ \
  --output-filename finetuning_qa_dataset \
  --gen-model google/gemma-3-4b \
  --gen-api-base http://127.0.0.1:1234/v1 \
  --judge-model google/gemma-3-4b \
  --judge-api-base http://127.0.0.1:1234/v1 \
  --min-chunk-len 200 \
  --question-chars 20 \
  --answer-chars 5 \
  --lang en

Pro tip: The --min-chunk-len parameter is critical. Too short (< 150 chars) and questions lack context; too long (> 1000 chars) and the model struggles with focus. Start with 200-400 characters and adjust based on your content type!

Use cases:
- Personal knowledge base fine-tuning
- Domain-specific QA dataset creation
- RAG system training data preparation

Note: The script includes comprehensive error handling and progress tracking, and allows resuming progress should the process get interrupted.

Note2: Original Reddit post that gave me the idea:
https://www.reddit.com/r/LocalLLaMA/s/avkdzk8NSn

The script can be found here:
https://gist.github.com/CultriX-Github/9d53565214d56b12b9002a56230d1c00

Articles 1

Article
7

Reverse-engineering Custom-GPT prompts