Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
mrs83 
posted an update 3 days ago
Post
2650
Introducing Completionist, an open-source command-line tool that automates synthetic dataset generation.

It works by iterating over an existing HF dataset and by using a LLM to create completions.

- Problem: You need a fast way to create custom datasets for fine-tuning or RAG, but you want the flexibility to use different LLM backends or your own infrastructure.
- Solution: Completionist connects with any OpenAI-compatible endpoint, including Ollama and LM Studio, or a Hugging Face inference endpoint.

A simple CLI like Completionist gives you the possibility to take full control of your synthetic data generation workflow.

👉 Check out Completionist on GitHub: https://github.com/ethicalabs-ai/completionist

Synthetic Dataset Example: ethicalabs/kurtis-mental-health-v2-sft-reasoning

You can now run the CLI by using a Container Engine such as Podman (or Docker)

mkdir -p datasets
podman run -it -v ./datasets:/app/datasets ethicalabs/completionist:latest \
  --api-url http://host.containers.internal:11434/v1/chat/completions \
  --dataset-name mrs83/kurtis_mental_health \
  --prompt-input-field Context \
  --model-name hf.co/ethicalabs/Kurtis-E1.1-Qwen3-4B-GGUF:latest \
  --system-prompt "You are a compassionate and empathetic mental-health assistant named Kurtis, trained by ethicalabs.ai. You provide thoughtful and supportive responses to user queries" \
  --output-file datasets/generated_dataset.parquet

In this example, --api-url is set to the Ollama HTTP server, listening on the host machine (host.containers.internal:11434).