tdrussell/Llama-3-70B-Instruct-Storywriter · Continuation of the "Instruct-Storywriter" experiments

Just in case anybody is interested: https://huggingface.co/jukofyork/creative-writer-v0.1-alfa-35b

I'll be putting all the follow-up models/experiments in this collection: Creative Writing Models.

It's not quite the same as @tdrussell 's method (as I'm particularly interested in only changing the prose and/or Entropy of models), but I use the same "concatenate chunks of real books" type of data... I've also found that smaller models just don't take well to this (even when restricted the the n-1 first down_proj matrices only):

Why no 8B?

I tried multiple times to train this on Llama 3 8B Instruct, using a variety of hyperparameters. It never worked well. The model took a huge hit to intelligence every time, to the point of being unusable. 70B fared much better. I don't know why, maybe 8B is just too small for this type of technique, and loses too much of the instruction-tuned smarts.

I've run this on mistral:7b, wizard-lm-2:7b, mistral-nemo:12b and mistral-small:22b (both "additive" and "multiplicative" LoRAs...) when tuning the hyper-parameters, and can confirm that all had a (very) large drop in their cross-validation loss (compared to larger models), and ultimately ended up losing most of their instruction following ability.

I'm obviously not releasing the data used (for the same reason as @tdrussell ), but I will do my best to write up each experiment so it can be repeated.

Anybody else who wants to create a similar dataset can use this shell script:

#!/bin/bash

# Check if input and output directories are provided
if [ $# -ne 2 ]; then
    echo "Usage: $0 <input_directory> <output_directory>"
    exit 1
fi

input_dir="$1"
output_dir="$2"

# Create an output directory if it doesn't exist
mkdir -p "$output_dir"

# Number of parallel processes
NUM_PARALLEL=44

# Function to process a single file
process_file() {
    local file="$1"
    local output_dir="$2"
    
    # Extract filename without path and extension
    filename=$(basename "$file")
    filename_without_extension="${filename%.*}"
    
    # Create new filename with .txt extension
    new_filename="${filename_without_extension}.txt"
    
    # Build output file path
    output_file_path="$output_dir/$new_filename"

    # Initialize options variables
    options="--asciiize --unsmarten-punctuation --enable-heuristics"
    input_options=""
    output_options="--max-line-length 0 --newline unix --pretty-print --txt-output-formatting markdown"

    # Check if the file is a PDF and add the special input options
    if [[ "${filename##*.}" = "pdf" ]]; then
         input_options="--no-images --pdf-engine calibre"
    fi

    # Convert the file
    ebook-convert "$file" "$output_file_path" ${options} ${input_options} ${output_options}
    
    # Post-processing
    sed -i $'s/^[[:space:]\xC2\xA0]*$//' "$output_file_path"
    sed -i ':a;N;$!ba;s/\n\{2,\}/\n\n/g' "$output_file_path"
    sed -i -e 's/\\//g' "$output_file_path"
    sed -i -e 's/\*[[:space:]]*\*[[:space:]]*\*/***/g' "$output_file_path"
    sed -i -e '/^[\*+-]  *$/{ N; s/\n//; }' "$output_file_path"
    sed -i -e 's/^\([\*+-]\)[[:space:]]\+/\1 /' "$output_file_path"

    echo "Processed: $file -> $output_file_path"
}
export -f process_file

# Find all files and process them using parallel
find "$input_dir" -type f \( -iname "*.epub" -o -iname "*.mobi" -o -iname "*.pdf" \) | sort | \
    parallel -j $NUM_PARALLEL process_file {} "$output_dir"

echo "All files processed."

which only requires the latest version of Calibre's ebook-convert and GNU Parallel to work:

You might need to reduce NUM_PARALLEL=44 to around 1/2 the number of cores in your CPU.
It works very well for .epub and .mobi, but .pdf results will depend on how well formatted they are.
It often hangs on broken files at the end, but if you give it 5-10 minutes and just kill it you only lose these.
You don't have to use GNU Parallel, but it can take a very long time if not...
If you are using any non-English texts, then you probably want to remove --asciiize from the options variable or you'll lose any special accented characters, etc.