Continuation of the "Instruct-Storywriter" experiments

#6
by jukofyork - opened

Just in case anybody is interested: https://huggingface.co/jukofyork/creative-writer-v0.1-alfa-35b

I'll be putting all the follow-up models/experiments in this collection: Creative Writing Models.

It's not quite the same as @tdrussell 's method (as I'm particularly interested in only changing the prose and/or Entropy of models), but I use the same "concatenate chunks of real books" type of data... I've also found that smaller models just don't take well to this (even when restricted the the n-1 first down_proj matrices only):

Why no 8B?

I tried multiple times to train this on Llama 3 8B Instruct, using a variety of hyperparameters. It never worked well. The model took a huge hit to intelligence every time, to the point of being unusable. 70B fared much better. I don't know why, maybe 8B is just too small for this type of technique, and loses too much of the instruction-tuned smarts.

I've run this on mistral:7b, wizard-lm-2:7b, mistral-nemo:12b and mistral-small:22b (both "additive" and "multiplicative" LoRAs...) when tuning the hyper-parameters, and can confirm that all had a (very) large drop in their cross-validation loss (compared to larger models), and ultimately ended up losing most of their instruction following ability.

I'm obviously not releasing the data used (for the same reason as @tdrussell ), but I will do my best to write up each experiment so it can be repeated.

Anybody else who wants to create a similar dataset can use this shell script:

#!/bin/bash

# Check if input and output directories are provided
if [ $# -ne 2 ]; then
    echo "Usage: $0 <input_directory> <output_directory>"
    exit 1
fi

input_dir="$1"
output_dir="$2"

# Create an output directory if it doesn't exist
mkdir -p "$output_dir"

# Number of parallel processes
NUM_PARALLEL=44

# Function to process a single file
process_file() {
    local file="$1"
    local output_dir="$2"
    
    # Extract filename without path and extension
    filename=$(basename "$file")
    filename_without_extension="${filename%.*}"
    
    # Create new filename with .txt extension
    new_filename="${filename_without_extension}.txt"
    
    # Build output file path
    output_file_path="$output_dir/$new_filename"

    # Initialize options variables
    options="--asciiize --unsmarten-punctuation --enable-heuristics"
    input_options=""
    output_options="--max-line-length 0 --newline unix --pretty-print --txt-output-formatting markdown"

    # Check if the file is a PDF and add the special input options
    if [[ "${filename##*.}" = "pdf" ]]; then
         input_options="--no-images --pdf-engine calibre"
    fi

    # Convert the file
    ebook-convert "$file" "$output_file_path" ${options} ${input_options} ${output_options}
    
    # Post-processing
    sed -i $'s/^[[:space:]\xC2\xA0]*$//' "$output_file_path"
    sed -i ':a;N;$!ba;s/\n\{2,\}/\n\n/g' "$output_file_path"
    sed -i -e 's/\\//g' "$output_file_path"
    sed -i -e 's/\*[[:space:]]*\*[[:space:]]*\*/***/g' "$output_file_path"
    sed -i -e '/^[\*+-]  *$/{ N; s/\n//; }' "$output_file_path"
    sed -i -e 's/^\([\*+-]\)[[:space:]]\+/\1 /' "$output_file_path"

    echo "Processed: $file -> $output_file_path"
}
export -f process_file

# Find all files and process them using parallel
find "$input_dir" -type f \( -iname "*.epub" -o -iname "*.mobi" -o -iname "*.pdf" \) | sort | \
    parallel -j $NUM_PARALLEL process_file {} "$output_dir"

echo "All files processed."

which only requires the latest version of Calibre's ebook-convert and GNU Parallel to work:

  • You might need to reduce NUM_PARALLEL=44 to around 1/2 the number of cores in your CPU.
  • It works very well for .epub and .mobi, but .pdf results will depend on how well formatted they are.
  • It often hangs on broken files at the end, but if you give it 5-10 minutes and just kill it you only lose these.
  • You don't have to use GNU Parallel, but it can take a very long time if not...
  • If you are using any non-English texts, then you probably want to remove --asciiize from the options variable or you'll lose any special accented characters, etc.

@tdrussell did you ever manage to get Focal Loss working? I found it almost impossible to find a sensible value that could be reused... :/

The original paper suggests gamma=2, but some follow-up papers now suggest gamma=3 or even a mix of gamma=5 if p < ~0.2 and gamma=3 if p >= ~0.2, but I had no luck with it... I have a feeling there must be some better parametrisation (perhaps adjusted using the loss's mean for a batch?), but eventually gave up trying to use it...

Sign up or log in to comment