Continuation of the "Instruct-Storywriter" experiments
Just in case anybody is interested: https://huggingface.co/jukofyork/creative-writer-v0.1-alfa-35b
I'll be putting all the follow-up models/experiments in this collection: Creative Writing Models.
It's not quite the same as
@tdrussell
's method (as I'm particularly interested in only changing the prose and/or Entropy of models), but I use the same "concatenate chunks of real books" type of data... I've also found that smaller models just don't take well to this (even when restricted the the n-1
first down_proj
matrices only):
Why no 8B?
I tried multiple times to train this on Llama 3 8B Instruct, using a variety of hyperparameters. It never worked well. The model took a huge hit to intelligence every time, to the point of being unusable. 70B fared much better. I don't know why, maybe 8B is just too small for this type of technique, and loses too much of the instruction-tuned smarts.
I've run this on mistral:7b
, wizard-lm-2:7b
, mistral-nemo:12b
and mistral-small:22b
(both "additive" and "multiplicative" LoRAs...) when tuning the hyper-parameters, and can confirm that all had a (very) large drop in their cross-validation loss (compared to larger models), and ultimately ended up losing most of their instruction following ability.
I'm obviously not releasing the data used (for the same reason as @tdrussell ), but I will do my best to write up each experiment so it can be repeated.
Anybody else who wants to create a similar dataset can use this shell script:
#!/bin/bash
# Check if input and output directories are provided
if [ $# -ne 2 ]; then
echo "Usage: $0 <input_directory> <output_directory>"
exit 1
fi
input_dir="$1"
output_dir="$2"
# Create an output directory if it doesn't exist
mkdir -p "$output_dir"
# Number of parallel processes
NUM_PARALLEL=44
# Function to process a single file
process_file() {
local file="$1"
local output_dir="$2"
# Extract filename without path and extension
filename=$(basename "$file")
filename_without_extension="${filename%.*}"
# Create new filename with .txt extension
new_filename="${filename_without_extension}.txt"
# Build output file path
output_file_path="$output_dir/$new_filename"
# Initialize options variables
options="--asciiize --unsmarten-punctuation --enable-heuristics"
input_options=""
output_options="--max-line-length 0 --newline unix --pretty-print --txt-output-formatting markdown"
# Check if the file is a PDF and add the special input options
if [[ "${filename##*.}" = "pdf" ]]; then
input_options="--no-images --pdf-engine calibre"
fi
# Convert the file
ebook-convert "$file" "$output_file_path" ${options} ${input_options} ${output_options}
# Post-processing
sed -i $'s/^[[:space:]\xC2\xA0]*$//' "$output_file_path"
sed -i ':a;N;$!ba;s/\n\{2,\}/\n\n/g' "$output_file_path"
sed -i -e 's/\\//g' "$output_file_path"
sed -i -e 's/\*[[:space:]]*\*[[:space:]]*\*/***/g' "$output_file_path"
sed -i -e '/^[\*+-] *$/{ N; s/\n//; }' "$output_file_path"
sed -i -e 's/^\([\*+-]\)[[:space:]]\+/\1 /' "$output_file_path"
echo "Processed: $file -> $output_file_path"
}
export -f process_file
# Find all files and process them using parallel
find "$input_dir" -type f \( -iname "*.epub" -o -iname "*.mobi" -o -iname "*.pdf" \) | sort | \
parallel -j $NUM_PARALLEL process_file {} "$output_dir"
echo "All files processed."
which only requires the latest version of Calibre's ebook-convert and GNU Parallel to work:
- You might need to reduce
NUM_PARALLEL=44
to around 1/2 the number of cores in your CPU. - It works very well for
.epub
and.mobi
, but.pdf
results will depend on how well formatted they are. - It often hangs on broken files at the end, but if you give it 5-10 minutes and just kill it you only lose these.
- You don't have to use GNU Parallel, but it can take a very long time if not...
- If you are using any non-English texts, then you probably want to remove
--asciiize
from theoptions
variable or you'll lose any special accented characters, etc.
@tdrussell did you ever manage to get Focal Loss working? I found it almost impossible to find a sensible value that could be reused... :/
The original paper suggests gamma=2
, but some follow-up papers now suggest gamma=3
or even a mix of gamma=5 if p < ~0.2
and gamma=3 if p >= ~0.2
, but I had no luck with it... I have a feeling there must be some better parametrisation (perhaps adjusted using the loss's mean for a batch?), but eventually gave up trying to use it...