Evaluation for fictional writing models

#5
by Henk717 - opened

It would be useful for the KoboldAI community if a test that evaluates fictional writing ability is added to this system since we primarily use fiction generation models. It would help us pick which base models are most suitable for our task.

That's a good idea! Do you know any good existing benchmarks for this (probably already in lmeh)?

Fictional eval is almost never done so we don't know which benchmarks are good for this, but I assume the ones mentioning books.
Normally we just crowdsource the information from our community which ones they do and don't like.

Open LLM Leaderboard org

So this should be some kind of human evaluation, right?

@thomwolf At Chai Research we use explicit user feedback as a source of quality. And we are sharing the process with every developer to get real feedback from millions of users in the app during this event: https://www.chai-research.com/competition.html
I wonder if it's a good idea to make it similar to existing LMEH evaluations by selecting completion based on loglikelihoods. From my experience, it's not really correlated with human feedback. On the other hand, it's possible to use the reward model from the RLHF pipeline, but I don't have enough experiments with this "benchmark" yet to claim anything.

clefourrier changed discussion status to closed

Sign up or log in to comment