This models seems to love using — as some part of speech or grammar.
I have no idea what — even is it's not a proper hypen and it's mostly being used as some sort of semicolon or speech pause indicator? But nothing in my settings or card or whatever else on sillytavern uses — in fact I've never seen any AI generate — before...
Where did this love of — come from and is there any chance of training this bias out?
That's called an "em dash". The model is correctly using that punctuation in your examples—it is, after all, used as a universal punctuation in place of semicolons, colons, parentheses, etc.
It's because R1 loves to use them for writing, along with smart-quotes (“ ”)
I guess the main reason I hate em dash is that it's not on my keyboard so I can't use it myself. Maybe instead of manually replacing every em dash with a semicolon to try and get the ai to behave I should just learn to accept em dash is a valid semicolon.... em dash actually looks prettier to read than a semicolon so I'm not sure why it upsets me.
I know exactly what you mean! I almost replaced it, along with the smart-quotes in my datasets from R1, but decided to leave it as it's more of a me issue.
Anything else I should clean out of my datasets?
@TheDrummer This model is for RP right? You'd know better than I; but intuitively, it makes sense to me to replace the smart-quotes (“ ”) with regular quotes (" "), as I'm guessing most people would use them during RP, and the front ends like ST along with people's stscript/regex and DRY sampler ignore strings are probably built around that.
But if I were doing it, I'd keep the original dataset as well if you want to make novel-writing models later.