This models seems to love using — as some part of speech or grammar.

by left1000 - opened Mar 3

Mar 3

I have no idea what — even is it's not a proper hypen and it's mostly being used as some sort of semicolon or speech pause indicator? But nothing in my settings or card or whatever else on sillytavern uses — in fact I've never seen any AI generate — before...

Where did this love of — come from and is there any chance of training this bias out?

Geechan

Mar 3

That's called an "em dash". The model is correctly using that punctuation in your examples—it is, after all, used as a universal punctuation in place of semicolons, colons, parentheses, etc.

gghfez

Mar 3

It's because R1 loves to use them for writing, along with smart-quotes (“ ”)

left1000

Mar 4

I guess the main reason I hate em dash is that it's not on my keyboard so I can't use it myself. Maybe instead of manually replacing every em dash with a semicolon to try and get the ai to behave I should just learn to accept em dash is a valid semicolon.... em dash actually looks prettier to read than a semicolon so I'm not sure why it upsets me.

gghfez

Mar 4

I know exactly what you mean! I almost replaced it, along with the smart-quotes in my datasets from R1, but decided to leave it as it's more of a me issue.

TheDrummer

Owner Mar 4

Anything else I should clean out of my datasets?

gghfez

Mar 4

@TheDrummer This model is for RP right? You'd know better than I; but intuitively, it makes sense to me to replace the smart-quotes (“ ”) with regular quotes (" "), as I'm guessing most people would use them during RP, and the front ends like ST along with people's stscript/regex and DRY sampler ignore strings are probably built around that.

But if I were doing it, I'd keep the original dataset as well if you want to make novel-writing models later.

MateoTeo

Apr 10

•

edited Apr 23

@TheDrummer Must say, this Fallen-R1 is the most fun and unhinged finetune I have tested so far. But my-my... in ~150 tokens it manages to insert 23 (!!!) "—if" xD
And nothing helps; I tried to ban such tokens, set system rules to avoid such patterns, set multiple good examples and starters, play with samplings, but it will still lean into its deep love for inserting as many em dashes as possible. So, all hopes are on you, mate. Looks like the problem is deeply rooted in the training data's patterns.

UPD. Found that the model behaves more stably after around 3k tokens of context, avoiding inserting em-dashes left and right, but this will hit the model's style and alignment.

MateoTeo

May 5

•

edited May 7

UPD. Adding this array of tokens to the Logit Bias (with -100 value) will remove these em-dashes from generation:
[90863, 17223, 22416, 29096, 2345, 55434, 63938, 38542, 72318, 87671, 51749, 44603]

I'm sure that this is not the full list of tokens that use em-dashes, but it will remove most such patterns.
(For some reason, banning tokens is not working for me.)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment