Report for cardiffnlp/twitter-roberta-base-irony

#191
by giskard-bot - opened
Giskard org

Hi Team,

This is a report from Giskard Bot Scan 🐢.

We have identified 8 potential vulnerabilities in your model based on an automated scan.

This automated analysis evaluated the model on the dataset tweet_eval (subset irony, split test).

You can find a full version of scan report here.

👉Robustness issues (5)

When feature “text” is perturbed with the transformation “Transform to uppercase”, the model changes its prediction in 21.74% of the cases. We expected the predictions not to be affected by this transformation.

Level Metric Transformation Deviation
major 🔴 Fail rate = 0.217 Transform to uppercase 170/782 tested samples (21.74%) changed prediction after perturbation

Taxonomy

avid-effect:performance:P0201
🔍✨Examples
text Transform to uppercase(text) Original prediction Prediction after perturbation
1 Just walked in to #Starbucks and asked for a "tall blonde" Hahahaha #irony JUST WALKED IN TO #STARBUCKS AND ASKED FOR A "TALL BLONDE" HAHAHAHA #IRONY irony (p = 0.65) non_irony (p = 0.78)
9 People who tell people with anxiety to "just stop worrying about it" are my favorite kind of people #not #educateyourself PEOPLE WHO TELL PEOPLE WITH ANXIETY TO "JUST STOP WORRYING ABOUT IT" ARE MY FAVORITE KIND OF PEOPLE #NOT #EDUCATEYOURSELF irony (p = 0.87) non_irony (p = 0.51)
10 Most important thing I've learned in school MOST IMPORTANT THING I'VE LEARNED IN SCHOOL irony (p = 0.91) non_irony (p = 0.71)

When feature “text” is perturbed with the transformation “Transform to title case”, the model changes its prediction in 14.43% of the cases. We expected the predictions not to be affected by this transformation.

Level Metric Transformation Deviation
major 🔴 Fail rate = 0.144 Transform to title case 113/783 tested samples (14.43%) changed prediction after perturbation

Taxonomy

avid-effect:performance:P0201
🔍✨Examples
text Transform to title case(text) Original prediction Prediction after perturbation
1 Just walked in to #Starbucks and asked for a "tall blonde" Hahahaha #irony Just Walked In To #Starbucks And Asked For A "Tall Blonde" Hahahaha #Irony irony (p = 0.65) non_irony (p = 0.54)
21 The definition of #IRONY would be if a 77-year-old rapper went #viral and took #BITCOIN mainstream. Maybe only way #babyboomers will buy in. The Definition Of #Irony Would Be If A 77-Year-Old Rapper Went #Viral And Took #Bitcoin Mainstream. Maybe Only Way #Babyboomers Will Buy In. irony (p = 0.82) non_irony (p = 0.58)
22 Pretty excited about how you gave up on me. File Under: #sarcasm Pretty Excited About How You Gave Up On Me. File Under: #Sarcasm irony (p = 0.51) non_irony (p = 0.80)

When feature “text” is perturbed with the transformation “Add typos”, the model changes its prediction in 13.25% of the cases. We expected the predictions not to be affected by this transformation.

Level Metric Transformation Deviation
major 🔴 Fail rate = 0.132 Add typos 95/717 tested samples (13.25%) changed prediction after perturbation

Taxonomy

avid-effect:performance:P0201
🔍✨Examples
text Add typos(text) Original prediction Prediction after perturbation
3 @user He is exactly that sort of person. Weirdo! @user He is exatvly that sort of person. Weirdo! non_irony (p = 0.79) irony (p = 0.81)
22 Pretty excited about how you gave up on me. File Under: #sarcasm Pretty xecite dabout how you gave up kn me. File Under: #sarcasm irony (p = 0.51) non_irony (p = 0.78)
27 How dare Charles Barkley have an intelligent conversation about race. #sarcasm #CharlesBarkley How dare Charles Barkley have an intelilgent onvsersation about race. #wrcasm #CarlesBarkley non_irony (p = 0.64) irony (p = 0.69)

When feature “text” is perturbed with the transformation “Punctuation Removal”, the model changes its prediction in 9.31% of the cases. We expected the predictions not to be affected by this transformation.

Level Metric Transformation Deviation
medium 🟡 Fail rate = 0.093 Punctuation Removal 59/634 tested samples (9.31%) changed prediction after perturbation

Taxonomy

avid-effect:performance:P0201
🔍✨Examples
text Punctuation Removal(text) Original prediction Prediction after perturbation
23 Who told the #hipsters that #irony was a thing of the Clinton years? Do they not carry history books in used bookstores in #brooklyn ? Who told the #hipsters that #irony was a thing of the Clinton years Do they not carry history books in used bookstores in #brooklyn non_irony (p = 0.65) irony (p = 0.94)
39 On the train and surrounded by posh people, I'm so at home! #not #stickoutlikeasorethumb On the train and surrounded by posh people I m so at home #not #stickoutlikeasorethumb irony (p = 0.66) non_irony (p = 0.81)
40 Stupid #doctors visits is gonna bury me!! Now that's #irony Stupid #doctors visits is gonna bury me Now that s #irony irony (p = 0.55) non_irony (p = 0.77)

When feature “text” is perturbed with the transformation “Transform to lowercase”, the model changes its prediction in 5.47% of the cases. We expected the predictions not to be affected by this transformation.

Level Metric Transformation Deviation
medium 🟡 Fail rate = 0.055 Transform to lowercase 39/713 tested samples (5.47%) changed prediction after perturbation

Taxonomy

avid-effect:performance:P0201
🔍✨Examples
text Transform to lowercase(text) Original prediction Prediction after perturbation
19 @user Guess they didn't get the memo reg non-nuclear Baltic sea #sarcasm @user guess they didn't get the memo reg non-nuclear baltic sea #sarcasm non_irony (p = 0.51) irony (p = 0.60)
22 Pretty excited about how you gave up on me. File Under: #sarcasm pretty excited about how you gave up on me. file under: #sarcasm irony (p = 0.51) non_irony (p = 0.54)
30 Nooooooooooo again it's on!!! #PickANewSong #CantStandIt nooooooooooo again it's on!!! #pickanewsong #cantstandit non_irony (p = 0.73) irony (p = 0.75)
👉Performance issues (3)

For records in the dataset where text contains "user", the Recall is 34.98% lower than the global Recall.

Level Data slice Metric Deviation
major 🔴 text contains "user" Recall = 0.366 -34.98% than global

Taxonomy

avid-effect:performance:P0204
🔍✨Examples
text label Predicted label
19 @user Guess they didn't get the memo reg non-nuclear Baltic sea #sarcasm irony non_irony (p = 0.51)
25 @user hmm... let me think about that #sarcasm irony non_irony (p = 0.91)
47 @user 180 dead on 26/11 n more than 10k our ppl killed in terror attacks till date but not 1 paki show sympathy 2 them #irony irony non_irony (p = 0.71)

For records in the dataset where text contains "irony", the Accuracy is 27.73% lower than the global Accuracy.

Level Data slice Metric Deviation
major 🔴 text contains "irony" Accuracy = 0.531 -27.73% than global

Taxonomy

avid-effect:performance:P0204
🔍✨Examples
text label Predicted label
23 Who told the #hipsters that #irony was a thing of the Clinton years? Do they not carry history books in used bookstores in #brooklyn ? irony non_irony (p = 0.65)
47 @user 180 dead on 26/11 n more than 10k our ppl killed in terror attacks till date but not 1 paki show sympathy 2 them #irony irony non_irony (p = 0.71)
65 #Irony RT @user If you're going to give someone a scathing, 1-Star review for poor grammar, FFS use proper grammar. irony non_irony (p = 0.71)

For records in the dataset where text contains "sarcasm", the Accuracy is 12.15% lower than the global Accuracy.

Level Data slice Metric Deviation
major 🔴 text contains "sarcasm" Accuracy = 0.645 -12.15% than global

Taxonomy

avid-effect:performance:P0204
🔍✨Examples
text label Predicted label
4 So much #sarcasm at work mate 10/10 #boring 100% #dead mate full on #shit absolutely #sleeping mate can't handle the #sarcasm irony non_irony (p = 0.93)
6 People complain about my backround pic and all I feel is like "hey don't blame me, Albert E might have spoken those words" #sarcasm #life irony non_irony (p = 0.73)
19 @user Guess they didn't get the memo reg non-nuclear Baltic sea #sarcasm irony non_irony (p = 0.51)

Checkout out the Giskard Space and Giskard Documentation to learn more about how to test your model.

Disclaimer: it's important to note that automated scans may produce false positives or miss certain vulnerabilities. We encourage you to review the findings and assess the impact accordingly.

Sign up or log in to comment