giskardai/giskard-evaluator · Report for distilbert/distilbert-base-uncased-finetuned-sst-2-english

Hi Team,

This is a report from Giskard Bot Scan 🐢.

We have identified 5 potential vulnerabilities in your model based on an automated scan.

This automated analysis evaluated the model on the dataset sst2 (subset default, split validation).

👉Robustness issues (1)

When feature “text” is perturbed with the transformation “Add typos”, the model changes its prediction in 13.0% of the cases. We expected the predictions not to be affected by this transformation.

Level	Metric	Transformation	Deviation
major 🔴	Fail rate = 0.130	Add typos	104/800 tested samples (13.0%) changed prediction after perturbation

Taxonomy

avid-effect:performance:P0201

🔍✨Examples

	text	Add typos(text)	Original prediction	Prediction after perturbation
13	we root for ( clara and paul ) , even like them , though perhaps it 's an emotion closer to pity .	we root for ( clara and paul ) , even like them , htough perhaps it 's an emotiom closer to pity .	POSITIVE (p = 0.96)	NEGATIVE (p = 0.99)
16	the emotions are raw and will strike a nerve with anyone who 's ever had family trauma .	the ekotions are raw andw ill strike a nerve with anyone wgo 's ever had family trauma .	POSITIVE (p = 1.00)	NEGATIVE (p = 0.60)
22	holden caulfield did it better .	holdsn caulfkeld did t better .	POSITIVE (p = 0.99)	NEGATIVE (p = 1.00)

👉Performance issues (4)

For records in the dataset where text_length(text) >= 50.500 AND text_length(text) < 61.500, the Precision is 15.5% lower than the global Precision.

Level	Data slice	Metric	Deviation
major 🔴	`text_length(text)` >= 50.500 AND `text_length(text)` < 61.500	Precision = 0.759	-15.50% than global

Taxonomy

avid-effect:performance:P0204

🔍✨Examples

	text	text_length(text)	label	Predicted `label`
92	you wo n't like roger , but you will quickly recognize him .	61	NEGATIVE	POSITIVE (p = 1.00)
171	rarely has leukemia looked so shimmering and benign .	54	NEGATIVE	POSITIVE (p = 0.98)
183	the lower your expectations , the more you 'll enjoy it .	58	NEGATIVE	POSITIVE (p = 1.00)

For records in the dataset where text_length(text) >= 73.500 AND text_length(text) < 82.500, the Recall is 11.19% lower than the global Recall.

Level	Data slice	Metric	Deviation
major 🔴	`text_length(text)` >= 73.500 AND `text_length(text)` < 82.500	Recall = 0.826	-11.19% than global

Taxonomy

avid-effect:performance:P0204

🔍✨Examples

	text	text_length(text)	label	Predicted `label`
93	if steven soderbergh 's ` solaris ' is a failure it is a glorious failure .	76	POSITIVE	NEGATIVE (p = 1.00)
123	turns potentially forgettable formula into something strangely diverting .	75	POSITIVE	NEGATIVE (p = 0.99)
142	what better message than ` love thyself ' could young women of any size receive ?	82	POSITIVE	NEGATIVE (p = 0.99)

For records in the dataset where text_length(text) >= 165.500 AND text_length(text) < 179.500, the Recall is 6.37% lower than the global Recall.

Level	Data slice	Metric	Deviation
medium 🟡	`text_length(text)` >= 165.500 AND `text_length(text)` < 179.500	Recall = 0.871	-6.37% than global

Taxonomy

avid-effect:performance:P0204

🔍✨Examples

	text	text_length(text)	label	Predicted `label`
158	by getting myself wrapped up in the visuals and eccentricities of many of the characters , i found myself confused when it came time to get to the heart of the movie .	168	NEGATIVE	POSITIVE (p = 0.99)
266	a coda in every sense , the pinochet case splits time between a minute-by-minute account of the british court 's extradition chess game and the regime 's talking-head survivors .	179	POSITIVE	NEGATIVE (p = 0.99)
282	while there 's something intrinsically funny about sir anthony hopkins saying ` get in the car , bitch , ' this jerry bruckheimer production has little else to offer	166	POSITIVE	NEGATIVE (p = 1.00)

For records in the dataset where text_length(text) >= 151.500 AND text_length(text) < 165.500, the Recall is 5.93% lower than the global Recall.

Level	Data slice	Metric	Deviation
medium 🟡	`text_length(text)` >= 151.500 AND `text_length(text)` < 165.500	Recall = 0.875	-5.93% than global

Taxonomy

avid-effect:performance:P0204

🔍✨Examples

	text	text_length(text)	label	Predicted `label`
324	you 'll gasp appalled and laugh outraged and possibly , watching the spectacle of a promising young lad treading desperately in a nasty sea , shed an errant tear .	164	POSITIVE	NEGATIVE (p = 0.95)
673	drops you into a dizzying , volatile , pressure-cooker of a situation that quickly snowballs out of control , while focusing on the what much more than the why .	162	POSITIVE	NEGATIVE (p = 0.94)
692	sustains its dreamlike glide through a succession of cheesy coincidences and voluptuous cheap effects , not the least of which is rebecca romijn-stamos .	154	NEGATIVE	POSITIVE (p = 0.94)

Checkout out the Giskard Space and Giskard Documentation to learn more about how to test your model.

Disclaimer: it's important to note that automated scans may produce false positives or miss certain vulnerabilities. We encourage you to review the findings and assess the impact accordingly.