Diabetes dataset - mean age of -9.733290091 × 10-19

#1
by mjboothaus - opened

I took a look at your very nice EDA article / Marimo notebook. Some kind feedback: the default diabetes data is presented (like most data in notebooks!) without context/data definitions. So when I looked at the ‘age’ data for example it has a mean value of -9.733290091 × 10-19

Further investigation (see docs reproduced below) leads to the realisation that it is scaled by default.

I realise that it is just an example, but as I’ve been computing for many years, I can’t help but feel that some metadata management either in ydata-profiling and/or Marimo might be worth considering to encourage better practice (ideally leveraging an exisiting package?) I realise that as it is published it on HuggingFace as it does provide model/data cards as an option for metadata. Nice work!

From scikit-learn docs:

Target:
Column 11 is a quantitative measure of disease progression one year after baseline
Attribute Information:
age age in years
sex
bmi body mass index
bp average blood pressure
s1 tc, total serum cholesterol
s2 ldl, low-density lipoproteins
s3 hdl, high-density lipoproteins
s4 tch, total cholesterol / HDL
s5 ltg, possibly log of serum triglycerides level
s6 glu, blood sugar level
Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times the square root of n_samples(i.e. the sum of squares of each column totals 1).

Hi @mjboothaus thanks for giving your feedback.

As you pointed out I wrote this up as basic EDA tool for those familiar with what YData and assuming that they have some CSV of data they already know. Despite this, giving context on the diabetes dataset is a good idea and enhances clarity, I will add the metadata in a compelling way.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment