lucharo/ydata-profiling-marimo · Diabetes dataset - mean age of -9.733290091 × 10-19

I took a look at your very nice EDA article / Marimo notebook. Some kind feedback: the default diabetes data is presented (like most data in notebooks!) without context/data definitions. So when I looked at the ‘age’ data for example it has a mean value of -9.733290091 × 10-19

Further investigation (see docs reproduced below) leads to the realisation that it is scaled by default.

I realise that it is just an example, but as I’ve been computing for many years, I can’t help but feel that some metadata management either in ydata-profiling and/or Marimo might be worth considering to encourage better practice (ideally leveraging an exisiting package?) I realise that as it is published it on HuggingFace as it does provide model/data cards as an option for metadata. Nice work!

From scikit-learn docs:

Target:
Column 11 is a quantitative measure of disease progression one year after baseline
Attribute Information:
age age in years
sex
bmi body mass index
bp average blood pressure
s1 tc, total serum cholesterol
s2 ldl, low-density lipoproteins
s3 hdl, high-density lipoproteins
s4 tch, total cholesterol / HDL
s5 ltg, possibly log of serum triglycerides level
s6 glu, blood sugar level
Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times the square root of n_samples(i.e. the sum of squares of each column totals 1).