Kévin Yauy commited on
Commit
9d42d90
·
1 Parent(s): 59e5ae2

feat(app): first commit PhenoGenius web app standalone repository

Browse files
README.md ADDED
@@ -0,0 +1,75 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: PhenoGenius
3
+ emoji: genie
4
+ sdk: streamlit
5
+ sdk_version: 1.25.0
6
+ app_file: phenogenius_app.py
7
+ python_version: 3.11
8
+ pinned: true
9
+ ---
10
+
11
+ # PhenoGenius web app
12
+
13
+ Symptom interaction modeling for precision medicine
14
+
15
+ ## Overview
16
+
17
+ Symptom interaction model provide a method to standardize clinical descriptions and fully exploit phenotypic data in precision medicine.
18
+
19
+ This repository contains scripts and files to use PhenoGenius Web app, the phenotype matching system for genetic disease based on this model. **Please try PhenoGenius in the cloud at [https://huggingface.co/spaces/kyauy/PhenoGenius](https://huggingface.co/spaces/kyauy/PhenoGenius).**
20
+
21
+ If you use PhenoGenius, please cite:
22
+ > Yauy et al., Learning phenotypic patterns in genetic disease by symptom interaction modeling. medrXiv (2023). [https://doi.org/10.1101/2022.07.29.22278181](https://doi.org/10.1101/2022.07.29.22278181)
23
+
24
+ ## Install
25
+
26
+ - Requirements
27
+
28
+ ```bash
29
+ python == 3.11 #(pyenv install 3.11)
30
+ poetry #(https://python-poetry.org/docs/#installation)
31
+ git-lfs
32
+ ```
33
+
34
+ - Install dependencies
35
+
36
+ ```bash
37
+ poetry install
38
+ ```
39
+
40
+ If you need to generate a `requirements.txt` file, use the following command:
41
+ ```
42
+ poetry export --without-hashes --format=requirements.txt > requirements.txt
43
+ ```
44
+
45
+ NB: if git-lfs is not installed, you won't be able to download PhenoGenius Web app resources.
46
+
47
+ ## Use streamlit webapp in your desktop
48
+
49
+ ### Run
50
+
51
+ ```bash
52
+ poetry shell
53
+ streamlit run phenogenius_app.py
54
+ ```
55
+
56
+
57
+ Enjoy !
58
+
59
+ ## Command line interface
60
+
61
+ The command line interface is available in the PhenoGenius client repository (https://github.com/kyauy/PhenoGenius/)[https://github.com/kyauy/PhenoGenius/].
62
+
63
+ ## License
64
+
65
+ *PhenoGenius* is licensed under the Apache License, Version 2.0. See [LICENSE](LICENSE) for the full license text.
66
+
67
+ ## Misc
68
+
69
+ *PhenoGenius* is a collaboration of :
70
+
71
+ [![SeqOne](data/img/logo-seqone.png)](https://seqone.com/)
72
+
73
+ [![Université Grenoble Alpes](data/img/logo-uga.png)](https://iab.univ-grenoble-alpes.fr/)
74
+
75
+
data/img/logo-chuga.png ADDED
data/img/logo-seqone.png ADDED
data/img/logo-uga.png ADDED
data/img/logoMIAI-rvb.png ADDED
data/img/phenogenius.png ADDED
data/resources/Homo_sapiens.gene_info.gz ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0d068aeddb48594d70dc3d91409059bc2b5992ac280f6971d2ae41583d895707
3
+ size 3234358
data/resources/hpo_obo_2024.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:655dd665ba80c547844e8a7095398fafe155bc149566b656a3a62019f9aed813
3
+ size 11488780
data/resources/main_topics_hpo_390_42_filtered_norm_004_2024.tsv ADDED
The diff for this file is too large to render. See raw diff
 
data/resources/ohe_all_thesaurus_weighted_2024.tsv.gz ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8f2a882aa1c25da99468af56829fc35299631b3f9e30aca821bfe9cfa9f220e2
3
+ size 14121583
data/resources/pheno_NMF_390_matrix_42_2024.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6a2740bb601d0776dbccccf17da25e8c09342475a510f37f875232160ac47b41
3
+ size 17447203
data/resources/pheno_NMF_390_model_42_2024.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4c8cde8d11d69f6ffa5c405a41187f5ef42b5375e7aa8fe1f8f6d986967ba13b
3
+ size 58193070
data/resources/similarity_dict_threshold_80.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:73187c97bb1bf2898d1a85b27ed260e708d195e74f98338b44517bf8537bf076
3
+ size 38378638
phenogenius_app.py ADDED
@@ -0,0 +1,643 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+ import numpy as np
3
+ import pandas as pd
4
+ from PIL import Image
5
+ import ujson as json
6
+ import pickle as pk
7
+ from plotnine import *
8
+
9
+ # -- Set page config
10
+ apptitle = "PhenoGenius"
11
+
12
+ st.set_page_config(
13
+ page_title=apptitle,
14
+ page_icon=":genie:",
15
+ layout="wide",
16
+ initial_sidebar_state="auto",
17
+ )
18
+
19
+ # -- Set Sidebar
20
+ image_pg = Image.open("data/img/phenogenius.png")
21
+ st.sidebar.image(image_pg, caption=None, width=100)
22
+ st.sidebar.title("PhenoGenius")
23
+
24
+ st.sidebar.header(
25
+ "Learning phenotypic patterns in genetic diseases by symptom interaction modeling"
26
+ )
27
+
28
+ st.sidebar.markdown(
29
+ """
30
+ This webapp presents symptom interaction models in genetic diseases to provide:
31
+ - Standardized clinical descriptions
32
+ - Interpretable matches between symptoms and genes
33
+
34
+ Code source is available in GitHub:
35
+ [https://github.com/kyauy/PhenoGenius](https://github.com/kyauy/PhenoGenius)
36
+
37
+ Last update: 2024-07-15
38
+
39
+ PhenoGenius is a collaborative project from:
40
+ """
41
+ )
42
+
43
+ image_uga = Image.open("data/img/logo-uga.png")
44
+ st.sidebar.image(image_uga, caption=None, width=95)
45
+
46
+ image_seqone = Image.open("data/img/logo-seqone.png")
47
+ st.sidebar.image(image_seqone, caption=None, width=95)
48
+
49
+ image_miai = Image.open("data/img/logoMIAI-rvb.png")
50
+ st.sidebar.image(image_miai, caption=None, width=95)
51
+
52
+ image_chuga = Image.open("data/img/logo-chuga.png")
53
+ st.sidebar.image(image_chuga, caption=None, width=60)
54
+
55
+
56
+ @st.cache_data(max_entries=50)
57
+ def convert_df(df):
58
+ return df.to_csv(sep="\t").encode("utf-8")
59
+
60
+
61
+ @st.cache_data(max_entries=50)
62
+ def load_data():
63
+ matrix = pd.read_csv(
64
+ "data/resources/ohe_all_thesaurus_weighted_2024.tsv.gz",
65
+ sep="\t",
66
+ compression="gzip",
67
+ index_col=0,
68
+ )
69
+ return matrix
70
+
71
+
72
+ @st.cache_data(hash_funcs={"Pickle": lambda _: None}, max_entries=50)
73
+ def load_nmf_model():
74
+ with open("data/resources/pheno_NMF_390_model_42_2024.pkl", "rb") as pickle_file:
75
+ pheno_NMF = pk.load(pickle_file)
76
+ with open("data/resources/pheno_NMF_390_matrix_42_2024.pkl", "rb") as pickle_file:
77
+ reduced = pk.load(pickle_file)
78
+ return pheno_NMF, reduced
79
+
80
+
81
+ @st.cache_data(max_entries=50)
82
+ def symbol_to_id_to_dict():
83
+ # from NCBI
84
+ ncbi_df = pd.read_csv("data/resources/Homo_sapiens.gene_info.gz", sep="\t")
85
+ ncbi_df = ncbi_df[ncbi_df["#tax_id"] == 9606]
86
+ ncbi_df_ncbi = ncbi_df.set_index("Symbol")
87
+ ncbi_to_dict_ncbi = ncbi_df_ncbi["GeneID"].to_dict()
88
+ ncbi_df = ncbi_df.set_index("GeneID")
89
+ ncbi_to_dict = ncbi_df["Symbol"].to_dict()
90
+ return ncbi_to_dict_ncbi, ncbi_to_dict
91
+
92
+
93
+ @st.cache_data(hash_funcs={"_json.Scanner": hash}, max_entries=50)
94
+ def load_hp_ontology():
95
+ with open("data/resources/hpo_obo_2024.json") as json_data:
96
+ data_dict = json.load(json_data)
97
+ return data_dict
98
+
99
+
100
+ @st.cache_data(max_entries=50)
101
+ def hpo_description_to_id():
102
+ data_dict = {}
103
+ for key, value in hp_onto.items():
104
+ data_dict[value["name"]] = key
105
+ return data_dict
106
+
107
+
108
+ @st.cache_data(max_entries=50)
109
+ def load_topic_data():
110
+ topic = pd.read_csv(
111
+ "data/resources/main_topics_hpo_390_42_filtered_norm_004_2024.tsv",
112
+ sep="\t",
113
+ index_col=0,
114
+ )
115
+ return topic
116
+
117
+
118
+ @st.cache_data(hash_funcs={"_json.Scanner": hash}, max_entries=50)
119
+ def load_similarity_dict():
120
+ with open("data/resources/similarity_dict_threshold_80.json") as json_data:
121
+ data_dict = json.load(json_data)
122
+ return data_dict
123
+
124
+
125
+ def get_symbol(gene):
126
+ if gene in symbol.keys():
127
+ return symbol[gene]
128
+
129
+
130
+ def get_hpo_name(hpo):
131
+ names = {}
132
+ if hpo in hp_onto.keys():
133
+ names[hpo] = hp_onto[hpo]["name"]
134
+ return names
135
+
136
+
137
+ def get_hpo_name_only(hpo):
138
+ if hpo in hp_onto.keys():
139
+ return hp_onto[hpo]["name"]
140
+ else:
141
+ return None
142
+
143
+
144
+ def get_hpo_name_list(hpo_list, hp_onto):
145
+ names = {}
146
+ for hpo in hpo_list:
147
+ if hpo in hp_onto.keys():
148
+ names[hpo] = hp_onto[hpo]["name"]
149
+ return names
150
+
151
+
152
+ def get_similar_terms(hpo_list, similarity_terms_dict):
153
+ hpo_list_w_simi = {}
154
+ for term in hpo_list:
155
+ hpo_list_w_simi[term] = 1
156
+ if term in similarity_terms_dict.keys():
157
+ for key, value in similarity_terms_dict[term].items():
158
+ if value > 0.8:
159
+ score = value / len(similarity_terms_dict[term].keys())
160
+ if key in hpo_list_w_simi.keys():
161
+ if score > hpo_list_w_simi[key]:
162
+ hpo_list_w_simi[key] = score
163
+ else:
164
+ pass
165
+ else:
166
+ hpo_list_w_simi[key] = score
167
+ hpo_list_all = hpo_list_w_simi.keys()
168
+ return hpo_list_w_simi, list(hpo_list_all)
169
+
170
+
171
+ def score(hpo_list, matrix):
172
+ # Create a copy of the filtered matrix to avoid SettingWithCopyWarning
173
+ matrix_filter = matrix[hpo_list].copy()
174
+
175
+ # Use .loc to safely add or modify columns in the copy of the DataFrame
176
+ matrix_filter.loc[:, "sum"] = matrix_filter.sum(axis=1)
177
+ matrix_filter.loc[:, "gene_symbol"] = matrix_filter.index.to_series().apply(
178
+ get_symbol
179
+ )
180
+
181
+ # Return the modified DataFrame sorted by 'sum'
182
+ return matrix_filter.sort_values("sum", ascending=False)
183
+
184
+
185
+ def score_sim_add(hpo_list_add, matrix, sim_dict):
186
+ # Ensure matrix_filter is a copy to avoid modifying the original DataFrame
187
+ matrix_filter = matrix[hpo_list_add].copy()
188
+
189
+ # Iterate through sim_dict to update matrix_filter values
190
+ for key, value in sim_dict.items():
191
+ if key in matrix_filter.columns:
192
+ matrix_filter[key] = (
193
+ matrix_filter[key] * value
194
+ ) # Direct column assignment is fine here
195
+
196
+ # Calculate the sum and assign gene_symbol, using direct assignment for these operations
197
+ matrix_filter["sum"] = matrix_filter.sum(axis=1)
198
+ matrix_filter["gene_symbol"] = matrix_filter.index.to_series().apply(get_symbol)
199
+
200
+ # Return the DataFrame sorted by 'sum'
201
+ return matrix_filter.sort_values("sum", ascending=False)
202
+
203
+
204
+ def get_phenotype_specificity(gene_diag, data_patient):
205
+ rank = data_patient.loc[int(ncbi[gene_diag]), "rank"]
206
+ max_rank = data_patient["rank"].max()
207
+ if rank == max_rank:
208
+ return "D - the reported phenotype is NOT consistent with what is expected for the gene/genomic region or not consistent in general."
209
+ elif rank < 41:
210
+ return "A - the reported phenotype is highly specific and relatively unique to the gene (top 40, 50 perc of diagnosis in PhenoGenius cohort)."
211
+ elif rank < 250:
212
+ return "B - the reported phenotype is consistent with the gene, is highly specific, but not necessarily unique to the gene (top 250, 75 perc of diagnosis in PhenoGenius cohort)."
213
+ else:
214
+ return "C - the phenotype is reported with limited association with the gene, not highly specific and/or with high genetic heterogeneity."
215
+
216
+
217
+ def get_relatives_list(hpo_list, hp_onto):
218
+ all_list = []
219
+ for hpo in hpo_list:
220
+ all_list.append(hpo)
221
+ if hpo in hp_onto.keys():
222
+ for parent in hp_onto[hpo]["parents"]:
223
+ all_list.append(parent)
224
+ for children in hp_onto[hpo]["childrens"]:
225
+ all_list.append(children)
226
+ return list(set(all_list))
227
+
228
+
229
+ def get_hpo_id(hpo_list):
230
+ hpo_id = []
231
+ for description in hpo_list:
232
+ hpo_id.append(hp_desc_id[description])
233
+ return ",".join(hpo_id)
234
+
235
+
236
+ hp_onto = load_hp_ontology()
237
+ hp_desc_id = hpo_description_to_id()
238
+ ncbi, symbol = symbol_to_id_to_dict()
239
+
240
+
241
+ with st.form("my_form"):
242
+ c1, c2 = st.columns(2)
243
+ with c1:
244
+ hpo_raw = st.multiselect(
245
+ "Select interactively your HPOs or...",
246
+ list(hp_desc_id.keys()),
247
+ ["Renal cyst", "Hepatic cysts"],
248
+ )
249
+ with c2:
250
+ hpo = st.text_input(
251
+ "copy/paste your HPOs, separated with comma",
252
+ "HP:0000107,HP:0001407",
253
+ )
254
+ gene_diag_input = st.multiselect(
255
+ "Optional: provide HGNC gene symbol to be tested",
256
+ options=list(ncbi.keys()),
257
+ default=["PKD1"],
258
+ max_selections=1,
259
+ )
260
+ submit_button = st.form_submit_button(
261
+ label="Submit",
262
+ )
263
+
264
+
265
+ if submit_button:
266
+ if hpo_raw != ["Renal cyst", "Hepatic cysts"] and len(hpo_raw) > 0:
267
+ hpo = get_hpo_id(hpo_raw)
268
+ data = load_data()
269
+ pheno_NMF, reduced = load_nmf_model()
270
+ topic = load_topic_data()
271
+ similarity_terms_dict = load_similarity_dict()
272
+
273
+ hpo_list_ini = hpo.strip().split(",")
274
+
275
+ if gene_diag_input:
276
+ if gene_diag_input[0] in ncbi.keys():
277
+ gene_diag = gene_diag_input[0]
278
+ else:
279
+ st.write(
280
+ gene_diag_input
281
+ + " gene are not in our database. Please check gene name (need to be in CAPITAL format)."
282
+ )
283
+ gene_diag = None
284
+ else:
285
+ gene_diag = None
286
+
287
+ hpo_list_up = []
288
+ for hpo in hpo_list_ini:
289
+ if hpo in ["HP:0000001"]:
290
+ pass
291
+ elif len(hpo) != 10:
292
+ st.write(
293
+ "Incorrect HPO format: "
294
+ + hpo
295
+ + ". Please check (7-digits terms with prefix HP:, and separed by commas)."
296
+ )
297
+ pass
298
+ elif hpo not in data.columns:
299
+ pass
300
+ st.write(hpo + " not available in current database. Please modify.")
301
+ else:
302
+ if data[hpo].astype(bool).sum(axis=0) != 0:
303
+ hpo_list_up.append(hpo)
304
+ else:
305
+ hpo_to_test = hp_onto[hpo]["direct_parent"][0]
306
+ while data[hpo_to_test].astype(bool).sum(
307
+ axis=0
308
+ ) == 0 and hpo_to_test not in ["HP:0000001"]:
309
+ hpo_to_test = hp_onto[hpo_to_test]["direct_parent"][0]
310
+ if hpo_to_test in ["HP:0000001"]:
311
+ st.write(
312
+ "No gene-HPO associations was found for "
313
+ + hpo
314
+ + " and parents."
315
+ )
316
+ else:
317
+ hpo_list_up.append(hpo_to_test)
318
+ st.write(
319
+ "We replaced: ",
320
+ hpo,
321
+ " by ",
322
+ hp_onto[hpo]["direct_parent"][0],
323
+ "-",
324
+ get_hpo_name(hpo_to_test),
325
+ )
326
+ hpo_list = list(set(hpo_list_up))
327
+ del hpo_list_up
328
+
329
+ if hpo_list:
330
+ with st.expander("See HPO inputs"):
331
+ st.write(get_hpo_name_list(hpo_list_ini, hp_onto))
332
+ del hpo_list_ini
333
+
334
+ hpo_list_name = get_relatives_list(hpo_list, hp_onto)
335
+
336
+ st.header("Clinical description with symptom interaction modeling")
337
+
338
+ witness = np.zeros(len(data.columns))
339
+ witness_nmf = np.matmul(pheno_NMF.components_, witness)
340
+
341
+ patient = np.zeros(len(data.columns))
342
+ for hpo in hpo_list:
343
+ hpo_index = list(data.columns).index(hpo)
344
+ patient[hpo_index] = 1
345
+
346
+ patient_nmf = np.matmul(pheno_NMF.components_, patient)
347
+
348
+ witness_sugg_df = (
349
+ pd.DataFrame(reduced)
350
+ .set_index(data.index)
351
+ .apply(lambda x: (x - witness_nmf) ** 2, axis=1)
352
+ )
353
+ patient_sugg_df = (
354
+ pd.DataFrame(reduced)
355
+ .set_index(data.index)
356
+ .apply(lambda x: (x - patient_nmf) ** 2, axis=1)
357
+ )
358
+
359
+ case_sugg_df = (patient_sugg_df - witness_sugg_df).sum()
360
+
361
+ patient_df_info = pd.DataFrame(case_sugg_df).merge(
362
+ topic, left_index=True, right_index=True
363
+ )
364
+
365
+ patient_df_info["mean_score"] = round(
366
+ patient_df_info[0] / (patient_df_info["total_weight"] ** 2), 4
367
+ )
368
+
369
+ patient_df_info_write = patient_df_info[
370
+ ["mean_score", "main_term", "n_hpo", "hpo_name", "hpo_list", "weight"]
371
+ ].sort_values("mean_score", ascending=False)
372
+
373
+ del case_sugg_df
374
+ del patient_sugg_df
375
+ del witness_sugg_df
376
+ del patient
377
+
378
+ with st.expander("See projection in groups of symptoms dimension*"):
379
+ st.dataframe(patient_df_info_write)
380
+ st.write(
381
+ "\* For interpretability, we report only the top 10% of the 390 groups of interacting symptom associations"
382
+ )
383
+ match_proj_csv = convert_df(patient_df_info_write)
384
+
385
+ st.download_button(
386
+ "Download description projection",
387
+ match_proj_csv,
388
+ "clin_desc_projected.tsv",
389
+ "text/csv",
390
+ key="download-csv-proj",
391
+ )
392
+
393
+ sim_dict, hpo_list_add = get_similar_terms(hpo_list, similarity_terms_dict)
394
+ similar_list = list(set(hpo_list_add) - set(hpo_list))
395
+ similar_list_desc = get_hpo_name_list(similar_list, hp_onto)
396
+
397
+ if similar_list_desc:
398
+ with st.expander("See symptoms with similarity > 80%"):
399
+ similar_list_desc_df = pd.DataFrame.from_dict(
400
+ similar_list_desc, orient="index"
401
+ )
402
+ similar_list_desc_df.columns = ["description"]
403
+ st.write(similar_list_desc_df)
404
+ del similar_list_desc_df
405
+ del similar_list
406
+ del similar_list_desc
407
+
408
+ st.header("Phenotype matching")
409
+ results_sum = score(hpo_list, data)
410
+ results_sum["matchs"] = results_sum[hpo_list].astype(bool).sum(axis=1)
411
+ results_sum["score"] = results_sum["matchs"] + results_sum["sum"]
412
+ results_sum["rank"] = (
413
+ results_sum["score"].rank(ascending=False, method="max").astype(int)
414
+ )
415
+ cols = results_sum.columns.tolist()
416
+ cols = cols[-4:] + cols[:-4]
417
+ match = results_sum[cols].sort_values(by=["score"], ascending=False)
418
+ st.dataframe(match[match["score"] > 1.01].drop(columns=["sum"]))
419
+ match_csv = convert_df(match)
420
+
421
+ st.download_button(
422
+ "Download matching results",
423
+ match_csv,
424
+ "match.tsv",
425
+ "text/csv",
426
+ key="download-csv-match",
427
+ )
428
+
429
+ if gene_diag:
430
+ if int(ncbi[gene_diag]) in results_sum.index:
431
+ p = (
432
+ ggplot(match, aes("score"))
433
+ + geom_density()
434
+ + geom_vline(
435
+ xintercept=results_sum.loc[int(ncbi[gene_diag]), "score"],
436
+ linetype="dashed",
437
+ color="red",
438
+ size=1.5,
439
+ )
440
+ + ggtitle("Matching score distribution")
441
+ + xlab("Gene matching score")
442
+ + ylab("% of genes")
443
+ + theme_bw()
444
+ + theme(
445
+ text=element_text(size=12),
446
+ figure_size=(5, 5),
447
+ axis_ticks=element_line(colour="black", size=4),
448
+ axis_line=element_line(colour="black", size=2),
449
+ axis_text_x=element_text(angle=45, hjust=1),
450
+ axis_text_y=element_text(angle=60, hjust=1),
451
+ subplots_adjust={"wspace": 0.1},
452
+ legend_position=(0.7, 0.35),
453
+ )
454
+ )
455
+ col1, col2, col3 = st.columns(3)
456
+
457
+ with col1:
458
+ st.pyplot(ggplot.draw(p))
459
+
460
+ st.write(
461
+ "Gene ID rank:",
462
+ results_sum.loc[int(ncbi[gene_diag]), "rank"],
463
+ " | ",
464
+ "Gene ID count:",
465
+ round(results_sum.loc[int(ncbi[gene_diag]), "sum"], 4),
466
+ )
467
+ st.write(results_sum.loc[[int(ncbi[gene_diag])]])
468
+ st.write(
469
+ "Gene ID phenotype specificity:",
470
+ get_phenotype_specificity(gene_diag, results_sum),
471
+ )
472
+ del p
473
+
474
+ else:
475
+ st.write("Gene ID rank:", " Gene not available in PhenoGenius database")
476
+ del results_sum
477
+ del match
478
+
479
+ st.header("Phenotype matching by similarity of symptoms")
480
+ results_sum_add = score_sim_add(hpo_list_add, data, sim_dict)
481
+ results_sum_add["rank"] = (
482
+ results_sum_add["sum"].rank(ascending=False, method="max").astype(int)
483
+ )
484
+ cols = results_sum_add.columns.tolist()
485
+ cols = cols[-2:] + cols[:-2]
486
+ match_sim = results_sum_add[cols].sort_values(by=["sum"], ascending=False)
487
+ st.dataframe(match_sim[match_sim["sum"] > 0.01])
488
+
489
+ match_sim_csv = convert_df(match_sim)
490
+
491
+ st.download_button(
492
+ "Download matching results",
493
+ match_sim_csv,
494
+ "match_sim.tsv",
495
+ "text/csv",
496
+ key="download-csv-match-sim",
497
+ )
498
+
499
+ if gene_diag:
500
+ if int(ncbi[gene_diag]) in results_sum_add.index:
501
+ p2 = (
502
+ ggplot(match_sim, aes("sum"))
503
+ + geom_density()
504
+ + geom_vline(
505
+ xintercept=results_sum_add.loc[int(ncbi[gene_diag]), "sum"],
506
+ linetype="dashed",
507
+ color="red",
508
+ size=1.5,
509
+ )
510
+ + ggtitle("Matching score distribution")
511
+ + xlab("Gene matching score")
512
+ + ylab("% of genes")
513
+ + theme_bw()
514
+ + theme(
515
+ text=element_text(size=12),
516
+ figure_size=(5, 5),
517
+ axis_ticks=element_line(colour="black", size=4),
518
+ axis_line=element_line(colour="black", size=2),
519
+ axis_text_x=element_text(angle=45, hjust=1),
520
+ axis_text_y=element_text(angle=60, hjust=1),
521
+ subplots_adjust={"wspace": 0.1},
522
+ legend_position=(0.7, 0.35),
523
+ )
524
+ )
525
+ col1, col2, col3 = st.columns(3)
526
+
527
+ with col1:
528
+ st.pyplot(ggplot.draw(p2))
529
+
530
+ st.write(
531
+ "Gene ID rank:",
532
+ results_sum_add.loc[int(ncbi[gene_diag]), "rank"],
533
+ " | ",
534
+ "Gene ID count:",
535
+ round(results_sum_add.loc[int(ncbi[gene_diag]), "sum"], 4),
536
+ )
537
+ st.write(
538
+ "Gene ID phenotype specificity:",
539
+ get_phenotype_specificity(gene_diag, results_sum_add),
540
+ )
541
+ del p2
542
+
543
+ else:
544
+ st.write("Gene ID rank:", " Gene not available in PhenoGenius database")
545
+
546
+ del sim_dict
547
+ del hpo_list_add
548
+ del results_sum_add
549
+ del match_sim
550
+
551
+ st.header("Phenotype matching by groups of symptoms")
552
+
553
+ patient_df = (
554
+ pd.DataFrame(reduced)
555
+ .set_index(data.index)
556
+ .apply(lambda x: sum((x - patient_nmf) ** 2), axis=1)
557
+ )
558
+
559
+ witness_df = (
560
+ pd.DataFrame(reduced)
561
+ .set_index(data.index)
562
+ .apply(lambda x: sum((x - witness_nmf) ** 2), axis=1)
563
+ )
564
+ del patient_nmf
565
+ del witness
566
+ del witness_nmf
567
+
568
+ case_df = pd.DataFrame(patient_df - witness_df)
569
+ case_df.columns = ["score"]
570
+ case_df["score_norm"] = abs(case_df["score"] - case_df["score"].max())
571
+ # case_df["frequency"] = matrix_frequency["variant_number"]
572
+ case_df["sum"] = case_df["score_norm"] # + case_df["frequency"]
573
+ case_df_sort = case_df.sort_values(by="sum", ascending=False)
574
+ case_df_sort["rank"] = (
575
+ case_df_sort["sum"].rank(ascending=False, method="max").astype(int)
576
+ )
577
+ case_df_sort["gene_symbol"] = case_df_sort.index.to_series().apply(get_symbol)
578
+ match_nmf = case_df_sort[["gene_symbol", "rank", "sum"]]
579
+ st.dataframe(match_nmf[match_nmf["sum"] > 0.01])
580
+
581
+ match_nmf_csv = convert_df(match_nmf)
582
+
583
+ st.download_button(
584
+ "Download matching results",
585
+ match_nmf_csv,
586
+ "match_groups.tsv",
587
+ "text/csv",
588
+ key="download-csv-match-groups",
589
+ )
590
+
591
+ if gene_diag:
592
+ if int(ncbi[gene_diag]) in case_df_sort.index:
593
+ p3 = (
594
+ ggplot(match_nmf, aes("sum"))
595
+ + geom_density()
596
+ + geom_vline(
597
+ xintercept=case_df_sort.loc[int(ncbi[gene_diag]), "sum"],
598
+ linetype="dashed",
599
+ color="red",
600
+ size=1.5,
601
+ )
602
+ + ggtitle("Matching score distribution")
603
+ + xlab("Gene matching score")
604
+ + ylab("% of genes")
605
+ + theme_bw()
606
+ + theme(
607
+ text=element_text(size=12),
608
+ figure_size=(5, 5),
609
+ axis_ticks=element_line(colour="black", size=4),
610
+ axis_line=element_line(colour="black", size=2),
611
+ axis_text_x=element_text(angle=45, hjust=1),
612
+ axis_text_y=element_text(angle=60, hjust=1),
613
+ subplots_adjust={"wspace": 0.1},
614
+ legend_position=(0.7, 0.35),
615
+ )
616
+ )
617
+ col1, col2, col3 = st.columns(3)
618
+
619
+ with col1:
620
+ st.pyplot(ggplot.draw(p3))
621
+
622
+ st.write(
623
+ "Gene ID rank:",
624
+ case_df_sort.loc[int(ncbi[gene_diag]), "rank"],
625
+ " | ",
626
+ "Gene ID count:",
627
+ round(case_df_sort.loc[int(ncbi[gene_diag]), "sum"], 4),
628
+ )
629
+ st.write(
630
+ "Gene ID phenotype specificity:",
631
+ get_phenotype_specificity(gene_diag, case_df_sort),
632
+ )
633
+ del p3
634
+ else:
635
+ st.write("Gene ID rank:", " Gene not available in PhenoGenius database")
636
+ del case_df_sort
637
+ del match_nmf
638
+ del case_df
639
+
640
+ else:
641
+ st.write(
642
+ "No HPO terms provided in correct format.",
643
+ )
poetry.lock ADDED
The diff for this file is too large to render. See raw diff
 
pyproject.toml ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [tool.poetry]
2
+ name = "PhenoGenius_app"
3
+ version = "1.1.0"
4
+ description = ""
5
+ authors = ["kevin.yauy <[email protected]>"]
6
+
7
+ [tool.poetry.dependencies]
8
+ python = ">=3.11"
9
+ pandas = ">=1.3.0"
10
+ ujson = "^5.4.0"
11
+ streamlit = "^1.11.1"
12
+ plotnine = "^0.13.0"
13
+ numpy = ">=1.24,<2.1"
14
+ scikit-learn = "^1.5.1"
15
+
16
+ [tool.poetry.dev-dependencies]
17
+ pytest = "^5.2"
18
+
19
+ [build-system]
20
+ requires = ["poetry-core>=1.0.0"]
21
+ build-backend = "poetry.core.masonry.api"
requirements.txt ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ altair==5.4.1 ; python_version >= "3.11"
2
+ attrs==24.2.0 ; python_version >= "3.11"
3
+ blinker==1.8.2 ; python_version >= "3.11"
4
+ cachetools==5.5.0 ; python_version >= "3.11"
5
+ certifi==2024.8.30 ; python_version >= "3.11"
6
+ charset-normalizer==3.3.2 ; python_version >= "3.11"
7
+ click==8.1.7 ; python_version >= "3.11"
8
+ colorama==0.4.6 ; python_version >= "3.11" and platform_system == "Windows"
9
+ contourpy==1.3.0 ; python_version >= "3.11"
10
+ cycler==0.12.1 ; python_version >= "3.11"
11
+ fonttools==4.53.1 ; python_version >= "3.11"
12
+ gitdb==4.0.11 ; python_version >= "3.11"
13
+ gitpython==3.1.43 ; python_version >= "3.11"
14
+ idna==3.8 ; python_version >= "3.11"
15
+ jinja2==3.1.4 ; python_version >= "3.11"
16
+ joblib==1.4.2 ; python_version >= "3.11"
17
+ jsonschema-specifications==2023.12.1 ; python_version >= "3.11"
18
+ jsonschema==4.23.0 ; python_version >= "3.11"
19
+ kiwisolver==1.4.7 ; python_version >= "3.11"
20
+ markdown-it-py==3.0.0 ; python_version >= "3.11"
21
+ markupsafe==2.1.5 ; python_version >= "3.11"
22
+ matplotlib==3.9.2 ; python_version >= "3.11"
23
+ mdurl==0.1.2 ; python_version >= "3.11"
24
+ mizani==0.11.4 ; python_version >= "3.11"
25
+ narwhals==1.6.2 ; python_version >= "3.11"
26
+ numpy==2.0.2 ; python_version >= "3.11"
27
+ packaging==24.1 ; python_version >= "3.11"
28
+ pandas==2.2.2 ; python_version >= "3.11"
29
+ patsy==0.5.6 ; python_version >= "3.11"
30
+ pillow==10.4.0 ; python_version >= "3.11"
31
+ plotnine==0.13.6 ; python_version >= "3.11"
32
+ protobuf==5.28.0 ; python_version >= "3.11"
33
+ pyarrow==17.0.0 ; python_version >= "3.11"
34
+ pydeck==0.9.1 ; python_version >= "3.11"
35
+ pygments==2.18.0 ; python_version >= "3.11"
36
+ pyparsing==3.1.4 ; python_version >= "3.11"
37
+ python-dateutil==2.9.0.post0 ; python_version >= "3.11"
38
+ pytz==2024.1 ; python_version >= "3.11"
39
+ referencing==0.35.1 ; python_version >= "3.11"
40
+ requests==2.32.3 ; python_version >= "3.11"
41
+ rich==13.8.0 ; python_version >= "3.11"
42
+ rpds-py==0.20.0 ; python_version >= "3.11"
43
+ scikit-learn==1.5.1 ; python_version >= "3.11"
44
+ scipy==1.14.1 ; python_version >= "3.11"
45
+ six==1.16.0 ; python_version >= "3.11"
46
+ smmap==5.0.1 ; python_version >= "3.11"
47
+ statsmodels==0.14.2 ; python_version >= "3.11"
48
+ streamlit==1.38.0 ; python_version >= "3.11"
49
+ tenacity==8.5.0 ; python_version >= "3.11"
50
+ threadpoolctl==3.5.0 ; python_version >= "3.11"
51
+ toml==0.10.2 ; python_version >= "3.11"
52
+ tornado==6.4.1 ; python_version >= "3.11"
53
+ typing-extensions==4.12.2 ; python_version >= "3.11"
54
+ tzdata==2024.1 ; python_version >= "3.11"
55
+ ujson==5.10.0 ; python_version >= "3.11"
56
+ urllib3==2.2.2 ; python_version >= "3.11"
57
+ watchdog==4.0.2 ; platform_system != "Darwin" and python_version >= "3.11"