Paper2Agent commited on
Commit
ffcb052
·
verified ·
1 Parent(s): b205f27

Upload 5 files

Browse files
Files changed (5) hide show
  1. Dockerfile +11 -0
  2. README.md +3 -5
  3. requirements.txt +14 -0
  4. scanpy_mcp.py +83 -0
  5. tools/clustering.py +800 -0
Dockerfile ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.12
2
+ WORKDIR /app
3
+ COPY requirements.txt .
4
+ RUN mkdir -p /tmp/numba_cache && chmod -R 777 /tmp/numba_cache
5
+ ENV NUMBA_CACHE_DIR=/tmp/numba_cache
6
+ RUN pip install --no-cache-dir -r requirements.txt
7
+ COPY scanpy_mcp.py .
8
+ COPY tools/ tools/
9
+ RUN mkdir -p /app/data/upload /data/tmp_inputs /data/tmp_outputs && chmod -R 777 /app/data/upload /data
10
+ EXPOSE 7860
11
+ CMD ["uvicorn", "scanpy_mcp:app", "--host", "0.0.0.0", "--port", "7860"]
README.md CHANGED
@@ -1,12 +1,10 @@
1
  ---
2
  title: Scanpy Mcp
3
- emoji: 📈
4
- colorFrom: purple
5
- colorTo: gray
6
  sdk: docker
7
  pinned: false
8
- license: bsd-3-clause
9
- short_description: Paper2Agent-generated MCP server
10
  ---
11
 
12
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
1
  ---
2
  title: Scanpy Mcp
3
+ emoji: 🏢
4
+ colorFrom: red
5
+ colorTo: yellow
6
  sdk: docker
7
  pinned: false
 
 
8
  ---
9
 
10
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
requirements.txt ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ anndata
2
+ datetime
3
+ fastmcp
4
+ matplotlib
5
+ numpy
6
+ pandas
7
+ pathlib
8
+ scanpy
9
+ typing
10
+ uv
11
+ uvicorn
12
+ scikit-image
13
+ fastapi
14
+ starlette==0.47.3
scanpy_mcp.py ADDED
@@ -0,0 +1,83 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Model Context Protocol (MCP) for scanpy
3
+
4
+ Scanpy is a scalable toolkit for analyzing single-cell gene expression data built jointly with anndata.
5
+ It provides preprocessing, visualization, clustering, pseudotime and trajectory inference, differential expression testing, and integration of heterogeneous datasets.
6
+ This codebase focuses on fundamental single-cell RNA sequencing analysis workflows including quality control, normalization, dimensionality reduction, and clustering.
7
+
8
+ This MCP Server contains the tools extracted from the following tutorials:
9
+ 1. clustering
10
+ - quality_control: Calculate and visualize QC metrics, filter cells and genes, detect doublets
11
+ - normalize_data: Normalize count data with median total counts and log transformation
12
+ - select_features: Identify highly variable genes for feature selection
13
+ - reduce_dimensionality: Perform PCA analysis and variance visualization
14
+ - build_neighborhood_graph: Construct nearest neighbor graph and UMAP embedding
15
+ - cluster_cells: Perform Leiden clustering with visualization
16
+ - annotate_cell_types: Multi-resolution clustering, marker gene analysis, and differential expression
17
+ """
18
+
19
+ import sys
20
+ from pathlib import Path
21
+ from fastmcp import FastMCP
22
+ from starlette.requests import Request
23
+ from starlette.responses import PlainTextResponse, JSONResponse
24
+ import os
25
+ from fastapi.staticfiles import StaticFiles
26
+ import uuid
27
+ import os
28
+
29
+
30
+ # Import the MCP tools from the tools folder
31
+ from tools.clustering import clustering_mcp
32
+
33
+ # Define the MCP server
34
+ mcp = FastMCP(name = "scanpy")
35
+
36
+ # Mount the tools
37
+ mcp.mount(clustering_mcp)
38
+
39
+ # Use absolute directory for uploads
40
+ BASE_DIR = os.path.dirname(os.path.abspath(__file__))
41
+ UPLOAD_DIR = os.path.join(BASE_DIR, "/data/upload")
42
+ os.makedirs(UPLOAD_DIR, exist_ok=True)
43
+
44
+ @mcp.custom_route("/health", methods=["GET"])
45
+ async def health_check(request: Request) -> PlainTextResponse:
46
+ return PlainTextResponse("OK")
47
+
48
+
49
+ @mcp.custom_route("/", methods=["GET"])
50
+ async def index(request: Request) -> PlainTextResponse:
51
+ return PlainTextResponse("MCP is on https://Paper2Agent-scanpy-mcp.hf.space/mcp")
52
+
53
+ # Upload route
54
+ @mcp.custom_route("/upload", methods=["POST"])
55
+ async def upload(request: Request):
56
+ form = await request.form()
57
+ up = form.get("file")
58
+ if up is None:
59
+ return JSONResponse({"error": "missing form field 'file'"}, status_code=400)
60
+
61
+ # Generate a safe filename
62
+ orig = getattr(up, "filename", "") or ""
63
+ ext = os.path.splitext(orig)[1]
64
+ name = f"{uuid.uuid4().hex}{ext}"
65
+ dst = os.path.join(UPLOAD_DIR, name)
66
+
67
+ # up is a Starlette UploadFile-like object
68
+ with open(dst, "wb") as out:
69
+ out.write(await up.read())
70
+
71
+ # Return only the absolute local path
72
+ abs_path = os.path.abspath(dst)
73
+ return JSONResponse({"path": abs_path})
74
+
75
+ app = mcp.http_app(path="/mcp")
76
+ # Saved uploaded input files
77
+ app.mount("/files", StaticFiles(directory=UPLOAD_DIR), name="files")
78
+ # Saved output files
79
+ app.mount("/outputs", StaticFiles(directory="/data/tmp_outputs"), name="outputs")
80
+
81
+ # Run the MCP server
82
+ if __name__ == "__main__":
83
+ mcp.run(transport="http", host="127.0.0.1", port=8003)
tools/clustering.py ADDED
@@ -0,0 +1,800 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Scanpy tutorial for single-cell RNA sequencing preprocessing and clustering analysis.
3
+
4
+ This MCP Server provides 7 tools:
5
+ 1. quality_control: Calculate and visualize QC metrics, filter cells and genes, detect doublets
6
+ 2. normalize_data: Normalize count data with median total counts and log transformation
7
+ 3. select_features: Identify highly variable genes for feature selection
8
+ 4. reduce_dimensionality: Perform PCA analysis and variance visualization
9
+ 5. build_neighborhood_graph: Construct nearest neighbor graph and UMAP embedding
10
+ 6. cluster_cells: Perform Leiden clustering with visualization
11
+ 7. annotate_cell_types: Multi-resolution clustering, marker gene analysis, and differential expression
12
+
13
+ All tools extracted from `https://github.com/scverse/scanpy/tree/main/docs/tutorials/basics/clustering.ipynb`.
14
+ """
15
+
16
+ # Standard imports
17
+ from typing import Annotated, Literal, Any
18
+ import pandas as pd
19
+ import numpy as np
20
+ from pathlib import Path
21
+ import os
22
+ from fastmcp import FastMCP
23
+ from datetime import datetime
24
+ import matplotlib.pyplot as plt
25
+
26
+ # Scanpy and related imports
27
+ import scanpy as sc
28
+ import anndata as ad
29
+
30
+ # Base persistent directory (HF Spaces guarantees /data is writable & persistent)
31
+ BASE_DIR = Path("/data")
32
+
33
+ DEFAULT_INPUT_DIR = BASE_DIR / "tmp_inputs"
34
+ DEFAULT_OUTPUT_DIR = BASE_DIR / "tmp_outputs"
35
+
36
+ INPUT_DIR = Path(os.environ.get("CLUSTERING_INPUT_DIR", DEFAULT_INPUT_DIR))
37
+ OUTPUT_DIR = Path(os.environ.get("CLUSTERING_OUTPUT_DIR", DEFAULT_OUTPUT_DIR))
38
+
39
+ # Ensure directories exist
40
+ INPUT_DIR.mkdir(parents=True, exist_ok=True)
41
+ OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
42
+
43
+ # Timestamp for unique outputs
44
+ timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
45
+
46
+ # MCP server instance
47
+ clustering_mcp = FastMCP(name="clustering")
48
+
49
+ # Set scanpy figure parameters
50
+ sc.settings.set_figure_params(dpi=300, facecolor="white")
51
+
52
+ @clustering_mcp.tool
53
+ def quality_control(
54
+ # Primary data inputs
55
+ data_path: Annotated[str, "Path to h5ad file or directory with 10X data. The h5ad file should contain raw count data in AnnData format."] = None,
56
+ # Analysis parameters with tutorial defaults
57
+ mt_prefix: Annotated[str, "Prefix for mitochondrial genes"] = "MT-",
58
+ ribo_prefixes: Annotated[list, "Prefixes for ribosomal genes"] = ["RPS", "RPL"],
59
+ hb_pattern: Annotated[str, "Pattern for hemoglobin genes"] = "^HB[^(P)]",
60
+ min_genes: Annotated[int, "Minimum number of genes expressed per cell"] = 100,
61
+ min_cells: Annotated[int, "Minimum number of cells expressing a gene"] = 3,
62
+ batch_key: Annotated[str | None, "Column name in adata.obs for batch information"] = None,
63
+ out_prefix: Annotated[str | None, "Output file prefix"] = None,
64
+ ) -> dict:
65
+ """
66
+ Calculate quality control metrics, visualize QC distributions, and filter low-quality cells and genes.
67
+ Input is single-cell count data in AnnData format and output is QC plots, filtered data, and doublet scores.
68
+ """
69
+ # Validate exactly one input
70
+ if data_path is None:
71
+ raise ValueError("Path to h5ad file or 10X data directory must be provided")
72
+
73
+ # Set output prefix
74
+ if out_prefix is None:
75
+ out_prefix = f"qc_{timestamp}"
76
+
77
+ # Load data
78
+ data_path = Path(data_path)
79
+ if data_path.is_dir():
80
+ # Assume 10X directory format
81
+ adata = sc.read_10x_mtx(data_path)
82
+ adata.var_names_make_unique()
83
+ elif data_path.suffix in ['.h5', '.h5ad']:
84
+ if data_path.suffix == '.h5':
85
+ adata = sc.read_10x_h5(data_path)
86
+ adata.var_names_make_unique()
87
+ else:
88
+ adata = ad.read_h5ad(data_path)
89
+ else:
90
+ raise ValueError("data_path must be a directory with 10X data or h5/h5ad file")
91
+
92
+ # Define gene categories
93
+ adata.var["mt"] = adata.var_names.str.startswith(mt_prefix)
94
+ adata.var["ribo"] = adata.var_names.str.startswith(tuple(ribo_prefixes))
95
+ adata.var["hb"] = adata.var_names.str.contains(hb_pattern)
96
+
97
+ # Calculate QC metrics
98
+ sc.pp.calculate_qc_metrics(
99
+ adata, qc_vars=["mt", "ribo", "hb"], inplace=True, log1p=True
100
+ )
101
+
102
+ # Create QC violin plots
103
+ plt.figure(figsize=(12, 4))
104
+ sc.pl.violin(
105
+ adata,
106
+ ["n_genes_by_counts", "total_counts", "pct_counts_mt"],
107
+ jitter=0.4,
108
+ multi_panel=True,
109
+ )
110
+ violin_path = OUTPUT_DIR / f"{out_prefix}_qc_violin.png"
111
+ plt.savefig(violin_path, dpi=300, bbox_inches='tight')
112
+ plt.close()
113
+
114
+ # Create QC scatter plot
115
+ plt.figure(figsize=(8, 6))
116
+ sc.pl.scatter(adata, "total_counts", "n_genes_by_counts", color="pct_counts_mt")
117
+ scatter_path = OUTPUT_DIR / f"{out_prefix}_qc_scatter.png"
118
+ plt.savefig(scatter_path, dpi=300, bbox_inches='tight')
119
+ plt.close()
120
+
121
+ # Filter cells and genes
122
+ print(f"Before filtering: {adata.n_obs} cells, {adata.n_vars} genes")
123
+ sc.pp.filter_cells(adata, min_genes=min_genes)
124
+ sc.pp.filter_genes(adata, min_cells=min_cells)
125
+ print(f"After filtering: {adata.n_obs} cells, {adata.n_vars} genes")
126
+
127
+ # Doublet detection
128
+ if batch_key and batch_key in adata.obs.columns:
129
+ sc.pp.scrublet(adata, batch_key=batch_key)
130
+ else:
131
+ sc.pp.scrublet(adata)
132
+
133
+ # Save processed data
134
+ output_file = OUTPUT_DIR / f"{out_prefix}_qc_processed.h5ad"
135
+ adata.write_h5ad(output_file)
136
+
137
+ # Save QC metrics summary
138
+ qc_summary = pd.DataFrame({
139
+ 'metric': ['n_obs', 'n_vars', 'mean_n_genes_by_counts', 'mean_total_counts', 'mean_pct_counts_mt', 'doublet_rate'],
140
+ 'value': [
141
+ adata.n_obs,
142
+ adata.n_vars,
143
+ adata.obs['n_genes_by_counts'].mean(),
144
+ adata.obs['total_counts'].mean(),
145
+ adata.obs['pct_counts_mt'].mean(),
146
+ adata.obs['predicted_doublet'].sum() / adata.n_obs
147
+ ]
148
+ })
149
+ qc_summary_path = OUTPUT_DIR / f"{out_prefix}_qc_summary.csv"
150
+ qc_summary.to_csv(qc_summary_path, index=False)
151
+
152
+ return {
153
+ "message": f"Quality control completed for {adata.n_obs} cells and {adata.n_vars} genes",
154
+ "reference": "https://github.com/scverse/scanpy/tree/main/docs/tutorials/basics/clustering.ipynb",
155
+ "artifacts": [
156
+ {
157
+ "description": "QC violin plots",
158
+ "path": str(violin_path.resolve())
159
+ },
160
+ {
161
+ "description": "QC scatter plot",
162
+ "path": str(scatter_path.resolve())
163
+ },
164
+ {
165
+ "description": "QC processed data",
166
+ "path": str(output_file.resolve())
167
+ },
168
+ {
169
+ "description": "QC metrics summary",
170
+ "path": str(qc_summary_path.resolve())
171
+ }
172
+ ]
173
+ }
174
+
175
+ @clustering_mcp.tool
176
+ def normalize_data(
177
+ # Primary data inputs
178
+ data_path: Annotated[str, "Path to h5ad file with QC-processed single-cell data. Should be output from quality_control tool."],
179
+ # Analysis parameters with tutorial defaults
180
+ target_sum: Annotated[float | None, "Target sum for normalization. None uses median total counts"] = None,
181
+ out_prefix: Annotated[str | None, "Output file prefix"] = None,
182
+ ) -> dict:
183
+ """
184
+ Normalize count data using median total counts scaling followed by log1p transformation.
185
+ Input is quality-controlled AnnData object and output is normalized expression data.
186
+ """
187
+ # Validate exactly one input
188
+ if data_path is None:
189
+ raise ValueError("Path to h5ad file must be provided")
190
+
191
+ # Set output prefix
192
+ if out_prefix is None:
193
+ out_prefix = f"normalized_{timestamp}"
194
+
195
+ # Load data
196
+ adata = ad.read_h5ad(data_path)
197
+
198
+ # Saving count data
199
+ adata.layers["counts"] = adata.X.copy()
200
+
201
+ # Normalizing to median total counts (or target_sum if specified)
202
+ sc.pp.normalize_total(adata, target_sum=target_sum)
203
+ # Logarithmize the data
204
+ sc.pp.log1p(adata)
205
+
206
+ # Save normalized data
207
+ output_file = OUTPUT_DIR / f"{out_prefix}_normalized.h5ad"
208
+ adata.write_h5ad(output_file)
209
+
210
+ # Create normalization summary
211
+ import numpy as np
212
+ from scipy import sparse
213
+
214
+ # Handle sparse matrices properly
215
+ if sparse.issparse(adata.layers["counts"]):
216
+ counts_mean = adata.layers["counts"].mean()
217
+ counts_std = np.sqrt(adata.layers["counts"].multiply(adata.layers["counts"]).mean() - counts_mean**2)
218
+ else:
219
+ counts_mean = np.mean(adata.layers["counts"])
220
+ counts_std = np.std(adata.layers["counts"])
221
+
222
+ if sparse.issparse(adata.X):
223
+ x_mean = adata.X.mean()
224
+ x_std = np.sqrt(adata.X.multiply(adata.X).mean() - x_mean**2)
225
+ else:
226
+ x_mean = np.mean(adata.X)
227
+ x_std = np.std(adata.X)
228
+
229
+ norm_summary = pd.DataFrame({
230
+ 'layer': ['raw_counts', 'normalized_log1p'],
231
+ 'mean_expression': [float(counts_mean), float(x_mean)],
232
+ 'std_expression': [float(counts_std), float(x_std)]
233
+ })
234
+ summary_path = OUTPUT_DIR / f"{out_prefix}_normalization_summary.csv"
235
+ norm_summary.to_csv(summary_path, index=False)
236
+
237
+ return {
238
+ "message": f"Data normalized with log1p transformation for {adata.n_obs} cells",
239
+ "reference": "https://github.com/scverse/scanpy/tree/main/docs/tutorials/basics/clustering.ipynb",
240
+ "artifacts": [
241
+ {
242
+ "description": "Normalized data",
243
+ "path": str(output_file.resolve())
244
+ },
245
+ {
246
+ "description": "Normalization summary",
247
+ "path": str(summary_path.resolve())
248
+ }
249
+ ]
250
+ }
251
+
252
+ @clustering_mcp.tool
253
+ def select_features(
254
+ # Primary data inputs
255
+ data_path: Annotated[str, "Path to h5ad file with normalized single-cell data. Should be output from normalize_data tool."],
256
+ # Analysis parameters with tutorial defaults
257
+ n_top_genes: Annotated[int, "Number of highly variable genes to select"] = 2000,
258
+ batch_key: Annotated[str | None, "Column name in adata.obs for batch correction"] = None,
259
+ flavor: Annotated[Literal["seurat", "cell_ranger", "seurat_v3"], "Method for highly variable gene selection"] = "seurat",
260
+ out_prefix: Annotated[str | None, "Output file prefix"] = None,
261
+ ) -> dict:
262
+ """
263
+ Identify highly variable genes for feature selection using specified method.
264
+ Input is normalized AnnData object and output is feature selection plot and filtered data.
265
+ """
266
+ # Validate exactly one input
267
+ if data_path is None:
268
+ raise ValueError("Path to h5ad file must be provided")
269
+
270
+ # Set output prefix
271
+ if out_prefix is None:
272
+ out_prefix = f"features_{timestamp}"
273
+
274
+ # Load data
275
+ adata = ad.read_h5ad(data_path)
276
+
277
+ # Find highly variable genes
278
+ if batch_key and batch_key in adata.obs.columns:
279
+ sc.pp.highly_variable_genes(adata, n_top_genes=n_top_genes, batch_key=batch_key, flavor=flavor)
280
+ else:
281
+ sc.pp.highly_variable_genes(adata, n_top_genes=n_top_genes, flavor=flavor)
282
+
283
+ # Plot highly variable genes
284
+ plt.figure(figsize=(10, 6))
285
+ sc.pl.highly_variable_genes(adata)
286
+ plot_path = OUTPUT_DIR / f"{out_prefix}_highly_variable_genes.png"
287
+ plt.savefig(plot_path, dpi=300, bbox_inches='tight')
288
+ plt.close()
289
+
290
+ # Save data with feature selection
291
+ output_file = OUTPUT_DIR / f"{out_prefix}_feature_selected.h5ad"
292
+ adata.write_h5ad(output_file)
293
+
294
+ # Create feature selection summary
295
+ n_highly_var = adata.var['highly_variable'].sum()
296
+ feature_summary = pd.DataFrame({
297
+ 'metric': ['total_genes', 'highly_variable_genes', 'selection_fraction'],
298
+ 'value': [
299
+ adata.n_vars,
300
+ n_highly_var,
301
+ n_highly_var / adata.n_vars
302
+ ]
303
+ })
304
+ summary_path = OUTPUT_DIR / f"{out_prefix}_feature_summary.csv"
305
+ feature_summary.to_csv(summary_path, index=False)
306
+
307
+ return {
308
+ "message": f"Selected {n_highly_var} highly variable genes from {adata.n_vars} total genes",
309
+ "reference": "https://github.com/scverse/scanpy/tree/main/docs/tutorials/basics/clustering.ipynb",
310
+ "artifacts": [
311
+ {
312
+ "description": "Highly variable genes plot",
313
+ "path": str(plot_path.resolve())
314
+ },
315
+ {
316
+ "description": "Feature selected data",
317
+ "path": str(output_file.resolve())
318
+ },
319
+ {
320
+ "description": "Feature selection summary",
321
+ "path": str(summary_path.resolve())
322
+ }
323
+ ]
324
+ }
325
+
326
+ @clustering_mcp.tool
327
+ def reduce_dimensionality(
328
+ # Primary data inputs
329
+ data_path: Annotated[str, "Path to h5ad file with feature-selected data. Should be output from select_features tool."],
330
+ # Analysis parameters with tutorial defaults
331
+ n_comps: Annotated[int, "Number of principal components to compute"] = 50,
332
+ use_highly_variable: Annotated[bool, "Whether to use only highly variable genes"] = True,
333
+ n_pcs_plot: Annotated[int, "Number of PCs to show in variance plot"] = 50,
334
+ color_vars: Annotated[list, "Variables to color PCA plot by"] = ["sample", "pct_counts_mt"],
335
+ out_prefix: Annotated[str | None, "Output file prefix"] = None,
336
+ ) -> dict:
337
+ """
338
+ Perform principal component analysis for dimensionality reduction and visualization.
339
+ Input is feature-selected AnnData object and output is PCA embeddings and variance plots.
340
+ """
341
+ # Validate exactly one input
342
+ if data_path is None:
343
+ raise ValueError("Path to h5ad file must be provided")
344
+
345
+ # Set output prefix
346
+ if out_prefix is None:
347
+ out_prefix = f"pca_{timestamp}"
348
+
349
+ # Load data
350
+ adata = ad.read_h5ad(data_path)
351
+
352
+ # Perform PCA
353
+ sc.tl.pca(adata, n_comps=n_comps, use_highly_variable=use_highly_variable)
354
+
355
+ # Plot PCA variance ratio
356
+ plt.figure(figsize=(10, 6))
357
+ sc.pl.pca_variance_ratio(adata, n_pcs=n_pcs_plot, log=True)
358
+ variance_path = OUTPUT_DIR / f"{out_prefix}_pca_variance.png"
359
+ plt.savefig(variance_path, dpi=300, bbox_inches='tight')
360
+ plt.close()
361
+
362
+ # Plot PCA colored by specified variables
363
+ available_vars = [var for var in color_vars if var in adata.obs.columns]
364
+ if available_vars:
365
+ # Create combinations for plotting
366
+ plot_colors = []
367
+ plot_dims = []
368
+ for var in available_vars[:2]: # Limit to 2 variables to match tutorial
369
+ plot_colors.extend([var, var])
370
+ plot_dims.extend([(0, 1), (2, 3)])
371
+
372
+ plt.figure(figsize=(12, 8))
373
+ sc.pl.pca(
374
+ adata,
375
+ color=plot_colors,
376
+ dimensions=plot_dims,
377
+ ncols=2,
378
+ size=2,
379
+ )
380
+ pca_path = OUTPUT_DIR / f"{out_prefix}_pca_colored.png"
381
+ plt.savefig(pca_path, dpi=300, bbox_inches='tight')
382
+ plt.close()
383
+ pca_artifacts = [{"description": "PCA colored by variables", "path": str(pca_path.resolve())}]
384
+ else:
385
+ pca_artifacts = []
386
+
387
+ # Save data with PCA
388
+ output_file = OUTPUT_DIR / f"{out_prefix}_pca.h5ad"
389
+ adata.write_h5ad(output_file)
390
+
391
+ # Create PCA summary
392
+ pca_summary = pd.DataFrame({
393
+ 'PC': [f'PC{i+1}' for i in range(min(10, n_comps))],
394
+ 'variance_ratio': adata.uns['pca']['variance_ratio'][:min(10, n_comps)]
395
+ })
396
+ summary_path = OUTPUT_DIR / f"{out_prefix}_pca_summary.csv"
397
+ pca_summary.to_csv(summary_path, index=False)
398
+
399
+ artifacts = [
400
+ {
401
+ "description": "PCA variance plot",
402
+ "path": str(variance_path.resolve())
403
+ },
404
+ {
405
+ "description": "PCA processed data",
406
+ "path": str(output_file.resolve())
407
+ },
408
+ {
409
+ "description": "PCA summary",
410
+ "path": str(summary_path.resolve())
411
+ }
412
+ ] + pca_artifacts
413
+
414
+ return {
415
+ "message": f"PCA completed with {n_comps} components explaining {adata.uns['pca']['variance_ratio'].sum():.2%} variance",
416
+ "reference": "https://github.com/scverse/scanpy/tree/main/docs/tutorials/basics/clustering.ipynb",
417
+ "artifacts": artifacts
418
+ }
419
+
420
+ @clustering_mcp.tool
421
+ def build_neighborhood_graph(
422
+ # Primary data inputs
423
+ data_path: Annotated[str, "Path to h5ad file with PCA data. Should be output from reduce_dimensionality tool."],
424
+ # Analysis parameters with tutorial defaults
425
+ n_neighbors: Annotated[int, "Number of neighbors for graph construction"] = 15,
426
+ n_pcs: Annotated[int, "Number of principal components to use"] = None,
427
+ color_by: Annotated[str, "Variable to color UMAP by"] = "sample",
428
+ point_size: Annotated[float, "Point size for UMAP plot"] = 2,
429
+ out_prefix: Annotated[str | None, "Output file prefix"] = None,
430
+ ) -> dict:
431
+ """
432
+ Build nearest neighbor graph from PCA space and compute UMAP embedding for visualization.
433
+ Input is PCA-processed AnnData object and output is neighbor graph, UMAP embedding, and visualization.
434
+ """
435
+ # Validate exactly one input
436
+ if data_path is None:
437
+ raise ValueError("Path to h5ad file must be provided")
438
+
439
+ # Set output prefix
440
+ if out_prefix is None:
441
+ out_prefix = f"neighbors_{timestamp}"
442
+
443
+ # Load data
444
+ adata = ad.read_h5ad(data_path)
445
+
446
+ # Compute the neighborhood graph
447
+ sc.pp.neighbors(adata, n_neighbors=n_neighbors, n_pcs=n_pcs)
448
+
449
+ # Compute UMAP
450
+ sc.tl.umap(adata)
451
+
452
+ # Plot UMAP
453
+ if color_by in adata.obs.columns:
454
+ plt.figure(figsize=(8, 6))
455
+ sc.pl.umap(adata, color=color_by, size=point_size)
456
+ umap_path = OUTPUT_DIR / f"{out_prefix}_umap.png"
457
+ plt.savefig(umap_path, dpi=300, bbox_inches='tight')
458
+ plt.close()
459
+ else:
460
+ # Plot without coloring if variable doesn't exist
461
+ plt.figure(figsize=(8, 6))
462
+ sc.pl.umap(adata, size=point_size)
463
+ umap_path = OUTPUT_DIR / f"{out_prefix}_umap.png"
464
+ plt.savefig(umap_path, dpi=300, bbox_inches='tight')
465
+ plt.close()
466
+
467
+ # Save data with neighborhood graph and UMAP
468
+ output_file = OUTPUT_DIR / f"{out_prefix}_neighbors.h5ad"
469
+ adata.write_h5ad(output_file)
470
+
471
+ # Create neighborhood summary
472
+ neighbor_summary = pd.DataFrame({
473
+ 'metric': ['n_neighbors', 'n_pcs_used', 'umap_dimensions'],
474
+ 'value': [n_neighbors, n_pcs, adata.obsm['X_umap'].shape[1]]
475
+ })
476
+ summary_path = OUTPUT_DIR / f"{out_prefix}_neighbor_summary.csv"
477
+ neighbor_summary.to_csv(summary_path, index=False)
478
+
479
+ return {
480
+ "message": f"Neighborhood graph and UMAP completed for {adata.n_obs} cells",
481
+ "reference": "https://github.com/scverse/scanpy/tree/main/docs/tutorials/basics/clustering.ipynb",
482
+ "artifacts": [
483
+ {
484
+ "description": "UMAP visualization",
485
+ "path": str(umap_path.resolve())
486
+ },
487
+ {
488
+ "description": "Neighborhood graph data",
489
+ "path": str(output_file.resolve())
490
+ },
491
+ {
492
+ "description": "Neighborhood summary",
493
+ "path": str(summary_path.resolve())
494
+ }
495
+ ]
496
+ }
497
+
498
+ @clustering_mcp.tool
499
+ def cluster_cells(
500
+ # Primary data inputs
501
+ data_path: Annotated[str, "Path to h5ad file with neighborhood graph. Should be output from build_neighborhood_graph tool."],
502
+ # Analysis parameters with tutorial defaults
503
+ resolution: Annotated[float, "Resolution parameter for Leiden clustering"] = 0.5,
504
+ flavor: Annotated[Literal["igraph", "leidenalg"], "Leiden algorithm implementation"] = "igraph",
505
+ n_iterations: Annotated[int, "Number of iterations for clustering"] = 2,
506
+ cluster_key: Annotated[str, "Key name for storing clusters in adata.obs"] = "leiden",
507
+ out_prefix: Annotated[str | None, "Output file prefix"] = None,
508
+ ) -> dict:
509
+ """
510
+ Perform Leiden clustering on the neighborhood graph and visualize results.
511
+ Input is AnnData with neighborhood graph and output is clustered data with UMAP visualization.
512
+ """
513
+ # Validate exactly one input
514
+ if data_path is None:
515
+ raise ValueError("Path to h5ad file must be provided")
516
+
517
+ # Set output prefix
518
+ if out_prefix is None:
519
+ out_prefix = f"clusters_{timestamp}"
520
+
521
+ # Load data
522
+ adata = ad.read_h5ad(data_path)
523
+
524
+ # Perform Leiden clustering
525
+ sc.tl.leiden(
526
+ adata,
527
+ resolution=resolution,
528
+ flavor=flavor,
529
+ n_iterations=n_iterations,
530
+ key_added=cluster_key
531
+ )
532
+
533
+ # Plot UMAP colored by clusters
534
+ plt.figure(figsize=(8, 6))
535
+ sc.pl.umap(adata, color=[cluster_key])
536
+ cluster_path = OUTPUT_DIR / f"{out_prefix}_clusters_umap.png"
537
+ plt.savefig(cluster_path, dpi=300, bbox_inches='tight')
538
+ plt.close()
539
+
540
+ # Save clustered data
541
+ output_file = OUTPUT_DIR / f"{out_prefix}_clustered.h5ad"
542
+ adata.write_h5ad(output_file)
543
+
544
+ # Create clustering summary
545
+ n_clusters = len(adata.obs[cluster_key].unique())
546
+ cluster_counts = adata.obs[cluster_key].value_counts().sort_index()
547
+
548
+ cluster_summary = pd.DataFrame({
549
+ 'cluster': cluster_counts.index,
550
+ 'n_cells': cluster_counts.values,
551
+ 'fraction': cluster_counts.values / adata.n_obs
552
+ })
553
+ summary_path = OUTPUT_DIR / f"{out_prefix}_cluster_summary.csv"
554
+ cluster_summary.to_csv(summary_path, index=False)
555
+
556
+ return {
557
+ "message": f"Leiden clustering identified {n_clusters} clusters at resolution {resolution}",
558
+ "reference": "https://github.com/scverse/scanpy/tree/main/docs/tutorials/basics/clustering.ipynb",
559
+ "artifacts": [
560
+ {
561
+ "description": "Clusters UMAP plot",
562
+ "path": str(cluster_path.resolve())
563
+ },
564
+ {
565
+ "description": "Clustered data",
566
+ "path": str(output_file.resolve())
567
+ },
568
+ {
569
+ "description": "Cluster summary",
570
+ "path": str(summary_path.resolve())
571
+ }
572
+ ]
573
+ }
574
+
575
+ @clustering_mcp.tool
576
+ def annotate_cell_types(
577
+ # Primary data inputs
578
+ data_path: Annotated[str, "Path to h5ad file with clustered data. Should be output from cluster_cells tool."],
579
+ # Analysis parameters with tutorial defaults
580
+ resolutions: Annotated[list, "List of resolutions for multi-resolution clustering"] = [0.02, 0.5, 2.0],
581
+ groupby_key: Annotated[str, "Clustering key to use for marker analysis"] = "leiden_res_0.50",
582
+ method: Annotated[Literal["wilcoxon", "t-test", "logreg"], "Method for differential expression"] = "wilcoxon",
583
+ n_genes: Annotated[int, "Number of top genes to show in plots"] = 5,
584
+ marker_genes: Annotated[dict | None, "Dictionary of cell type marker genes"] = None,
585
+ out_prefix: Annotated[str | None, "Output file prefix"] = None,
586
+ ) -> dict:
587
+ """
588
+ Perform multi-resolution clustering, marker gene analysis, and differential expression for cell type annotation.
589
+ Input is clustered AnnData object and output is multi-resolution plots, marker analysis, and differential expression results.
590
+ """
591
+ # Validate exactly one input
592
+ if data_path is None:
593
+ raise ValueError("Path to h5ad file must be provided")
594
+
595
+ # Set output prefix
596
+ if out_prefix is None:
597
+ out_prefix = f"annotation_{timestamp}"
598
+
599
+ # Load data
600
+ adata = ad.read_h5ad(data_path)
601
+
602
+ # Define default marker genes if not provided
603
+ if marker_genes is None:
604
+ marker_genes = {
605
+ "CD14+ Mono": ["FCN1", "CD14"],
606
+ "CD16+ Mono": ["TCF7L2", "FCGR3A", "LYN"],
607
+ "cDC2": ["CST3", "COTL1", "LYZ", "DMXL2", "CLEC10A", "FCER1A"],
608
+ "Erythroblast": ["MKI67", "HBA1", "HBB"],
609
+ "Proerythroblast": ["CDK6", "SYNGR1", "HBM", "GYPA"],
610
+ "NK": ["GNLY", "NKG7", "CD247", "FCER1G", "TYROBP", "KLRG1", "FCGR3A"],
611
+ "ILC": ["ID2", "PLCG2", "GNLY", "SYNE1"],
612
+ "Naive CD20+ B": ["MS4A1", "IL4R", "IGHD", "FCRL1", "IGHM"],
613
+ "B cells": ["MS4A1", "ITGB1", "COL4A4", "PRDM1", "IRF4", "PAX5", "BCL11A", "BLK", "IGHD", "IGHM"],
614
+ "Plasma cells": ["MZB1", "HSP90B1", "FNDC3B", "PRDM1", "IGKC", "JCHAIN"],
615
+ "Plasmablast": ["XBP1", "PRDM1", "PAX5"],
616
+ "CD4+ T": ["CD4", "IL7R", "TRBC2"],
617
+ "CD8+ T": ["CD8A", "CD8B", "GZMK", "GZMA", "CCL5", "GZMB", "GZMH", "GZMA"],
618
+ "T naive": ["LEF1", "CCR7", "TCF7"],
619
+ "pDC": ["GZMB", "IL3RA", "COBLL1", "TCF4"],
620
+ }
621
+
622
+ # Perform multi-resolution clustering
623
+ for res in resolutions:
624
+ sc.tl.leiden(
625
+ adata, key_added=f"leiden_res_{res:4.2f}", resolution=res, flavor="igraph"
626
+ )
627
+
628
+ # Plot multi-resolution clustering
629
+ cluster_keys = [f"leiden_res_{res:4.2f}" for res in resolutions]
630
+ plt.figure(figsize=(15, 5))
631
+ sc.pl.umap(
632
+ adata,
633
+ color=cluster_keys,
634
+ legend_loc="on data",
635
+ )
636
+ multiresolution_path = OUTPUT_DIR / f"{out_prefix}_multiresolution_clusters.png"
637
+ plt.savefig(multiresolution_path, dpi=300, bbox_inches='tight')
638
+ plt.close()
639
+
640
+ # Check if groupby_key exists, if not use first resolution
641
+ if groupby_key not in adata.obs.columns:
642
+ groupby_key = cluster_keys[1] if len(cluster_keys) > 1 else cluster_keys[0]
643
+
644
+ # Plot marker genes
645
+ # Filter marker genes to only include those present in the data
646
+ available_markers = {}
647
+ for cell_type, genes in marker_genes.items():
648
+ available_genes = [g for g in genes if g in adata.var_names]
649
+ if available_genes:
650
+ available_markers[cell_type] = available_genes
651
+
652
+ if available_markers:
653
+ plt.figure(figsize=(12, 8))
654
+ sc.pl.dotplot(adata, available_markers, groupby=groupby_key, standard_scale="var")
655
+ marker_path = OUTPUT_DIR / f"{out_prefix}_marker_genes.png"
656
+ plt.savefig(marker_path, dpi=300, bbox_inches='tight')
657
+ plt.close()
658
+ marker_artifacts = [{"description": "Marker genes dotplot", "path": str(marker_path.resolve())}]
659
+ else:
660
+ marker_artifacts = []
661
+
662
+ # Differential expression analysis
663
+ sc.tl.rank_genes_groups(adata, groupby=groupby_key, method=method)
664
+
665
+ # Plot top differentially expressed genes
666
+ plt.figure(figsize=(10, 8))
667
+ sc.pl.rank_genes_groups_dotplot(
668
+ adata, groupby=groupby_key, standard_scale="var", n_genes=n_genes
669
+ )
670
+ de_path = OUTPUT_DIR / f"{out_prefix}_differential_expression.png"
671
+ plt.savefig(de_path, dpi=300, bbox_inches='tight')
672
+ plt.close()
673
+
674
+ # Create manual cell type annotations for coarse resolution
675
+ coarse_key = f"leiden_res_{resolutions[0]:4.2f}"
676
+ if coarse_key in adata.obs.columns:
677
+ adata.obs["cell_type_lvl1"] = adata.obs[coarse_key].map({
678
+ "0": "Lymphocytes",
679
+ "1": "Monocytes",
680
+ "2": "Erythroid",
681
+ "3": "B Cells",
682
+ })
683
+
684
+ # Save annotated data
685
+ output_file = OUTPUT_DIR / f"{out_prefix}_annotated.h5ad"
686
+ adata.write_h5ad(output_file)
687
+
688
+ # Export differential expression results
689
+ de_results = []
690
+ for cluster in adata.obs[groupby_key].unique():
691
+ cluster_genes = sc.get.rank_genes_groups_df(adata, group=cluster).head(n_genes)
692
+ cluster_genes['cluster'] = cluster
693
+ de_results.append(cluster_genes)
694
+
695
+ if de_results:
696
+ de_df = pd.concat(de_results, ignore_index=True)
697
+ de_path_csv = OUTPUT_DIR / f"{out_prefix}_differential_genes.csv"
698
+ de_df.to_csv(de_path_csv, index=False)
699
+ de_artifacts = [{"description": "Differential expression genes", "path": str(de_path_csv.resolve())}]
700
+ else:
701
+ de_artifacts = []
702
+
703
+ # Create annotation summary
704
+ annotation_summary = pd.DataFrame({
705
+ 'resolution': resolutions,
706
+ 'n_clusters': [len(adata.obs[f"leiden_res_{res:4.2f}"].unique()) for res in resolutions]
707
+ })
708
+ summary_path = OUTPUT_DIR / f"{out_prefix}_annotation_summary.csv"
709
+ annotation_summary.to_csv(summary_path, index=False)
710
+
711
+ artifacts = [
712
+ {
713
+ "description": "Multi-resolution clustering",
714
+ "path": str(multiresolution_path.resolve())
715
+ },
716
+ {
717
+ "description": "Differential expression plot",
718
+ "path": str(de_path.resolve())
719
+ },
720
+ {
721
+ "description": "Annotated data",
722
+ "path": str(output_file.resolve())
723
+ },
724
+ {
725
+ "description": "Annotation summary",
726
+ "path": str(summary_path.resolve())
727
+ }
728
+ ] + marker_artifacts + de_artifacts
729
+
730
+ return {
731
+ "message": f"Cell type annotation completed with {len(resolutions)} resolutions and marker analysis",
732
+ "reference": "https://github.com/scverse/scanpy/tree/main/docs/tutorials/basics/clustering.ipynb",
733
+ "artifacts": artifacts
734
+ }
735
+
736
+
737
+ @clustering_mcp.prompt
738
+ def preprocess_and_cluster_scanpy(data_path: str) -> str:
739
+ """
740
+ Complete preprocessing and clustering pipeline for single-cell RNA sequencing data analysis.
741
+
742
+ This comprehensive workflow performs all essential steps for analyzing scRNA-seq data from raw counts
743
+ to cell type annotation, following the standard Scanpy tutorial for single-cell analysis.
744
+ """
745
+ return f"""
746
+ Execute a complete single-cell RNA-seq preprocessing and clustering pipeline on {data_path}.
747
+
748
+ First inspect the data to understand:
749
+ - Dataset size and complexity
750
+ - Organism (human/mouse) from gene names
751
+ - Batch information in adata.obs (e.g., "sample", "batch", "donor", "experiment", "condition")
752
+ - Data quality distribution
753
+
754
+ IMPORTANT: Adapt parameters intelligently based on data characteristics.
755
+ Stick to the defaults if there is no strong reason (e.g. unchanged leads to false results) to change.
756
+
757
+ Then run the pipeline sequentially, making smart parameter choices:
758
+
759
+ 1. **quality_control** - Examine data and adapt:
760
+ - data_path="{data_path}"
761
+ - batch_key: Set if batch columns exist (for batch-aware doublet detection)
762
+ - mt_prefix: "MT-" (human) or "Mt-" (mouse) based on gene names
763
+ - min_genes/min_cells: Adjust based on quality distributions
764
+ - Review QC plots before proceeding
765
+
766
+ 2. **normalize_data** - Use QC output:
767
+ - target_sum: None (median) or 10000 (CP10K)
768
+
769
+ 3. **select_features** - Feature selection:
770
+ - batch_key: Use same as step 1 if batches present
771
+ - n_top_genes: 2000-3000 based on complexity
772
+ - flavor: "seurat" or "seurat_v3" for high dropout
773
+
774
+ 4. **reduce_dimensionality** - PCA analysis:
775
+ - n_comps: 50 (or less for small datasets)
776
+ - Review variance plot for optimal PC selection
777
+ - color_vars: Include relevant metadata
778
+
779
+ 5. **build_neighborhood_graph** - Graph construction:
780
+ - n_pcs: Based on elbow in variance plot (20-40)
781
+ - n_neighbors: 10-30 based on dataset size
782
+ - Check UMAP for batch effects
783
+
784
+ 6. **cluster_cells** - Clustering:
785
+ - resolution: 0.1-0.4 (broad) or 0.6-1.5 (fine)
786
+ - Based on expected cell type diversity
787
+
788
+ 7. **annotate_cell_types** - Annotation:
789
+ - resolutions: Test multiple [low, medium, high]
790
+ - marker_genes: Provide tissue-specific markers if known
791
+ - Validate with marker expression
792
+
793
+ KEY DECISIONS:
794
+ - Identify and consistently use batch_key throughout if batches exist
795
+ - Adjust all thresholds based on data quality
796
+ - Validate each step before proceeding
797
+ - Document any anomalies or batch effects
798
+
799
+ The pipeline produces a fully annotated dataset with QC metrics, embeddings, clusters, and cell type markers.
800
+ """