biocynthia-demo / README.md
latticetower's picture
fix typos
ca699c2
metadata
title: BioCynthia Demo
emoji: 🧬
colorFrom: gray
colorTo: red
sdk: docker
pinned: true
license: apache-2.0
short_description: reverse engineering antibiotics with MCP tools
tags:
  - mcp-server-track

BioCynthia: reverse engineering antibiotics with MCP tools

Let's start with the song!

There are bacteria in soil and sea
They have what is called a B-G-C
These genes produce some complex peptides
And they might save our lives!

Now from the beginning!

Many organisms, usually soil and sea bacteria or fungi, deal with the pathogens in their own way: they create mini-factories in their genome, the factories, which are not necessary for their life and reproduction, but still produce the very specific peptides for them: in some cases those peptides have antibiotic properties and kill the pathogens.

The factory of genes is called Biosynthetic gene cluster (BGC). These factories behave differently, there are several known classes (unrelated between each other). The BGCs might be itself hybrid, nested - whatever necessary. Sometimes the BGC is broken and doesn't produce anything at all.

And here comes the biggest problem: the BGCs are hard to locate, hard to describe, to make the BGC's natural product (NP) (the peptide this cluster produces) one have to prove that this particular cluster actually produces it.

But why bother?

This comes with the profit! If we are able to find the NP with desired properties, we can use the genome's machinery to produce it! We don't have to ask fellow chemists to design reaction pathways for us, it won't cost us all our money and time - bacteria or fungi already knows how to synthesize the cure and will do it for us!

What's this project is about?

It's actually about genome mining: let's generate peptides as usual and try to find if it is already produced by some bacteria. What if we already have the soil sample, but missed that BGC and didn't explore it?

My original plan is described in the picture:

project pipeline

I've spent a lot of time on data preparation, so I've decided to skip the parts where I split the peptide to monomer subunits (there are tools which do this, also, it is solvable with rdkit).

Also, I don't do actual domain sequence generation (I plan to experiment with diffusion models but I'll keep it as after-hackathon activity).

The idea is simple: given some specific (randomly sampled) peptide, construct the sequence of domains which might have produced it.

With this sequence of domains, I want to look at the publicly available databases if there are close hits.

This might help to find useful BGCs previously missed.

Limitations

I consider only nonribosomal peptide synthetases, since they have known modular structures and are relatively well described.

I limit the possible peptide monomers to aminoacids and their D- isomers (there are many others)

I dont' take into account hybrid or complex BGCs or don't consider circular peptides, since I haven't figured out yet what part of the BGC affects circularity

Data

As the dataset I use the domains corresponding to NRPS modules with their corresponding monomers extracted from antismash database v.4.0. I've spent most of the time cleaning and collecting this data :)

Why is this topic

I'm a PhD student and this is basically my research area