Unlocking the Future of Drug Design - PMDM

He watched his daughter break the entire newly built lighthouse.

“What’s wrong with it?”

“There’s no balcony,” she muttered.

As she crushed the beautiful white tower, her dad pointed out. “It resembled Fastnet. You could’ve just added a few circular railings at the top and- “

“I want two balconies.”

She’d made up her mind. The dad shook his head and sipped his morning coffee, reading the newspaper. He wondered if using LEGO instead of CADD would help him in his quest for better drug designs. Maybe LEGO wasn’t such a bad idea after all.

He imagined the day when designing new drugs would be as simple and creative as building with those little tiny pieces.

He’d still have to work out multiple permutations of different blocks, hoping he would find the perfect combination for a drug that could possibly help millions around the globe.

This is where the latest breakthrough in structure-based generative chemistry comes into play.

The Puzzle of Drug Design

Have you ever played one of the most influential games created by Alexey Pajitnov? Well, designing drugs has always been like playing a gigantic game of Tetris. Scientists take molecules from existing databases (whose space ranges from 10^60 to 100^100 depending on the size of the required drug), and try to fit them into protein target areas, and see what fits the best. This method, called Virtual Screening, is laborious and limited by what’s already known. Kind of like making the same, repeated buildings from a LEGO manual, right?

Deep Learning methods have allowed us to explore such a widespread database. Methods such as Variational Autoencoder, Generative Adversarial Networks, normalizing flows, and diffusion models learn the underlying hidden distribution of molecules. Yet, they fail to provide substantial 3D information, representing molecules as simple SMILE strings or graphs. Capturing the 3D structure of a molecule is perhaps one of the most important tasks. One molecular graph could form various conformations with different properties in the 3D space. Considering this, 3D molecule generation was incorporated to account for the 3D spatial information of molecules. However, whether these molecules would bind well to the target proteins was not considered.

Henceforth, considering the 3D structure of the target pocket as conditional information as well as learning the interactions between molecules and proteins helped researchers to understand the conditioned density of desired molecular data.

Variational Autoencoders (VAEs) are sophisticated generative models engineered to understand the underlying probability distribution of a given dataset and generate new, similar samples. At the heart of a VAE is an encoder-decoder architecture. Voxelized atomic density images would be fed to VAEs to obtain transformed molecules from those images. However, VAE compresses the pocket structure information and fails to generate accurate target-specific molecules.

What about auto-regressive models?

Think of it like this: imagine you're trying to draw a detailed picture, but you can only draw one tiny part at a time, and you have to start from scratch with each stroke of your pen. That's kind of what these auto-regressive frameworks do when generating molecules- they build them up atom by atom, step by step. Sure, they're trying to explore this vast chemical space, but it's like trying to navigate through a maze blindfolded. You're bound to make mistakes, especially when you can't see the bigger picture. And let's face it, molecular structures are as messy and unpredictable as trying to untangle a ball of yarn- every twist and turn adds another layer of complexity.

Consequently, achieving accurate and efficient 3D sampling of molecules within pocket cavities remains a significant challenge in the field. Recently, diffusion models have garnered significant attention in computer vision tasks, especially in point cloud generation, which is quite similar to 3D molecule generation. These methods can fill in 3D objects by learning the joint distribution of the data. Maybe they could be used for molecule generation?

Enter PMDM: The Master Builder

The Pocket-based Molecular Diffusion Model (PMDM) is like having a magical LEGO set that can create the perfect piece for any spot in your castle along with providing information about its binding efficiency. PMDM uses advanced diffusion-based techniques to generate molecules with fixed pocket information. Lei Huang and her team have developed this novel conditional deep generative model for 3D molecule generation fitting specified target proteins. Let us take a look at the simplified steps below!

Protein Point Cloud Encoding

Let's imagine molecules as points in the 3D space. A molecule would contain many such points and form a “cloud” of data points in this space. An invariant encoder called SchNet is used to capture the semantic and spatial context of the protein.

Drawbacks

Regular methods for 3D point clouds cannot involve edge information like chemical bond information if we represent 3D molecular geometries as 3D point clouds.

The Dual Diffusion Method

PMDM uses two kinds of connections:

Covalent Localized Edges: These represent strong chemical bonds between atoms that are close to each other (pairs of atoms with interatomic distances below a certain threshold).
Global Edges: These represent weaker forces affecting atoms that are further apart (van der Waals forces).

Diffusion Process

This is where the points spread out and move around. Think of this as adding "noise" or randomness to the molecule’s data. This process iteratively corrupts the original molecule (ligand) by adding Gaussian noise.
The goal of PMDM is to learn how to reverse this process to create molecules that fit some specific criteria. There is a need to model a conditioned data distribution that further effectively generates accurate molecules with high affinity to the targets.

Kernels and Experiments

No matter the orientation of the molecule, its identity should remain the same. Hence, researchers at the City University of Hong Kong and Tencent AI Lab in China designed an equivariant dynamic kernel that obeys the translation, rotation, reflection, and permutation equivariance of molecular geometry systems. The team tested their model on the Synthetic CrossDocked Dataset. This dataset includes molecules designed to fit into certain pockets.

Results of PMDM:

· PMDM was able to create new molecules that are similar to real drugs.
· These new molecules could be made in a lab (synthesis-accessible).
· They can bind well to specific proteins (high binding affinity), which is important for making effective drugs.
· PMDM performed better than the best existing models in multiple tests.

Let’s take a deeper dive into the results. After all, we need to see why PMDM works as well as it does, don’t we?

Metrics for Evaluation

PMDM’s performance was measured using several metrics:

Vina Score: Estimates the binding affinity between the ligand and the protein pocket.
High Affinity: Percentage of molecules with better binding affinity than the ground truth molecules.
QED (Quantitative Estimate of Drug-likeness): Assesses how drug-like a molecule is based on several properties.
SA (Synthetic Accessibility): Measures how easily the molecule can be synthesized in a lab.
Lipinski's Rule of Five: Determines if the molecule meets key criteria for drug-likeness.
LogP: Indicates the molecule’s solubility and permeability. It indicates the octanol-water partition coefficient, which should be between -0.4 and 5.6 if the molecule is a good drug candidate
Diversity: Measures the variety of generated molecules.
Generation Time: The time taken to generate 100 samples for each target pocket.

For every target protein in the test set, the group produced 100 molecules, for a total of 10,000 molecules. A sample of these created molecules' sizes was taken from the training set's size distribution. Several kinds of indicators were used to compare the success rate of PMDM with other models.

It performed better than rivals in several domains, including CVAE and AR-SBDD. For example, PMDM demonstrated a higher Vina Score, which suggests a greater binding to the target proteins. It also exhibited quicker production times and better scores on measures like QED and Lipinski Scores, indicating that its compounds were not only efficient but also had promise as potential drugs.

When taking into account the complete model creation, PMDM wasn't simply good; it was extraordinary, creating a new benchmark for molecular design.

What about evaluating each LEGO piece, not just the final model?

When evaluating a LEGO model, it's not enough to look at the finished build; each brick must fit well and contribute to the structure's stability. Similarly, to understand the quality of generated molecules, it's not just the overall structure that matters but also the quality of their sub-structures. Several pocket proteins were investigated to understand the substructures.

AR-SBDD often creates unstable three-atom rings, indicating it gets stuck in local optima.

In terms of ring structures, PMDM generates fewer unstable three and four-atom rings and more stable five and six-atom rings (cue the organic chemistry flashbacks), which are crucial in drug design due to their frequent hydrogen bonds. This balance suggests that PMDM has a much better understanding of the molecular data distribution, producing molecules that are more representative of real-world drugs.

Further, bond angles and dihedral angles were assessed to ensure the molecules' local geometry was accurate. PMDM outperformed all models in maintaining these geometric properties and these results indicate that it is capable of capturing the local atom geometry of the data.

So many LEGO worlds!

After checking out the local geometry of molecules from PMDM, it's important to consider the broader chemical space they occupy. We need to focus on the overall shape and distribution of molecules.

PMDM stands out by accurately capturing both 2D and 3D molecular fingerprints. Using methods like Morgan, RDKit, and USRCAT, PMDM represents the chemical space of generated molecules compared to test-set molecules. The Extended-Connectivity Fingerprints (ECFP) based on the Morgan algorithm consider atom types, connectivity, and chemical features. RDKit fingerprints measure 2D substructures.

Visualizing chemical space distribution with t-SNE (t-distributed stochastic neighbor embedding) shows that PMDM-generated molecules can cover the test-set molecules in 2D substructure space, showing an accurate modeling of the training space. The 3D chemical space is also well captured by PMDM, indicating no significant mismatches between generated and test-set molecules.

Since the shape of molecular targets is important, researchers use Principal Moments of Inertia (PMI) and Plane of Best Fit (PBF) descriptors to characterize these shapes. PMI descriptors reflect whether a molecule's geometry is rod-shaped, disc-shaped, or sphere-shaped. A ternary plot of Normalized Principal Moment of Inertia ratios (NPR) shows that PMDM-generated molecules exhibit similar patterns to test set molecules, gathering around the rod corner and even touching the disc and sphere corners, suggesting PMDM's ability to explore novel shapes beyond the dataset.

The PBF values, which measure the distance of heavy atoms from the plane of best fit, also show a great match between the test set and generated molecules. This indicates that PMDM can model both 2D and 3D molecular structures accurately, guiding the exploration of novel drug-like structures.

Is your LEGO model better than ours?

How well does PMDM work practically? To test this, the trained model was applied to generate molecules targeted for SARS-CoV-2-related proteins with high affinities. Specifically, the focus was on designing non-covalent inhibitors for the SARS-CoV-2 main protease (Mpro), which is crucial for viral replication. Hence it is a viable drug target.

SARS-CoV-2 Mpro was selected as a test case.

The researchers aimed to generate molecules with novel scaffolds. Using three atoms as the seed fragment, 40,000 molecules were generated. According to their Vina scores, some were filtered resulting in 10,627 high-affinity molecules. None of these molecules were in the training set, indicating that PMDM can generate novel molecules that bind well to target proteins.

Building the Perfect LEGO Tower: Pharmacophore Analysis

Imagine building a LEGO tower, where each piece has to fit just right to complete the structure. In the same way, designing effective drugs requires molecules to have specific features that fit perfectly with their target. To see if the molecules generated by PMDM had these features, the team used Align-It software to visualize the distribution of hydrophobic groups, such as aromatic rings and lipophilic regions.

The results were promising. The hydrophobic groups in the generated molecules clustered in key areas (S1’, S1, S2, S3, and S4) just like the reference compound, showing that PMDM can create molecules with similar binding properties.

Further analysis revealed that the hydrogen bond acceptors in the generated molecules interacted well with crucial residues like HIS 163 and GLU166, and the hydrogen bond donors were in the right spots. Additionally, new clusters suggested that these molecules could form hydrogen bonds with other parts of the protein pocket.

Admiring your final build

As the dad watched his daughter build her LEGO tower (including two balconies) with crazy determination, he couldn't help but feel a sense of pride. Just like her, he knew that creating something remarkable often required breaking down a few things along the way. It requires patience, determination, and the ability to plunge yourself into a vast ocean of opportunities.

PMDM makes things easier and faster while delivering better results than the top methods out there. We believe it's going to revolutionize how new drugs are designed, especially for targeting specific proteins. This method could be a ground-breaking method for faster drug discovery and delivery.

Reference - link to article

molecular musings

Search This Blog