Written by Shambhavi Shukla, Pragya Khanna, & Shambhavi Mathur
Fig.1.Structures of a protein that were predicted by artificial intelligence (blue) and experimentally determined (green) match almost perfectly.
A Google AI offshoot artificial intelligence network DeepMind has made a big leap towards overcoming one of the major challenges of the biological field — the determination of a protein's 3D form and shape from its amino-acid sequence. Read the original paper here.
DeepMind’s AlphaFold surpassed over 100 other teams in a biennial protein structure prediction competition known as CASP. The findings were declared at the beginning of the conference- held virtually- on 30th November.
In a few hours same day, it was then declared that Google DeepMind’s Artificial Intelligence computer program, AlphaFold, had made a decisive breakthrough in the determination of the 3-D structures of proteins.
The announcement was immediately hailed as one of the major scientific advances of the decade.
“What the DeepMind team has managed to achieve is fantastic and will change the future of structural biology and protein research,” says Janet Thornton, director emeritus of the European Bioinformatics Institute. “This is a 50-year-old problem,” adds John Moult, a structural biologist at the University of Maryland, Shady Grove, and co-founder of the competition, Critical Assessment of Protein Structure Prediction (CASP). “I never thought I’d see this in my lifetime.”
You can watch this video for a brief overview-
But to answer your questions— Why is it important to understand the 3-D structures of protein, why are they difficult to construct, and what is the nature of AlphaFold’s advance? Why is this so exciting and what further advances in medicine and the other biosciences may result? We are writing this article to explain all the slightest details in most elementary forms.
Critical Assessment of protein Structure Prediction, or CASP, is a community-wide, global protein structure prediction experiment that has been conducted every two years since 1994. Although CASP's primary objective is to help advance the methods of identifying the three-dimensional protein structure from its amino acid sequence, many consider the experiment more like a "world championship" in this field of science.
It is important that the experiment be conducted in a double-blind fashion to ensure that no predictor can have prior information about the structure of a protein that would put him/her at an advantage: neither predictors nor organizers and evaluators know the structures of the target proteins at the time when predictions are made. Structure prediction targets are either structures that are soon to be solved by X-ray crystallography or NMR spectroscopy or structures that have just been solved by the Protein data bank and are held on hold.
The main metric used by CASP to measure the accuracy of predictions is the Global Distance Test, which ranges from 0-100 it can be approximately thought of as the percentage of amino acid residues (beads in the protein chain) within a threshold distance from the correct position. According to Professor Moult, a score of around 90 GDT is informally considered to be competitive with results obtained from experimental methods.
An independent mechanism for the evaluation of protein structure modeling methods is provided by CASP13. CASP organizers have been reporting on the sequences of unknown protein structures for modeling from May to July 2018. From May through mid-August, protein models were collected and evaluated as the experimental coordinates became available, and then CASP13 made headlines in December 2018 when AlphaFold, an Artificial Intelligence program developed by Deepmind, won it.
An enhanced version 2 of AlphaFold won CASP14 in November 2020, with a score of nearly 90 on a 100-point prediction accuracy scale.
In the course of the past five decades, many researchers were able to determine the shapes of proteins in labs by using experimental techniques, such as cryo-electron microscopy, X-ray crystallography, and nuclear magnetic resonance. Each of these methods depends on a lot of trial and error, which sometimes can take several years for work and cost tens of thousands or maybe more dollars per protein structure. This is the only reason why many biologists are turning towards AI methods as an alternative to this long and laborious process for difficult or complex proteins.
The team of DeepMind focused specifically on the problem of modeling target shapes from scratch, without using any previously solved proteins in the form of templates, and achieved a high degree of accuracy when predicting the physical properties of a protein structure and then, further, used two distinctive methods to construct the predictions related to full protein structures. Both of these methods seemed to rely on deep neural networks that are trained to predict certain properties of the protein from its very genetic sequence.
The properties that their network predicts are:
The distances between the pairs of amino acids.
The angles between the chemical bonds that connect those amino acids.
DeepMind’s team trained a neural network to predict a distribution of distances between every pair of residues in a protein, furthermore, these probabilities were then combined into a score that estimates how accurate the proposed protein structure was, and further, trained a neural network separately that uses all the distances in an aggregate to estimate how the proposed structure closely resembles the right answer. Using the scoring functions that are mentioned above, they were able to search the protein landscape to find the structures that perfectly matched their predictions. The first method was built on the techniques that are commonly used in structural biology, and it repeatedly replaced the pieces of a protein structure with new protein fragments.
Fig.2. An overview of the main neural network architecture. The model operates over evolutionarily related protein sequences as well as amino acid residue pairs, iteratively passing the information between both representations to generate a structure.
Further training of the generative neural network to invent new fragments, which were used to continually improve the score of the proposed or predicted protein structure. The second method, however, optimized scores through the gradient descent - a mathematical technique that is commonly used in machine learning for making small and incremental improvements- which later on, resulted in highly accurate structures of the predicted proteins.
Fig.3. AlphaFold Learning Process.
This technique was applied to the entire chain, instead of a few pieces that must be folded separately before being assembled or cumulated to form a larger structure, in order to simplify the prediction process.
Fig.4. An animation of the gradient descent method predicting a structure for CASP13 T1008.
The following tools and datasets were used for the Critical Assessment of Structure Prediction (CASP) system and also for few experiments, such as Uniclust30; PSI-Blast v2.6.0; BioPython v.1.65; Rosetta v3.5 and PyMol v2.2.0 for structure visualization. For accuracy, they compared the final resulted structures to those which were either experimentally predicted or proposed, then used the metrics such as Template Modeling score to check what was required. For the distance potential, the histogram probabilities are estimated for discrete distance bins, so, therefore, in order to construct a differentiable potential, the distribution is interpolated with a cubic spline. The final bin accumulates a probability mass from all the distances that are beyond 22 A and as greater distances are considered harder to predict accurately, the potential was, therefore, only fitted up to 18 A, with a uniform extrapolation.
Fig.5. Improvements were seen in the median accuracy of predictions in the free-modeling category for the best team in each CASP, measured as best-of-5 GDT.
There is a wide range of using the predicted structures, all with different accuracy requirements, from generally understanding the fold shape to understanding the detailed side chain configurations in binding regions.
Fig.6. Two examples of protein targets in the free-modeling category. AlphaFold predicts highly accurate structures measured against experimental results.
To realize the structures of proteins that minimized the constructed potential, the team of intellectual individuals created differentiable models of ideal protein backbone geometry, giving the backbone atom coordinates as a function of the torsion angles. The complete potential to be minimized is then considered to be the sum of the distance. Although, there is no such guarantee that these potentials have any equivalent scale. Scaling certain parameters on the terms were introduced and chosen by the cross-validation on CASP12FM domains. In practice, the equal weighing for all terms was found to lead to the best results.
Fig.7. Neural Networks being used to predict physical properties.
This breakthrough of protein folding with the help of AI builds on the first entry of DeepMind at CASP13 in the year 2018. The initial version of AlphaFold achieved the highest accuracy degree among all the participants. DeepMind has now seemed to level up from what they presented at CASP13 in 2018, now, the team DeepMind has developed new deep learning architectures for CASP14, which has drawn its inspiration from the fields of physics, machine learning, and biology, and have taken the influence from the previous works of several scientists over the past years, as well.
For this latest invention of AlphaFold (AlphaFold 2) used at CASP14, DeepMind has created a well trained and well structured attention-based neural network system that attempts to interpret the structure of the folded protein also sometimes considered as a “spatial graph”. This uses evolutionarily related sequences or multiple sequence alignment and also, the representation of the amino acid residue pairs in order to refine this graph.
When iterating this process, the system develops strong predictions of an underlying protein’s physical structure and, a very specific feature of this invention of AlphaFold can predict the parts of each predicted protein which are reliable enough by using an internal confidence measure.
AlphaFold is, therefore, a once in a generation invention, that is advanced in predicting protein structures with incredible speed and precision, and has created history on November 30, 2020. Not only it did an outstanding job at CASP14 but also it has many reasons to use it.
The reasons for using this are:
It acts as an alternative to certain methods for protein folding.
It is not as expensive and laborious as the methods used previously for protein folding by some researchers.
It uses Artificial Intelligence to predict or propose the structure of the folded protein.
Its accuracy level is higher than the predicted structure, and even has a higher probability of matching the predicted to that of the real.
With the knowledge of protein structures in hand so many “impossible” problems can now begin to be solved, beginning from diagnosis of fatal diseases to devising a plastic-eating bacteria it all seems achievable now.
There are a set of rare diseases that result in a malformed protein by mutation of a single gene which poses threat to the whole organism. AlphaFold can be extremely useful in predicting the shape of the aforementioned protein, which is an important step towards effective drug discovery. Another benefit that AlphaFold provides here is that it is much faster and cost-effective than finding structures by experimentation.
It can prove to be useful in future pandemics like the COVID-19, by accurately predicting the viral protein structures early in the appearance, accelerating the process of vaccine development.
AlphaFold has opened doors to a future in which one could rapidly acquire knowledge of a disease and hence develop more effective drugs efficiently or one where enzymes could breakdown plastic waste or even capture carbon from the atmosphere, all with the help of protein structures. Even though there’s a lot more work to be done, this know-how of the shapes of the building blocks of lives could enable scientists to better understand the natural world and probably expand the knowledge of life itself.
What we all need to remember is that this solution to the 50-year-old grand problem is not the end of the road, but rather a stepping stone for many more ground-breaking researches yet to come!
Even though AlphaFold is way ahead of its time, there is still plenty to be done. The current version, says Dr. Jumper, has more room to grow. He thinks space exists to boost the software’s accuracy still further. There are also, for now, things that remain beyond its reach, such as how structures built from several proteins are joined together or how to determine the precise location of all amino acid side chains.
Nonetheless, AlphaFold has revealed the outstanding potential that AI holds which has strengthened our confidence that AI will become one of the essential tools in expanding our scientific knowledge.
Data and Code Availability:
The training, validation, and test data splits (CATH domain codes) and Source code for the distogram, reference distogram, and torsion prediction neural networks, together with the neural network weights and input data for the CASP13 targets are available for research and non-commercial use at alphafold_casp13. The following versions of public datasets were used in this study: PDB 2018-03-15; CATH 2018-03-16; Uniclust30 2017-10; and PSI-BLAST nr dataset (as of 15 December 2017). Several open-source libraries were used to conduct these series of experiments particularly HHblits, PSI-BLAST, and the machine-learning framework Tensor-Flow (https://github.com/tensorflow/tensorflow) along with the TensorFlow library Sonnet (https://github.com/deepmind/sonnet), which, provides implementations of individual model components.