Protein folding problem solved?

The protein folding problem is a fascinating one. Protein are linear strings of amino acids. But they are not floppy strings. They take unique 3-D shapes and those shapes are critical for how they function. Clearly, there must be some rule or mechanism contained within the amino acid sequence that tells a protein how to bend but discovering those rules has not been easy. Now an AI algorithm, given a sequence oof amino acids, has been able to predict with considerable success the shape of that protein.

In 1972, Christian Anfinsen was awarded a Nobel Prize for his work showing that it should be possible to determine the shape of proteins based on the sequence of their amino acid building blocks.

Every two years, scores of teams from more than 20 countries blindly attempt to predict using computers the shape of a set of around 100 proteins from amino acid sequences alone.

At the same time, the 3-D structures are worked out in the lab by biologists using traditional techniques like X-ray crystallography and NMR spectroscopy, which determine the location of each atom relative to each other in the protein molecule.

A team of scientists from Casp (the Community Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction) then compares these predictions with 3-D structures solved using experimental methods.

In the latest round of the challenge, Casp-14, AlphaFold determined the shape of around two thirds of the proteins with accuracy comparable to laboratory experiments.

The assessors said accuracy with most of the other proteins was also high, though not quite at that level.

AlphaFold is based on a concept called deep learning. In this process, the structure of a folded protein is represented as a spatial graph.

The program then “learns” using information on the 3-D shapes of known proteins held in a worldwide database.

The article claims that the protein folding mystery has been solved. While this is a big step, I am not sure I would go that far. There is a difference between being able to predict something based on patterns extracted from a database and being able to predict it based on an underlying mechanism that one has unearthed. The former is what Aristotle might have called ‘know how’ while the latter is ‘know why’. It is the difference between predictions of planetary motion before Newtonian mechanics, that were based on patterns and rules like Kepler’s laws, and after Newtonian mechanics, that were based on the laws of motion and gravity.

As far as I am aware, we still do not know the underlying mechanism that determines how the protein folds. To me, that would constitute having really solved the problem,

Any biologists want to chime in?


  1. nifty says

    I’ll start. This is an area I did some work on in graduate school. The major challenge, in my thinking, is that the problem would be reasonably simple in vacuum, but proteins do not fold in vacuum- they fold in salty water. The folding interactions depend on the balance of interactions of the amino acids with each other compared to their interactions with the ions and dipoles present in water. Part of the driving force is also entropy changes in the water- if your protein interacts with itself, that may then free water molecules that were formerly in restricted positions with the unfolded protein. This is progress if they are now at about 2/3 level. A while ago this was at the 1/2 level.
    Also note that this really is working on proteins that fold similarly to those that have bee examined before, and is not as accurate for new classes of proteins with no prior info.
    Finally, some proteins are unstructured as part of their function:

  2. invivoMark says

    The mechanisms of folding aren’t especially mysterious. Proteins mostly fold spontaneously as they are synthesized by the ribosome. Sometimes they fold just by the forces of the salts and water alone, sometimes they fold with the assistance of a chaperone protein. It’s all driven by simple ionic and van der Waals forces.

    The problem is that the forces applied by each individual water molecule and salt ion, along with intramolecular forces of each amino acid on each other amino acid, are astronomically complex and impossible to computationally model from basic principles (at least, too complex to compute before the inevitable heat death of the universe).

    There are other factors that also increase complexity. Many proteins are trans-membrane proteins, with one or more passes through a lipid bilayer. They must be synthesized at the membrane’s surface, and are threaded through the membrane as they’re made.

    Many proteins change conformation after they’re synthesized, due to post-translational modifications such as added sugars and phosphates, or because they’re part of large protein complexes that take on their own shapes.

    These factors all serve to complicate the problem even more dramatically. But the mechanisms are still mostly the same.

  3. nifty says

    One feature I do like in the community of researchers working in this area: those about ready to publish their results on a new structure, either X-ray or NMR, will often let the computational people know in advance. Then the modelers can focus on work on these proteins, knowing that in fairly short time periods they will get some good feedback on how well their modeling approaches worked.

  4. Reginald Selkirk says

    @3 invivoMark covered a lot of the exceptions: post-translational modifications(he did not specifically mention proteolysis), proteins that require chaperonins.
    That proteins are in salty water rather than vacuum should in principle not be too big a hurdle; that should be incorporated into the force fields used. I would compare it to “movie physics” -- it doesn’t have to be 100% accurate, just accurate enough to get the job done.
    The major mathematical/philosophical objection I would make is that since their algorithm is based on known structures, there is always the possibility that a protein of interest will feature something that is truly novel that is not handled well by the training set. On second reading, I notice that @1 nifty mentioned this.

  5. John Morales says


    The major mathematical/philosophical objection I would make is that since their algorithm is based on known structures […]

    More accurately, the “algorithm”:

    We trained a neural network to predict a distribution of distances between every pair of residues in a protein (visualised in Figure 2). These probabilities were then combined into a score that estimates how accurate a proposed protein structure is. We also trained a separate neural network that uses all distances in aggregate to estimate how close the proposed structure is to the right answer.

    Using these scoring functions, we were able to search the protein landscape to find structures that matched our predictions. Our first method built on techniques commonly used in structural biology, and repeatedly replaced pieces of a protein structure with new protein fragments. We trained a generative neural network to invent new fragments, which were used to continually improve the score of the proposed protein structure.

Leave a Reply

Your email address will not be published. Required fields are marked *