Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide

banner

3DSIG COSI Track Presentations

Attention Conference Presenters - please review the Speaker Information Page available here
3Dsig Opening remarks
Date: Saturday, July 22
Time: 10:00 AM - 10:20 AM
Room: Forum
  • Rafael Najmanovich, University of Montreal, Canada

Presentation Overview: Show

Opening remarks: 3DSIG past, present & future

Cell-wide analysis of protein thermal unfolding reveals determinants of thermostability
Date: Saturday, July 22
Time: 10:20 AM - 10:40 AM
Room: Forum
  • Abdullah Kahraman, University of Zurich; Sabanci University, Switzerland; Turkey
  • Pascal Leuenberger, ETH Zurich, Switzerland
  • Stefan Ganscha, ETH Zurich, Switzerland
  • Valentina Cappelletti, ETH Zurich, Switzerland
  • Paul J. Boersema, ETH Zurich, Switzerland
  • Christian von Mering, University of Zurich, Switzerland
  • Manfred Claassen, ETH Zurich, Switzerland
  • Paola Picotti, ETH Zurich, Switzerland

Presentation Overview: Show

Abstract
Temperature-induced cell death is thought to be due to protein denaturation, but the determinants of thermal sensitivity of proteomes remain largely uncharacterized. We developed a structural proteomic strategy to measure protein thermostability on a proteome-wide scale and with domain-level resolution. We applied it to E.coli, S.cerevisiae, T.thermophilus, and human cells, yielding thermostability data for more than 8000 proteins. Our results (1) indicate that temperature-induced cellular collapse is due to the loss of a subset of proteins with key functions, (2) shed light on the evolutionary conservation of protein and domain stability, and (3) suggest that natively disordered proteins in a cell are less prevalent than predicted and (4) that highly expressed proteins are stable because they are designed to tolerate translational errors that would lead to the accumulation of toxic misfolded species.

Introduction
Temperature is crucially important to life. Small temperature changes can differentiate optimal and lethal growth conditions of living organisms. Because of the higher abundance and lower stability of proteins as compared with those of other biological macromolecules, thermally induced cell death is thought to be due to protein denaturation, but the determinants of thermal sensitivity of proteomes remain largely uncharacterized.

Methods
To determine the thermal stability of proteins on a proteome-wide scale and with domain-level resolution, we developed a structural proteomic approach that relies on limited proteolysis (LiP) and mass spectrometry (MS) applied over a range of temperatures.

Results
Our LiP-MS strategy was validated through analysis of purified proteins in the presence and absence of a biologically relevant matrix. We then obtained proteome-wide thermal denaturation profiles for E.coli, S.cerevisiae, T.thermophilus, and human cells. In contrast to previous predications that proteome instability derives from the simultaneous and generalized loss of hundreds of proteins, we observed that at a temperature at which cells experience temperature-induced physiological impairment, a subset of essential proteins undergoes denaturation.

Confirming results of previous studies on the basis of comparison of genomes of thermophilic and mesophilic bacteria, we observed enrichment for lysine residues and b-sheet structures in thermostable proteins. We also found that unstable proteins have a higher content of aspartic acid than that of stable proteins and observed an inverse correlation between protein length and thermal stability. Further, thermostable proteins are substantially less prone to thermal aggregation than unstable proteins. Relative domain thermostability was conserved both within species and across organisms. Thermal stability was not generally similar for proteins encoded by orthologous genes. This suggests that the melting temperatures of proteins are affected by the reshuffling of protein domains, despite the conservation of domain stability.

According to the “translational robustness” theory, highly expressed proteins must tolerate translational errors that can lead to the accumulation of toxic misfolded species. Our data show a clear direct relationship between protein thermal stability and intracellular abundance and an inverse relationship between protein stability and aggregation or local unfolding. In- creasing the thermodynamic stabilities of the folds of abundant proteins will broaden the range of amino acid replacements that a protein can tolerate before misfolding. Our findings suggest that over the course of evolution, the burden of intracellular misfolding has been reduced by increasing the thermodynamic stability of abundant proteins.

Conclusions
Our study contributes insight into the molecular and evolutionary bases of protein and proteome thermostability and provides a blueprint for future studies on the stability of proteomes and thermal denaturation.

How nature builds electrostatic interactions in natural enzymes: What can we learn for enzyme design?
Date: Saturday, July 22
Time: 10:40 AM - 11:00 AM
Room: Forum
  • Timothy A. Coulther, Northeastern University, United States
  • Lisa Ngu, Northeastern University, United States
  • Penny J. Beuning, Northeastern University, United States
  • Mary Jo Ondrechen, Northeastern University, United States

Presentation Overview: Show

Abstract
Enzyme design is in its infancy and basic design principles are not yet established. We show how strong electrostatic coupling between proton transfer equilibria is prevalent in the active sites of natural enzymes and facilitates catalysis. Multilayer networks of coupled charged residues promote catalytic efficiency, as observed in natural enzymes and in evolved artificial enzymes.

Introduction
Enzymes catalyze reactions at physiological temperature and neutral pH with high specificity. Therefore they hold great appeal as green industrial catalysts because of the potential for significant reduction in energy expenditures and in unwanted by-products. However, for most industrial chemical processes, no natural enzymes exist. Efforts to design enzymes to catalyze unnatural reactions have had some significant successes; indeed, stable folds can be designed in silico such that the necessary “reagent” residues are in the correct spatial arrangement to facilitate catalysis1, 2. These initial designs typically have very low activity; achieving significant activity requires many rounds of directed evolution over periods of years. The goal of this work is to understand the intrinsic properties that give natural enzymes their catalytic power and to learn how to build these properties into in silico enzyme design.

Methods
Partial Order Optimum Likelihood (POOL)3, 4 is a machine learning method developed by us to predict residues important for function, using the 3D structure of the query protein. The input features to POOL are based on computed electrostatic and chemical properties from THEMATICS. These input features are effectively measures of the strength of coupling between protonation events. POOL is used to characterize the properties of natural enzymes that are necessary for efficient catalysis and to explore how these properties may be built into designed enzymes.

Results & Conclusions
Catalytic sites in proteins are characterized by networks of strongly coupled protonation states; these networks impart the necessary electrostatic and proton-transfer properties to the active residues in the first layer around the reacting substrate molecule(s). Typically these networks include first-, second-, and sometimes third- layer residues. POOL-predicted, multi-layer active sites with significant participation by distal residues have been verified experimentally by single-point site-directed mutagenesis and kinetics assays for Ps. putida nitrile hydratase5, human phosphoglucose isomerase (Figure 1)5, E. coli Y family DNA polymerase DinB6, E. coli replicative DNA polymerase Pol III, and E. coli ornithine transcarbamoylase.


FIGURE 1. Multilayer active site of human phosphoglucose isomerase. Ligands in the substrate binding site are shown in space-filling form. First-layer residue side chains are colored green, second-layer blue, and third-layer orange.

In designed enzymes, such as retroaldolases, the residue-specific input features to POOL – measures of the strength of coupling between protonation equilibria – rise as the enzymes evolve to higher rates of catalytic turnover. An approach to build these properties into the initial designs is proposed.

References
1. Rothlisberger, D., Khersonsky, O., Wollacott, A.M., Jiang, L., DeChancie, J., Betker, J., Gallaher, J.L. et al. Nature 453, 190-195 (2008).
2. Siegel, J.B., Zanghellini, A., Lovick, H.M., Kiss, G., Lambert, A.R., St Clair, J.L., Gallaher, J.L. et al. Science 329, 309-313 (2010).
3. Tong, W., Wei, Y., Murga, L.F., Ondrechen, M.J. & Williams, R.J. PLoS Comp Biol 5, e1000266 (2009).
4. Somarowthu, S., Yang, H., Hildebrand, D.G.C. & Ondrechen, M.J. Biopolymers 95, 390-400 (2011).
5. Brodkin, H.R., DeLateur, N.A., Somarowthu, S., Mills, C.L., Novak, W.R., Beuning, P.J., Ringe, D. & Ondrechen, M.J. Protein Sci. 24, 762-778 (2015).
6. Walsh, J.M., Parasuram, R., Rajput, P.R., Rozners, E., Ondrechen, M.J. & Beuning, P.J. Envi. and Molec. Mutagenesis 53, 766-776 (2012).

Computational design of a symmetrical beta-trefoil lectin with cancer cell binding activity
Date: Saturday, July 22
Time: 11:00 AM - 11:20 AM
Room: Forum
  • Daiki Terada, Yokohama City University; RIKEN, Japan
  • Arnout R. D. Voet, KU Leuven, Belgium
  • Hiroki Noguchi, KU Leuven, Belgium
  • Kenichi Kamata, Yokohama City University; RIKEN, Japan
  • Mio Ohki, Yokohama City University; RIKEN, Japan
  • Christine Addy, Yokohama City University; RIKEN, Japan
  • Yuki Fujii, Nagasaki International University, Japan
  • Daiki Yamamoto5, Yokohama City University, Japan
  • Yasuhiro Ozeki5, Yokohama City University, Japan
  • Jeremy R. H. Tame, Yokohama City University, Japan
  • Kam Zhang, RIKEN, Japan

Presentation Overview: Show

Abstract
Computational protein design has advanced very rapidly over the last decade, but there remain few examples of artificial proteins with direct medical applications. We describe a new artificial β-trefoil lectin that recognizes Burkitt’s lymphoma cells. The new protein, Mitsuba-1, contains 150-residues with three identical tandem repeats. Mitsuba-1 was expressed and crystallized to confirm the X-ray structure matches the predicted model. Mitsuba-1 recognizes cancer cells that express globotriose on the surface, but the cytotoxicity is abolished.

Introduction
MytiLec is a small lectin isolated from the Mediterranean mussel, and found to bind sugar chains with α-D-galactose. MytiLec-1 shows cytotoxic effects towards certain cancer cells including Burkitt’s lymphoma. Recently, we have created perfectly symmetrical proteins from natural templates based on the view that many nearly symmetrical ring-shaped proteins have evolved through exactly such an intermediate phase. We designed Pizza, a β-propeller protein with six identical blades, and showed it can fold readily and is extremely stable1. We have further demonstrated that this symmetrical protein can be used as a template for biomineralization2. Here we have adopted a similar procedure and applied it to MytiLec-1, to create a protein with three identical subdomains, that retains sugar binding activity and the ability to bind selected cell types.

Methods
We used ancestral sequence reconstruction to derive likely parent sequences assuming evolution through duplication, and then computationally evaluated these sequences for stability. The detailed method is described as follows3. A template structure for the desired symmetric protein to be designed is selected from the Protein Data Bank. All the individual subunits are structurally superimposed and their sequences aligned. A phylogenetic tree is created based on this alignment. Putative ancestral sequences are generated using a maximum likelihood based ancestral reconstruction algorithm. One representative subunit that is the closest to all the other subunits is identified and used to generate the backbone of an ideal protein with perfect symmetry. The energy of each one of the ancestral sequences adopting the ideal protein structure is calculated. One sequence with the lowest energy is selected and duplicated multiple times to create the protein with the desired number of identical sequences repeated in tandem. This sequence will be used to produce the recombinant protein for experimental verification.

Results & Conclusions
We have used this method to design a perfectly symmetric β–trefoil protein called Mitsuba-1 containing three identical subdomains. Each subdomain consists of 47 residues. Mitsuba-1 is cloned and expressed in E. coli and subsequently purified and crystallized. The crystal structure of Mitsuba-1 was determined at 1.54Å resolution. It confirmed the design and also revealed the binding of three GlaNac ligands. Mitsuba-1 is stable with a melting temperature of 55oC determined by CD and remains folded in the presence of 3.6M GdmCl. Mitsuba-1 can bind to Raji cells, which are derived from Burkitt’s lymphoma. Mitsuba-1 is not found to reduce the viability of Raji cells. Mitsuba-1 is a further example demonstrating the effectiveness of our method for creating repeat proteins by examining probable evolutionary routes to existing natural proteins.


FIGURE 1. The overall structure of Mitsuba-1 viewed along the pseudo-three-fold symmetry axis or perpendicular to it.

References
1.Voet, A. R. D., etal. (2014) Computational design of a self-assembling symmetrical β-propeller protein. Proc. Natl. Acad. Sci. U.S.A., 111, 15102.
2.Voet, A. R. D., etal. (2015) Biomineralization of a Cadmium Chloride Nanocrystal by a Designed Symmetrical Protein. Angew. Chem. Int. Ed., 54, 9857.
3.Voet, A.R.D., etal. (2017) Evolution-Inspired Computational Design of Symmetric Proteins. Methods in Molecular Biology, 1529, 309.

CATS (Coordinates of Atoms by Taylor Series): Protein design with backbone flexibility in all locally feasible directions
Date: Saturday, July 22
Time: 11:20 AM - 11:40 AM
Room: Forum
  • Mark Hallen, Toyota Technological Institute at Chicago, United States
  • Bruce Donald, Duke University, United States

Presentation Overview: Show

Motivation: When proteins mutate or bind to ligands, their backbones often move significantly, especially in loop regions. Computational protein design algorithms must model these motions in order to accurately optimize protein stability and binding affinity. However, methods for backbone conformational search in design have been much more limited than for sidechain conformational search. This is especially true for combinatorial protein design algorithms, which aim to search a large sequence space efficiently and thus cannot rely on temporal simulation of each candidate sequence.
Results: We alleviate this difficulty with a new parameterization of backbone conformational space, which represents all degrees of freedom of a specified segment of protein chain that maintain valid bonding geometry (by maintaining the original bond lengths and angles and ω dihedrals). In order to search this space, we present an efficient algorithm, CATS, for computing atomic coordinates as a function of our new continuous backbone internal coordinates. CATS generalizes the iMinDEE and EPIC protein design algorithms, which model continuous flexibility in sidechain dihedrals, to model continuous, appropriately localized flexibility in the backbone dihedrals φ and ψ as well. We show using 81 test cases based on 29 different protein structures that CATS finds sequences and conformations that are significantly lower in energy than methods with less or no backbone flexibility do. In particular, we show that CATS can model the viability of an antibody mutation known experimentally to increase affinity, but that appears sterically infeasible when modeled with less or no backbone flexibility.
Availability: Our code is available as free software at https://github.com/donaldlab/OSPREY_refactor Contact: mhallen@ttic.edu, brd+ismb17@cs.duke.edu
Supplementary information: Supplementary data are available at Bioinformatics online.

3DSIG KEYNOTE: Computational Protein Design: Judge the protein by the cover, story and taste
Date: Saturday, July 22
Time: 11:40 AM - 12:30 PM
Room: Forum
  • Ilan Samish, Amai Proteins, Israel

Presentation Overview: Show

Computational protein design (CPD), a yet evolving field, includes computer-aided engineering of amino-acid sequences for the partial modification or full de novo design of proteins of interest. The designs are defined by a requested structure, function, or working environment. Next, the protein is designed to the requested target in an iterative and often hierarchical approach. Not less important is the negative design aspect in which the CPD is directed to avoid unwanted designs. Integrating these aspects in a case-study approach is aimed to present the plethora of approaches within the CPD field as well as direct researchers to future challenges. These include advancing the field for the benefit of understanding protein structure and function and the relationships between them as well as applying such know-how for the benefit of mankind as part of the biotechnological industry. Applied aspects range from new biological drugs, via healthier and tastier food products to nanotechnology and environmentally-friendly enzymes replacing toxic chemical reactions utilized in the current industry.



1. Samish I. (Ed., 2016) Computational Protein Design, Methods in Molecular Biology, Springer Protocols, Humana Press.

2. Samish I. MacDermaid CM. Perez-Aguilar JMP. Saven JG. (2011). Theoretical and Computational Protein Design. Annu Rev Phys Chem 62:129-149

3. Samish I. (2009). Search and Sampling in Structural Bioinformatics. In, Gu J. Bourne PE. (Eds.), Structural Bioinformatics, 2nd Ed. (pp. 207-236). Wiley.

Large-scale structure prediction by improved contact predictions and model quality assessment
Date: Saturday, July 22
Time: 2:00 PM - 2:20 PM
Room: Forum
  • Arne Elofsson, Stockholm University, Sweden
  • David Menendez Hurtado, Stockholm University, Sweden
  • Mirco Michel, Stockholm University, Sweden

Presentation Overview: Show

Motivation: Accurate contact predictions can be
used for predicting the structure of proteins. Until recently these
methods were limited to very big protein families, decreasing their
utility. However, recent progress by combining direct
coupling analysis with machine learning methods has made it possible
to predict accurate contact maps for smaller
families. To what extent these predictions can be used to
produce accurate models of the families is not known.

Results: We present the PconsFold2 pipeline that
uses contact predictions from PconsC3, the CONFOLD folding
algorithm and model quality estimations to predict
the structure of a protein. We show that the model quality
estimation significantly increases the number of models that
reliably can be identified. Finally, we apply PconsFold2 to 6379
Pfam families of unknown structure and find that PconsFold2 can, with an estimated
90% specificity, predict the structure of up to 450 Pfam
families of unknown structure. Out of these 343 have not been
reported before.
Availability: Datasets as well as models of all the 450
Pfam families are available at
http://c3.pcons.net/. All programs used
here are freely available.
Contact: arne@bioinfo.se
Supplementary information: No supplementary data

Exploring the Sequence-based Prediction of Folding Initiation Sites in Proteins
Date: Saturday, July 22
Time: 2:20 PM - 2:40 PM
Room: Forum
  • Daniele Raimondi, ULB/VUB, Belgium
  • Gabriele Orlando, ULB/VUB, Belgium
  • Rita Pancsa, MRC Laboratory of Molecular Biology, UK
  • Taushif Khan, ULB/VUB, Belgium
  • Wim Vranken, ULB/VUB, Belgium

Presentation Overview: Show

Abstract
The very early stages of protein folding, defined by intrinsic local interactions between amino acids close to each other in the protein sequence, are poorly understood. We developed a method to predict from the primary sequence of a protein where early folding is likely to take place, and show that these early folding predictions give insights into the folding process. On a proteome scale the predicted early folding residues tend to become the residues that interact the most in the folded structure, and are often the residues that display evolutionary covariation. These connections suggest that the initial behavior of the protein chain has a lasting effect on its subsequent states.

Introduction
Amino acids close to each other in the sequence that interact favorably are important in shaping the folding landscape, but the structure of a protein does not necessarily provide direct information about where the first local structural elements started to form. We created the Start2Fold database that identifies these residues from pulsed labelling and related Hydrogen Deuterium eXchange (HDX) experiments1. Based on an additional analysis in relation to protein backbone rigidity predictions2, we developed EFoldMine, a predictor of early folding residues based on data from high-quality experimental NMR-based HDX studies.

Methods
From 30 Start2Fold sequences covering a total of 3398 residues2, 482 were designated ‘early folding’: these are the first residues to be involved in sufficient local structure formation to protect their backbone HN from solvent. We used 5 features computed from the protein sequence: DynaMine backbone dynamics3,4, and a new set of predictions for side-chain dynamics and secondary structure propensity from NMR chemical shift-based estimations of the side-chain dynamics and the secondary structure propensity. We used a window of flanking residues between i-2 and i+2, resulting in a 25-dimensional feature vector. Performances were evaluated in strict stratified cross-validation settings, using BLASTCLUST to stratify the 30 sequences in function of a 25% sequence identity (SI) cutoff at 90% coverage.

Results & Conclusions
The EFoldMine performance reaches an MCC of 35.4, and an AUC of 80.8. We investigated how the EFoldMine predictions relate to two protein pairs that have a very similar topology but different folding pathways. EFoldMine picks up where folding starts, with very different early folding profiles between the pairs of proteins (Figure 1). The predictions also relate very well to independent HDX-MS experiments. On a proteome scale, they consistently encompass many of the residues that form the most extensive contacts in the folded protein, and not only for the typical (hydrophobic) structure-forming residues. We also observe that residues with evolutionary covariance signals tend to be early folding residues. The combination of these observations suggests that in particular the interactions between early folding residues have to be conserved in order to maintain the protein fold in evolution.

EFoldMine identifies the amino acid residues in proteins that are inclined to form structural elements unaided at the very first stage of the folding process. The connection of the early folding predictions to both folding pathway data and the folded protein structure suggests that the initial statistical behaviour of the protein chain has a lasting effect on its subsequent states.


FIGURE 1. Myoglobin (a,b) and leghemoglobin (c,d) in relation to early folding scores.

References
1. Pancsa, R., Varadi, M., Tompa, P. & Vranken, W. F. Nucleic Acids Res. 44, D429–D434 (2016).
2. Pancsa, R., Raimondi, D., Cilia, E. & Vranken, W. F. Biophys J 110, 572–583 (2016).
3. Cilia, E., Pancsa, R., Tompa, P., Lenaerts, T. & Vranken, W. F. Nat Commun 4, 2741 (2013).
4. Cilia, E., Pancsa, R., Tompa, P., Lenaerts, T. & Vranken, W. F. Nucleic Acids Res. 42, W264–W270 (2014).

Density-based clustering in structural bioinformatics: application to beta turns and antibody CDRs
Date: Saturday, July 22
Time: 2:40 PM - 3:00 PM
Room: Forum
  • Maxim Shapovalov, Fox Chase Cancer Center, United States
  • Benjamin North, Fox Chase Cancer Center, United States
  • Simon Kelow, Fox Chase Cancer Center, United States
  • Roland Dunbrack, Fox Chase Cancer Center; Temple University, United States

Presentation Overview: Show

Introduction
Clustering methods can be classified into a number of different categories: centroid-based clustering (e.g., k-means); connectivity-based clustering (e.g., hierarchical clustering); distribution-based clustering (e.g., Gaussian mixture models); and density-based clustering, which looks for local maxima in the data density separated by low-density regions. In this work, we explore the utility of density-based clustering methods in structural bioinformatics. An important advantage of density-based methods is that typically they identify outliers during the clustering process, and the outliers are not included in the output clusters. This may occur in protein structures when an element of structure (e.g., a protein loop) may be incorrectly modeled, or might be the result of human engineering rather than a natural process (e.g, antibody CDRs). It is of value to identify both the clusters of data and the noise, and both may be the subject of further study.

We have utilized “density-based spatial clustering of applications with noise” (DBSCAN) and its variants in two important applications. DBSCAN is a simple algorithm in which every data point is assigned to one of three categories, depending on the number of neighbors it has within a predefined distance, Eps. If a data point has at least a fixed number of points (defined by a parameter MinPts) within Eps, it is called a “core point.” Any point that is not a core point but is within Eps of a core point is a “border point.” All other points are called “noise points.” Core points are connected to each other (in the sense of a graph edge) if they are within Eps of each other. Each connected graph of core points becomes a cluster. Border points are assigned to same cluster as their nearest core point. Noise points remain unclustered.

Results
β turns are typically defined as non-helical, 4-residue segments in which the Cα atom of residue 1 is ≤ 7 Å from the Cα atom of residue 4. Turns have been analyzed since the 1970s, and the extant β turn types comprise the following: I, I’, II, II’, VIa(1,2), VIb, and VIII. The Type VI turns have a cis-peptide bond between residues 2 and 3. Type IV turns are a catch-all class for turns that do not fit the standard types, and comprise up to a third of β turns. We compiled a set of 13,030 β turns from 1082 protein chains from structures with resolution ≤ 1.2 Å and less than 50% mutual sequence identity. We used a dihedral angle metric, which is a sum over the six dihedral angles connecting Cα(1) to Cα(4) of the function Dij=2(1-cos(xi – xj)) for each dihedral x. DBSCAN was successful at removing 289 noise points (2.2%) and produced 10 clusters of β turns, 3 of which can be subclustered with a further round of DBSCAN with different parameters or with another clustering method. From these results, we have developed a new, simpler nomenclature for β turns from these (preliminary) results based on the Ramachandran regions: A (for α), B (for β and polyproline II regions), L (for α-left), and E (for the lower right and upper right regions) and lower case for cis residues. In addition to the standard turns (Types AA, LL, BL, EA, Ba, Bb, AB respectively), we find small but significant populations of AL and LA turns (shown below) and two turn types with a cis peptide bound before residue 2: aA and aB.

We have used DBSCAN and an Linfinity metric (the maximum of the dihedral angle metric D over all dihedrals of two conformations) to extend our earlier clustering and nomenclature of antibody CDRs (North, Lehmann, Dunbrack, J. Mol. Biol. 406, 228-256, 2011). For most CDRs and loop lengths, about 10% of the data ends up as noise. An example is shown in the figure below for L3 length 8. The clusters are cleaner in terms of Ramachandran regions, and the sequence profiles are also more predictive in some cases. We have kept our previous nomenclature, which is widely used (L1-11-1, etc.) and created new cluster names and retired others where needed.

Proteins from Peptides
Date: Saturday, July 22
Time: 3:00 PM - 3:20 PM
Room: Forum
  • Andrei Lupas, Max Planck Institute for Developmental Biology, Germany
  • Vikram Alva, Max Planck Institute for Developmental Biology, Germany
  • Marcus D. Hartmann, Max Planck Institute for Developmental Biology, Germany
  • Edgardo Sepulveda, Max Planck Institute for Developmental Biology, Germany
  • Jorg Martin, Max Planck Institute for Developmental Biology, Germany
  • Hongbo Zhu, Max Planck Institute for Developmental Biology, Germany

Presentation Overview: Show

Introduction
For the most part, contemporary proteins can be traced back to a basic set of a few thousand domain prototypes, many of which were already established in the Last Universal Common Ancestor of life on Earth, around 3.5 billion years ago. The origin of these domain prototypes, however, remains poorly understood. We have proposed that they arose from an ancestral set of peptides, which acted as cofactors of RNA-mediated catalysis and replication1. Initially, these peptides were entirely dependent on the RNA scaffold for their structure, but as their complexity increased, they became able to form structures by excluding water through hydrophobic contacts, making them independent of the RNA scaffold. Their ability to fold was thus an emergent property of peptide-RNA coevolution.

The ribosome is the main survivor of this primordial RNA world and offers an excellent model system for retracing the steps that led to the folded proteins of today, due to its very slow rate of change2. Close to the peptidyl transferase center, which is the oldest part of the ribosome, proteins are extended and largely devoid of secondary structure; further from the center, their secondary structure content increases and supersecondary topologies become common, although the proteins still largely lack a hydrophobic core; at the ribosomal periphery, supersecondary structures coalesce around hydrophobic cores, forming folds that resemble those seen in proteins of the cytosol. Collectively, ribosomal proteins chart a path of progressive emancipation from the RNA scaffold, offering a window onto the time when proteins were acquiring the ability to fold.

Results & Conclusions
We have retraced this emancipation from the RNA scaffold computationally and experimentally for a cytosolic protein fold, the tetratricopeptide repeat (TPR). By amplifying an αα-hairpin from a ribosomal protein, RPS20, which is unstructured in the absence of the cognate ribosomal RNA, we explored whether an intrinsically disordered peptide could form a folded protein through an increase in complexity afforded by repetition. Simple repetition was not sufficient in our case, but the repeat protein was so close to a folded structure that only two point mutations per repeat were necessary to allow it to fold reliably. The mutations needed for this transition did not appear to affect negatively the interaction with the RNA scaffold and were neutral for survival and growth in the parent organism, raising the possibility that they could have been among the variants sampled multiply in the course of evolution. TPRs could thus have plausibly arisen by amplification from an ancestral, RNA-dependent helical hairpin, as proposed by our theory.

FIGURE 1. (A) Scenario for the divergent evolution of ribosomal protein RPS20 and the cytosolic TPR fold from an ancestral, ribosome-associated αα-hairpin. (B) The crystal structure of the RPS20-derived repeat. The three chains in the asymmetric unit are colored green, blue and yellow, respectively. Chains A and B form a dimer. (C) Superposition of the RPS20-derived repeat (green) and the TPR protein CTPR3 (PDB: 1na0, chain A, gray).

References
1. Alva, V., Söding, J. & Lupas, A. N.. Elife 4, e09410 (2015).
2. Lupas, A. N. & Alva, V.. J Struct Biol, in press, DOI: 10.1016/j.jsb.2017.04.007 (2017).
3. Zhu, H., Sepulveda, E., Hartmann, M. D., Kogenaru, M., Ursinus, A., Sulz, E., Albrecht, R., Coles, M., Martin, J. & Lupas, A. N.. Elife 5, e16761 (2016).

Improving fragment assembly protein structure prediction
Date: Saturday, July 22
Time: 3:20 PM - 3:40 PM
Room: Forum
  • Charlotte Deane, University of Oxford, UK
  • Saulo de Oliveira, University of Oxford, UK

Presentation Overview: Show

Abstract
Protein structures can elucidate functional understanding, explain disease mechanisms and inform drug design. However, experimental structure determination is costly, and technically difficult and while the three-dimensional structure of proteins is difficult to obtain amino acid sequences are easily available and far outnumber solved structures. However, current de novo protein structure prediction methods are heuristics limited by the enormous search space, with successful prediction largely restricted to small, single domain proteins.
The three key components of de novo fragment‐assembly methods for protein structure prediction are the fragment library, the “energy” function and the search method. In this talk I will give an overview of my groups work on improving each of these stages. Firstly, describing the development of a novel fragment library Flib that uses predicted secondary structure to determine library generation strategy [1]. Secondly, giving a comparison of the different co-evolution contact predictors in terms of their ability to improve protein structure prediction [2]. Finally demonstrating how sequential prediction approaches using SAINT2 can improve both search heuristics and final model quality.

FIGURE 1.
The precision of five fragment library generators showing the differences in precision achieved for different secondary structure types.

References
1. Saulo Henrique Pires de Oliveira, Jiye Shi, Charlotte M Deane, Building a better fragment library for de novo protein structure prediction, Plos One, 2015, 10(4), e0123998
2. Saulo Henrique Pires de Oliveira, Jiye Shi, Charlotte M Deane, Comparing co-evolution methods and their application to template-free protein structure prediction, Bioinformatics, 2017; 33 (3): 373-381. doi: 10.1093/bioinformatics/btw618

MESHI-score a method for estimation of protein model accuracy
Date: Saturday, July 22
Time: 3:40 PM - 4:00 PM
Room: Forum
  • Chen Keasar, Ben-Gurion University, Israel
  • Tomer Sidi, Ben-Gurion University, Israel

Presentation Overview: Show

Introduction
The first steps of protein structure prediction (PSP) methods generate many alternative 3D models of the protein (aka decoys), and mediocre decoys typically outnumber the good ones. Thus the next step, an estimation of model accuracies (EMA, aka QA), has considerable impact on the overall PSP performance. EMA scores are also important for PSP users, as estimates of models' reliability.

Over the last two decades, we and others have developed quite a few energy terms and other structural features that provide different perspectives on the complex structures of proteins. Each feature assesses some aspect of the decoys’ quality, and EMA is the art of combining them into a single coherent score. The complexity of protein structures suggests that many diverse features combined in subtle ways may be needed to create a good EMA. Yet most EMA methods make do with relatively few, carefully selected, features to avoid the risk of overfitting. We take the opposite approach, ultimately aiming to make use of any structural feature we deem useful, and to combine these features in complex ways. To this end we developed an EMA platform called MESHI-score1. We continuously augment it with more features and improve their integration. Here we present the unique approach, its performance in CASP12, and more recent results.

Methods
Decoy preprocessing: Before assessment, MESHI-score reduces noise by side-chain repacking2 and energy minimization.

Ensemble learning and feature selection: MESHI-score pursues an ensemble learning approach that combines many (usually 1000) independently trained predictors. Each predictor is assigned a somewhat different objective function, and is trained by stochastic optimization, which includes feature selection. Overfitting at the predictor level is prevented by a constraint on the number of features with non-zero weight. The final score is a weighted median of the predictors' scores. Thus, each of the many features "has a chance" to contribute and yet, since the integration step does not include any adjustable parameters, overfitting is avoided.
Additionally, in a variant of MESHI-score, called MESHI-score-con, the decoys' score is biased towards the weighted average of its close neighbors (GDT_TS >= 0.95) scores.

Features – Currently feature set of MESHI-score includes 106 diverse features, including pairwise potentials, torsion angle and hydrogen bond terms, contact and radius of gyration terms, compatibility with sequence based predictions, and meta-features that evaluate the distribution of other features.

Results
In CASP12, MESHI-score and MESHI-score-con participated as EMA servers. They were ranked among the top four or five top methods (and the best servers) according to two measures of performance (Fig. 1).

In a different CASP track, MESHI-score was also used by five structure prediction groups and three of them attained top ranks (Fig. 2).

The development of MESHI-score is a continuous process of feature incorporation and improved training. Table 1 depicts the performance improvement of Meshi-score over the last two years. It also compares MESHI-score with three other alternative scores.

Summary
MESHI-score is a modular and extendable EMA method with state-of –art performance.

References
1. Mirzaei , Sidi, Keasar, and Crivelli IEEE/ACM transactions on computational biology and bioinformatics, in press 2016
2. Krivov,G.G. et al. Proteins, 77, 778–795 2009
3. Zhou and Skolnick Biophys J. 101:2043 (2011)
4. Benkert, Künzli, and Schwede NAR 37 (suppl_2): W510-W514 (2009)
FIGURE 1. Two screenshots from the CASP site, each depicts the top ranking groups by a different measure.5. Ray, Lindahl, and Wallner BMC bioinformatics 13:224 (2012)

FIGURE 2. Top structure prediction groups from the CASP12 site. Groups that uses MESHI-score are marked by arrows.
TABLE 1. Current MESHI-score perfrmance.

3DSIG KEYNOTE: Protein bioinformatics of low resolution structural data
Date: Saturday, July 22
Time: 4:30 PM - 5:20 PM
Room: Forum
  • Daisuke Kihara, Purdue University, United States

Presentation Overview: Show

For many years protein structure bioinformatics has been using protein structures in PDB to elucidate structures and functions of proteins and developing computational methods for the analyses. Although PDB remains as the main source of biomolecular structure data, the game-changing technology development occurred for electron microscopy (EM) in recent years enabled solving macromolecular structures at near atomic resolution using EM. An increasing number of structures determined by EM at various resolutions, from about 1.5 to over 20 Angstroms, are accumulated in EMDB. EM data pose new challenges and exciting opportunities to the protein bioinformatics community. We will start by overviewing computational methods needed for interpreting EM data of macromolecular structures. Then we will discuss our recent analysis of protein structures determined by EM, and further present methods we developed, including EM-SURFER, which is a server for rapid EMDB search, and structure modeling methods for EM maps.

EncoMPASS: An Encyclopedia of Membrane Proteins Analyzed by Structure and Symmetry
Date: Saturday, July 22
Time: 5:20 PM - 5:40 PM
Room: Forum
  • Edoardo Sarti, National Institutes of Health, United States
  • Antoniya Aleksandrova, National Institutes of Health, United States
  • Lucy Forrest, National Institutes of Health, United States

Presentation Overview: Show

Abstract
EncoMPASS (Encyclopedia of Membrane Proteins Analyzed by Structure and Symmetry) is an online, completely automated database for relating integral proteins of known structure from the points of view of their sequence, structure, and symmetries. It can be used for organizing resources for protein structure determination, benchmarking sequence alignment tools, and inferring membrane protein functionalities via comparative studies.

Introduction
Integral membrane proteins constitute 20-30% of the genome and it has been estimated that they are targeted by around half of all FDA-approved drugs as well as of physiologically-relevant small ligands, making them extremely relevant in both cell biology and pharmacology. They are also associated with distinct structural features, such as the predisposition for internal and quaternary symmetries, that reflect the geometric constraints of the lipid bilayer. Several databases dedicated to structures of membrane proteins have been developed, but none of them classify the proteins or assign relationships between the proteins that they enumerate. Moreover, symmetry is never taken into consideration.
To address these issues, we present here the novel Encyclopedia of Membrane Proteins Analyzed by Structure and Symmetry (EncoMPASS), a fully-automated database through which we aim to introduce a more flexible representation of the structural relationships between experimentally-determined membrane protein structures.

Methods
In order to ensure the quality of our structural analysis, we select only proteins whose structure has been experimentally determined through X-ray crystallography with resolution <3.5 Å. Figure 1 illustrates the procedure for generating the data for one database entry: the protein complex is first inserted in the membrane through the procedure used by the OPM database1, then it is divided into single-chain subunits. Each chain is analyzed separately, and its sequence and structure are aligned with all other chains having the same topology. Both the complex and each individual chain then undergo symmetry recognition through two different programs, and the results thereof are combined in a consensus analysis.



FIGURE 1. EncoMPASS follows the depicted automatic protocol in order to perform all analyses for a given membrane protein complex and each of its single-chain subunits. Numerical and graphic data obtained from each of these analyses are available on the webpage describing that complex or chain.

Structure and sequence similarity are investigated using sequence identity, TM-score2, RMSD, and through multiple graphical outputs that illustrate the distribution of related structures in the space of sequences and conformations, also at the residue level.

Results & Conclusions
EncoMPASS provides a structure- and symmetry-oriented analysis of over 2000 structures, covering more than 90% of all membrane protein coordinate files deposited in the PDB. Data relating to sequence, structure and symmetry analyses is reported for each entry. A set of graphical tools helps the user to investigate how structures relate to each other in terms of sequence and structure similarity, and how frequently each residue of each single-chain subunit superposes in comparisons with other topologically similar chains.

A complete analysis of quaternary and internal symmetries is also available, and each symmetric region is illustrated and associated with the appropriate symmetry axis and symmetry transformation.

The database is updated monthly and its underlying code is freely available.

Thanks to these characteristics, EncoMPASS can be used for organizing resources for protein structure determination, benchmarking sequence alignment tools, and inferring membrane protein functionalities via comparative studies.

References
1. Lomize, M. et al. (2006), Bioinformatics, 22, 623–625.
2. Zhang,Y. and Skolnick,J. (2004), Proteins Struct. Funct. Genet., 57, 702–710.

Folding membrane proteins by deep transfer learning
Date: Saturday, July 22
Time: 5:40 PM - 6:00 PM
Room: Forum
  • Jinbo Xu, Toyota Technological Institute, United States

Presentation Overview: Show

Abstract
Predicting membrane protein (MP) structures is challenging partially due to lack of sufficient solved structures for homology modeling. We describe a low-cost, high-throughput deep transfer learning method that first predicts MP contacts by learning from non-membrane proteins (non-MPs) and then predicts 3D structure models using predicted contacts. Tested on 510 non-redundant MPs, our method predicts correct folds for 218 of 510 MPs, of which 57 and 108 MPs having 3D models with RMSD less than 4Å and 5Å, respectively. Rigorous blind test in CAMEO shows that our method predicted correct folds for all 4 test MPs and high-resolution 3D models (RMSD ~ 2Å) for two. We estimated that our method could predict correct folds for 1345~1871 of 2215 reviewed human multi-pass MPs, including a few hundred new folds.

Methods
We predict protein contacts by concatenating two deep residual neural networks (see [1, 2] for details), which performed the best in CASP12 (for soluble protein contact prediction). We extend this method to MP contact prediction. Compared to soluble protein contact prediction, one challenge of applying deep learning to MP contact prediction is lack of sufficient training data since there are only 510 non-redundant MPs with solved structures in PDB. To overcome this, we train our deep learning model using thousands of non-MPs with solved structures. It turns out that the resultant deep model works well for MP contact prediction. Our further study indicates that using a mix of non-MPs and MPs with solved structures, we can train a deep model with even better prediction accuracy, greatly outperforming existing methods. Our method is of low cost and very high throughput. It takes from 30 minutes to a few hours on a Linux workstation to predict 3D models for a test MP, much more efficient than the popular fragment assembly methods.

Results
Contact prediction accuracy. Tested on the 510 non-redundant MPs, when top L/2 predicted long-range contacts are evaluated, our method has accuracy 0.58, much better than pure co-evolution methods such as Evfold (0.23), PSICOV (0.25) and CCMpred (0.28), and a supervised learning method MetaPSICOV (0.37). This result suggests that non-MPs and MPs share some common properties for contact prediction that can be learned by our deep learning model, and that the set of non-MPs contain more information for contact prediction than the set of MPs.
Folding accuracy. Experimental results show that our predicted contacts can help fold about 218 and 288 of the 510 MPs with solved structures when TMscore≥0.6 and TMscore≥0.5 are used as the correctness criterion, respectively. Meanwhile, 57 and 169 of the 510 MPs have predicted 3D models with RMSD<4.0Å and <6.5Å, respectively. In contrast, pure co-evolution methods such as CCMpred (and PSICOV, Evfold) and homology modeling can fold only 10 and 18 of the MPs with RMSD<4.0Å, respectively, and 49 and 89 of them with RMSD<6.5Å.
Blind test in CAMEO. We have implemented our algorithm as a fully-automated web server and blindly tested it through the weekly live benchmark CAMEO (http://www.cameo3d.org/) operated by Schwede group. CAMEO can be interpreted as a fully-automated CASP, with >30 participating servers including some well-known such as Baker’s Robetta, RaptorX, Swiss-Model and HHpred. Since September 2016, there are only four test MPs among all the CAMEO hard targets. Our web server (CAMEO ID: Server60) predicted correct folds for all of them, much better than all the other CAMEO-participating servers. In particular, for 5h36A and 5h35E (>210 AAs) we predicted 3D models with RMSD ~ 2Å.

References
1. S. Wang, S. Sun, Z. Li, R. Zhang, and J. Xu. Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model. PLoS Computational Biology, 2017.
2. Z. Li, S. Wang, Y. Yu and J. Xu. Predicting membrane protein contacts from non-membrane proteins by deep transfer learning. https://arxiv.org/abs/1704.07207

Deep learning based subdivision approach for large scale macromolecules structure recovery from electron cryo tomograms
Date: Sunday, July 23
Time: 10:00 AM - 10:20 AM
Room: Forum
  • Min Xu, Carnegie Mellon University, United States
  • Xiaoqi Chai, Carnegie Mellon University, United States
  • Hariank Muthakana, Carnegie Mellon University, United States
  • Xiaodan Liang, Carnegie Mellon University, United States
  • Ge Yang, Carnegie Mellon University, United States
  • Tzviya Zeev-Ben-Mordehai, University of Oxford, United Kingdom
  • Eric Xing, Carnegie Mellon University, United States

Presentation Overview: Show

Motivation: Cellular Electron CryoTomography (CECT) enables 3D visualization of cellular organization at near-native state and in sub-molecular resolution, making it a powerful tool for analyzing structures of macromolecular complexes and their spatial organizations inside single cells. However, high degree of structural complexity together with practical imaging limitations make the systematic de novo discovery of structures within cells challenging. It would likely require averaging and classifying millions of subtomograms potentially containing hundreds of highly heterogeneous structural classes. Although it is no longer difficult to acquire CECT data containing such amount of subtomograms due to advances in data acquisition automation, existing computational approaches have very limited scalability or discrimination ability, making them incapable of processing such amount of data.

Results: To complement existing approaches, in this paper we propose a new approach for subdividing subtomograms into smaller but relatively homogeneous subsets. The structures in these subsets can then be separately recovered using existing computation intensive methods. Our approach is based on supervised structural feature extraction using deep learning, in combination with unsupervised clustering and reference-free classification. Our experiments show that, compared to existing unsupervised rotation invariant feature and pose-normalization based approaches, our new approach achieves significant improvements in both discrimination ability and scalability. More importantly, our new approach is able to discover new structural classes and recover structures that do not exist in training data.

Availability: Source code freely available at http://www.cs.cmu.edu/~mxu1/software

Contact: mxu1@cs.cmu.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Conservation of coevolving protein interfaces bridges prokaryote–eukaryote homologies in the twilight zone
Date: Sunday, July 23
Time: 10:20 AM - 10:40 AM
Room: Forum
  • Juan Rodriguez-Rivas, Spanish National Cancer Research Centre, Spain
  • Simone Marsili, Spanish National Cancer Research Centre, Spain
  • David Juan, Spanish National Cancer Research Centre, Spain
  • Alfonso Valencia, Spanish National Cancer Research Centre, Spain

Presentation Overview: Show

Abstract
Protein–protein interactions are fundamental for the proper functioning of the cell. As a result, protein interaction surfaces are subject to strong evolutionary constraints. Recent developments have shown that residue coevolution provides accurate predictions of heterodimeric protein interfaces from sequence information. So far these approaches have been limited to the analysis of families of prokaryotic complexes for which large multiple sequence alignments of homologous sequences can be compiled. We explore the hypothesis that coevolution points to structurally conserved contacts at protein–protein interfaces, which can be reliably projected to homologous complexes with distantly related sequences. We introduce a domain-centered protocol to study the interplay between residue coevolution and structural conservation of protein–protein interfaces. We show that sequence-based coevolutionary analysis systematically identifies residue contacts at prokaryotic interfaces that are structurally conserved at the interface of their eukaryotic counterparts. In turn, this allows the prediction of conserved contacts at eukaryotic protein–protein interfaces with high confidence using solely mutational patterns extracted from prokaryotic genomes. Even in the context of high divergence in sequence (the twilight zone), where standard homology modeling of protein complexes is unreliable, our approach provides sequence-based accurate information about specific details of protein interactions at the residue level. Selected examples of the application of prokaryotic coevolutionary analysis to the prediction of eukaryotic interfaces further illustrate the potential of this approach.

References
1. Rodriguez-Rivas J, Marsili S, Juan D, Valencia A (2016) Conservation of coevolving protein interfaces bridges prokaryote–eukaryote homologies in the twilight zone. Proc Natl Acad Sci 113(52):15018–15023.

Automated evaluation of quaternary structures from protein crystal structures
Date: Sunday, July 23
Time: 10:40 AM - 11:00 AM
Room: Forum
  • Jose Duarte, University of California, San Diego, United States
  • Spencer Bliven, Paul Scherrer Institute; National Institutes of Health, United States
  • Aleix Lafita, Paul Scherrer Institute, United States
  • Guido Capitani, Paul Scherrer Institute, United States
  • Stephen K. Burley, University of California, San Diego; Rutgers University, United States

Presentation Overview: Show

Introduction
Crystallography is the most powerful technique for generating atomic level structures of proteins and other biological macromolecules. However, it does not always yield definitive insights into the quaternary structures of biological macromolecules. In order to provide better tools for determining the most likely quaternary structure in proteins, we have developed the new EPPIC 3 method. It uses evolutionary considerations as the ultimate arbiters of the biological relevance of interfaces and assemblies, thereby offering a complementary approach versus other available methods that rely on thermodynamic considerations 2.

Results & Conclusions
EPPIC 3 extends our previous Evolutionary Protein-Protein Interface Classifier (EPPIC)1, by going beyond classifying pairwise interfaces. It identifies all possible topologically valid assemblies present in a protein crystal and provides predictions as to likely quaternary structures.

Pairwise interface classifications are based on two evolutionary scores and a single geometric score. These descriptors were trained against a large dataset of known biologically relevant and crystal interfaces to fit a logistic regression classifier that provides probabilistic scoring of interfaces together with confidence assignment.

Assembly enumeration is achieved by representing the crystal lattice as a periodic graph. Finding valid assemblies is then reduced to the problem of finding subgraphs complying to a set of rules, which guarantees closed assemblies (Point Group symmetries) and isomorphism in the assembly composition and connectivity throughout the crystal. Finally the assemblies are scored based on the individual scores of the constitutive interfaces, providing in the end a single probability of an assembly being the biological one, together with a confidence estimation. The confidence values are very valuable not only for users but also for downstream analyses (e.g. docking), where only high confident predictions can be selected.

The software is accessible through an easy to use web graphical interface at http://www.eppic-web.org. The graphical interface is designed to aid the crystallographer in interpreting putative quaternary structures using 2D and 3D graphical tools that operate within the browser (Figure 1).

The server offers additional useful tools to the structural bioinformatics community such as precalculated sequence alignments for every PDB structure and visualization of conservation on protein surface with the browser embedded NGL viewer. All the data are provided via xml downloads to the community, enabling further analyses.


FIGURE 1. Novel visualization tools are available as part of the web user interface, allowing the visualization of the crystal lattice via the lattice graph representation. This is done both in 2D thanks to the vis.js library and in 3D thanks to the fast NGL molecular visualization package3. The colors and labels show the different protein entities and interface types present in the lattice, allowing for a one-glance understanding of the assembly symmetry and its connectivity.

References
1. Duarte JM, Srebniak A, Schärer MA & Capitani G, "Protein interface classification by evolutionary analysis", BMC Bioinformatics 13, 334 (2012)
2. Krissinel E & Henrick, K, "Inference of macromolecular assemblies from crystalline state." J Mol Biol 372, 774-797 (2007)
3. Rose AS and Hildebrand PW. “NGL Viewer: a web application for molecular visualization”, Nucleic Acids Research (2015)

Deep Learning in text mining for protein docking using full-text articles
Date: Sunday, July 23
Time: 11:00 AM - 11:20 AM
Room: Forum
  • Varsha Badal, The University of Kansas, United States
  • Petras J. Kundrotas, The University of Kansas, United States
  • Ilya A. Vakser, The University of Kansas, United States

Presentation Overview: Show

Residues extracted from PubMed abstracts by text mining techniques can be used as constraints in protein-protein docking (Badal, et al., 2015). However, the pool of the mined residues contains many false positives (residues not relevant to protein binding), which can be partially removed by natural language processing algorithms (Badal et al, 2017, submitted). Deep learning methods can potentially provide further reduction of these false positives. However, abstracts provide more formal and strictly crafted text, which may lack in variety/richness for training of the deep learning models. On the other hand, full text articles, while providing richer source of information, are freely available only for a fraction of the PubMed abstracts, as PMC-open access. We investigated whether deep-learning models trained on the limited-availability full texts can be applied for the filtering of residues in the PubMed abstracts. For this purpose, we used deep recursive neural network, which composes word vectors (Irsoy and Cardie, 2014). Word vectors of the residue-containing sentences from the PMC full text articles were generated by word2vec (Mikolov, et al., 2013). We propose to label words and trees with sentiments from 0 to 4 (0,1,2 labeling negative samples, i.e. describing non-interface residues and 3,4 denoting positive samples, relevant to the interface residues) to train, and subsequently, classify sentences containing residues. The approach was tested on the non-redundant set of protein complexes from DOCKGROUND resource (http://dockground.compbio.ku.edu). The results showed that the model is capable of distinguishing the abstract sentences containing interface residues from those containing non-interface residues. We further investigated the local sentiment (surrounding words/phrase) using a window of words of various lengths around the residue in a sentence.

References
Badal, V.D., Kundrotas, P.J. and Vakser, I.A. Text mining for protein docking. PLoS Comp. Biol. 2015;11:e1004630.
Irsoy, O. and Cardie, C. Deep recursive neural networks for compositionality in language. In, Advances in Neural Information Processing Systems. 2014. p. 2096-2104.
Mikolov, T., et al. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 2013.

SnapDock - Template Based Docking by Geometric Hashing
Date: Sunday, July 23
Time: 11:20 AM - 11:40 AM
Room: Forum
  • Michael Estrin, Tel Aviv University, Israel
  • Haim J. Wolfson, Tel Aviv University, Israel

Presentation Overview: Show

A highly efficient template based protein-protein docking algorithm, nicknamed SnapDock, is presented. It employs a Geometric Hashing based structural alignment scheme to align the target proteins to the interfaces of non-redundant protein-protein interface libraries. Docking of a pair of proteins utilizing the 22,600 interface PIFACE library is performed in less than 2 minutes on the average. A flexible version of the algorithm allowing hinge motion in one of the proteins is presented as well. To evaluate the performance of the algorithm a blind re-modelling of 3,547 PDB complexes, which have been uploaded after the PIFACE publication has been performed with success ratio of about 35%. Interestingly, a similar experiment with the template free PatchDock docking algorithm yielded a success rate of about 23% with roughly 1/2 of the solutions different from those of SnapDock. Thus the combination of the two methods gave a 42% success ratio.

3DSIG KEYNOTE: Improving cancer chemotherapy with structure-based drug repositioning
Date: Sunday, July 23
Time: 11:40 AM - 12:30 PM
Room: Forum
  • Michael Schroeder, TU Dresden, Germany

Presentation Overview: Show

Drug resistance is an important open problem in cancer treatment. In recent years, the heat shock protein HSP27 (HSPB1) was identified as a key player driving resistance development. HSP27 is overexpressed in many cancer types and influences cellular processes such as apoptosis, DNA repair, recombination, and formation of metastases. As a result cancer cells are able to suppress apoptosis and develop resistance to cytostatic drugs.

To identify HSP27 inhibitors we follow a novel structure-based drug repositioning approach. We exploit a similarity between a predicted HSP27 binding site to a viral thymidine kinase to generate lead inhibitors for HSP27. We characterise binding of a known inhibitor with interactions patterns of our tool Plip and exploit this knowledge to assess better binders.

Six of these leads were verified experimentally. They bind HSP27 and down-regulate its chaperone activity. Most importantly, all six compounds inhibit development of drug resistance in cellular assays. One of the leads – chlorpromazine – is an antipsychotic, which has a positive effect on survival time in human breast cancer. The identified compounds will now undergo preclinical studies.

PRODIGY: a structure-based method for the prediction of binding affinity in biomolecular complexes
Date: Sunday, July 23
Time: 2:00 PM - 2:20 PM
Room: Forum
  • Anna Vangone, Utrecht University, Netherlands
  • Li Xue1, Utrecht University, Netherlands
  • Joao Rogridues, Utrecht University, Netherlands
  • Panagiotis Kastritis, Utrecht University, Netherlands
  • Mikael Trellet, Utrecht University, Netherlands
  • Jorg Shaarschmidt, Utrecht University, Netherlands
  • Alexandre MJJ Bonvin, Utrecht University, Netherlands

Presentation Overview: Show

Abstract
Here we present PRODIGY (PROtein binDIng enerGY predictor), a method to predict binding affinity of biomolecular complexes. Our approach is based purely on structural properties of the complexes, and outperforms other predictors so far reported. Due to its high performance and fast prediction time, we implemented it in a user-friendly web-server, freely available at: http://milou.science.uu.nl/services/PRODIGY/.

Introduction
Interactions between biomolecules, such as protein, nuclei acids and small ligands, are at the basis of almost every process happening in the cells. Understanding these interactions is therefore a crucial step in the investigation of biological systems and in drug development. Despite all efforts that have been devoted to unravel principles of biomolecular interactions in the past decades, we still lack a thorough understanding of the energetics of proteins association.

Recently, we introduced a simple but robust descriptor of binding affinity based only on structural properties of the protein-protein complexes (1). In this work, we have shown the contribution that the number and typology of the interfacial contacts made at protein-protein complex interface has in the description of the binding affinity, and developed a high performing contact-based predictor. Despite the importance of the topic, there are surprisingly only limited online tools for a fast and easy fast prediction of binding affinity. For this reason, based on our contact-based binding affinity predictor, we developed PRODIGY (PROtein binDIng enerGY predictor) (2), a webserver to predict the affinity of a protein-protein complex from its three-dimensional structure.

Methods
In order to evaluate the relationship between the contacts at the interface and the experimental binding affinity in protein-protein complexes, we used the bound structures from the protein-protein binding affinity benchmark in Kastritis et al. (3). We calculated the number of interface pair-wise contacts (ICs) for each complex (4) and the non-interacting surface properties (NIS) (5), classifying both ICs and NIS according to the charged/polar/apolar nature of the residue. We trained a linear regression model on such structural features (ICs and NIS) and asses the linear dependency between the experimental binding affinity and the structural properties tested as Pearson’s correlation coefficient was reported. PRODIGY performance were compared with the current state of art by using the pre-calculated data available in CCharPPI (6).

Results & Conclusions
Using the protein-protein binding affinity benchmark of Kastritis et al. (3) we showed that the number of interfacial contacts (ICs) at the interface of a protein-protein complex correlates with the experimental binding affinity. This information, combined properties of the non-interacting surface (NIS) which have been shown to influence binding affinity, (5) has led one of the top binding affinity predictor in the field, showing Pearson’s Correlation r = 0.73, p-value < 0.0001 and RMSE = 1.89 kcal mol-1 (Figure 1).





FIGURE 1. Scatter plot of predicted vs experimental binding affinities for a set of 81 protein-protein complexes.

Finally, we implemented our contact-based binding affinity predictor into the web-server PRODIGY (2).

References
1. Vangone, A. & Bonvin, A.M.J.J.. eLife, 4, e07454 (2015).
2. Xue, L., Rodrigues, J.P.G.L.M., Kastritis, P.L., Bonvin, A.M.J.J.. Bionformatics, 32 (23):3676-3678 (2016).
3. Kastritis, P.L., Moal, I.H., Hwang, H., Weng, Z., Bates, P.A., Bonvin, A.M.J.J., Janin, J.. Protein Science, 20:482-491 (2011).
4. Vangone, A., Spinelli, R., Scarano, V., Cavallo, L., Oliva, R.. 27:2915-2916 (2011).
5. Kastritis, P.L., Rodrigues, J.P.G.L.M., Folkers, G.E., Boelens, R., Bonvin, A.M.J.J.. Journal of Molecular Biology, 426:2632-2652 (2014).
6. Moal, I.H., Jimenez-Garcia, B., Fernandez-Recio, J.. 31(1), 123-125 (2015).

Identifying Multiple Active Conformations of G Protein-Coupled Receptors Using Focused Conformational Sampling
Date: Sunday, July 23
Time: 2:20 PM - 2:40 PM
Room: Forum
  • Ravinder Abrol, California State University Northridge, United States
  • Sijia Dong, Universitty of Minesota, United States
  • William A. Goddard III, California Institute of Technology, United States

Presentation Overview: Show

Abstract
G protein-coupled receptors (GPCRs) are membrane proteins critical in manycellular signal transductions. The pleiotropic signaling of GPCRs is enabled by their conformational flexibility that enables them to exist in multiple states, where functionally important active states are high in energy. This makes the experimental studies of these active states very challenging. Most computational methods can only identify lowest-energy states, so we developed a focused conformational sampling method that is capable of identifying multiple active states of GPCRs. It was able to correctly predict the active conformation of two GPCRs starting only from the inactive state, and explained previous experiments, which has been an unsurmountable challenge for standard molecular dynamics simulations.

Introduction
The multiple functions of GPCRs are dependent on their activation, which enables them to couple to several signaling partners like G proteins, GPCR kinases (GRKs), and arrestins inside the cell. This has also led to the pharmacological targeting of GPCRs that causes both therapeutic effects and undesirable side-effects. A mechanistic understanding of GPCR functions requires the study of multiple active conformations that these receptors can putatively reside in. These conformations have been hard to chacaterize experimentally as they are high in energy and difficult to stabilize. The computational biophysical methods are good at identifying mainly lowest energy conformations. To address this problem, we have developed the ActiveGEnSeMBLE computational method1 that systematically predicts multiple conformations that are likely in the GPCR activation landscape, including multiple active- and inactive-state conformations, while minimizing computationally cost.

Method
ActiveGEnSeMBLE method1 starts with a systematic coarse grid conformational sampling of helix tilts/rotations (~13 trillion transmembrane (TM)-domain conformations) and selects the conformational landscape based on energy. This profile (Figure 1) identifies multiple potential active-state energy wells, using the TM3-TM6 intracellular distance as a surrogate activation coordinate. These energy wells are then sampled locally using a finer grid in conformational space to find a locally minimized conformation in each energy well (Figure 1), which can be further relaxed if necessary using molecular dynamics (MD) simulations.
Figure 1: Conformational sampling in ActiveGEnSeMBLE method.

Results and Conclusions
We validated the ActiveGEnSeMBLE method by predicting active human β2 adrenergic receptor (hβ2AR) and human M2 muscarinic acetylcholine receptor (hM2R) crystal structures. We found that the ActiveGEnSeMBLE method sampled the orientations of the TM helices and located structures in various energy wells spanning the range of TM3–TM6 distances (R36) traversed in the process of activation. Subsequent analysis revealed a local minimum in each of these energy wells that was close or identical to a crystal-structure conformation based on backbone RMSD (Figure 2).
Figure 2: Conformational energy landscape for human β2 adrenergic receptor activation.

To show the utility of the knowledge of active conformations, molecular dynamics simulations of the active conformation of hβ2AR, with and without the G protein and the agonist, were used to generate energy profiles that are consistent with the qualitative energy landscape of hβ2AR obtained from experiments2, providing information about how the ligand and G protein may play a role in activation. These results (Figure 3) indicate that the agonist alone is not enough to stabilize the active state, and that the Gα C-terminal chain needs to be bound to the GPCR to promote activation, in agreement with experimental observations2.
Figure 3: Energy landscape of human β2 adrenergic receptor.

References
1. Dong, Goddard, Abrol (2016) Biophys J. 110(12):2618-29.
2. Manglik, ..., Kobilka (2015) Cell. 161(5):1101-11.

The Impact of Conformational Entropy on the Accuracy of the Molecular Docking Software FlexAID in Binding Mode Prediction
Date: Sunday, July 23
Time: 2:40 PM - 3:00 PM
Room: Forum
  • Louis-Philippe Morency, University of Montreal, Canada
  • Rafael Najmanovich, University of Montreal, Canada

Presentation Overview: Show

Abstract
Here we show the newest implementation of Flexible Artificial Intelligence Docking (FlexAID) allowing its scoring function to consider the conformational entropy of ligand and biomolecules complexes. The higher accuracy of FlexAID1 on complex cases, the addition of novel features, i.e. the conformational entropy, its accessibility and its easy-to-use graphical user interface place FlexAID in an interesting position to tackle biologically and pharmacologically relevant situations currently ignored by other methods.

FlexAID1 is available as a command-line pre-compiled executable (available at http://bcb.med.usherbrooke.ca/flexaid for Windows, macOS & Linux) or through the NRGsuite, a PyMOL integrated user interface allowing the user to use FlexAID in an intuitive manner with real time visualization. Both the NRGsuite2 and FlexAID1 are distributed as open-source software.

Introduction
A major goal of molecular docking is to predict the experimentally observed binding mode between a biomolecule, i.e. a polymer of amino or nucleic acids, and a ligand— e.g. small molecules, peptides or nucleic acids. This computational method is used to study the structure of the molecular interactions involved in cell’s essential biological functions. Actual molecular docking methods are developed to evaluate the molecular interactions of a single conformation, a single pose at a time, and they are trained to estimate the enthalpic fraction of the binding free energy. Consequently, most current molecular docking methods fail to efficiently model the entropic contributions, especially those of conformational nature, who are fundamental in molecular recognition events.

Our research group develops FlexAID1, an accessible and competitive ligand and biomolecule molecular docking software whose focus is on molecular flexibility. Here we introduce FlexAID’s newest feature that allows its scoring function to estimate the conformational entropy by redefining the static binding mode usually predicted in molecular docking into a dynamic collection of similar poses evaluated altogether.

Methods
We implemented the new scoring function in FlexAID, allowing its genetic algorithm to select less favourable, but frequently observed binding modes, with a probability following a Boltzmann distribution. This implementation allows FlexAID to consider conformational entropy of the complexes during the molecular docking simulation. The core of the implementation of the conformational entropy in FlexAID resides within an unsupervised and density-based molecular classification algorithm charged to group similar poses together into binding modes, thus redefining a binding mode as a dynamic collection of poses scored altogether as it an be seen in Figure 1.


Figure 1. Visual representation of a dynamic binding mode output by the newest implementation of FlexAID (PDB: 1xm6).

Results & Conclusions
We present the impact of FlexAID’s newest feature on its accuracy in binding mode prediction using three increasingly complex scenarios: the Astex Diverse Set4, the Astex Non Native Set5 and the HAP26 dataset. We show that FlexAID outperforms other open-source molecular docking methods when molecular flexibility is crucial. Furthermore, FlexAID now outputs multiple conformations per binding mode, a novelty that allows the user to visualize the dynamics of the complex studied. We believe that its higher accuracy in complex scenarios, the addition of novel features, e.g. the conformational entropy, its accessibility and its easy-to-use graphical user interface, the NRGsuite2, place FlexAID in an interesting position to tackle biologically and pharmacologically relevant situations currently ignored by other molecular docking methods.

References
1. J. Chem. Inf. Model. 55: 1323–1336 (2015).
2. Bioinformatics btv458 (2015).
3. PLoS Comput Biol 10, e1003569 (2014).
4. J. Med. Chem. 50, 726–741 (2007).
5. J. Chem. Inf. Model. 48, 2214–2225 (2008).
6. Bioinformatics 28, i423–i430 (2012).

Interactome based drug design based on disease-disease relationships
Date: Sunday, July 23
Time: 3:00 PM - 3:20 PM
Room: Forum
  • Jonathan Fine, Purdue University, United States
  • Joydeb Majumder, Purdue University, United States
  • Travis C. Lantz, Purdue University, United States
  • Gaurav Chopra, Purdue University, United States

Presentation Overview: Show

Abstract
We have developed an interactome based drug discovery, design, and repurposing platform that analyzes similarity of compound-proteome interaction signatures to determine functional drug behavior, compared to traditional single target approaches, resulting in common links between diseases (prostate cancer, breast cancer, hypertension) to predict repurposeable drugs. Using our new flexible target and antitarget guided design program (CANDOCK), we designed and synthesized potent (IC50 < 1nM) anticancer drug leads for castration resistant prostate cancer (CRPC) that are non-toxic on normal human prostate epithelial cells and in mice at high dose. We also tested our designed lead in patient-derived xenograft tumors in castrated immune compromised NSG mice outperforming abiraterone, a known CRPC drug. These non-toxic putative drugs are differential nuclear hormone receptor modulators optimized by using the interaction profile (vs single target) of multiple receptors in the androgen signaling pathway to improve efficacy and reduce toxicity. We conclude that compared to traditional single target drug discovery that is slow and error prone, interactome based drug discovery that considers any disease as a heterogeneous combination of other disease mechanisms will have a broader impact leading to chemical control of biological pathways in diseases with overlapping mechanisms foreshadowing a new era of faster drug discovery and design.

Introduction
Modern drug discovery uses large libraries to screen against one or few disease targets using high-throughput screening with iterative design endeavors costing millions of dollars. This paradigm is optimized to develop drugs against a single target which was thought necessary to minimize side effects. However, it has an intrinsic high failure rate due to efficacy of wrong target selection, resistance mechanisms of singular targeting or toxicity due to unknown off-target effects. Our work indicates that most human ingestible drugs function as a differential interaction with multiple proteins from different druggable protein classes. Here, we introduce computational methods for compound library development based on proteome-wide interactions. Instead of screening millions of compounds that are randomly synthesized and have intrinsic high failure rate, one uses virtual screening with multiple proteins combined with target and antitarget based lead optimization to make modular libraries focused on signaling pathways of interest.

Methods
We have previously implemented a modeling pipeline that generates an interaction between 3,733 human approved drugs and 48,278 proteins (~1 billion predicted interactions). Predictions are sorted and ranked by structural proteome interaction signature similarity to all other known drug signatures approved for particular indications suggesting disease-disease relationships and a new CRPC repurposed lead prospectively validated in human cancer cell lines. Next, we used CANDOCK to design chemical analogs from our CRPC repurposed lead based on target and antitarget inhibition profile for pathway specificity to not engage in typical blind synthetic library design based on single target optimization (Figure 1). These designs were tested for anticancer efficacy, metastatic potential on human cancer cell lines in vitro and efficacy and toxicity compared to existing CRPC drug in vivo.


Figure 1. Disease-disease relationship results in repurposed lead. Synthesis of multitarget (target and antitarget) profile based lead optimization results in pathway specific compound library that is potent, non-toxic and disease-specific compared to single target drugs.

Results & Conclusions
We implemented a proteome design method, CANDOCK, to synthesize potent, non-toxic library of differential-targeting (not promiscuous) hormone specific signaling anti-cancer agents for CRPC. We experimentally validated our selection for enhanced efficacy and toxicity in vivo.

3DSIG DISCUSSION: How Will Data Science Influence What We Do?
Date: Sunday, July 23
Time: 3:20 PM - 4:00 PM
Room: Forum
  • Phillip Bourne, University of Virginia, United States

Presentation Overview: Show

Abstract: Data science is becoming increasingly influential in many industries. Since the research that is being undertaken by those who attend 3Dsig has always been data driven, is there anything new to us emerging from data science and so-called “Big Data?” If the answer is yes, what is new, how is it being applied and what is next? This will be an audience discussion around these questions and other questions that will undoubtedly arise.

Three-dimensional organisation of human genome
Date: Sunday, July 23
Time: 4:30 PM - 4:50 PM
Room: Forum
  • Przemysław Szałaj, University of Warsaw, Poland
  • Zhonghui Tang, The Jackson Laboratory for Genomic Medicine, United States
  • Paul Michalski, The Jackson Laboratory for Genomic Medicine, United States
  • Michał J Piętal, University of Warsaw, Poland
  • Michał Sadowski, University of Warsaw, Poland
  • Oscar Luo, The Jackson Laboratory for Genomic Medicine, United States
  • Yijun Ruan, The Jackson Laboratory for Genomic Medicine, United States
  • Dariusz Plewczynski, University of Warsaw, Poland

Presentation Overview: Show

Abstract
Chromatin Interaction Analysis by Paired-End Tag Sequencing (ChIA-PET) reveals long-range chromatin interactions and provides insights into the basic principles of spatial genome organization and gene regulation mediated by specific protein factors. Recently, we showed that a single ChIA-PET experiment provides information at all genomic scales of interest, from the high resolution locations of binding sites and enriched chromatin interactions mediated by specific protein factors, to the low resolution of non-enriched interactions that reflect topological neighborhoods of higher-order chromosome folding. This multilevel nature of ChIA-PET data offers an opportunity to use multiscale 3D models to study structural-functional relationships at multiple length scales, but doing so requires a structural modeling platform.

Introduction
Chromosomal folding are important features of genome organization, which play critical roles in genome functions, including transcriptional regulation. Using 3C-based mapping technologies to render long-range chromatin interactions has started to reveal some basic principles of spatial genome organization. Among 3D genome mapping technologies, ChIA-PET is unique in its ability to generate multiple datasets (in a single experiment), including binding sites, enriched chromatin interactions (mediated by specific protein factors, like CTCF), as well as non-enriched interactions that reflect topological neighborhoods of higher-order associations. The multifarious nature of ChIA-PET data represents an important advantage in capturing multi-layer structural-functional information, but also imposes new challenges in multi-scale modeling of 3D genome [1].

Methods
The above experimental insights allowed us to propose [2] the 3D-GNOME (3-Dimensional GeNOme Modeling Engine), a complete computational pipeline for 3D simulation using ChIA-PET data. 3D-GNOME consists of three integrated components: a graph-distance-based heatmap normalization tool, a 3D modeling platform, and an interactive 3D visualization tool. Using ChIA-PET and Hi-C data derived from human B-lymphocytes, we demonstrate the effectiveness of 3D-GNOME in building 3D genome models at multiple levels, including the entire genome, individual chromosomes, and specific segments at megabase (Mb) and kilobase (kb) resolutions of single average and ensemble structures. Further incorporation of CTCF-motif orientation and high-resolution looping patterns in 3D simulation provided additional reliability of potential biologically plausible topological structures.

Results & Conclusions
Finally, we will present 3D-GNOME web service [3], which generates 3D structures from 3C data and provides tools to visually inspect and annotate the resulting structures, in addition to a variety of statistical plots and heatmaps which characterize the selected genomic region. 3D-GNOME simulates the structure and provides a convenient user interface for further analysis. Alternatively, a user may generate structures using ChIA-PET data for the GM12878 cell line by simply specifying a genomic region of interest. 3D-GNOME is freely available at http://3dgnome.cent.uw.edu.pl/ providing unique insights in the topological mechanism of human variations and diseases.

Further refinement of 3DGNOME and application to additional ChIA-PET and other types of 3D genome mapping data will help to advance our understanding of genome structures and functions.

References
1. Biological model published in Cell on 17th Dec, 2015: http://linkinghub.elsevier.com/retrieve/pii/S0092-8674(15)01504-4
2. Computational model published in Genome Research 2016: http://genome.cshlp.org/content/early/2016/10/27/gr.205062.116.abstract
3. Web server published in Nucleic Acids Research 2016: http://nar.oxfordjournals.org/content/early/2016/05/16/nar.gkw437.full

From Mutations to Mechanisms and Dysfunction via Computation and Mining of Protein Energy Landscapes
Date: Sunday, July 23
Time: 4:50 PM - 5:10 PM
Room: Forum
  • Tatiana Maximova, George Mason University, United States
  • Wanli Qiao, George Mason University, United States
  • Erion Plaku, The Catholic University of America, United States
  • Carla Mattos, Northeastern University, United States
  • Buyong Ma, National Cancer Institute, United States
  • Ruth Nussinov, National Cancer Institute, United States
  • Amarda Shehu, National Cancer Institute, United States

Presentation Overview: Show

Abstract
The energy landscape underscores the inherent nature of proteins as dynamic systems interconverting between structures with varying energies. Recently, we have developed a method that feasibly reconstructs landscapes. Here we demonstrate that the availability of landscapes of wildtype and diseased variants opens the way for data mining techniques to harness quantitative information embedded in landscapes to summarize mechanisms via which mutations alter dynamics and function.

Introduction
The energy landscape contains the information needed to characterize and relate protein equilibrium dynamics to function1. Due to the broad spatio-temporal scales involved, neither wet- nor dry-laboratory techniques can reconstruct entire landscapes2. In 2016 we proposed a method to do so for medium-size proteins by exploiting known structures of a protein’s wildtype and variant sequences3. Our interest in understanding the impact of Ras mutations on (dys)function and now our ability to build landscapes for many variants prompt us to investigate landscape mining techniques. We report here on a technique that allows relating specific structures and transitions to biological mechanisms and exposing mechanisms via which mutations alter dynamics and function.

Methods
The SoPriMp algorithm3 we proposed and analyzedin 2016 leverages known structures of a protein to build a sample-based representation of the energy landscape of a sequence of interest. Due to the role of Ras in cell growth and of many of its variants in cancer and other disorders, we have applied SoPriMp to map landscapes of 14 pathogenic variants. Interesting questions can be asked on how mutations alter dynamics and function via landscape comparison. We propose here a landscape mining technique based on statistical analysis of high-dimensional spatial data, borrowing concepts from manifold theory. The technique automatically extracts basins and saddles, allowing us to measure descriptors related to volumes, depths, and spatial and energetic distances between basins and basin-separating saddles.

Results & Conclusions
This technique yields graphical representations of computed landscapes, shown for selected variants in Figure 1A. On these (and others not shown) we observe: In oncogenic variants, the On-Off barrier typically gets elevated; the Off basin shrinks or disappears; the On basin splits, separating the R- and T-states (the importance of the later for function is detailed in Ref.4). On syndrome-causing variants, changes are less drastic: the Off basin leaks into other regions, or both the Off and On basins degenerate, even merging with one another and spilling over much of the landscape. Extracted landscape descriptors can be correlated with wet-lab measurements (on the same variants)5 of Ras activities (see Figure 1B). These and other descriptors (transition costs not shown here) reveal that the transition between R- and T-states relates to RAF1-RBD binding and GTP activation, and transitions between states with active-inactive GTP signaling (On-to-Off, T-state-to-Off) relate to intrinsic hydrolysis. These results signal an exciting stage where we can compute and mine landscapes to learn how mutations impact function and elucidate the role of specific states and transitions in biological pathways.

FIGURE 1. A. Landscapes of H-Ras variants. Contours track basins (left: On; right: Off). B. Cross-variant wet-lab-characterized mechanisms5 that correlate above 0.6 (in bold: above 0.75) with landscape descriptors are P7 (Intrinsic Hydrolysis) and P0 (GTP activation).

References
1. Boehr, D. D., Nussinov, R., and Wright, P. E. Nature Chem Biol 5 (2009).
2. Russel, D. et al. Curr Opin Cell Biol 21 (2009).
4. Maximova, T. et al. ACM Bionf & Comp Biol, (2016).
5. Johnson, C. W. and Mattos, C. Enzymes (2013).
6. Gremer, L. et al. Human Mutation (2011).

What can human variation tell us about proteins?
Date: Sunday, July 23
Time: 5:10 PM - 5:30 PM
Room: Forum
  • Stuart MacGowan, University of Dundee, UK
  • Fábio Madeira, University of Dundee, UK
  • Thiago Britto-Borges, University of Dundee, UK
  • Melanie S. Schmittner, University of Dundee, UK
  • Christian Cole, University of Dundee, UK
  • Geoffrey J. Barton, University of Dundee, UK

Presentation Overview: Show

What can human variation tell us about proteins?1
Human sequencing projects have generated population variant datasets from thousands of individuals.2-3 For the exome, protein structure is an effective context in which to interpret the effects of missense variation. Studies have shown that disease variants are enriched in buried sites4 and protein interaction surfaces,5 whilst somatic6 and pathogenic germline variants7 often cluster in 3D. Beyond structure, the genomic distribution of genetic variation is affected by gene essentiality,2, 8 protein domain architecture9 and other genomic features.3

Given these observations, we were curious to see if missense variant distributions could identify the structural features that had been used to interpret them. The variant data were too sparse to compare genetic variation between individual protein residues in domains directly,2, 9 so we began by aggregating variants over columns in multiple sequence alignments of Pfam domains.10 We then compared the resulting population variation profiles to residue conservation across all species and found that the same structural and functional pressures that affect residue conservation during domain evolution also place constraints on missense variants. Consequently, positions that are depleted of missense variants are likely to be structurally important.

We found that missense depleted positions are enriched in known pathogenic variants while positions that are both missense depleted and evolutionary conserved are further enriched in pathogenic variants compared to those that are only evolutionary conserved. Unexpectedly, some evolutionary Unconserved positions are Missense Depleted in human (UMD positions) and these are also enriched in pathogenic variants. UMD positions are further differentiated from other unconserved residues in that they are enriched in ligand, DNA and protein binding interactions, which suggests this stratification can identify unconserved positions that are structurally important.

I will illustrate these principles by discussing examples from the Src Homology 2 (SH2; Figure 1), G-Protein Coupled Receptor (GPCR; Figure 2) and the Nuclear Receptor Ligand Binding Domain families.

Figure 1. Inter-domain interactions of the SH2 domain in inactivated Src (PDB ID: 2src).11 The surface of the SH2 domain (PF00017) is coloured yellow to red corresponding to a. increasing missense depletion and b. increasing Shenkin conservation.

Figure 2. UMD residues (blue) in the Rhodopsin-like receptors (PF00001) mapped to a structure of the Delta-type opioid receptor (PDB ID: 4n6h).12 Amongst the 11 UMD residues are several involved in ligand binding and one that coordinates the bound sodium ion.

References
1. MacGowan, S. A., et al., bioRxiv 2017, doi: https://doi.org/10.1101/127050.
2. Lek, M., et al., Nature 2016, 536 (7616), 285-91.
3. Telenti, A., et al., Proc Natl Acad Sci U S A 2016, 113 (42), 11901-11906.
4. Wang, Z.; Moult, J., Hum Mutat 2001, 17 (4), 263-70.
5. David, A.; Sternberg, M. J., J Mol Biol 2015, 427 (17), 2886-98.
6. Gao, J., et al., Genome Medicine 2017, 9 (1), 4.
7. Sivley, R. M., et al., bioRxiv 2017, doi: https://doi.org/10.1101/109652.
8. Petrovski, S., et al., PLoS Genet 2013, 9 (8), e1003709.
9. Gussow, A. B., et al., Genome Biol 2016, 17, 9.
10. Finn, R. D., et al., Nucleic Acids Res 2016, 44 (D1), D279-85.
11. Xu, W., et al., Mol Cell 1999, 3 (5), 629-38.
12. Fenalti, G., et al., Nature 2014, 506 (7487), 191-6.

3DSIG Closing remarks
Date: Sunday, July 23
Time: 5:30 PM - 6:00 PM
Room: Forum
  • Rafael Najmanovich, University of Montreal, Canada

Presentation Overview: Show

Closing remaks: Careers in and around structural bionformatics