CASP4 Abstracts

Comparative modeling category



These are the abstracts submitted by the predicting groups to the 2000 CASP4 meeting.

003 , Gerloff
012 , Levitt
017 , Yang-Ansuei
022 , InforMax
023 , Jones
028 , Ram-Samudrala
032 , Wolynes
042 , Honig-Barry
044 , Walts-Wondrous-Wizards
047 , kitasato-univ.
058 , Harrison-Weber
065 , Torda-Andrew
088 , ORNL-PROSPECT
090 , Hogue-Feldman
095 , blundell-tl
126 , Sternberg
133 , CBC-FOLD
155 , TUDELFT
169 , Dunbrack
173 , Barton
186 , SDSC1
187 , SDSC2:Reddy-Bourne
197 , Godzik
218 , LAMBERT-Christophe
223 , Braun-UTMB
237 , Sali-Andrej
241 , Vajda
255 , BinToHes
273 , WXW
330 , Zemla-Joanna
342 , SBI-AT
354 , baker
363 , Moult
381 , SBfold
382 , SBauto
384 , Murzin
389 , 123D+
406 , VENCLOVAS
414 , Friesner
429 , CHEN-WENDY
444 , MOE-CCG
447 , MSI
447 , MSI
457 , SBI-GR
465 , YASARA
482 , MSI-GA
486 , Shoshana-Wodak
501 , Hovmoeller-Zhou
520 , Scheib-Holger
526 , Ginalski
535 , shankari


Jones , 023

number of submitted models: 56

Automatic generation of alignments for comparative modelling using THREADER

David T. Jones

Brunel University
email:
David.Jones@brunel.ac.uk

 
In overview, our comparative modelling procedure involves the use of  
THREADER3 (in comparative modelling mode) to generate an alignment  
between the target sequence and a single selected template structure. This  
alignment is then fed into the MODELLER4 program (A. Sali and T.  
Blundell, J.Mol.Biol. 234, 779, 1993) for the final modelling stage. In  
detail, the procedure is as follows: 
 
1. Template selection. The GenTHREADER & mGenTHREADER (D.T.  
Jones, J. Mol. Biol. 287, 797-815, 1999)  results stored on the CAFASP2  
server were selected to find the template structure which produced the  
highest score. Where the confidence of the best match was not either  
HIGH or CERT, the target was classed as a fold recognition target and  
was not processed further in this category. 
 
2. Alignment generation. PSIPRED (D.T. Jones, J. Mol. Biol. 292, 195- 
202, 1999) secondary structure predictions were generated and  
THREADER3 was used with both sequence and secondary structure  
weighting. Alignments were generated with different sequence similarity  
weights (-S option) in the range 50-400. Depending on the degree of  
sequence similarity reported, 1-3 alignments were selected on the  
basis of threading energy Z-scores (i.e. only 1 was selected if the 
equence identity was > 50% and 3 were selected if  %ID < 35%) 
 
3. Each of the alignments (or just one in the case of trivial homologues)  
was then fed into the MODELLER4 program and the final structures  
were evaluated in terms of overall threading energy Z-scores using 100 sequence shuffles. 
The model with the highest combined pairwise/solvation energy Z-score was submitted. 
Although no significant human intervention was used at any 
point, some hand editing of the input files was sometimes  
needed in order to get MODELLER4 to run (chain labelling issues for example). 
 


Torda-Andrew , 065

number of submitted models: 93

Sequence to structure alignments without
Boltzmann based force fields

Abraham, M, Ayers, DJ, Dosztanyi, Huber, T, Procter, JB, Russell, AJ and Torda, AE

Australian National University
email:
Andrew.Torda@anu.edu.au

 
Alignments were calculated and models ranked using the sausage 
program [1]. Sidechains were fitted using a self-consistent 
mean-field method [2]. 
 
Three force fields were used in three different steps: 
 
1. Sequence to structure alignments used a score function 
which used the identity of only one interaction partner 
[5]. This allowed us to use the Gotoh method [4] for speed, 
while avoiding the frozen approximation or double dynamic 
programming. 
 
2. Ranking of models used a z-score optimised force field [3] 
 
3. Fed by unbounded optimism or perhaps pure faith, 
side-chains were placed on the models using a more 
conventional, physically based, molecular mechanics style 
force field. 
 
The first two force fields may be knowledge-based, but they 
were built in complete ignorance of Boltzmann 
statistics. Instead, the parameters are optimised so as to 
distinguish native coordinates from a mass of misfolded 
structures. 
 
A second series of optimisation calculations allowed us to 
find weights for additional terms for secondary structure 
predictions [6], sequence similarity and gap penalties. 
 
Finally, the library of templates consisted not of simple 
protein coordinates, but rather of precalculated fields due to 
averaging over similar structures. 
 
The alignment code and methodology is undisputably fast. It 
may occasionally be correct. 
 
For the last few targets, secondary structure predictions were 
made using a neural net fed on the sausage alignment 
calculations. 
 
------------------- 
[1] Huber T, Russell AJ, Ayers D, Torda AE (1999) 
Bioinformatics, 15, 1064-1065. 
Sausage: protein threading with flexible force fields. 
 
[2] Huber T, Torda AE, van Gunsteren WF (1996), Biopolymers, 
39, 103-114. 
Optimization methods for conformational sampling using a 
Boltzmann-weighted mean field approach. 
 
[3] Huber, T and Torda, AE (1999) Protein Sci, 7, 142-149. 
Protein fold recognition without Boltzmann statistics or 
explicit physical basis. 
 
[4] Gotoh, O. (1982) J Mol Biol, 162, 705-708. 
An improved algorithm for matching biological sequences. 
 
[5] Huber T, Torda AE (1998) J Comput Chem, 15, 1455-1467. 
Protein sequence threading, the alignment problem, and a 
two-step strategy. 
 
[6] Rost B and Sander C. (1993) J Mol Biol, 232, 584-599. 
Prediction of protein secondary structure at better than 70% 
accuracy. 
 


Scheib-Holger , 520

number of submitted models: 2

Homology modeling of CASP4 targets T0119 and T0123

Holger Scheib

GlaxoWellcome Experimental Research S.A.
email:
hys14462@glaxowellcome.co.uk

 
The target sequence for T0119 (Benzoate Dioxygenase Reductase) was found in
SwissProt (Accession Code P07771). Initially, the SwissModel First Approach 
Mode (http://www.expasy.ch/swissmod/SWISS-MODEL.html) without preselected 
template files was applied with the resulting model structure containing 2 
domains with a missing linker sequence from K103 to A114. Also, the optional 
WhatCheck and Predict Protein reports were obtained as well as results from
3D-PSSM prediction.The 3D-model structure domain 1 constitutes of 2FE/2S
ferredoxin, domain 2 of ferredoxin-NAD reductase. Domain 2 can be further
subdivided into 2 subdomains. The SwissModel result was analyzed by comparing
the model structure to the WhatCheck report and PredictProtein results. Using
PredictProtein, the 3D-structure was checked for secondary structure and
accessibility.Although the overall model seemed to be reasonable, the side
chain positions of the following residues were manually modified to reduce
sterical repulsion or to increase hydrophobic or polar contacts:

Domain 1: E24
Domain 2: F139, S186, K235, E279. 
In domain 1, the loop between E67 and A75 was created applying the SwissModel
loop building tool. Finally, both domains were manually brought into close
neighborhood hinting for a putative 3D orientation in the connected domains.

For target T0123 (b-lactoglobulin), the target sequence was sent to SwissModel
(http://www.expasy.ch/swissmod/SWISS-MODEL.html). SwissModel First Approach
Mode was performed without preselected template files. The results retrieved
besides the 3D model structure included both the WhatCheck and PredictProtein
report, and 3D-PSSM prediction data.

The 3D structural model was build with 3BLG, 1BSOA, 1CJ5A, 1BST, and 2BLG as
template structures. Differences among the templates were more in structure
rather than sequence, since the template structures were resolved at various
pH-values. From the additional information coming along with the T0123 target,
one could extract that the crystallization was carried out at pH 3.2
indicating a closed conformation of the loop between residues  85 and 90.

Also, from the respective publication of Qin et al. in Biochemistry (see
reference below), one could conclude that pig b-lactoglobulin is most likely
monomeric.

From PredictProtein, the suggested template structures concerning the sequence
identity to T0123 were 1B0O, 1QAB, 1MUP, 1AQB, 1FEN, and 1WDC. Due to the
differences between template structures selected by SwissModel and
PredictProtein, the results for secondary structure prediction were ignored,
since there is no correlation at all.

The resulting alignment generated by SwissModel was manually corrected at the
C-terminus by shifting residues P151, A152, and Q153 two positions downstream
placing the two residue gap in the alignment (of the target sequence with the
template structure sequences) between L150 and P151. A loop was build using
L150 and C158 as anchor points scanning the SPDBViewer loop database. Energies
of all loops found were calculated using the Gromos Force Field by van
Gunsteren and coworkers. The lowest energy loop was selected originating from
4GCR with 1.47 Å resolution, sequence pattern ITDDCPS. The respective energy
was calculated to 3542.2 kJ.

The following residues were found to clash either with other side chains or
the T0123 backbone:

P5, K35, S37, K40, R61, Q91, F93, L94, H102, L105, L126, V128, D130, I132,
R133, P146, P151, and E155.

The side chains of the model structure were then energetically minimized using
the Simulated Annealing algorithm implemented into the SPDBViewer applying the
following parameters:

Heating: 20 steps, initial T: 1000 K, final T: 1000 K
Annealing: 200 steps, initial T: 1000 K, final T: 300 K
Equilibration: 100 steps, initial T: 300 K, final T: 300 K
Random seed: 0.000

After applying the Simulated Annealing algorithm to the model structure, the
side chains of residues H102, L105, and R159 were manually modified, since
they still repulsively interacted with their environment.

From the resulting model, the following characteristics could be extracted
coinciding with literature data by Qin and coworkers:

  1. the structure must be in closed conformation, since crystallization was
     carried out in acidic milieu.

  2. disulfide bonds occur between C66 and C158 with Ca-Ca distance of 6.04 Å
     (as compared to 5.91 Å from literature) and C106 and C119 with Ca-Ca
     distance of 3.90 Å (as compared to 3.83 Å), respectively.

  3. a salt bridge is possible between V1 and E108

  4. the putative substrate binding site consists of L10, T12,V15, A41 ,V43,
     L46, L54, I56, L58, L71, A73, A80, F82, I84, L92, L94, L103, L105, M107
     differing in 6 positions from bovine b-lactoglobulin but not in the four
     highly conserved positions among lipocalins (L10, L54, L58, F82).

  5. in the closed conformation, an H-bond is likely to occur between OE2 of
     E89 which is the key residue in loop movement, and O of S116.



Reference for bovine b-lactoglobulin structure

Qin, B.Y., Bewley, M.C., Creamer, L.K., Baker, H.M., Baker, E.N. and Jameson,
G.B. (1998)  "Structural basis of the Tanford transition of bovine
beta-lactoglobulin" Biochemistry 37:14014-14023.


References for SwissProt

Bairoch A. and Apweiler R. (2000) "The SWISS-PROT protein sequence database
and its supplement TrEMBL in 2000". Nucleic Acids Res. 28:45-48.

Bairoch A. and Apweiler R. (1997) "The SWISS-PROT protein sequence database:
its relevance to human molecular medical research". J. Mol. Med. 75:312-316.

Apweiler R., Gateau A., Contrino S., Martin M.J., Junker V., O'Donovan C.,
Lang F., Mitaritonna N., Kappus S. and Bairoch A. (1997) "Protein sequence
annotation in the genome era: the annotation concept of SWISS-PROT + TREMBL".
In: ISMB-97; Proceedings 5th International Conference on Intelligent Systems
for Molecular Biology, pp33-43, AAAI Press, Menlo Park, CA, USA.

Bairoch A. (1997) "Proteome databases". In: Proteome research: new frontiers
in functional genomics, Wilkins M.R., Williams K.L, Appel R.D., Hochstrasser
D.H, Eds., pp93-132, Springer Verlag, Heidelberg. ISBN: 3-540-62775-8.

Moller S., Leser U., Fleischmann W. and Apweiler R. (1999) "EDITtoTrEMBL: a
distributed approach to high-quality automated protein sequence annotation".
Bioinformatics 15:219-227.

Fleischmann W., Moller S., Gateau A. and Apweiler R. (1999) "A novel method
for automatic functional annotation of proteins". Bioinformatics 15:228-233.

O'Donovan C., Jesus Martin M., Glemet E., Codani J.J. and Apweiler R. (1999)
"Removing redundancy in SWISS-PROT and TrEMBL". Bioinformatics 15:258-259.


References for SwissModel

Peitsch MC (1995) "ProMod: automated knowledge-based protein modelling tool."
PDB Quarterly Newsletter 72:4.

Peitsch MC (1995) "Protein modelling by E-Mail". Bio/Technology 13:658-660.

Peitsch MC (1996) "ProMod and Swiss-Model: Internet-based tools for automated
comparative protein modelling". Biochem. Soc. Trans. 24:274-279.

Peitsch MC and Guex N (1997) "Large-scale comparative protein modelling". In:
Proteome research: new frontiers in functional genomics, p 177-186, Wilkins
MR, Williams KL, Appel RO, Hochstrasser DF eds., Springer Verlag, Heidelberg.
ISBN: 3-540-62775-8..

Guex N and Peitsch MC (1997) "SWISS-MODEL and the Swiss-PdbViewer: An
environment for comparative protein modelling". Electrophoresis 18:2714-2723.

Guex N and Peitsch MC (1999) "Molecular modelling of proteins". Immunology News 6:132-134.

Guex N, Diemand A and Peitsch MC (1999) "Protein modelling for all". TiBS 24:364-367.


References for Swiss-PDBViewer

Guex N, Diemand A and Peitsch MC (1999) "Protein modelling for all". TiBS 24:364-367.

Guex, N. and Peitsch, M.C. (1997) "SWISS-MODEL and the Swiss-PdbViewer: An
environment for comparative protein modeling". Electrophoresis 18, 2714-2723.

Guex, N and Peitsch, M.C.(1996) "Swiss-PdbViewer: A Fast and Easy-to-use PDB
Viewer for Macintosh and PC". Protein Data Bank Quaterly Newsletter 77, 7.

Guex, N.(1996) "Swiss-PdbViewer: A new fast and easy to use PDB viewer for the
Macintosh". Experientia 52, A26.


Reference for WhatCheck

Vriend G. http://www.sander.embl-heidelberg.de/whatcheck/


References for Predict Protein

Rost, B. (1996) "PHD: predicting one-dimensional protein structure by profile
based neural networks". Methods Enzymol., 266:525-539.

Rost, B. and Sander, C. (1993) "Prediction of protein secondary structure at
better than 70% accuracy". J. Mol. Biol., 232:584-599.

Rost, B. and Sander, C. (1994) "Combining evolutionary information and neural
networks to predict protein secondary structure". Proteins, 19:55-77.

Rost, B. and Sander, C. (1994) "Conservation and prediction of solvent
accessibility in protein families". Proteins, 20:216-226.

Reference for 3D-PSSM

Kelley LA, MacCallum RM & Sternberg MJE (2000) "Enhanced Genome Annotation
using Structural Profiles in the Program 3D-PSSM". J. Mol. Biol. 299(2),
501-522.

Reference for Gromos

van Gunsteren, W.F. and Berendsen, H.J.C. (1987) "Groningen molecular
simulation (GROMOS) library manual". Groningen. biomos.


WXW , 273

number of submitted models: 139

Comparative modeling of protein tertiary structures based on secondary structure alignment

Xiongwu Wu

WXW Info., Inc
email:
wxw@giccs.georgetown.edu

The target protein sequence is aligned with  the protein secondary structural
segments taken from the protein listed in the selected protein list provided
by Dr. Uwe Hobohm. Secondary structures are derived from best alignment
sequences for each residue.

Based on the secondary structures derived in above approach, the target
sequence is aligned to the selected proteins based on the secondary structural
identities and the best match is used as the templet for comparative modeling.


Ram-Samudrala , 028

number of submitted models: 207

Handling interconnected structural changes in comparative modelling of
proteins using a statistical scoring function, graph theory, andexhaustive enumeration techniques

Ram Samudrala and Michael Levitt

Stanford University
email:
ram@csb.stanford.edu

The interconnected nature of interactions in protein structures,
thorough sampling of side chain and main chain conformations, and
devising a discriminatory function that can distinguish between
correct and incorrect conformations are the major hurdles preventing
the construction of accurate homology models. We present an algorithm
that uses graph theory to handle the problem of
interconnectedness. Sampling of side chain and main chain
conformations is accomplished by exhaustively enumerating all possible
choices using a discrete state model, including fragments from a
database of protein structures.  The optimal combination of these
possibilities is selected using an all-atom scoring function aided by
the graph-theoretic approach.

Following is a brief description of the components and steps of this
method, which can be divided into: discriminatory function,
identification of template and generation of alignment, initial model
building, construction of variable main chain and side chain regions,
and moving models closer to the native conformation.

0. DISCRIMINATORY FUNCTION: the function used throughout generally is
an all-atom distance-dependent conditional probability discriminatory
function based on a statistical analysis of known protein
structure. The negative log of the conditional probability of
observing two atoms interact given a particular distance is used as a
``pseudo-energy'' term.  Reference: J Mol Biol 275: 893-914 (1998).

1. IDENTIFICATION OF TEMPLATE AND GENERATION OF ALIGNMENT: The CAFASP
meta-server data were used to identify the proteins that a given
target sequence was related to (based on a consensus of all the hits 
produced by the different servers). The alignments generated by the 
different servers were then used to construct initial models. The 
initial models were then ranked by our discriminatory function and the 
models that ranked highest were used for further model-building. 
 
2. INITIAL MODEL BUILDING: Following the sequence alignment, for each 
parent structure, an initial model was generated by copying atomic 
coordinates for the main chain (excluding any insertions) and for the 
side chains of residues that are identical in the target and parent 
structures.  Residues that differ in type were constructed using a 
minimum perturbation technique.  The MP method changes a given amino 
acid to the target amino acid preserving the values of equivalent chi 
angles between the two side chains, where available. The other chi 
angles are constructed by the MP method using an internally developed 
library based on residue type. 
 
3. CONSTRUCTION OF VARIABLE MAIN CHAIN AND SIDE CHAIN REGIONS:  
 
Main chain sampling is performed using an exhaustive enumeration 
technique based on discrete states of phi/psi angles. For longer main 
chain regions, we use fragments (3-tuples) from a database of protein 
structures to generate the discrete phi/psi angles. 
 
Side chains possibilities are generated by selecting the most probable 
side chain rotamers based on the interactions of a given rotamer with 
the local main chain (evaluated using the discriminatory function 
above). Reference: Samudrala R, Moult J. Prot. Eng.  11: 991-997, 
1998. 
 
We then use a graph-theoretic approach to assemble the sampled side 
chain and main chain conformations together in a consistent manner. 
Each possible conformation of a residue is represented using the 
notion of a node in a graph.  Each node is given a weight based on the 
degree of the interaction between its side chain atoms and the local 
main chain atoms.  The weight is computed using a all-atom conditional 
probability discriminatory function. Edges are then drawn between 
pairs of residues/nodes that are consistent with each other (i.e., 
clash-free and satisfying geometrical constraints). The edges are also 
weighted according to the probability of the interaction between atoms 
in the two residues. Once the entire graph is constructed, all the 
maximal sets of completely connected nodes (cliques) are found using a 
clique-finding algorithm. The cliques with the best probabilities 
represent the optimal combinations of mixing and matching between the 
various possibilities, taking the respective environments into 
account.  Reference: J Mol Biol 279:287-302 (1998).  Clique-finding is 
accomplishing using the Bron and Kerbosch algorithm.  Reference: 
Communications of the ACM, 16: 575-577 (1973). 
 
All models used were refined using ENCAD. 
 
5. MOVING MODELS CLOSER TO THE NATIVE CONFORMATION: 
 
Once we had generated a final model for each parent, we used  
an off-lattice fourteen-state phi/psi model and a sequential 
build-up algorithm to generate structures around the conformational 
space of the final model. We then used our scoring function to select 
the best ranking ones. The goal here is that some of the conformations 
sampled would actually be closer to the native conformation and that 
our scoring function will be able to select it. 
 
We test how the above approach works in a comparative-modelling 
scenario and assess the predictive power of this method by applying it 
to properly controlled blind tests as part of the fourth meeting on 
the Critical Assessment of protein Structure Prediction methods 
(CASP4). Compared to CASP2 and CASP2, where a similar approach was 
used, we have improved the method used to sample main chains and have 
made minor enhancements to the other components of this approach 
including the scoring function. The biggest change is in our attempt 
to move models closer to the final answer. It remains to be seen how 
the improvements in methodology correlate with model accuracy. 
 


LAMBERT-Christophe , 218

number of submitted models: 45

ESyPred3D: an Expert System for the Prediction of the protein 3D structure

C. Lambert , N. Léonard, K. de Fays and E. Depiereux

University of Namur, Belgium
email:
christophe.lambert@fundp.ac.be

 
The aim of our work is to propose a reliable automatic method for homology
modeling, especially when the protein of interest shares a low percentage of
identities (20-30%) with the chosen template.

Our strategy consists in the usual steps for homology modeling: search for the
template in databanks, target-template alignment and modeling. Actually, our
method does not provide any assessment of the model.

For the search for template in databank, we used four iterations of
PSIBLAST[1] on the non redundant protein database (nr) of the NCBI. All
sequences having a expected value lower than 0.001 are included in the profile
building. The template is chosen as the sequence of known structure (PDB) that
has the lower expected value. The search in the nr databank also give us a
large number of similar sequences.

As far as possible, two sets of sequences are built. The first one contains
the 50 best hits below the expected value cutoff of 0.001. The second one
contains a subset of the sequences, after dropping too redundant ones. This
method aims at creating different conditions to run multiple alignment
programs and extracting different consensus and in order to raise the
confidence of the sequence-structure alignment.

The two sets are then submitted to five alignment programs: ClustalW[7],
Dialign2[5], Match-Box[3], Multalin[2] and PRRP [4]. A pairwise alignment
between the target and template sequences is extracted from each multiple
alignment and the final sequence-structure alignment is obtained from the
consensus between all the pairwise alignments including the one provided by
PSI-BLAST. A tri-dimensional model is built using MODELLER[6] version 4 on
this final alignment.

For the purpose of the CASP experience, two other models were built:
- one from the rough sequence-structure alignment provided by PSI-BLAST[1]
- one from the consensus of all alignment methods expected PSI-BLAST.


1. Altschul SF, et al. (1997). Nucleid Acids Research 25(17): 3389-3402
2. Corpet F (1988) Nucl. Acids Res. 16:10881-10890.
3. Depiereux E, et al. (1997). Comput. Appl. Biosci. 13(3): 249-256.
4. Gotoh O (1996) J. Mol. Biol. 264:823-838
5. Morgenstern, B. (1999). Bioinformatics 15(3): 211-8.
6. Sali A and Blundell TL (1993). Journal of Molecular Biology 234(3): 779-815.
7. Thompson JD, et al. (1994). Nucleic Acid Research 22(22): 4673-4680.


TUDELFT , 155

number of submitted models: 44

COMPLETION AND REFINEMENT OF 3D HOMOLOGY MODELS WITH RESTRICTIVE
MOLECULAR DYNAMICS

Jaap A. Flohil and Simon W. de Leeuw

Delft University of Technology
email:
j.a.flohil@tn.tudelft.nl

A method is presented to refine models built by homology by the use of
restrictive molecular dynamics (MD) techniques. The basic idea behind this
method is that structure validation software is used to determine for each
residue the likelihood that it is correctly modeled. This information is used to
determine restraints in the MD simulation, which is used for model refinement.
Residues that are likely to be positioned correctly according to the validation
software should be strongly constrained or restrained in the MD simulation,
whereas residues likely to be positioned inappropriate, should be kept free.

The BLAST2P (Altschul et al, 1990; J. Mol. Biol. 215:403-410) server at the
EMBL was used to find a template that show at least 50equence identity to the
target sequence.

After side-chain modeling, artifacts of the modeling process have been
detected by automated procedures based on the structure validation modules of
WHAT IF (Vriend, 1990; J. Mol. Graph. 8, 52-56).

If the alignment procedure indicated that insertions had to be made, glycines
were inserted consecutively inserted by a shrink-insert-expand procedure
(applied with a similar procedure as used for T0058, but now with multiple loop residues),
followed by an energy minimization after each inserted glycine. After expansion
of the final glycine, the formed loop was mutated into the target sequence,
and added to the selection of free moving residues. 
 

The initial model was created after an energy minimization (EM) of 100 steps
steepest descent with GROMACS (Berendsen et al, 1995; Comp. Phys. Comm. 95 
pp. 43-56), to remove Van der Waals overlap and to adapt to the GROMOS96 force 
field. 

The quality of residues according to the validation software is used to
determine the strength of restraints in the 1ns restricted MD simulation. 
Depending on the magnitude of the modeling errors reported in 
the model checks, and position of modeled indels, residues were selected to move 
free. 

After adding water and a short additional EM, a 10 ps run with position
restraints (1000 kJmol-1nm-2) on the protein was done to equillibrate the
water. A 1-5 ns refinement run was performed in explicit water, with harmonic
restraints of 10000 kJmol-1nm-2 on all heavy atomic coordinates. Every
picosecond, a frame in the MD trajectory was analyzed on the formation of
non-bonded backbone-backbone contacts between free and restrained groups, and
internal protein contacts. All frames were clustered into RMSD groups, and the
average frame of each cluster was selected for submission. The model priority
was given by the ranking of interatomic contacts. B-factors were calculated
from RMS fluctuations along the trajectory.
 
Protein fragments seem to be sufficiently compliant to find a native-like
state for incorrectly modeled fragments in a strongly constraint framework. At
presence, model refinement with molecular dynamics generally leads to a model
that is less like the experimental structure (Levitt et al, 1999; Nature
structural biology, Vol.6 nr. 2, February 1999), but extrapolating from the
results a significant model quality improvement might be possible in the near
future.
 
 
 


Sali-Andrej , 237

number of submitted models: 13

Comparative protein structure modeling by Modeller-6

Andras Fiser, Marc Marti-Renom, Ash Stuart, Andrej Sali

The Rockefeller University
email:
fisera@rockefeller.edu

 
Template structures were identified primarily by programs Psi-Blast 
(1) and MODELLER (2). In the difficult cases, additional information 
was obtained by threading programs including GenThreader (3) and 
3D-PSSM (4). In general, several template combinations were explored, 
finally selecting those template combinations that resulted in the 
best model, as assessed by ProsaII (5). 
   
Initial alignments were generated with programs Psi-Blast (1), ALIGN 
(2), Align2D (2), ClustalW (6),and ALIGN4D (7). ALIGN2D and ALIGN4D 
align profiles with profiles. In the difficult cases, multiple 
sequence profiles were generated manualy for the target and the 
templates, and then aligned by ALIGN2D. In the case of several 
comparable templates, the multiple sequence alignment derived from the 
structural superposition of the templates guided the profile for the 
template structures. For structural superposition, either MALIGN3D (2) 
or CE (8) was used. Alignments were generally hand-edited. In 
addition, a new approach that automatically combines the best few 
alignments using a self-consistent field method was applied to 
optimize the alignments (9). In the low template-target sequence 
identity cases (~10-15equence identity), predicted secondary 
structure (JPRED (10)) was taken into account to refine the final 
alignment. 
   
The models were built by Modeller-6. The input to the program was the 
alignment between the target sequence and template structure(s). The 
output obtained without any user intervention was a model with all 
non-hydrogen atoms. When possible, additional restraints from ligand 
binding were included. Insertions and deletions that required a 
refinement were modeled by an automated ab initio loop modeling method 
(11). 
   
All models were checked by PROCHECK (12) for proper stereochemistry 
and by PROSA (5) for energetically favorable non-bonded contacts. When 
unfavorable stereochemistry or non-bonded contacts were identified, 
the loop modeling protocol (11) was used to refine the offending 
segments. 
   
(1) S.F. Altschul, T.L. Madden, A.A. Schaffer, J. Zhang Zhang, W. Miller and
D.J. Lipman (1997) Nucl. Acids Res., 25,3389-3402 
(2) A. Sali and T. L. Blundell, (1993) J. Mol. Biol. 234,779-815
(3) Jones, D.T. (1999) J.Mol Biol. 287: 797-815 
(4) Kelley L.A, MacCallum R.M. and  Sternberg M.J.E (2000). J. Mol. Biol. 299, 
501-522 
(5) M. J. Sippl, (1993) Proteins, 17,355-362 
(6) J.D. Thompson, D.G. Higgins and T.J. Gibson,(1994) Nucl. Acids 
Res.,22,4673-4680, 
(7) M. Marti-Renom, M. S. Madhusudhan and A. Sali, in preparation 
(8) Shindyalov I.N. and Bourne P.E. (1998) Prot. Eng., 11, 739-747. 
(9) A. Fiser and A. Sali, in preparation 
(10)Cuff J.A., Clamp M.E., Siddiqui A.S., Finlay M., Barton G,J., (1998) 
Bioinformatics, 14, 892-893, 
(11)A. Fiser, RKG. Do and A. Sali (2000) Prot. Sci. 9, 1753-1773 
(12)R.A. Laskowski, M.W. McArthur, D.S. Moss and J.M. Thornton,(1993) J. Appl. 
 Cryst.,26,283-291 


123D+ , 389

number of submitted models: 214

SEquence-structure alignment with 123D+ server.

Nickolai N. Alexandrov

Ceres, Inc.
email:
nicka@ceres-inc.com

 
123D+ server compares a target sequence with a set of protein domains from ASTRAL 
non-redundant set (version 1.50, 50 0dentity list). For every residue in the domain, 
the following information is derived from the PDB files: (i) residue type (amino 
acid in SEQRES field), (ii) secondary structure, assigned by Stride, and (iii) the 
number of contacts with other residues. Domain profiles are created by psi-blast 
run against NR database. Similarly, psi-blast profile is also created for a target 
sequence. Secondary structure of a target is predicted by probabilistic approach 
from statistics of amino acid pairs in a sliding window of 17 residues. Similarity 
score between position i in target and position j in domain is computed as: 
log((Paa*Pss*Pcc)/(P'aa*P'ss*P'cc)), where Paa is a probability to have the same 
amino acid in i and j, computed from the psi-blast profiles; Pss is a probability 
to have the same secondary structure; and Pcc is a probability to have the same 
number of contacts, computed from the contact capacity potentials for every 
residue type. P'aa, P'ss, and P'cc are correspondent expected probabilities. 
123D+ uses dynamic programming to find an optimal sequence-structure alignment. 
In addition to standard events of match, deletion, and insertion, the algorithm 
features a choice of residues not to be aligned, which helps to deal with 
different loop conformations. As default alignment mode was used fit, where the 
whole domain is required to be aligned with a part of the target sequence. 
123D+ was benchmarked with ASTRAL set of domains and outperformed psi-blast in 
fold recognition. 123D+ is available at 
http://www-lmmb.ncifcrf.gov/~nicka/run123D+.html. 


VENCLOVAS , 406

number of submitted models: 13

Sequence-structure alignment selection by 3D structure evaluation

Ceslovas Venclovas

Lawrence Livermore National Laboratory
email:
venclovas@llnl.gov

 
Comparative modeling method used to build models for CASP4 is a modification
of one used at CASP3 and described in more detail in the special Proteins
issue (Venclovas et al., 1999). What follows is an attempt to briefly describe
major steps in this procedure.

Parent (template) selection
 
PDB templates were identified either using the Smith-Waterman (Smith &
Waterman, 1981) search against PDB for high homology targets, or using
PSI-BLAST (Altschul et al., 1997) search against non-redundant NCBI sequence
database. Usually more than one template was used to build models.

Sequence-structure alignments
 
Sequence-structure alignments were generated and tested both at the sequence
level as well as at the 3D level. For high homology targets, where structural
template(s) were among closely related sequences, multiple sequence alignment
analysis was used first. This step consisted of producing series of multiple
sequence alignments for the same set of sequences using systematic variation
of parameters. The regions where variation of the parameters did not affect
the alignment were tabulated and alignment within these regions was used to
build a model. In the case of distant homology targets, results of initial
PSI-BLAST search were used for intermediate sequence search procedure as a
first step towards generating sequence-structure alignment. In this procedure,
a set of sequences that bridge sequence space between target sequence and
template(s) were used as probes to do search against non-redundant sequence
database. Target-template sequence alignments were extracted from resulting
search data and their consistency was analyzed. For regions where one dominant
alignment variant was produced, this variant was used to build a model. If
there were several variants, all of them were tested by building model and
evaluating consistency with the 3D structure. Alignments for some regions that
were expected to be structurally conserved, but could not be aligned by
PSI-BLAST, were derived manually using PSIPRED (Jones, 1999) secondary
structure predictions as a guide.

Selecting sequence-structure alignments by model evaluation
 
Final sequence-structure alignments were selected by building and evaluating
3D models. If sequence methods suggested several major alternative alignments
for specific region, all of them were tested by building and evaluating
corresponding models. In most cases models not only for target protein, but
also for its close homologs were built, in attempt to make a better judgement
regarding correct alignment in the questionable regions. Evaluation of 3D
models was done using ProsaII (Sippl, 1993) (comparing Z-scores of models for
several homologs generated using alternative alignments) and by visual
inspection with emphasis on detecting buried charged and hydrophylic side
chains.

Loop modeling
 
Regions that were expected to differ in target structure compared to the
template(s) were defined as loops. Coordinates for these regions were assigned
after suitable fragments from PDB structures were found. The preference was
given to the proteins related to the target. Otherwise the conformation which
was dominant in results of fragment search was assigned to the targeted
region.

Generating 3D structures

Models for evaluation purposes were built either with Homology module
(InsightII) or with MODELLER (Sali & Blundell, 1993). The final models were
generated using MODELLER with subsequent side chain rebuilding using SCWRL
(Bower et al., 1997). The model structures were then verified with Whatcheck
function of WHATIF package (Vriend, 1990) and detected severe steric clashes
were relieved. No energy minimization procedures were used.

References:

Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller,
W. & Lipman, D. J. (1997). Gapped BLAST and PSI-BLAST: a new generation of
protein database search programs. Nucleic Acids Res 25(17), 3389-3402.

Bower, M. J., Cohen, F. E. & Dunbrack, R. L., Jr. (1997). Prediction of
protein side-chain rotamers from a backbone-dependent rotamer library: a new
homology modeling tool. J Mol Biol 267(5), 1268-1282.

Jones, D. T. (1999). Protein secondary structure prediction based on
position-specific scoring matrices. J Mol Biol 292(2), 195-202.

Sali, A. & Blundell, T. L. (1993). Comparative protein modelling by
satisfaction of spatial restraints. J Mol Biol 234(3), 779-815.

Sippl, M. J. (1993). Recognition of errors in three-dimensional structures of
proteins. Proteins 17(4), 355-362.

Smith, T. F. & Waterman, M. S. (1981). Identification of common molecular
subsequences. J Mol Biol 147(1), 195-197.

Venclovas, C., Ginalski, K. & Fidelis, K. (1999). Addressing the issue of
sequence-to-structure alignments in comparative modeling of CASP3 target
proteins. Proteins Suppl. 3, 73-80.

Vriend, G. (1990). WHAT IF: a molecular modeling and drug design program.
J Mol Graph 8(1), 52-56, 29.


CHEN-WENDY , 429

number of submitted models: 14

On the knowledge of comparative modeling

Shu-wen W. Chen and Jean-Luc Pellequer

The Scripps Research Institute
email:
pelleque@scripps.edu

Our contribution to the CASP4 experiment consisted in testing the extend of
our structural knowledge underlying model building. Models presented at CASP4
were built using partially automated programs and computational graphics.

To identify structural templates, we used either a Blast search or fold
recognition programs. The most frequently used fold recognition program was
mGenTHREADER (Jones, 1999) but others found at the CAFASP2 experiment web site
were also used. When multiple structural templates were proposed, we used
known experimental data such as co-factor binding, substrate/product structure
to assist our selection. In case multiple structural templates were selected,
we manually aligned these structures onto each other. To select the final
template (we only mixed small regions from various templates) we looked at
crystallographic resolution, refinement statistics, and phi,psi dihedral
angles distribution.

Sequence alignments were initially performed by the BESFIT or GAP program (GCG
Inc, Madison, WI) with the BLOSUM62 substitution matrix (Henikoff and
Henikoff, 1992). Gap weights were fitted to obtain the longest alignment with
the smallest number of gaps. Final alignments were manually adjusted by
examining available structural templates. Attention was focused on Gly
replacements, the presence of side-chain to main-chain hydrogen bonds, and
irregularities in secondary structure elements.

The structural template backbone was transferred to the target sequence. Side
chains were substituted using the top rotamer from the library of Tuffery et
al. (1991). Side chains of conserved residues were kept rigid. Substituted
side-chain conformations were optimized using a self-consistent rotamer search
procedure and subsequently refined in a torsional angle space by Nelder-Meads
Simplex minimization algorithm (Chen and Pellequer, unpublished). The Charmm22
all atom force field parameters were used for scoring (Brooks et al., 1983).

Deletions and insertions were modeled using a self-consistent loop closure
algorithm with or without side chain flexibility (Chen and Pellequer,
unpublished). A discrete representation of the phi,psi conformational space
was used. Then, several cycles of manual rebuilding and refining were carried
out using Turbo-Frodo (Roussel and Cambillau, 1989). When possible, co-factors
and substrates were included in modeling (without attempt to refine them). To
further remove steric clashes and refine the final geometry, all side-chain
atoms were energy-optimized using Xplor 3.8 (Brünger, 1992) followed by a
brief all atoms minimization.


kitasato-univ. , 047

number of submitted models: 122

Comparative Modeling using FAMS -Full Automatic Modeling System

Mitsuo Iwadate, Kazuyoshi Ebisawa, Youji Kurihara, Mayuko Takeda-Shitaka and Hideaki Umeyama

Kitasato University
email:
umeyamah@pharm.kitasato-u.ac.jp

 
We introduce a method of homology modeling consisting of database searches and
simulated annealing. The method involves searches for homologous proteins,
alignment, construction of Ca atoms, construction of main-chain atoms, and the
construction of side-chain atoms. All processes after alignment are performed
automatically. Searches for homologous proteins and alignment are based on
PSI-BLAST raw output. Then the raw output is modified taking the hydrophobic
core and secondary structure in account. In this method, main-chain
conformations are generated from the main-chain coordinates in reference
protein.  Weighting function is defined by the local space homology
representing the similarity of environmental residues at reference protein.
Side-chain conformations are generated for constructed main-chain atoms by
database searches, and main-chain atoms are optimized for the fixed side-chain
conformations.  These two processes, i.e., the side-chain generation and
main-chain optimization, are repeated several times.  This type of
construction provides a structure similar to the X-ray structure, in
particular, main-chain and side-chain atoms in the residues belonging to the
structurally conserved regions (SCRs).  To examine the accuracy of our method,
we predicted fourteen proteins whose structures are known.  The average root
mean square deviation between models and X-ray structures was 2.29 A for all
atoms, and the percentage of chi1 angles within 30 degree was 72.6  0.000000or
SCRs residues.  Some models were in good agreement with their respective X-ray
structures, but not with the reference structures for homology modeling.


Sternberg , 126

number of submitted models: 45

Model Building by Comparison: Selecting and Improving Algorithms via
Expert Knowledge

Paul A. Bates and Michael J.E. Sternberg

Imperial Cancer Research Fund
email:
paul.bates@icrf.icnet.uk


Fully automated comparative model building procedures are generally less
accurate than procedures using some human intervention. Nevertheless, fully
automated procedures are essential for large-scale genome modeling. We are
trying to understand which algorithms are the best to use at each stage in the
model building process.  Towards this aim a fully automatic model building
program called 3D-JIGSAW (http://ww.bmm.icnet.uk/3djigsaw) has been written
and entered into CAFASP2.  This program is currently designed to work at
levels of no less than 40equence identity with the closest parent.  The
program is modular with each module centring around a particular algorithm
required in the modeling process.  The program produces intermediate files at
critical modeling steps.  For the 16 targets model built for CASP4 (targets easily
assigned to parents of known structure by the program PSI-BLAST (Altschul SF
et al., 1997, Nucl. Acids Res., 25, 3389-3402)  intermediate files were inspected,
altered if thought not to be optimal, and the program restarted from the
appropriate point. In addition, one of the program modules, the critical
alignment module, was changed from that used in the fully automatic version of
3D-JIGSAW.  The modules used in the model building process are similar to
those reported previously (Bates PA  and Sternberg  MJE ,1999, Proteins Suppl
3, 47-54) and are:
1. Selection of parents:  Parent  target sequences  are selected from a local
   sequence database (database consisting of  the NCBI sequences, nr,
   plus PDB sequences; annotated with data quality parameters such as
   resolution and numbers of missing atoms) using the program PSI-BLAST.
   Up to five parents are selected using a balance of sequence similarity
   and data quality.
2. Extraction of relevant sequences: A selection of sequences are taken
   between the target and parent sequences and hierarchically aligned
   (Barton GJ and Sternberg MJE ,1987, J. Mol. Biol., 20,327-37).
3. Superpose parents: The selected parents are superimposed via a
   multiple structure alignment algorithm.
4. Align target to parent sequences:  The profile of sequences from step 2
   are aligned to a profile of sequences from step 3.    This strategy worked
   quite well for alignments of target to parent above 40 equence
   identity but as most of the targets were below this level  a different
   module was used that aligned the best parent PSSM (position-specific
   scoring matrix; generated by PSI-BLAST) and target PSSM with
   adjustments to the metric dependant on the local agreement of known
   and predicted secondary structure.  Predicted secondary structure for the
   target was obtained from program PSIPRED (Jones DJ ,1999, J.Mol.Biol.,
   292, 195-202).
5. Selection of loops to change:  All loops are considered for replacement.
   The boundaries of the loops are taken from the ends of the secondary
   structure elements of the multiple structure alignment.  All loops, and all
   regions with incompatible backbone angles with the target sequence
   were modeled via database fragment searches.  Three databases were
   searched in the order:  (i)  homologous/analogous structures, (ii) loop
   classification database (Olivia B et  al., 1997, J. Mol . Biol. 266, 814-830)
   and (iiii) non-redundant database, 600 protein chains (sequence similarity
   of less than 25%, R-factor  <= 2.5).   Fragments were selected automatically
   and were chosen on the basis of good sequence similarity with the target
   and how well the fragments fitted to the take-off points both in terms of RMSD
   fit on the Ca atoms used for the take-off points and the difference in C=O
   angles of the backbones between parent and fragment.  A number of loop
   conformations were selected for each gap that joined all pairs of
   superimposed parents.
6. Mean-field calculations on fragments:  From an ensemble of secondary
   structure elements and connecting loops a mean-field calculation is
   performed to select a single element or loop for all sections of the
   target.  The algorithm used is a modification of the self consistent
   mean field approach to gap closure (Koehl P and Delarue M ,1995, Nat.
   Struct. Biol., 2,163-170) .
7. Selection of side-chain rotamers: Side-chains are built by tracing the
   path of the parent side-chain.  The maximum number of bond lengths,
   angles and torsion angles are taken from the parent side-chain that are
   compatible with the new side-chain.   Additional internal co-ordinates to
   complete the side-chain are taken from the secondary structure dependent
   rotamer library (McGregor MJ  et al., 1987, J. Mol. Biol., 198, 295-310).
   After the replacement of all side-chains and the assignment of a single
   rotamer for each, this parent rotamer plus rotamers from a side-chain
   rotamer library (Tuffery P  et al. ,1991, J. Biomol. Struct. Dyn. 8,
   1267-1289) are built at each residue position.  A second mean-field
   calculation is performed to select the most probable rotamer (Koehl P and
   Delarue  M ,1994, J. Mol. Biol., 23, 249-275).  The force field used in
   the calculations consisted of a soft atom pair potentials term,
   parameters taken from (Lee C and  Subbiah S, 1991, J. Mol. Biol. 217,
   373-388) and a hydrogen bonding potential term.
8. Energy refinement :  Because the loops are modeled via database searches
   they do not fit perfectly to the take-off points.  Thus, torsion angles
   were adjusted within the loop to give good geometry within the take-off
   regions; A modification of the tweak algorithm (Shenkin PS et al., 1987,
   Biopolymers, 26, 2053-2085) was used for this purpose. To remove the
   small number of steric clashes remaining in the models 100 steps of
   steepest descents energy minimization (unrestrained) were run using
   the program CHARMM (Brooks BR et al., 1983, J. Comp. Chem., 4, 187-217).


Hogue-Feldman , 090

number of submitted models: 22

Comparative Modelling using Maps of Conformational Space

Howard J Feldman, Thanh-Van T Le, John J Salama and Christopher W V Hogue

Samuel Lunenfeld Research Institute, Mount Sinai Hospital
email:
feldman@mshri.on.ca

For targets identified as homology modelling targets, similar sequences 
were identified through a BLAST search.
 
For T0123, we chose 1CJ5 as our template since it had the highest identity
(65%) to T0123 and only one gap.  The angles between consecutive alpha 
carbons in the template structure as well as virtual dihedrals between
sets of four carbons were recorded.  For the area near the deletion
(LPAQ) the backbone was allowed greater conformational freedom. 
 
For T0099, we chose 1SHF as our initial template since it had the highest 
identity (64%) to T0099 of those found.  However, an alignment using 
ClustalX v1.81 showed that there was a single residue deletion 
near residue 32.  The angles between consecutive alpha carbons 
in the template structure as well as virtual dihedrals between 
sets of four carbons were recorded as above.  For the area near the 
deletion (EKEGD) the backbone from a different template was used (1LCK). 
When placed on the same multiple alignment as the other SH3 domains, 
1LCK does not have any indels near this turn. 
 
These alpha-carbon "trajectory" angles were then plotted on "trajectory 
distributions" and some Gaussian noise was added, effectively making the 
backbone slightly flexible. 
 
Next, using our FOLDTRAJ algorithm(1), approximately 200 structures were 
generated by Calpha walk, using the coordinates recorded in the trajectory 
distributions.  The rest of the structure was built using the FOLDTRAJ 
algorithm described in the above reference.  Briefly, N, C and O are 
placed to minimize errors in bond angles and bond lengths.  Beta carbons 
are placed according to a look up table dependent on residue and adjacent 
alpha carbon positions.  Sidechains are placed probabilistically using 
Dunbrack's backbone dependent rotamer library(2).  All residues are chirally 
and sterically valid, have a minimum of non-hydrogen van der Waal 
collisions. 
 
Finally, from the pool of generated structures (all very similar in 
backbone but with different rotamer packings), various statistics 
were collected including radius of gyration, exposed surface area, 
exposed hydrophobic surface area, and empircal energy score according 
to two different scoring functions: an atom-based one (3) and a 
residue-based one (4). 
 
The best structures were chosen based on their energy scores, radii 
of gyration and exposed surface area.  The latter two were expected to 
be comparable to the same measures on the template structure(s).  This 
latter step along with BLAST and selection of the template were the only 
non-automated, subjectuve steps. 
 
REFERENCES 
 
1.      Feldman HJ and Hogue CWV.  (2000)  A Fast to Sample Real 
        Protein Conformational Space.  Proteins.  39(2): 112-131. 
 
2.      Dunbrack RLJ and Karplus M.  (1993)  Backbone-dependent rotamer 
        library for proteins.  Application to sidechain prediction. 
        J. Nol. Biol.  230: 543-574. 
 
3.      Zhang C, Vasmatzis G, Cornette JL and DeLisi C.  (1997) 
        Determination of Atomic Desolvation Energies From the 
        Structures of Crystallized Proteins.  J. Mol. Biol.  267: 
        707-726. 
 
4.      Bryant SH and Lawrence CE.  (1993)  An Empirical Energy 
        Function for Threading Protein Sequence Through the Folding 
        Motif.  Proteins.  16: 92-112. 
 


InforMax , 022

number of submitted models: 4

A Homology Modeling Algorithm for Protein Tertiary Structure Prediction

Feodor Tereshchenko, Nikolai Daraselia

InforMax Inc.
email:
feodor@informaxinc.com

 

The homology modeling starts from an alignment  This alignment must be done in
such a way that the sequences of target  aligned to the secondary structure
elements (alpha-helices and beta-strands) of the template are not interrupted
by gaps.

The 3D coordinates of the target backbone alpha-helices and beta-strands are
generated (copied) from the atomic coordinates of the template backbone. The
user may choose to copy loops of equal length or to model them ab initio.
Loops which connect  secondary structure elements are modeled (ab initio)
using the downhill simplex minimization algorithm (1).

An energy function incorporating distance-dependent residue-residue potentials
(2), validity of valent and dihedral angles formed between the last C-terminal
loop residue and the first N-terminal secondary structure residue, and the
distance between the last loop atom and the first secondary structure atom
allows the loop to close and minimizes its energy at the same time.

E=k1*SUM(R) + k2*D + alpha (D, v),   where
 
R - residue-residue potential for each pair of contacting amino acids (2);
alpha - penalty function;
D - distance between the last C-terminal atom of the loop to be closed and the
first N-terminal atom of the next secondary structure element;
v -valent angle between the last bond of the loop and the first bond of the
secondary structure.

The next group of algorithms is used to place side-chains on the resulting
target backbone. The dihedral angles of the amino acid side-chains (except
those that have no Chi-angles) were extracted from PDB and are stored in a
backbone-dependent rotamer library (3). Each valid PhiPsi combination for each
amino acid has a corresponding distribution of probability of Chi1 or Chi2
side-chain dihedral angles. The distribution of Chi3 angles is set as a
function of Chi1Chi2 combination, and that of Chi4 - of Chi2Chi3  combination.
The placement of side-chain rotamers starts with those amino acids which are
identical in both the target and the template. Such rotamers are copied and
left unchanged if possible.

The dihedral angles of other side-chains are set the following way: the
selection of rotamers proceeds from low-index Chi-angles to higher index
Chi-angles (where available). The angles with the same indexes are set at the
same time for all side-chains which were not predicted at the previous step.
The decision to select any particular rotamer from the library is based on the
Chi-angles probability distribution. For each Chi-angle, a mode of the
distribution is selected. After setting up all side-chain Chi-angles, the
algorithm checks for clashes with the backbone and then with the neighboring
side-chains.


The clashes with the backbone are resolved first by the rotamers with lesser
probability selecting from the library, or in the case of multimodal rotamer
distribution, rotamers with different mode. The clashes between side-chains
are resolved after that in the same manner. If this procedure does not resolve
all clashes, the clashing rotamers are joined in the cluster and a complete
search of backbone-dependent rotamer library is performed.

1. Nelder, J.A., Mead, R. (1965) A simplex method for function minimization.
Computer J., 7:308-313.

2. Bahar, I., Jernigan, R.L. (1997) Inter-residue potentials in globular
proteins and the dominance of highly specific hydrophilic interactions at
close separation. J. Mol. Biol., 266:195-214.

3. Bower M. J, Cohen F.E., Dunbrack R.L. Jr.(1997) Prediction of protein
side-chain rotamers from a backbone-dependent rotamer library: a new homology
modeling tool. J Mol. Biol., 267:1268-1282.
 


Ginalski , 526

number of submitted models: 4

Comparative Modeling of Selected CASP4 Target Proteins

Krzysztof Ginalski

Department of Biophysics, Institute of Experimental Physics, University of Warsaw
email:
kginal@icm.edu.pl

 
	For the fourth round of Critical Assessment of Techniques for Protein
Structure Prediction (CASP4), four target proteins were modeled using
comparative modeling technique: 1) beta-lactoglobulin from pig (target T0123;
64equence identity with template closest by sequence), 2) manganese superoxide
dismutase homolog from P. aerophilum (target T0128; 54equence identity), 3)
tryptophan synthase alpha subunit from P. furiosus (target T0122; 33equence
identity), 4) Sp18 from H. fulgens (target T0125; 18equence identity). This
set of target proteins (18-64equence identity) was chosen to represent
different levels of difficulty in comparative modeling. The main emphasis was
on generating sequence-to-structure alignments of target sequences with their
respective parent structures. As shown in previous rounds of CASP, this part
of the modeling procedure is the major source of errors.

	Initially, related proteins with known structures were identified with
PSI-BLAST searches [1] performed against the non-redundant protein sequence
database until profile convergence. Additionally, homologous sequences that
matched the targets were also collected. The CLUSTAL W program [2] was used to
generate multiple sequence alignments for sets of sequences containing target,
templates and other related proteins with unknown structure. Opening and
extension gap penalties were systematically changed, and all of the obtained
alignments were inspected for both variability and violation of structural
integrity. Possible sequence-to-structure alignments variants were tested by
building 3D molecular models for the target sequences with the Homology module
of InsightII (MSI Inc., San Diego, CA, USA). Backbone conformation was taken
from the template structure closest by sequence, and only side-chains were
substituted. Modeling of insertion and deletion regions was skipped for the
structures that were built to test the fitness of different alignment
variants. Models were then subjected to detailed evaluation, mainly by visual
inspection of structural consistency and using ProsaII energy profiles [3].
Such a 3D evaluation procedure enabled selection of final
sequence-to-structure alignments.

	Final models of target proteins were built using the MODELLER program [4].
Where possible, more than one template protein was used, after superimposition
of their molecular structures. In some cases after coordinates were assigned
to the target sequence, side-chains were rebuilt using the SCWRL program with
a backbone conformation-dependent rotamer library [5]. To preserve conserved
contacts, and maximize the electrostatic and hydrophobic interactions, the
positions of several side-chains were adjusted manually. Final models were
subjected to energy minimization (100 steps) to remove remaining steric
clashes and improve stereochemistry. All energy optimizations were performed
in Amber forcefield [6] with the Discover module of InsightII, using steepest
descent and conjugate gradient methods. The overall quality of each modeled
structure was checked in detail with the WHAT_CHECK program [7].


[1] Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, 
    D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search  
    programs. Nucleic Acids Res. 25, 3389-3402. 
 
[2] Thompson, J.D., Higgins, D.G., Gibson, T.J. (1994) CLUSTAL W: improving the 
    sensitivity of progressive multiple sequence alignment through sequence weighting, 
    position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22, 
    4673-4680. 
 
[3] Sippl, M.J. (1993) Recognition of errors in three-dimensional structures of proteins. 
    Proteins 17, 355-362. 
 
[4] Sali, A., Blundell, T.L. (1993) Comparative protein modelling by satisfaction of spatial 
    restraints. J. Mol. Biol. 234, 779-815. 
 
[5] Bower, M.J., Cohen, F.E., Dunbrack, R.L. Jr. (1997) Prediction of protein side-chain 
    rotamers from a backbone-dependent rotamer library: a new homology modeling tool. 
    J. Mol. Biol. 267, 1268-1282. 
 
[6] Weiner, S.J., Kollman, P.A., Case, D.A., Singh, U.C., Ghio, C., Alagona, G., Profeta, S. 
    Jr., Weiner, P. (1984) A new forcefield for molecular mechanical simulation of nucleic 
    acids and proteins. J. Am. Chem. Soc. 106, 765-784. 
 
[7] Hooft, R.W., Vriend, G., Sander, C., Abola, E.E. (1996) Errors in protein structures. 
    Nature 381, 272. 
 


BinToHes , 255

number of submitted models: 43

Ab initio loop modeling with precalculated synthetic loops and sidechain placement

Silvio Tosatto, Eckart Bindewald, Jochen Maydt, Achim Trabold, Juergen Hesser, Reinhard Maenner

Informatik V, Uni Mannheim
email:
silvio@rumms.uni-mannheim.de

 
We started the search for a template sequence by using PSI-BLAST [1]. The
resulting top candidates were inspected. If no significant hit was found we
applied the fold recognition method described in the CASP-4 fold recognition
abstract "Secondary structure and function based protein fold recognition"
(Bindewald, Tosatto et al). The most probable candidates were submitted to a
CLUSTALW [2] alignment. This alignment was manually modified to reduce the
impact of insertions and deletions. A raw model of the target without indels
was created using our program MOLEGO, which simply copies the backbone angle
information from the template protein according to the sequence alignment. The
insertions and deletions were subsequently modeled using NAZGUL, our fast
ab-initio loop modeling tool. We typically allowed between one and three
residues flanking an indel to be modified during loop modeling. The NAZGUL
algorithm evaluates a database of precalculated synthetic loops. This database
was created by recursively concatenating small polypeptide fragments, starting
from a Ramachandran distribution of phi and psi angles with rigid rod
geometry. Larger fragments are assembled from smaller ones by means of
geometric transformations. These are all stored according to loop length and
evaluated during the modeling step. Possible loops are evaluated and ranked
according to their geometric fit on single residue anchor regions. These are
then filtered for chain continuity, that is deviation from idealized bond
length and bond angle values, and inter-atomic clashes. The solution was
selected through visual inspection among the top scoring proposals. Insertions
at the beginning or end of the chain were modeled using MOLEGO's ab initio
method [3], which samples a discrete set of torsion angles using a
combinatorial search approach. As a final step the side chains were placed
with PESO, our implementation of the dead-end elimination algorithm [4] using
the AMBER [5] non-bonded potential and a set of backbone independent rotamers.

References:
[1] Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman
DJ. Gapped blast and psi-blast: a new generation of protein database search
programs. Nucleic Acids Research, 25(17):3389-3402, 1997.

[2] Thompson JD, Higgins DG, Gibson TJ. CLUSTAL W: Improving the sensitivity
of progressive multiple sequence alignment through sequence weighting,
position-specific gap penalties and weight matrix choice. Nucleic Acids
Research, 22(22): 4673-4680, 1994.

[3] Bindewald E, Hesser J, Männer R. Protein Structure Optimization using a
Combinatorial Search Algorithm. Proc. of the Int. Conf. on Mathem. and
Engineering Techniques in Medicine and Biological Sciences (METMBS'00),
233-238, 2000.

[4] Desmet J, DeMaeyer M, Hazes B, Lasters I. The dead-end elimination theorem
and its use in protein side chain positioning. Nature 356: 539-542, 1992.

[5] Weiner SJ, Kollman PA, Nguyen DT, Case DA. An all-atom force field for
simulations of proteins and nucleic acids. Phys Rev E  7:230-252, 1986.


Dunbrack , 169

number of submitted models: 20

Comparative modeling with PSI-BLAST, Modeller, and SCWRL

J. Michael Sauder and Roland L. Dunbrack, Jr.

Fox Chase Cancer Center
email:
RL_Dunbrack@fccc.edu

 
We performed the following steps in modeling  
comparative modeling targets in CASP4: 
 
1) PSI-BLAST was used to identify homologous proteins in the PDB. This 
was accomplished by using PSI-BLAST with the target sequence as query 
on the non-redundant (nr) sequence database available from NCBI.  This 
database had been filtered of low-complexity sequences with the 
program seg with a window size of 20 (higher than the default 12). The 
PSI-BLAST matrix is saved every other iteration. Each PSI-BLAST matrix 
was then used to search a database of PDB sequences that we derive 
from PDB files (these sequences differ from what RCSB puts out). 
 
2) We chose a parent from the PDB based on sequence identity, 
length of alignment, relative paucity of gaps, and resolution. 
 
3) The sequence alignments were examined in light of the 
parent structure and some manual adjustments were made to 
move gaps to the most likely coil regions. IN some cases, 
we also examined an alignment from the Threader program 
of David Jones. 
 
4) We used our program "blast2model" to take the alignment 
and parent structure to produce a PDB file with the backbone 
coordinates renumbered and residue type changed to the 
target sequence, given the alignment. We preserved the  
coordinates of residues that are identical in the parent 
and target according to the alignment. blast2model also 
outputs a sequence file that can act as input to the scwrl 
program to predict sidechains (below).  
 
5) In some cases, especially in very low sequence identity alignments, 
we did not build insertions and deletion regions, but rather left the 
parent backbone unaltered (just renamed and renumbered). We then used 
the program SCWRL (Bower,Cohen,Dunbrack 1997; Dunbrack 1999) to 
rebuild the missing sidechains (those different between parent and 
target). SCWRL uses a backbone-dependent rotamer library, followed by 
clustering of potentially clashing sidechains (which preserves the 
lowest energy conformation of sidechains which do not clash; this is 
effectively a dead-end elimination step).  These clusters are solved 
by a branch-and-bound algorithm.  If clusters are too large to be 
solved rapidly, one residue is identified that when removed will break 
the cluster in two parts.  Each part is solved separately for each 
rotamer of this keystone residue, and then the energies of the two 
clusters summed for each rotamer, and the lowest energy configuration 
is the prediction. So for a 12 residue cluster, this means that 
N=n**12 without breaking the cluster and N=n(n**6 + n**6) for the two 
residue cluster, where N is the number of combinations and n the 
number of rotamers per sidechain on average. 
 
The potential function is a statistical one for local 
backbone/sidechain interactions in the form of -log(prot(phi,psi,res)) 
and a simple linear steric interaction for sidechain-sidechain and 
non-local-backbone/sidechain interactions. prot is determined 
from the Bayesian statistical analysis in the backbone-dependent 
rotamer library (Dunbrack and Cohen, 1997). We believe 
this kind of statistical potential function produces better 
predictions when the backbone conformation is not precise (i.e., 
derived from another protein structure) than a molecular-mechanics 
type function that is very sensitive to atom positions. Hence 
for modeling purposes prediction rates are probably higher. 
 
6) In a number of cases, especially for the last few submissions in 
September, we used a loop prediction algorithm in the Modeller5 
program developed by Andrej Sali and kindly provided by him in advance 
of publication (it has now been published: Fiser,Do,Sali, 2000). We 
first removed 2-3 residues on either side of each gap and let SCWRL 
replace the sidechains for all mutated residues in the alignment. We 
then let Modeller model each missing loop in turn. We then used SCWRL 
to replace sidechains in the whole structure (not including conserved 
residues). 
 
 
 
References: 
 
M. Bower, F. E. Cohen, and R. L. Dunbrack, Jr. Sidechain prediction 
from a backbone-dependent rotamer library: A new tool 
for homology modeling. J. Mol. Biol. 267, 1268-1282 (1997).  
 
R. L. Dunbrack, Jr. and F. E. Cohen. Bayesian statistical analysis of 
protein sidechain rotamer preferences. Protein Science 6, 1661-1681 
(1997). 
 
R. L. Dunbrack, Jr. Comparative modeling of CASP3 targets using 
PSI-BLAST and SCWRL. Proteins: Structure, Function, Genetics, 
Suppl. 3, 81-87 (1999). 
 
A. Fiser, R.L. Do, A. Sali 
Modeling of loops in protein structures. 
Protein Sci. 9, 1753-73 (2000). 
 


Walts-Wondrous-Wizards , 044

number of submitted models: 172

Playing protein fold charades

N. Alexandrov, V. Brover, M. Troukhan, W. Volkmuth

Ceres, Inc
email:
nicka@ceres-inc.com

 
Our prediction process consists of two steps: selecting a template structure
and making an alignment.

1. Template selection.
 
All target sequences were compared with a set of structural domains using the
123D+ program, which combines sequence similarity, secondary structure
prediction and contact capacity potentials to compute a similarity score. If
there was a hit with Z-score > 6, we made the selection based on the strongest
hit. When the hit covered only a part of the target sequence, we cut out the
remaining part and repeated the run. If 123D+ did not detect an obvious hit,
we predicted the fold anyway, because sampling of a random set of recently
predicted structures indicates that approximately 900f them are structurally
similar to already known folds, even if there is no strong sequence
similarity. Without a strong 123D+ hit, we used other available associative
information in an attempt to link the target with a protein with known
structure. We used literature search, known metabolic pathways, gene
expression data, position on the chromosome, operons, distribution of folds in
the organism, secondary structure prediction, predictions of transmembrane
helices and coiled coils. We demonstrated that there is a correlation between
protein folds and gene expression and between protein folds and location in
the chromosome. All these additional information gave us quite weak signals.
However, when consistent, these signals resulted in rather confident
predictions. This part of the prediction is analagous to playing charades,
where one discovers an unknown word using many inderect, independent hints.
Interestingly, we can compare the effectiveness of such an approach verses a
pure automated method, as 123D+ server also participated in the CAFASP section
of CASP4.

2. Alignment 
 
Alignments were computed with 123D+ program and were in some cases manually
corrected. Manual intervention was limited to (i) placing deletions within the
target sequence so that their edges are close in space in 3D structure and
(ii) moving insertions in the target sequence to the surface of protein
structure.


Godzik , 197

number of submitted models: 158

FFAS+ server for homology modeling

L.Jaroszewski, A.Godzik

The Burnham Institute
email:
adam@ljcrf.edu

 
We applied idenitcal procedures for homology modeling targets and fold
recognition targets.  It consists of three steps: A: Selection of the
template(s), B: Generation of suboptimal alignments, C: Model building and
evaluation. In the cases when FFAS z-score value indicated that the similarity
between the template and query is strong (z-score values higher than 15), the
step B was usually skipped and the model was built based on the alignment from
FFAS.  This was the case for many of the homology modeling targets.  The
prototype of this procedure called "Multiple Model Approach" was described and
evaluated in (4-5).


A. Selection of the template(s) - Fold & Function Assignment System (1,2).

FFAS profile-profile search was performed in PDB database. FFAS is based on
the sequence profile-profile matching with dynamic programming.  The multiple
alignment is prepared based on the PSI-BLAST(8) output.  Non-redundant
database of protein sequences was used for profile calculation.  FFAS uses
sequences from PSI-Blast output with E-value below 0.01 and an elaborate
weighting scheme for the sequences included in the profile(1).  Weights are
assigned based on the dissimilarity of the sequence in respect to the other
sequences in the family.  In addition, FFAS performs a normalization of the
matrix containing the comparison scores between all positions of both aligned
profiles before the best path is searched for with dynamic programming
Smith-Watermann algorithm(8).

B. Calculation of suboptimal alignments.

A set of suboptimal (alternative) alignments was generated for the query
sequence and the template structure(s) selected from the PDB database in the
step A.  After the calculation of the initial alignment based on the
profile-profile FFAS method, a1 similarity matrix was recalculated using
several combinations of threading terms (burial and local conformation terms
are used).  The threading energy was calculated for the sequence profile
rather than for a single sequence, as it had been done in the classical
threading.  Several gap penalty values were also explored.  Gap penalties were
set higher within the secondary structure elements defined with the method
described in the separate publication(3).  The resulting alignments were
clustered to avoid redundancy.

C. Model building and evaluation.

The models based on the alignments calculated in the step B were built and
evaluated.. We used MODELER(5) program developed in A. Sali lab for model
building.  Model evaluation is based on the threading energy using statistical
potential and evolutionary information encoded in sequence profiles  (the
threading energy was calculated for the sequence profile rather than for a
single sequence, as it had been done in the classic threading - for example in
MatchMaker program).  The threading energy per residue was the final criterion
of the model quality.

References
1. Rychlewski, L., Jaroszewski, L., Li, W. & Godzik, A. (2000).
"Comparison of sequence profiles. Strategies for structural predictions using
sequence information". Protein Science 9, 232-241


2. Jaroszewski, L., Rychlewski, L. & Godzik, A. (2000).
"Improving the quality of twilight-zone alignments". Protein Science, 9,
1487-1496

3. Jaroszewski, L. & Godzik, A. (2000). Search for a New Description of
Protein Topology and Local Structure. ISMB 2000 - 8-th International
Conference on Intelligent Systems for Molecular Biology, San Diego 2000

4. Jaroszewski, L., Pawlowski, K. & Godzik, A. (1998).
"Multiple model approach: an extension of comparative modelling". Journal of
Molecular Modelling 4, 294-309

5. Pawlowski, K., Jaroszewski, L., Bierzynski, A. & Godzik, A. (1997).
"Multiple model approach - dealing with alignment ambiguities in comparative
protein modeling". In Biocomputing, 97 (Altman, R. B., Dunker, A. K., Hunter,
L. & Klein, T. E., eds.), pp. 328-339. World Scientific, Singapore.

6. Sali, A. and Blundell, T. L. (1993).
"Comparative protein modelling by satisfaction of spatial restraints". J. Mol.
Biol. 234, 779-815

7. Smith, T.F. and Waterman, M.S. (1981) "Identification of common molecular
subsequences". J Mol Biol 147:195-7

8. Altschul, S.F. et al. (1997) "Gapped BLAST and PSI-BLAST: a new generation
of protein database search programs". Nucleic Acids Res 25:3389-402


Harrison-Weber , 058

number of submitted models: 91

Randomized and Multiple Model Approaches to Homology Modeling and Ab Initio Modeling.

Ivan Y. Torshin, Irene T. Weber and Robert W. Harrison

TJU
email:
robert.harrison@acm.org

 
Molecular modeling is a combinatorial, multiple minimum optimization problem.
In homology modeling, the known homolog serves as a good starting point for
the search, while in ab initio folding there are only limited geometric data.
Two complimentary classes of algorithms were explored in our CASP-4
predictions: randomized algorithms, and multiple modeling algorithms.
Randomized algorithms, either based on the Kohonen self-assembling neural
network or an analytic solution for simultaneous circular equations, were used
to explore conformational space and delineate regions of allowed molecular
geometry. These algorithms are computationally efficient; it was possible to
fold most of the CASP-4 ab initio targets several hundred times in a few CPU
hours.  Multiple models from independent runs of the randomized procedures
were used to extract conformations that occurred repeatedly, as this improved
the reliability in tests. Hundreds of models were used for ab initio
predictions and ten models for homology modeling. AMMP (Harrison, 1999) was
used to predict 12 ab initio targets and 30 homology modeling targets.

Randomized Algorithms

	Our major focus has been to explore new algorithms for building molecular
        models and searching conformation space. One general class of
        algorithms, randomized algorithms, is especially interesting because
        these algorithms can efficiently find or approximate the solutions to
        combinatorial and geometric problems (Hertz 1991, de Berg 1997) and
        can be implemented efficiently on a parallel computer (JaJa 1992).
        The general idea behind randomized algorithms is to use a set of
        independent identically distributed random variables to limit the
        solution to an acceptably small range. Rather than attempt to converge
        to an exact solution of a mathematical problem, which may not exist or
        may not be meaningful in the context of protein structure, randomized
        approaches define a sequence of ever-closer bounds on the ranges of
        solutions. Two randomized approaches were tested, these were a
        modified Kohonen neural network with a distance metric and a
        randomized analytic solution to distance restraints (Harrison 1999).
        Multiple models were constructed using the distance restraints that
        were derived from homologous structures or sequences. Averages over
        the models were then used to develop a single model for submission.



Homology Modeling

	Protein folds were recognized using the FFAS server (Rychlewski et al.
        2000), the 3D-PSSM server  (Kelley et al. 2000), and the screening
        method we used for ab initio folding. Clustal (Thompson et al. 1994)
	was used for multiple sequence alignments when possible.  The thirty targets
	86-90,92,93,99-101,103,104,106,107,109,111-113, 115-123, 125,127, and 128
	were modeled. Ten models were generated from each template using either the
	Kohonen algorithm or the analytic approach (Harrison 1999) coupled with
	energy minimization and a short run of molecular dynamics. The averaged model
	was energy minimized to generate the final model.  The final models were
	subjected to 3ps runs of molecular dynamics, which may degrade the accuracy
	for the high homology examples. The variation among the models was calculated
	for each atom and used as an estimate of the uncertainty in the positions.


Ab Initio Folding

 	Ab Initio folding was used for targets
	91,94,95,96,97,98,102,105,108,110,114,124 and 126. A simple
	hydrophobicity-electrostatics potential was supplemented by a
	sequence-specific empirical potential to improve the stereochemistry of the
	prediction. Inter-residue distances were estimated by searching the protein
	database for short stretches of homology from different and unrelated
	proteins. Simply finding the best local fit for each overlapping window of
	amino acids does not result in a good self-consistent set of distances.
	However, when the requirement for chain continuity is enforced, the problem
	of identifying a self-consistent set of inter-residue distances becomes akin
	to a convolutional error correcting code which is readily solvable by dynamic
	programming (Viterbi 1967). This continuity condition is an inherent property
	of all polymers and provides a significant gain in prediction accuracy.
	Potential templates were identified for homology modeling by using the
	proteins that had the most fits for each sequence.

	The models were generated in three steps.
	1) 200 models were generated with the original potential functions, using
	C-alpha-only models. Then inter-residue distances (C-alpha-C-alpha) were
	averaged over all the models. Those distances where the standard deviation
	was less than 2 angstroms were extracted.
	2) A single model was generated that both satisfied the new distance
	information and minimized the hydrophobicity-electrostatics potential. This
	model was achiral and can represent either the left or right-handed solution.
	Secondary structure was identified visually and used to define additional
	distance restraints (published experimental data on helical locations were
	used for target 102).
	3) All-atom models were built for both the right and left-handed solutions.
	The best models had right-handed helices.

References
 
de Berg M., van Kreveld M., Overmars M., and Schwarzkopf, O. (1997)
Computational Geometry Springer-Verlag

Harrison, R.W (1999), A Self-Assembling Neural Network for Modeling Polymers
J. Math. Chem. 26,125-137

Hertz J., Krogh, A., Palmer R.G. (1991) Introduction to the theory of neural
computation, Sante Fe Institute studies in complexity lecture notes vol. 1.
Addison-Wesley pp244-246,

JaJa J. (1992) An Introduction to Parallel Algorithms, Addison-Wesley pp 433-484,

Kelley LA, MacCallum RM, Sternberg MJ (2000), Enhanced genome annotation using
structural profiles in the program 3D-pssm, J Mol Biol 299(2):499-520

Rychlewski L, Jaroszewski L, Li W, Godzik A (2000),Comparison of sequence
profiles. Strategies for structural predictions using sequence information
Protein Sci 9(2):232-41

Thompson JD, Higgins DG, Gibson TJ (1994), CLUSTAL W: improving the
sensitivity of progressive multiple sequence alignment through sequence
weighting, position-specific gap penalties and weight matrix choice, Nucleic
Acids Res 22(22):4673-80

Viterbi A.J. (1967) Error bounds for convolutional codes and an asymptotically
optimum decoding algorithm IEEE trans inf theory IT-13,260-269


Yang-Ansuei , 017

number of submitted models: 127

PrISM: Protein Informatics System for modeling

An-Suei Yang

Columbia University, Dept of Pharmacology
email:
yanga@ps7yang3.cpmc.columbia.edu

 
All alignments and 3D structural models for the targets in CASP4
were produced with the PrISM program.  PrISM (Protein Informatics System 
for Modeling) is a sequence/structure analysis/modeling system that can be used 
either interactively or automatically to produce 3D structures of 
proteins from their amino acid sequences (Yang & Honig, Proteins  
suppl. 3, 66-72 (1999), J. Mol. Biol. 301(3), 665-678, 679-690, 691-712 (2000)).  
PrISM has been released to the public domain and can be downloaded  
from the web site: http://www.columbia.edu/~ay1/.  
 
PrISM consists of a variety of integrated computational modules and databases,
including the facilities to carry out structure topology analysis, sequence homology 
search/alignment and statistics, structure-structure alignment, 
multiple sequence/structure alignment, sequence/structure profile 
analysis, fold recognition, comparative model building, sidechain and 
loop modeling, and model structure assessment. At present, PrISM makes use  
of the NCBI-nr and PDB as data resources. NCBI-nr is used without  
further modification.  PDB entries are divided into structural domains with  
PrISM structure topology analysis tools to form structure domain libraries.  
 
PrISM's sequence search and analysis tools, based on either the 
Smith-Waterman or PSI-BLAST sequence comparison algorithms, in 
conjunction with statistics based on the theory of extreme value 
distribution, can perform pairwise sequence similarity searches, 
pairwise or multiple sequence alignments, sequence family clustering, 
and sequence profile searches over sequence databases. Structure search 
and analysis tools use an algorithm which is built upon double dynamic 
programming and rigid-body superimposition methods.  This algorithm is 
capable of performing pairwise structure alignments, multiple structure 
alignments, structure similarity searches and clustering of similar 
protein structures. The functions of the sequence and structure analysis modules 
are to identify the most suitable structural template(s) and to predict 
the best sequence-to-structure alignments, which are then used in the 
protein structure modeling modules for model building. 
 
Structure templates are recognized first by a dynamic programming 
alignment score calculated with the BLOSUM 62 substitution matrix and 
then normalized using the extreme value distribution theory. If the sequence 
similarity score between a query sequence and a template is less 
than the empirically determined cut-off of p-value=10E-6, the alignment and the 
template are used to produce a homology model for the query sequence 
with the PrISM structure modeling modules. PSI-BLAST is also used to determine  
the most suitable structure templates. If sequence alignment methods fail to 
relate a query sequence to any structure in the PDB, a fold recognition 
procedure is applied.  This procedure is started by constructing models 
(backbone plus carbon beta) of the query sequence based on the 
predicted alignments of the sequence to all possible templates in the 
PDB using a sequence-to-structure mapping algorithm.  The most likely 
models for the sequence are then decided by a subsequent model ranking 
procedure based on a structure fitness score.  The structure fitness 
score is a sum of individual residue scores which are calculated using 
statistically derived parameters. These parameters are designed to 
evaluate these simplified models based on secondary structure 
propensities and the number and chemical properties of the contacting 
neighbors of each residue. 
 
PrISM's structure modeling modules build protein structures using one 
or more templates that are simultaneously aligned to the query 
sequence. When more than one template are used, an automatic procedure 
first divides templates into secondary structure segments, and then 
selects the most suitable segment templates for model building, segment 
by segment.  Mainchains are built by using the template conformation 
when possible. Insertion-deletion regions, usually loops, are then 
rebuilt using ab initio methods. Sidechain torsion angles are either 
taken from the templates or predicted based on the mainchain torsion 
angles with a neural network algorithm.  The model building and the 
alignment procedures can iterate until a reasonable model structure is 
arrived. Our model structures for targets in CASP4 have not been refined. 
 
PrISM contains a model assessment module, which is used to assess the 
quality of a predicted model as the experimental structure becomes 
available.  The assessment procedure is started by carrying out a 
structure alignment to align the model and the experimental structure. 
This is followed by the RMSD calculation, the evaluation of the predicted 
alignment on which the model is built, and the evaluation of the 
predicted mainchain and sidechain torsion angles.  These results 
provide statistical indicators for the quality of the predicted model. 
 
Using PrISM, we have built one model for each of the 42 CASP4 
targets. We did not make prediction for the target 116 (811 residues). 
The modeling strategy varies from one target to another 
because the protocol that is used depends on the amount and quality of 
information extracted from the sequence and structure databases.  
Overall, PrISM provides a flexible computational environment 
which has been used in a wide range of modeling challenges. 
 
 


blundell-tl , 095

number of submitted models: 23

Comparative modelling incorporating structural
features and environmental properties

David F. Burke, Nuria Campillo,Charlotte Deane,Paul de Bakker ,Lan Chen ,Axel Innis, Simon Lovell,Joerg Mueller,Kenji Mizuguchi,H.G.Nagendra,Ricardo Nunez,Jiye Shi, Hiroki Shirai ,Mark G Williams and Tom L. Blundell

department of biochemistry, cambridge UK.`
email:
dave@cryst.bioc.cam.ac.uk

 
 
Prediction of suitable structural homologues was performed,  
using the program FUGUE, by searching position specific  
environmental substitution tables generated from the HOMSTRAD  
database of homologous structures (Mizuguchi et al. 1998).  
These predictions were validated using a combination of visual  
inspection of the resulting alignment using the program JOY  
(Mizuguchi et al. 1998), comparisons of secondary structure  
predictions and a survey of the literature. Identification of  
homologues with known structure was also aided by PSI-BLAST  
searches and results from the CAFASP servers. The predicted  
alignment was then either rejected or manually edited if it was  
thought necessary.  
 
The 'core' structure of the target sequence was built using both 
 MODELLER and the new comparative modelling algorithm SCORE  
(Deane et al. submitted) which builds segments of the structure 
 which it predicts to be structurally conserved. The  
structurally variable regions were predicted using the programs 
 CODA (Deane and Blundell, submitted) and SLoop (Burke et al. 2000, 
Rufino et al. 1997; Donate, et al. 1996). Sidechains were then  
added using the program CELIAN.  
 
Validation was performed by superimposing all of the predicted  
models onto the initial template structures, using the  
structural alignment program COMPARER. The models were 
 inspected for structural features that were seen to be  
conserved among the template structures and suspect regions  
were re-modelled. 
 
 


Shoshana-Wodak , 486

number of submitted models: 16

Homology modeling method for CASP4

Koji Ogata and Shoshona J. Wodak

Universite Libre des Bruxelles
email:
koji@ucmb.ulb.ac.be

 
1. Selection of a template protein 

To determine if the structure of a target protein can be predicted using
homology modeling methods, we carried out a PSI-BLAST search with the target
sequence against the sequences of proteins in the PDB. This was performed
using tools and default setting available at NCBI server.  When sequence
similarity was detected with one or more protein entries in the PDB, homology
modeling was undertaken.  For the 7 targets for which we performed homology
modeling, sequence identify levels ranges from 20-50%.

To align the target sequence to those of the candidate templates the following
procedure was used.  The structures of all thePDB entries, identified as
displaying sequence similarity to the target, were aligned using structure
superposition procedures [1], applied to the backbone atoms.  This structural
alignment was used to derive a multiple sequence alignment for the
corresponding proteins, to which the target sequence was then aligned.  This
alignment was computed using Smith-Watermans algorithm and the GONNET matrix,
applying a length dependent gap penalty, with values of 12.0 and 1.0 for gap
creation and gap extension, respectively.  Gaps positioned in secondary
structure elements were manually displaced, towards appropriate positions in
nearby loop regions.  The template protein to be used for model building was
chosen from amongst the identified PDB entries as the protein with the highest
sequence similarity to the target.

2. Modeling regions with insertions and deletions

To model regions with insertion and deletions in the alignment, we searched
for suitable fragments from fragment databases.  These databases contained all
overlapping fragments ranging in lengths from 5 to 16 residues, respectively,
and belonging to all proteins in the PDB with less than 90equence identity
[2]. Each of these databases included more than 450 000 fragments.  For a
given indel region, suitable fragments were selected from these databases by
specifying the fragment length and requiring that the chosen fragment match
the backbone of the 2 residues preceding and following the indel region to
within 1.0Å rmsd. When a large number of candidate fragments were identified
for a given indel region in a region, those with most similar spatial
orientation the loop region in template to that of protein were selected.
When no candidate fragments were identified for a region, the alignment was
modified, and the fragment selection procedure repeated.  Finally, the
selected suitable fragments were modeled into the template backbone.

3. Side-chain modeling and optimization

For a given backbone structure side-chain conformations were selected using
the following procedure. First, lowest energy sidechain conformations were
selected from a library of conformations derived from known protein structures
[3]. This library typically contained many thousands conformations for each
side chain type. The selection procedure was performed using the Metropolis
Monte Carlo sampling method coupled to the AMBER force field.  In a second
step, the lowest energy conformation was subjected to Monte-Carlo sampling in
Cartesian space in order to eliminate residual strain. In a last step the
energy of the resulting structure was further relaxed using energy
minimization.

4. Modeling for T0119 (Special case)

Comparing the sequence of T0119 with that of Phthalate dioxygenase reductase
(E.C.1.18.1.) (PDB_ID 2PIA), we found that the N-terminal portion of T0119
(residues 1 to 90) was similar to the C-terminal portion of 2PIA (residues 228
to 321), and that the N-terminal portion of 2PIA (residues 1-227) was similar
to the C-terminal portion of T0119. The two proteins thus appeared to be
circular permutations of each other, with a short (10 residues) insertion
between the 2 domains in T0119 relative to 2PIA. Attempts to model the
insertion using our fragment databases, failed however, as no suitable
fragments could be found.  We then performed a PSI-BLAST search using the
C-terminal domain of 2PIA as the probe sequence.  This led to the
identification of Ferredoxin (PDB_ID 1QOA_A) as being similar to this portion
of 2PIA.  The corresponding structure was superimposed onto C-terminal portion
of 2PIA, substituted for it in the template, and used instead to search once
more for a suitable fragment bridging the 2 domains. This time a suitable
fragment could be identified in the one of our fragment databases. Hence in
modeling the T0119 target a template consisting of a chimera of the 1QOA_A and
2PIA backbones was used.

References
[1] Russell, RB and Barton, GJ., Proteins, 14, 309-323, 1992. 
[2] Ogata, K. and Umeyama, H., J. Mol. Graph. Model., 18, 258-72, 2000. 
[3] Ogata, K. and Umeyama, H., Protein Eng. 10, 353-359, 1997. 
 
 
 


SBI-AT , 342

number of submitted models: 68

Structure prediction using sequence profiles and predicted secondary structure

Tomas Nordahl Petersen, Claus Lundegaard, Morten Nielsen, Anne Marie Munk Jørgensen, Henrik Bohr, Jakob Bohr, Søren Brunak, Garry P. Gippert, Ole Lund. Structural Bioinformatics Advanced Technologies A/S. Hørsholm, Denmark

SBI-AT
email:
olund@strubix.dk

 
The team of scientists working at SBI Advanced Technologies A/S (SBI-AT)  is
developing novel technologies for protein structure prediction. The goal of
this work is to be able to make accurate tertiary structure models for as many
protein sequences as possible. The work covers diverse areas such as
prediction of protein secondary structure  using neural networks, construction
of improved sequence profiles, hidden Markov models using sequence and
structure profiles, construction of non-redundant data sets, and construction
of novel force fields. An algorithm for predicting secondary structure of
proteins at 80% accuracy has been developed [1]. Accurate secondary structure
predictions significantly enhance the capability to make accurate protein
models. The secondary structure predictions can be used to recognize folds and
find templates for remote homology modeling by identifying other proteins with
the same composition and sequential order of secondary structure units. They
can also be used to increase alignment accuracy, as well as aid in finding
fragments for ab initio structure prediction. The secondary structure
predictions will enable SBI to make more accurate protein models and create
models for proteins that were previously too non-similar from any known
protein structure. In turn, this will facilitate the search for small
molecules that will bind to these proteins. Use of up to 800 predictions of
differently trained neural networks, and the ability to combine the networks
in an efficient manner, lead to a more accurate prediction than that of any of
the individual networks. A novel technique: output expansion that predicts the
secondary structure for more than one residue at a time is also a key element
of the new method. This improves the prediction accuracy by teaching the
neural network about the structural context of its secondary structure
predictions. The method not only calculates the most likely secondary
structure for a given residue, but also calculates the probability that a
residue is in any of the three secondary structure conformations. This type of
output is much more useful as input to probabilistic methods such as hidden
Markov models. Using these new technologies the secondary structure of
proteins can be predicted with an unprecedented 80% accuracy rate, thus
improving the state-of-the-art in this very competitive field.

[1] Prediction of protein secondary structure at 80% accuracy. Petersen TN,
Lundegaard C, Nielsen M, Bohr H, Bohr J, Brunak S, Gippert GP, Lund O.
Structural Bioinformatics Advanced Technologies A/S, Hørsholm, Denmark.
tnordahl@strubix.dk. Proteins 2000 41: 17-20.


Murzin , 384

number of submitted models: 21

Distant Homology Recognition and Fold Prediction by a knowledge-based approach using SCOP and Pfam

Alexey G. Murzin and Alex Bateman

Centre for Protein Engineering, Cambridge, UK
email:
agm@mrc-lmb.cam.ac.uk

As submitted in the Fold Recognition category

Since our teams last performance in CASP2 four years ago, we have been working
on the methods that could extend the superfamilies of known structure in SCOP
to the sequence families of unknown structure in Pfam and other sequence
libraries. We entered CASP4 hoping that this prediction experiment would
provide an opportunity to test our new methods. A systematic work on the
extension of SCOP superfamilies has already resulted in the structural
assignment of many sequence families of unknown structure and, often, unknown
function. Indeed, in CASP3, there were at least three targets predictable by
this approach. Disappointedly, however, none of the CASP4 targets turned out
to be in our list of protein families with already assigned structures.

Therefore, in CASP4 we used essentially the same approach as developed for
CASP2 (Murzin A.G. and Bateman A. Distant homology recognition using
structural classification of proteins. Proteins, Suppl. 1:105-112, 1997). We
searched for probable homologues of the target sequences and available
biochemical information on the target protein and/or its sequence family and
used the predicted secondary structure to shortlist the SCOP superfamilies, to
which each attempted target may belong.  Predictions were based on the
discovery of superfamily specific characters. The experience and expertise
gained from our working on SCOP and Pfam databases were of a great help in
this knowledge-based approach.  Also, we tried our knowledge-based approach in
the two other prediction categories. We used superfamily specific features to
improve the alignments in some of the comparative modelling targets. For
several targets, predicted by our approach to be not related to any of the
SCOP superfamilies, we attempted the fold prediction using the conservation
patterns in the target sequence families, the available biochemical data
and/or the empirical folding rules derived from known protein structures.

The choice of prediction format, TS, and the target selection were influenced
by the CASP3 Fold Recognition assessment experience (Murzin A.G. Structure
Classification-Based Assessment of CASP3 Predictions for the Fold Recognition
Targets. Proteins Suppl. 3:88-108, 1999). To ensure the detection of (partly)
correct predictions by both sequence-dependent and sequence-independent
numerical evaluation procedures, each of our predictions was composed of the
regions of confident structure and alignment, the regions of confident
structure but tentative alignment, and the regions of tentative structure. The
3D coordinates for the most of the target atoms were the best way to represent
this structural mosaic in a single format. As one of us strongly opposed to
the NONE prediction, this option was not used. Therefore, in the absence of
predicted homologous structure, we either built a 3D model of our prediction
ab initio, or had it dropped. Only one model was submitted for each of the
completed predictions. Apart from the two targets whose structures were known
to us before they were submitted to CASP4, we did not attempt the large,
presumably multi-domain targets without apparent domain boundaries. Because of
time limitations, we also ignored late comparative modelling targets including
all but one of the predicted members of the P-loop hydrolase superfamily. Due
to the presence of characteristic P-loop motifs in their sequences, their
homology recognition seemed straightforward, and the actual challenge was the
alignment. All other targets were attempted but six or so of them were dropped
eventually.  In total, we submitted predictions for 21 targets. This include
four Comparative Modelling targets, T0090, T0092, T0093(!) and T0103; ten
Distant Homology Recognition targets, T0088, T0096_1, T0098, T0100, T0101,
T0104, T0108, T0109, T0118 and T0121_2; three targets with predicted known
folds (there may or may not be a distant homology), T0095, T0102 and T0114;
and four targets with predicted (probably) novel folds, T0086, T0091, T0094
and T0110.

Many of the Distant Homology Recognition predictions were based on the result
of previous analysis of SCOP superfamilies, for example the pectate lyase
beta-helix fold of T0100 and T0101 (Chothia C. and Murzin A.G. New folds for
all-beta proteins.  Structure 1, 217-222, 1993). There were several cases of
déja vu. T0108 had the same characteristic feature as the CASP4 target T0038
and was modelled on the experimental structure of the latter. In T0121_2,
there was the OB-fold signature similar to one we derived for the prediction
of T0004. For the fold prediction of T0102, we used the same pseudo ab initio
approach as we used for the CASP2 target T0042. Incidentally, the predicted
fold of T0102 was found to be similar to the experimental fold of T0042. In
T0086, there was a probable tandem repeat of two (alpha)-alpha-beta-beta-beta
motifs, detected by the analysis of its extended sequence family, analogous to
the approach that detected the internal duplication in T0002_2. Similarly, a
tandem repeat of two beta-alpha-beta-alpha-beta motifs was detected in the
extended T0094 sequence family. Unlike T0002_2, there was no SCOP superfamily
assigned for either T0086 or T0094. Both target structures were modelled ab
initio.

One of our CASP2 techniques, not credited properly at the time because it had
been used only for the late target T0026, was in great use through most of our
CASP4 predictions. For almost every target predicted to belong to a large
superfamily with many known structures, a composite template structure was
assembled from different fragments of several superfamily structures
superimposed onto their common fold. It allowed the selection of the most
suitable parts from different structures. In particular, the predicted
structure of the P-loop hydrolase T0104 was assembled from the fragments of
several topologically distinct members of this very diverse superfamily to
generate a novel topological variant.  For a number of our predictions, we
also created hybrid templates including fragments of non-homologous structures
to model the missing parts in the parent structure or even to construct the
whole fold. Then we used Modeller to generate the 3D coordinates,
automatically sealing the gaps and fixing the stereochemistry of the joints.


Levitt , 012

number of submitted models: 180

Comparative Modeling by Building Many Alternative All-Atom Models

Michael Levitt

Stanford University
email:
michael.levitt@stanford.edu


The methods used for Comparative modeling and Fold-Recognition were the same
and what follows is the same in both abstracts.  This work was greatly aided
by the availability of the output of all the 30 or so servers participating in
CAFASP on the CAFASP web site at http://cafasp.bioinfo.pl/target.  In general
these results were available within hours of the target sequence announcement
and we never felt the need to consult the original servers in any way.

We first used the freeware program "wget"to download all the files for any new
targets.  Then we parsed all these files using a large Perl script.  This
script collected together the results from all the servers to give consensus
secondary structure predictions, consensus fold-recognition results and every
alignment produced.  The script also converted all the proteins recognized by
the different servers into SCOP version 1.50 superfamily codes and the counted
how often the different codes occurred.  Initially, we used the results for
over 20 servers but then found it more accurate to concentrate on eight that
seemed to perform most consistently.  These were: ffas, foldfit, fugue,
genthreader, inbgu, mgenthreader, pdbblast, and target99.  As may have been
expected, the groups behind each of these eight servers were generally the
experts who had done well in fold-recognition at previous CASP events (Godzik,
FFAS and PDB-Blast; Sternberg, foldfit or 3D-PSSM; Mizuguchi/Blundell, FUGUE;
Fischer, INBGU; Jones genTHREADER and mGenTHREADER; and Karplus, SAM-T99 or
target99).  Unlike the CAFASP compilation released on the web by Danny Fischer
(http://www.cs.bgu.ac.il/~dfischer/CAFASP2/summaries/), no manual intervention
was used in parsing these raw results.  For each target we produced a summary
file that listed:

(1) The fold recognition hits in decreasing order of significance with the PDB
entry name, the significance scores and the SCOP 1.50 ID.  In some cases the
raw significance score given by the server was modified so that scores were on
the same scale (-100 for highest significance to small positive numbers for no
significance).. For example:

T0099_ffas_hit_1      1bu1a   -33.2   2.32.2
T0099_ffas_hit_2      1ark    -30.7   2.32.2

(2) All the alignments produced by each method together with information on
the sequence match.  For example:

T0099_ffas_al_2-a.mas_1ark   2.32.2  EFIAIYDYKAETEEDLTIKKGEKLEIIEK-EGDWWKAKAIGSGEIGYIPANYIAAA
T0099_ffas_al_2-b.sla_1ark   2.32.2  IFRAMYDYMAADADEVSFKDGDAIINVQAIDEGWMYGTVQRTGRTGMLPANYVEAI
T0099_ffas_al_2-x.par_1ark   2.32.2  nMAT=55, pID=28, nDEL=1, nINS=0, nCov=55/56, spaci=-99.000


(3) A Consensus summary allowing the fold to be recognized.  For each SCOP
superfamily we collect the number of hits, the mean significance score, the
method and rank, the SCOP title and the PDB domain names with their SPACI
scores (Brenner, Koehl and Levitt, 2000).  For example:

%T0099  4.77.1      -78.4     3 genthreader_1 mgenthreader_2 pdbblast_9
%T0099                           (Alpha and beta (a+b),SH2-like,SH2 domain)
%T0099                           1fmk 0.578, 2src 0.540,
%T0099  4.123.1     -59.9     6 genthreader_1 mgenthreader_2 pdbblast_6 pdbblast_7
%T0099                           (Alpha and beta (a+b),Protein kinase-like (PK-like))
%T0099                           1fmk 0.578, 2src 0.540, 1qcfa 0.431, 1ad5a 0.258,
%T0099  2.32.2      -45.9    60 ffas_1 ffas_2 ffas_3 ffas_4 ffas_5 ffas_6 ffas_7
%T0099                           (All beta,SH3-like barrel,SH3-domain)
%T0099                           1ckaa 0.665, 1fmk1 0.578, 2src 0.540,

For more complete results see our "private" site at: http://csb.stanford.edu/levitt/casp1234 .

During the CASP event, information contained in that site was updated
regularly by Levitt and shared with the different CASP4 groups in my lab
headed by Samudrala, Xia, Fain and Koehl respectively.  This is the only
information that was shared.  Each group then went on to make their own
comparative models (Samudrala, Koehl and Levitt) and/or ab initio models
(Fain, Levitt, Samudrala, and Xia).  There was no comparison of models, as
each individual preferred to use CASP as an opportunity to prefect their
methods rather than to "win" CASP.

Overall we felt very confident (perhaps wrongly so) about recognizing an
appropriate template in the comparative modeling and fold recognition parts of
CASP4.  We considered 17 targets to be Comparative Modeling targets (T0089,
T0090, T0092, T0099, T0101, T0103, T0111, T0112, T0113, T0117, T0119, T0121,
T0122, T0123, T0125, T0127, T0128) and did them all.  Of the remaining 26
targets, we considered 18 to be Fold-Recognition targets and 8 to be Ab Initio
targets.  For those targets that we considered to be fold-recognition targets,
9 were considered easy as their was very clear sequence similarity (T0087,
T0088, T0093, T0096, T0098, T0100, T0104, T0109, T0116), and 7 were considered
difficult and could not have been done without the consensus use of the
servers participating in CAFASP (T0094, T0095, T0107, T0108, T0115, T0118,
T0126), and 2 were considered to have no recognizable fold (T0120, T0124).
They were also too large for ab initio modeling so no results were submitted
for these.

In the predictions done by Levitt group, all the alignments for targets
submitted after 15 August were re-aligned using the structure of the template
to modify normal dynamic programming.  This was done as follows: (a) The cost
of deleting residues from the template was proportional to the distance across
the gap in three-dimensions (measured between the CA atoms adjacent to the
gap).  (b) The cost of inserting residues depended on how buried the residues
adjacent to the insertion were.  (c) Buried residues were given greater weight
in the scoring.  Each of these measures has associated with it a weight and
not having time to optimize these weights on known structural alignments, we
used 25 combinations of parameters and generated alignments for every one.


All the alignments taken from CAFASP before 15 August or re-aligned as
described above , we then used with our well-established automatic modeling
methods, SegMod and Encad,  to generate stereochemically acceptable all-atom
models for each alignment (see Levitt, M. Accurate Modelling of Protein
Conformation by Automatic Segment Matching. J. Mol. Biol. 226, 507-533 (1992)
and Levitt, M. Energy Refinement of Hen Egg-White Lysozyme. J. Mol. Biol. 82,
393-420 (1974)).

Finally the best models were selected as follows.  Use the rapdf probability
score (Samudrala, R & Moult, J.  An All-atom Distance-dependent Conditional
Probability Discriminatory Function for Protein Structure Prediction.  J. Mol.
Biol., 275: 893-914, (1998)) to choose the best 1000 models (it there are that
many).  Cluster all these 1000 or fewer models into 10 clusters (using
bottom-up hierarchical clustering based on inter-structure CA coordinate RMS
deviation).  For each model we use the rapdf score, Samudrala's HCF
hydrophobic compactness score, Keasar's surface energy, and the number of
hydrogen bonds to rank the conformations in each cluster.  Finally choose the
five lowest energy models never including more than one model from a given
cluster. Occasionally manual intervention was used in deciding the rank of the
models in the official submission to CASP.  For this we viewed the models to
judge general protein like shape and also used the coverage.  For example, a
model with a less favorable energy score may be ranked above a model with
better score if the first model covered more of the target sequence.


Friesner , 414

number of submitted models: 150

Comparative Modeling using a Combination of Threading and Restrained Energy Minimization

An, Y. Eyrich, V.A.Gunn, J.Pincus, D.L.Standley, D.M.Friesner, R.A.

Columbia University
email:
rich@chem.columbia.edu


We carried out comparative modeling in cases where an obvious homologue could
be identified using our fold recognition techniques. Based on the alignment,
we imposed geometrical constraints from the template in appropriate locations.
Some regions were unconstrained (e.g. insertions or deletions) and the final
geometries were determined in the course of running a tertiary folding
simulation. When different alignments or templates were possible, the tertiary
folding energy, similarity to the template, and template function were used to
select the structure that was finally submitted.

Following is a brief description of the core technology used to identify
homologues, build alignments, and carry out simulations:

A. Core Technologies

(1) Secondary structure prediction: predictions of the target sequence were
obtained from four public servers: PSIPRED, JPRED, SSPRO, and PHD.

(2) Alignment of the target sequence to sequences in the PDB: a dynamic
programming algorithm incorporating predicted secondary structure from step
(1) was used to produce a short list of proteins whose sequence
identity/secondary structure pattern indicated that they were plausible
candidates for remote homologues to the target. The scoring function used in
the alignment was optimized against a training-set from the PDB. All four
secondary structure prediction methods were used in this step, as well as
combinations of segments from various methods; the number of such combinations
depended upon the variability in the secondary structure prediction results.
The PDB sequences were decomposed into domains when possible; we also computed
the radius of gyration of the aligned part of the template in order to
eliminate alignments across domains in cases where a domain decomposition was
not conveniently available. In some cases, the effect of truncating the target
sequence at either end was investigated; in others, when a multiple domain
structure of the target was suspected (e.g. due to comments in the
literature), various partitions of the sequence were run independently. The
number of candidate homologues saved from this first stage depended upon the
type of protein. For mixed alpha-beta proteins, the number of sequences in the
PDB fitting the secondary structure pattern was typically rather small, with a
significant degradation in quality of agreement after the first ~50
candidates. For all-alpha and all-beta proteins, the number of reasonable
candidates was often much larger, on the order of 300-500. In some cases these
lists could be substantially truncated on the basis of known protein function
(e.g. carbohydrate binding proteins).  Finally, in many cases it was necessary
to enumerate a significant number of different alignments between the target
and candidate template. This was accomplished on a segmental basis, i.e. by
forcing the pairings of various designated beta-strands and alpha-helices.

(3) Constraint generation: Constraints were generated from the high-ranking
alignments. C_alpha atoms in the target sequence were constrained to the
corresponding template values.

(4) Tertiary folding simulations: The objective was to select the correct
candidate homologue and alignment. The computational details of the tertiary
folding simulations are briefly described as follows:

(a) An off-lattice model containing backbone atoms plus a pseudo atom
representation of the side chain for each amino acid was employed.

(b) The geometrical variables in the simulation were the phi and psi angles in
the loop regions; angles in secondary structural regions were fixed to ideal
values (-57, -47 degrees for alpha helices, -139, 135 degrees for beta
strands).

(c) The potential was a function of the distance between the side chain pseudo
atoms and the identities of the interacting residues. The functional form was
a general cubic spline that allowed great flexibility along with rapid
computation of energies and gradients. In general, hydrophobic-hydrophobic
interactions were attractive, and hydrophilic-hydrophilic interactions were
repulsive, as in the statistical potential of Sippl [1] and coworkers.
However, the potential was designed to vary as a function of protein size; we
have found this modification to be essential for obtaining reasonable results
for test cases. The size dependence was implemented by collecting distance
statistics from proteins of a given size-group (the training set). The
potential function was optimized iteratively so as to render the training set
proteins stable (i.e., after local minimization), while maintaining the
smallest energy gap possible between native conformations and their locally
minimized counterparts.

(d) Simulations of protein structures were carried out via a Monte Carlo plus
minimization algorithm [2] along the lines proposed by Li and Scheraga [3],
with a number of modifications to improve efficiency. The Monte Carlo code has
been developed to run in parallel using the MPI protocol over a network of
inexpensive personal computers.

Tertiary folding simulations were carried out using the potential above
supplemented with C_alpha restraint sets. The resulting structures were
clustered and ranked according to total energy. The choice of the correct
homologue was made by considering the total energy (including the constraint
term), similarity to the template (as determined by CE), and biological
function of the target.

References:

[1] Casari, G., Sippl, M.J. (1992). Structure-derived hydrophobic potential.
J. Mol. Biol. 224(3), 725-732.

[2] Eyrich, V. A., Standley, D. M. & Friesner, R. A. (1999). Prediction of
protein tertiary structure to low resolution: Performance for a large and
structurally diverse test set. Journal of Molecular Biology 288(4), 725-742.

[3] Li, Z. Q. & Scheraga, H. A. (1987). Monte-Carlo-Minimization Approach to
the Multiple-Minima Problem in Protein Folding. Proceedings of the National
Academy of Sciences of the United States of America 84(19), 6611-6615.

[4] Shindyalov IN, Bourne PE (1998) Protein structure alignment by incremental
combinatorial extension (CE) of the optimal path. Protein Engineering 11(9)
739-747.


Hovmoeller-Zhou , 501

number of submitted models: 5

Torsion angles for protein folding prediction

Structural Chemistry Stockholm University
email:
svenh@struc.su.se

The secondary structure prediction for proteins is usually done on the 3
categories Helix, Sheet and Random (HSR). An alternative is to use the torsion
angles, as defined in the Ramachandran plots. Also here we get 3 categories, a
b and others, but there is no one-to-one relation between these and the HSR
categories. Several consecutive amino acids with torsion angles in the area of
the Ramachandran plot typical for a-helices will correspond to HELIX in the
PDB. However, several consecutive amino acids with torsion angles in the b
region are not necessarily classified as SHEET. A straight strand is called a
sheet only if it has a partner in the form of another strand, parallel or
anti-parallel. Single strands, not defined as SHEET in the PDB are not
uncommon.

When sorting on torsion angles we get just over 50% a, 42% b, 4.5
0.000000e+00ft-handed helix and just 2.50f the amino acids outside these three
regions. Most (70%) of the amino acids with a conformation are found in
helices, while most (55%) of the amino acids with b conformation are found in
the regions called Random coil. Just over 10f the amino acids in HELIX and
SHEET do not have torsion angles in their expected a or b regions of the
Ramachandran plot. See the Table. A careful check of these residues shows that
most of them are due to mistakes in the PDB data sets. Such mistakes further
complicate protein folding prediction if it is based on the assumption that
the information in the PDB is always correct.

Torsion angles: alpha beta left alpha	othe    Sum

HELIX in PDB	36.8	 0.8	0.3	0.3	38.2
SHEET in PDB	 0.6 18.0	0.1	0.4	19.2
Random coil	13.5	23.3	4.1	1.8	42.6
Sum	        50.9	42.1	4.5 2.5    100.0

Except for glycines, all the other amino acids are to about or over 90 0n the
a or b regions. Thus, a prediction of torsion angles is essentially a 2-option
prediction, as opposed to the 3 options HSR. That facilitates folding
prediction, and gives an additional 3D prediction for the backbone. On the
other hand, as mentioned above, the translation from an a b other prediction
to an HSR prediction is not totally straight-forward.

We have plotted Ramachandran plot for each of the 20 amino acids, based on
all the 150 000 residues in our learning set. We have also made separate plots
for the residues defined as HELIX, SHEET or random coil in the PDB. These
plots show several interesting features. The torsion angles for amino acids in
HELIX are sharply focused at nearly the same value for all amino acids.
However, the residues classified as random coil, yet having torsion angles in
the a-helix region, have a very different distribution of angles; it is
elongated and inclined by 45 degrees to the axes of the Ramachandran plot.

The torsion angles for amino acids in HELIX are sharply focused near Ö = -64
Ø= -38. This value is nearly the same for all amino acids. The residues
classified as random coil with torsion angles in the a-helix region have an
elongated distribution of angles inclined by 45o to the axes of the
Ramachandran plot. The torsion angles for amino acids in SHEETs are shifted by
about 70degrees along Ö, relative to those in random coils.

We believe these findings are important for the protein folding prediction,
but we have not exploited this information fully yet.

We have analyzed single amino acids, pairs, triples and so on, based on both
the HSR and the torsion angle scheme. Our learning set contains 560 different
protein subunits, containing 150 000 amino acids. The testing set contains 30
subunits, none of which are included in the learning set. The most striking
feature is that 880f all the residues before proline have b conformation. For
most triples, the most common set of torsion angles is aaa. Most of these have
bbb as the second most probable set of torsion angles. The mixed combinations
aba, aab and so on are usually very rare. Gly and Pro are the amino acids that
are most commonly found to break a sequence of aaa or bbb, but this can also
be achieved by the amino acids with polar groups close to the b-carbons; Asn,
Asp, Ser and Thr.

We are predicting protein folding based on a combination of HSR and torsion
angle predictions. In both cases we base our predictions on statistics from
our learning set of 560 subunits in the PDB.

After a preliminary assignment of the amino acids as being in H, S or R, the
program goes over the sequence looking for single H or S or pairs of H. Since
these are not allowed, they will be reconsidered.

In summary, we reach very good predictions for helices, especially long ones,
but have more difficulties with the sheets. Part of this difficulty may stem
from the fact that two identical stretches of amino acids (equal sequence and
identical atomic co-ordinates) may in one case be called a SHEET but in
another protein it is considered as random coil, because it lacks a partner in
the form of a parallel or anti-parallel strand. We suspect that this problem
of classification may hamper the success of protein folding prediction.


Vajda , 241

number of submitted models: 22

Comparative modeling with multiple models

Jahnavi Prasad, Michael Silberstein, David Gatchell and Sandor Vajda

Boston University
email:
vajda@bu.edu

The basic idea of the method is to generate a large number (preferably up to a
few thousands) alignments, construct a homology model for each, and rank the
models according to their free energies.

The current implementation of the procedure starts with traditional target
selection using Blast and Psi-Blast. The Domain Profile Analysis developed in
Temple Smiths lab
(http://bmerc-www.bu.edu/bioinformatics/profile_request.html) has also been
consulted, One or (infrequently) several proteins have been selected as
templates for the comparative modeling. In the second step of the algorithm,
we generate multiple alignments between target and template sequences by
varying the alignment parameters (gap-opening, gap-extension, and scoring
matrix) for producing semi-global alignments by standard dynamic programming.
The blosum62 and gonnet matrices were used with gap opening penalty values 5,
6, 7, 8, 9, 10, 12, 14, 17, 20, 25, and gap extension penalty values 0.1, 0.2,
0.3, 0.5, 0.75, 1.0, 1.25, 1.6, 2.0, 2.5, 3, 4, 5, 7, 10. We produced only one
alignment for each set of parameters using a single trace-back path in the
dynamic programming matrix, thus resulting in 330 alignments for each
template-target pair. Any alignment was deleted if it was a duplicate, or less
than 750f the target residues were aligned to the template, generally
resulting in 80 to 150 retained alignments.

In the third step, all alignments are used for model construction via the
MODELER program developed by Sali and co-workers [1]. The resulting models
were minimized for 200 steps using the Charmm potential [2], and ranked by