CCP4 Interface: Molecular Replacement Module

	CCP4i: Graphical User Interface
	Molecular Replacement Module

Module Overview

Cell Content Analysis

Analyse Data (Patterson)

Specialist Help is available on:: MOLREPTheory - theory of Molecular Replacement in general, and the MOLREP program in particular

Module Overview

This module provides interfaces to three CCP4 Molecular Replacement programs: Beast, Molrep and Amore. The list of tasks begins, however, with two analysis tasks which it may be useful to run. Cell Content Analysis gives an estimate of the number of molecules in the asymmetric unit, and thus the number of molecules that the MR procedure should search for. Analyse Data for MR calculates a native Patterson to check for pseudo-translation, and in addition compares the Wilson B factor with the average B factor of the search model.

There are then tasks for running Beast, Molrep and Amore. These programs are alternatives to one another, but you may need to try more than one to obtain a clear solution. The tasks for Amore are based around a Model Database containing one or more search models together files and solutions generated by Amore for each model. The Model Database is accessed via the Amore Model Database task, launched from the task list or from other Amore tasks.

The layout of each task window, i.e. the number of folders present, and whether these folders are open or closed by default, depends on the choices made in the Protocol folder of the task (see Introduction). Although certain folders are closed by default, there are specific reasons why you should or may want to look at them. These reasons are described in the Task Window Layout sections below.

Cell Content Analysis

The program MATTHEWS_COEF, through the Cell Content Analysis task, can provide an estimate of the number of molecules in the asymmetric unit, and thus the number of molecules to search for in MR. It requires an MTZ file, and a fairly accurate estimate of the molecular weight of the protein (which can be obtained from the program RWCONTENTS, for example). The Matthews number is usually between 1.66+ and 4.0+ corresponding to protein contents of 75% to 30% but proteins with higher solvent contents will give higher values. This analysis may not be concusive in determining the number of molecules in the asymmetric unit.

Matthews - Task Window Layout

The solvent content information will appear in the 'solvent analysis' field upon clicking Run Now. The Interface displays a table with values of the percentage of solvent in the unit cell, as well as the corresponding Matthews coefficient, for a range of numbers of molecules in the asymmetric unit, from which the user reads the correct one.

See Stage 1 of the MR tutorial.

See program documentation: MATTHEWS_COEF.

Analyse Data for MR

Before running a molecular replacement program it is advisable to look a little at the data. This task will:

Check native Patterson map for large peaks which indicate pseudo-translation.

The task will run the FFT program to generate a native Patterson map and then run PEAKMAX to list the largest peaks to the .log file. You should be concerned if any non-origin peak is more than, say, 0.15 fraction of the origin peak, as this might suggest there is pseudo-translation present in the unit cell. The Molrep program has an option to handle pseudo-translation.
Compare the Wilson B factor with the average model B factor.

The task will run the Wilson program to determine the Wilson B factor from the data, and the program BAVERAGE to determine the average model B factor. The difference between these values is BADD which can be used in the Amore interface.

See Stage 2 of the MR tutorial.

See program documentation: FFT, WILSON, BAVERAGE

Beast - likelihood-based MR

This task provides an interface to Randy Read's maximum likelihood MR program Beast. Note that this program can be quite slow, but has succeeded in a number of cases where traditional programs have failed.

See program documentation: BEAST

MOLREP - auto MR

This a fully automated molecular replacement program which will attempt to find the number of molecules expected in the asymmetric unit as entered by the user. A PDB file for the best solution is output. It is also possible to run the program for just rotation or translation function; the rotation solutions are output to a file (given the extention .mr) and this can be used as input to a subsequent run of the translation function. When the .mr file is used as input, any lines beginning with a "#" character are ignored. When the .mr file is viewed within CCP4i, clicking on any line in the file will add or remove a "#" from the beginning of the line (see also Edit MR Solution File). Note that the format of this file is different from the format of .mr files output by AMoRe.

Molrep has other functions to do a self rotation function, search for a model in a phased map or an approach to fitting two molecules.

For the background theory of MOLREP, see MOLREP theory.

See program documentation: MOLREP

See Stage 3 of the MR tutorial.

AMORE Overview - How to Run a Simple AMoRe Job

The Molecular Replacement module uses a database to store information on the trial models used in a project. If you use only one trial model, this may seem unnecessarily complicated, but if you need to use multiple trial models you will appreciate the database.

To run a simple AMoRe job, click on the AMoRe task in the module menu. The task window for AMoRe and a task window which interfaces to the AMoRe model database will appear. You will need to enter the following information on your trial model in the database:

a unique one-word name (this will be used to generate filenames, so choose something short and distinctive)
the name of the coordinate file containing the trial model

All other filenames for intermediate files will be generated automatically from the model name.

It is possible to use a map as the trial structure in AMoRe - see below.

In the AMoRe task window, the protocol section has two menus for you to select a trial model from the database (if there is only one model in the database this will be set automatically) and the mode of running AMoRe. Usually you should keep the default auto-AMoRe for a start-to-finish run of the program. You will need to select the MTZ file containing the experimental data.

The first step in an AMoRe run is to move the trial coordinates to an optimal position centered on the origin; these coordinates are saved to a file. AMoRe then reports its best solutions in terms of transformations (Euler angles and translations) to be applied to the optimised, origin shifted, coordinates. These solutions are listed in the log file but also extracted into solution files (with file extension .mr). Solution files will be created for each of the rotation, translation and fitting stages of the AMoRe run, and the final file will have a name projectname_jobid_fit_model.mr where projectname is the name of the project, jobid is the job number and model is the name of the trial model.

The Molecular Replacement task Build AMoRe Output Model will apply the transformations stored in a solution file to the optimised coordinates and will also do some simple checks on the quality of a model - checking whether there is overlap between molecules in adjacent asymmetric units. You will need to select the solution file output by your AMoRe run.

Trial Models and Molecules

In AMoRe, the words 'model' and 'molecule' are used with very specific meanings.

In the context of AMoRe a 'molecule' is the structural element which can be treated as a rigid body in molecular replacement. It may be anything from a structural domain which is not even a whole chain, to a multi-chain protein.

A trial 'model' is the initial set of test coordinates which are taken from another solved crystal structure or NMR structure. This coordinate set may have been processed in some way to make it more suitable for use in molecular replacement - for example loop regions could have been excised. It is possible (and may be advisable) to generate multiple models from one input coordinate set by different processing (for example different degrees of severity in excising loop regions or applying some homolgy modelling to try to make the model more like the expected structure in the experimental data).

In the simplest case Molecular Replacement can be used to find one rotation/translation solution to map a model onto the experimental structure. If this is your case, you can skip some of the following discussion and you can ignore the part of the Interface referring to 'known' molecule(s). You should (at least for a first try) use the auto-AMoRe option which will run through all the AMoRe functions automatically.

The non-simple cases are:

If a crystal has non-crystallographic symmetry, there will be multiple copies of the basic structural element (i.e. 'molecule') in the asymmetric unit which can all be found using one model.
If the experimental structure contains multiple proteins or it is a multi-domain protein, it may be necessary to use models based on coordinate sets from two or more different protein structures.
Even if you have a coordinate set for a structure very similar to that expected in your experimental structure, it may be necessary to split a multi-domain protein into two or more models to allow for the possibility of different inter-domain relations between crystal structures.

Create Input SFs from Model

Alternative to inputting coordinates for model structures to AMoRe, some crystallographers prefer to input a map calculated from the coordinates, usually a sharpened E-map (that is a map generated using the normalised structure amplitudes rather than the SFs). The MR 'Create Input SFs from Model' task will create an MTZ file containing the appropriate Es or SFs and phis for a map to input to AMoRe. The name and coordinate file for the model must have been entered in the 'AMoRe Model Database' before running this task.

The task requires to know the cell parameters and resolution range - these can be read from an MTZ file such as the file containing the experimental data. This MTZ file is not used in any other way by the task.

AMoRe Model Database

To simplify running AMoRe, the Interface keeps a database of the models used for the molecular replacement. These models may be either variants of the same initial coordinate set which have been processed differently (for example with different loop regions excised) or they may be from different coordinate sets in cases where the experimental structure is made up of more than one 'molecule'. The contents of the database are displayed in a separate window which is opened when you select the AMoRe task. The key data you must input to the database is a name for each model and the name of either the coordinate file containing the model or an MTZ file containing SFs or Es for the model. The model name you enter will be used in menus and as part of filenames, so keep it short and distinct.

When AMoRe is run, some information will be automatically extracted from the log file and loaded into the database. This is visible in the 'AMoRe Details' folder. The information stored here currently is the name of transformed coordinate files, SF table files and details from the initial Tabling function (TABFUN) which are used by subsequent AMoRe functions.

The Rotation Function Radius and Model Cell

Probably the most important parameter in an AMoRe run is the radius used by the rotation function. There is debate about the best value to use and for tricky problems it is always worth trying a range of values. The Interface script will automatically generate a reasonable value for the radius from the parameters output by the TABFUN stage. The Tabling function moves the trial model to an optimal position and orientation and reports to the log file the size of an enclosing box for the model. The Interface calculates the search radius as:

the minimum of

0.75 * (the minimum axis length of the model enclosing box)

and

0.5 * (the minimum crystal cell axis)

This search radius is saved in the MR Database for this model and will be used by default in future AMoRe runs. It is also used in the calculation of the model cell, in the case of an auto-AMoRe run, as follows:

a_model = a_{tabfun-minimal-box} + radius + 5.0
b_model = b_{tabfun-minimal-box} + radius + 5.0
c_model = c_{tabfun-minimal-box} + radius + 5.0

where radius is the search radius as determined above. tabfun-minimal-box is the Minimal Box output in the logfile of the TABFUN stage. 5.0 is chosen as a nominal value for the resolution.

Solution Files

Molecular Replacement (.mr) Files

When AMoRe performs the rotation function, translation function or rigid body refinement (fitting function), it outputs the final result to the log file in lines which begin with the keyword SOLUTION (or some recognisable variation on it). The key data on the line are three Euler angles which are the rotation part of the solution and three fractional shifts which are the translation part of the solution.

It is often necessary to recycle these solutions as input into the next stage in AMoRe or to use them to generate well-positioned models. To simplify the recycling, the Interface automatically extracts the SOLUTION lines from the log file and saves them to a 'Solution File' which is put in the user's project directory and has a name like projectname_jobid_mode_model.mr where projectname is the name of the project, jobid is the job number, mode is either rot, tran or fit, depending on which stage these are the solutions for, and model is the name of the model this solution applies to. These MR files are analogous to the HA files of the Experimental Phasing module.

The solution file from the translation function will also include the alternative, lower scoring, translation solutions which are usually given the label SOLUT_1, SOLUT_2 in the log file. These solutions will be 'commented out' in the solution file which means that the lines containing these solutions will begin with a '#' character and they will not be read or used by default.

For subsequent AMoRe runs you should select which solution files to use as input and these will be edited into the input command file (any specification of the model number or the FIX keyword will be handled automatically). You do NOT need to edit the solution file in any way.

If you do not want to use all of the solutions in a solution file, then some lines from the file can be 'commented out' - that is a '#' character is placed at the beginning of the line so the rest of the line is then ignored by any program reading the file. The easiest way to edit a solution file is using the 'Edit MR Solution File' task. This task displays the contents of a solution file and you just need to click on a line to either add or remove the # at the beginning of the line.

You can access the 'Edit AMoRe Solution File' task in the conventional way from the task menu on the main CCP4i window or you can click on the 'View' button on the file selection line for a solution file.

By creating and using solution files automatically, the Interface simplifies running AMoRe and reduces the risk of errors, but there may be one or two tricks that you can do running AMoRe conventionally with scripts which you can not do easily with the Interface. There are a couple of ways to work round this:

Create or edit the solution file you need external to the Interface
Use the 'Run and View Com File' option to look at and edit the AMoRe command script

Please let us know if you find any serious limitation which is liable to affect other users, and we will try to fix it.

AMoRe Functions

The AMoRe process is split into functions which are described in the AMoRe program documentation but they are described here briefly from the point of view of someone using the Interface:

Sorting and Tabling Functions (SORTFUN and TABFUN)

The Interface treats the AMoRe Sorting and Tabling functions together. The purpose of these functions is to process the input model and experimental data into a form which is most convenient for AMoRe, which is a packed hkl file of the experimental data and SF table file of the inverse Fourier transform of the model.

The Sorting function produces a packed hkl file from structure factors (read from an input MTZ) and can also produce a SF table file. The Tabling function will produce a SF table file from coordinates. So the input experimental data is processed by the Sorting function and the model data is processed by either the Sorting or Tabling function depending on whether it is in the form of a map or atom coordinates. The choice of processing step is handled automatically by the Interface.

When the Interface runs AMoRe, it will automatically run the Sorting and Tabling if the necessary SF table files do not exist but, provided you do not delete these files, the script will skip these functions for all subsequent AMoRe runs. This saves some time but beware the SF table files are large.

The Tabling function also moves the input coordinates to an optimal position centered on the origin and these optimised coordinates are saved in an output coordinate file. All the subsequent solutions output by AMoRe are transformations which should be applied to these optimised coordinates. The AMoRe interface has a 'get origin shifted model' option to recreate the optimised model coordinate file if you inadvertently delete it.

Rotation Function (ROTFUN)

The rotation function is applied to one input model which is represented by an SF table file. The rotation function solutions are a list of rotations (no translation component) which are written to the log file and also to a solution file projectname_jobid_rot_model.mr. There will normally be multiple solutions and all of these solutions should be tested with the translation function.

Translation Function (TRAFUN)

The translation function is applied to one input model which is represented by an SF table file and rotation function solution(s) for the SAME model.

The output from the translation function is a list of transformations with both rotation and translation components - the rotation component is carried over, without change, from the rotation function solution. The solutions are extracted from the log file to a solution file called projectname_jobid_tran_model.mr which will list the one 'best' translation solution for each input rotation solution. It will also list, 'commented out', the alternative, poorer, solutions.

Rigid Body Refine Fitting Function (FITFUN)

The FITFUN stage will refine the rotation and translation solution for one or more molecules simultaneously. The input is usually the solutions from the translation function.

If you have only one molecule in your asymmtric unit, the solution file from the translation stage should be input into the refinement. Each of the solutions from the transation function will be refined in turn and output to a final solution file projectname_jobid_fit_model.mr.

Solving Structures with Multiple Molecules in the Asymmetric Unit

The usual procedure for solving an experimental structure containing more than one molecule, is to try to find a good solution for one molecule and then treat it as 'known' while you try to find the solution for the next molecule. Of course there may be more than one candidate for the solution of the 'known' molecule in which case the procedure will have to be repeated for all candidates.

If you have already determined the positions of some 'known' molecule(s) within your crystal then, in the translation function, they should be specified and they will be treated as fixed by the translation function. For each 'known' molecule you must specify the name of the model and a solution file containing both rotation and translation solutions (i.e. the solution file must be from the translation function). Only one solution will be taken from each solution file. The first solution not commented out in the file will be used. If you know the position of two, or more, molecules based on the same model, you should specify two or more solution files for the model.

Rigid body refinement (FITFUN) is applied to one or more input model(s) for which initial rotation and translation solutions must be specified. This function will refine those input solutions. To simplify the interface, the molecules are considered to be one 'test' molecule for which you can specify multiple alternative solutions and one or more 'known' molecules for which you can only specify one input solution. This is only a convention of the Interface; within AMoRe the refinement treats 'known' and 'test' molecules identically - they are refined simultaneously. If there are more than one solutions for the 'test' molecule defined in the solution file, AMoRe will do multiple refinement runs. The starting position of the 'test' molecule will differ for each refinement run but the starting position of the 'known' molecules will be the same for all runs. The final, refined position of the 'known' molecules will almost certainly be different for each run.

The interface to specify the 'known' molecules is identical to that for the translation function. For each selected solution file, one solution will be read from the file. For the 'test' molecule all uncommented solutions will be read from the solution file. The output from the fitting function is just one solution file projectname_jobid_fit_model.mr where model is the name of the test model.

It is possible to use auto-AMoRe, which will run the rotation, translation and fitting functions automatically, for structures with multiple molecules. The auto-AMoRe will attempt to find a solution for one model. If you already have one or more 'known' molecules, you should enter them in the interface.

Beware: if you have a case of NCS symmetry and 'know' one or more solutions, these solutions are liable to be found again.

AMoRe Memory Allocation

AMoRe requires large amounts of memory to hold the maps in core. If your version of AMoRe is not built with large enough default arrays, the AMoRe log file reports that there is insufficient memory (though not in a very helpful fashion!). You should open the 'Memory Allocation' folder at the bottom of the AMoRe window and enter (some guess at) the appropriate memory allocation. Alternatively you can enter the parameters in the 'Memory Allocation' folder in the MR Database window and they will be used to update the parameters in the current AMoRe window and saved and used for all future AMoRe runs. See also Memory Allocation in the AMoRe program documentation.

Build AMoRe Output Model

AMoRe will only output the transformations which need to be applied to the initial coordinates, but will not generate a model with the transfomations applied. The 'Build AMoRe Output Model' task will generate a coordinate file with the input model(s) transformed to best fit in the experimental model.

The input to this task is a solution file from the AMoRe fitting function. The solution file will contain a list of the models used in the fitting function. These are listed on the line beginning:

#CCP4I SCRIPT SOL fit

The Interface will look up the name of the coordinate file for the model in the database and these will be shown in the task window so you should not need to enter them. The task will put appropriate cell and symmetry information for the experimental structure into the coordinate file. The easiest way to provide this information is to give the name of the experimental data MTZ file from which the parameters can be extracted. The different 'molecules' in the structure are identified by different chain names: A,B,C etc..

This model is useful for testing the quality of the packing for the solution. This task will run the DISTANG program to list bad contacts between 'molecules'. You should look at the output log file for a listing of contacts.

See program documentation: AMoRe, DISTANG

Edit Protein Structure / Convert Protein Sequence - MODELLER

This task uses a non-CCP4 program MODELLER which is available from Andrej Sali (see Sali Lab at The Rockefeller University) and see installation notes on MODELLER. Note that the program is Unix/Linux specific.

The input to this program is the structure of one or more homologs and the sequence of the protein for which you require a structure. MODELLER can produce a model which is, as closely as possible, identical to the input structure with changed residues generated with geometrically reasonable coordinates but this structure is liable to be energetically unreasonable due to close contacts. MODELLER can also refine this structure with restraints which aim to keep the structure close to the input homolog structures. Where homolog and model sequence are similar the structures are liable to remain closely similar but regions of low homolgy, in particular loops, can change significantly.

If you are doing molecular replacement you could use homology modelling in two different ways:

To generate one or more models which are input to the molecular replacement program. In this case you will probably want to delete uncertain regions of loops or possibly side chains. In cases where molecualr replacement is not going well you could also try generating multiple refined models (which will have some variation on the original homolog model) and input each of these to the molecular replacement program.
To convert the structure to the new sequence after a molecular replacement solution has been found. This should save time on work usually done using a graphics program. In this case it is better not to refine the model.

Sequence Alignment

The quality of any output structure is hugely dependent on the quality of the alignment. MODELLER can do the necessary sequence alignment of the sequence and homolog structures but you are strongly recommended to review and possibly ammend the alignment produced. You should also be sceptical of the exact sequence alignment output by any sequence database search. The database search uses protocols designed for speed rather than accuracy in low homology regions. An experienced crystallographer looking at the homolog structure with a graphics program will probably make a better assessment of sequence alignment.

The alignment file format used by MODELLER is not particularly simple (see MODELLER documentation on Alignment File Format) and it is probably easiest to run the CCP4i task with the sequence file and homolog structures as input and let this generate the alignment file (extension .ali) which you can then edit and use as input if necessary.

CCP4i expects the sequence to be input in a simple file with one letter amino acid code. It does not expect any extra titles or comments - beware if you have these in the file then they may be interpreted as sequence code. The line length is not fixed and any spaces or characters outside of the range A-Z will be ignored so it should be possible to cut and paste a sequence into a file without necessarily removing all gaps or extra characters.

Building Models

The CCP4i interface has the option to produce a model which has no refinement, fast refinement or full refinement. Only in the latter case is there an option to generate more than one model. Beware that after full refinement the position of the model may have drifted from the position of the input homolog structure.

After refinement a graph file (extension .graph) is produced which contains a plot of restraint violations versus residue number; this is MODELLER's assessment of the quality of the model. Large values of restraint violation are bad; they usually correspond to regions of insertions or deletions in the alignment or significant differences in the sequences.

Post-Processing of the MODELLER Model

After refinement MODELLER puts the restraint violation in the Bvalues column of the output PDB file. The CCP4i script will, by default, replace these with Bvalues that the user can set in the task interface.

The CCP4i script can do some post processing of the output MODELLER model to edit either

poor regions of the model as determined by the restraint violation parameter output by MODELLER
side chains of residues which have been mutated

These regions can be either deleted (the mutated residues being converted to glycine or alanine) or the occupancies can be set to zero.

See also $CDOC/Mol_repl_itickle_tut.bath.ps, and Molecular Replacement (Birkbeck)