TRUNCATE (CCP4: Supported Program)
NAMEtruncate - obtain structure factor amplitudes using Truncate procedure and/or generate useful intensity statistics
truncate hklin foo_in.mtz hklout foo_out.mtz
[ plot foo.plt ]
The standard use of the program is to read a file of averaged intensities (output from SCALA, SCALEPACK2MTZ or DTREK2MTZ) and write a file containing mean amplitudes and the original intensities. If anomalous data is present then F(+), F(-), with the anomalous difference, plus I(+) and I(-) are also written out. The amplitudes are put on an approximate absolute scale using the scale factor taken from a Wilson plot.
There are two ways in TRUNCATE to calculate the amplitudes from the intensities. The simplest is just to take the square root of the intensities, setting any negative ones to zero (keyword TRUNCATE NO). Alternatively, the "truncate" procedure (keyword TRUNCATE YES, the default) calculates a best estimate of F from I, sd(I), and the distribution of intensities in resolution shells (see below). This has the effect of forcing all negative observations to be positive, and inflating the weakest reflections (less than about 3 sd), because an observation significantly smaller than the average intensity is likely to be underestimated. See reference below.
This program can be used even if the "truncate" procedure is not desired, since it produces some useful statistics on intensity distributions. These can indicate problems with the data; for instance if the data is extremely anisotropic (see the FALLOFF keyword) or if it is likely to be twinned. See Cumulative distribution plot, which for a perfect twinning becomes sigmoidal, and the moments of I (or E or z) which are different for twinned data than for untwinned.
If the input specified on the LABIN line includes an assignment for F, then no output will be generated. It is most undesirable to TRUNCATE a set of data where the intensities have already been modified to generate amplitudes.
The general formula for expected moments <I^k> /<I>^k for untwinned acentric data is:
Table of moments: k-th moment is Gamma(k+1) = k! if k is an integer k-th moment = sqrt(PI) k! if k equals integer+0.5 ie the (2k+1)th moment of E = sqrt(PI) 2k * 2k-2 * ... *2 In general Gamma(k+1) = k Gamma(k) Acentric Centric Untwinned data Perfect twin. Untwinned data Perfect twin. <E> 0.866 0.94 0.798 ? <E^3> 1.339 1.175 1.596 ? <I^2> 2.0 1.5 3.0 ? <I^3> 6.0 3.0 15.0 ? <I^4> 24.0 7.5 105.0 ?
The scale factor estimated from the Wilson plot is applied to the data and allows the data to be put on a (very approximate) absolute scale. This at least gives amplitudes of a sensible magnitude for further calculations. The calculation relies on the number of residues/atoms given by the keywords NRESIDUE/CONTENTS being roughly correct. The program does not, however, apply any temperature factor.
The various data control lines are identified by keywords Only the first 4 letters of each keyword are necessary. Most keywords are optional.
ANOMALOUS, CELL, CONTENTS, FALLOFF, HEADER, HISTORY, LABIN, LABOUT, NRESIDUE, PLOT, RANGES, RESOLUTION, RSCALE, SCALE, SYMMETRY, TITLE, TRUNCATE, VPAT
In addition, the following optional keywords control the data harvesting functionality:
DESCRIPTION OF KEYWORDS
(default TITLE='From Truncate') [OPTIONAL INPUT]
Title to write to output reflection file
(Default <nrange>=60) [OPTIONAL INPUT]
<nrange> is the number of resolution bins over the resolution
range for the Wilson Plot. <range> is the width of the bins on 4sin**2
theta/lambda**2 and is an alterative to <nrange>. The resolution
range used for the Wilson Plot is taken from the input data file, or set
with the RESOLUTION keyword. A subset of these bins, covering a resolution
range defined with the RSCALE keyword, is used to estimate the scale and
Resolution limits - either 4(sin theta/lambda)**2 or d in Angstroms (either order). Reflections outside these limits will be excluded from all analysis and omitted on output. Defaults are taken from the range of data in the input file (i.e. all data included).
Resolution limits for scaling (either 4(sin theta/lambda)**2 or d). This option allows you to exclude low resolution reflections from the calculation of the scale and B factor. However, all points in the range defined by RESO are plotted on the Wilson plot. It is probably a good idea to include only high resolution data (beyond 3A, if you have any data there) in the Wilson plot. This is because the assumptions behind Wilson statistics are invalid for low resolution data. The default high resolution limit is the same as RESOLUTION. The low resolution limit is, by default, set to 4.0A if the high res. limit is greater than 3.5A.
The default is to apply a scale factor from the Wilson plot. If a scale factor is given here, then that is applied instead. This option is useful if relative scaling is already done in SCALA.
If amplitudes rather than intensities are specified on the LABIN line, then the Wilson scale is not applied, and a default scale of 100 is used.
The first argument of the FALLOFF keyword should be "YES" or "NO", followed optionally by subkeywords controlling the detailed behaviour. The default is "YES", which triggers an analysis of the anisotropy of the data according to the "falloff" procedure contributed by Yorgo Modis. This calculates the falloff of mean F and mean F/sigma values as a function of (sinth/lab)**2 in 3 orthogonal directions. An overall falloff of all reflections is also calculated. The 3 mutually perpendicular directions are:
DIRECTION 2 = B*-AXIS DIRECTION 3 = PERPENDICULAR TO A* AND B* DIRECTION 1 = PERPENDICULAR TO B* AND DIRECTION 3.
If either of the subkeywords PLTX or PLTY are specified, then an output plot file (PLOT) is produced, in which Direction 1 is plotted as a thick line, Direction 2 is plotted as a hollow line with boxes at regular intervals of resolution, and Direction 3 is plotted as a thin line. The resolution range and number of resolution bins used in the calculation can be set by the keywords RESOLUTION and RANGES respectively.
Specify input column lables. [OPTIONAL INPUT]
Truncate takes output from SCALA, SCALEPACK2MTZ or DTREK2MTZ which generate standard labels. This is the most common usage of the program, in which case LABIN records are not required. If F is assigned,there will be no reflections output. You must assing either IMEAN/SIGIMEAN or F/SIGF.
The program labels defined are: IMEAN SIGIMEAN I(+) SIGI(+) I(-) SIGI(-) F SIGF F(+) SIGF(+) F(-) SIGF(-).
Specify output column labels. [OPTIONAL INPUT]
The labels allowed are F SIGF DANO SIGDANO F(+) SIGF(+) F(-) SIGF(-) IMEAN SIGIMEAN I(+) SIGI(+) I(-) SIGI(-) ISYM. The output labels will default to these unless they are changed by assigning a program label to a user label.
If there is no anomalous data present then only the appropriate columns (F, SIGF, IMEAN and SIGIMEAN) are output. Values may be given in any order and as either Proglabel=Userlabel or Userlabel=Proglabel.
[ALTERNATIVE COMPULSORY INPUT]
followed by number of atoms in asymmetric unit, including hydrogens
A maximum of 20 atom (element) types is allowed, each followed by a number, e.g.
CONTENTS H 746 C 454 N 115 O 139 S 12 ! Must include hydrogens
The average scattering power is calculated from a table of form factors. By default the file $CLIBD/atomsf.lib contains this table of form factors. You can change the table used by assigning 'ATOMSF' to your preferred file. [NOTE the program RWCONTENTS provides the information for this keyword; how many Carbons etc., from a PDB file. Also, it gives an estimate of the number of hydrogens there would be.]
[ALTERNATIVE COMPULSORY INPUT]
<Nres> is the number of residues expected in the asymmetric unit
A very approximate atom composition is calculated:
mean mass of an amino acid = 110 add on one ordered water per amino acid = ca. 128
This is then taken as 5 C + 1.35 N + 1.5 O + 8 H /residue as number of atoms in asymmetric unit.
volume per atom - default = 10
PLOT or PLOT ON produces extra ascii plots in the log output. The default is PLOT OFF.
Controls printout from reading file and batch headers
History strings to be added to history records in output file
Controls whether anomalous differences are output. Defaults YES if anomalous information is present on input file, otherwise NO
If YES (default) the data will be truncated according to the procedure of French and Wilson. If NO the data are not truncated but the structure amplitudes are calculated simply by taking the square root of the intensities. Negative intensities are set to zero.
Default is to use symmetry in input HKLIN file. (Normally OMIT this command.)
The cell dimensions in Angstroms and degrees. The angles default to 90 degrees. If this key is omitted then the cell dimensions are taken from the input file (normally OMIT this command)
Data Harvesting keywords
Provided a Project Name and a Dataset Name are specified (either explicitly or from the MTZ file) and provided the NOHARVEST keyword is not given, the program will automatically produce a data harvesting file. This file will be written to
The environment variable $HARVESTHOME defaults to the user's home directory, but could be changed, for example, to a group project directory.
Project Name. In most cases, this will be inherited from the MTZ file.
Dataset Name. In most cases, this will be inherited from the MTZ file.
Set the directory permissions to '700', i.e. read/write/execute for the user only (default '755').
Write the deposit file to the current directory, rather than a subdirectory of $HARVESTHOME. This can be used to send deposit files from speculative runs to the local directory rather than the official project directory, or can be used when the program is being run on a machine without access to the directory $HARVESTHOME.
Maximum width of a row in the deposit file (default 80). <row_length> should be between 80 and 132 characters.
Do not write out a deposit file; default is to do so provided Project and Dataset names are available.
The input files are:
The output files are:
The printer output starts with details of the control data and details of the input MTZ reflection data file. Analyses of the data against resolution are then given and include intensity distributions for comparison with Wilson's theoretical distributions. The following graphs are output (which can be viewed via XLOGGRAPH or LOGGRAPH):
The program TRUNCATE reads a reflection data file of averaged intensities (SCALA, SCALEPACK2MTZ or DTREK2MTZ output) and outputs an MTZ reflection data file containing F and DeltaFanom values. The input intensities are assumed to follow a normal distribution with the standard deviations, i.e. negative observations must have been preserved. The truncation procedure used was devised by French and Wilson and is based on Bayesian statistics. The F's are calculated using the prior knowledge of Wilson's distributions for acentric or centric data (calculated in shells of reciprocal space in a first pass through the data) and the mean intensity and standard deviation values. The F's output are all positive and follow Wilson's distribution. The truncation procedure has little effect on reflections larger than 3 standard deviations but should give significantly better values for the weak data than those obtained by merely taking the square root of the intensities and setting negative intensities to zero. Reflections of less than minus four standard deviations are rejected.
The following warnings should be heeded:
The Wilson plot part of the program attempts to calculate an absolute scale and temperature factor for a set of observed intensities, using the theory of A C Wilson. This says that IF the atoms are randomly distributed through the asymmetric unit THEN
<f**2> should equal scale*<Fobs**2> * exp(-2B sin**2/lambda**2)
By fitting a least squares line through ln(<f**2>/<Fobs**2>) v 2sin**sq/lambda**2 the program derives the scale and B value.
For real structures the assumption that the atoms are randomly distributed is obviously incorrect. The effect of this is most obvious in the low resolution reflections. The Wilson plot will deviate from a straight line from about 3.0A - 4.0A downwards. Although all the points on the Wilson plot are plotted, the scale and B are only determined from a limited resolution range determined by the user (see keyword RSCALE).
There may be a problem in evaluating <Fobs**2> if all the weak data have been systematically omitted (this should NOT be the case for data measured in any proper manner: note that if this IS the case, the Truncate procedure will also fail). If this is the case then you need to use TRUNCATE NO. The program estimates the expected number of reflections in each resolution shell and then calculates <Fobs**2> by dividing by the number of predicted reflections.
K.S. Wilson and S. French
"falloff" program contributed by Yorgo MODIS, European Molecular Biology Lab (original program: W.G.J. HOL/SINEKE BREEN (part of the Groningen BIOMOL package). Incorporated into TRUNCATE by Martyn Winn.
unix example scripts found in $CEXAM/unix/runnable/
VMS versions found in $CEXAM/vms/
....and non runnable examples in $CEXAM/unix/non-runnable/