SCALEIT (CCP4: Supported Program)

NAME

scaleit - derivative to native scaling and analysis

SYNOPSIS

scaleit hklin foo_in.mtz hklout foo.mtz
[Keyworded input]

PURPOSE

The program SCALEIT calculates and applies a derivative to native scaling function using either (a) an overall scale factor, (b) a scale and isotropic temperature factor or (c) a scale and anisotropic temperature factors. SCALEIT would normally be run after the merged datasets for native and derivatives had been combined into one file with CAD, and before beginning to search for heavy atom sites. See also FHSCAL for an alternative to SCALEIT.

In addition, SCALEIT performs several useful analyses of the scaled data, which may be useful even if FHSCAL is used for the final scaling (see PROGRAM OUTPUT). These analyses include a Normal probability analysis using the ideas of Dave Smith and Lynne Howell (J. Appl. Cryst. (1992) 25 81-86), which is done if there is only one derivative being scaled. SCALEIT also estimates the Kraut Scale using the formula of Ian Tickle Daresbury booklet 1991 p.91.

DESCRIPTION

The program has options to refine scale factors, to apply input scale factors, or just to analyse the agreement between derivative and native amplitudes. Several derivatives may be scaled and analysed in one run of the program, but the same type of scaling must be used for all of them. It is important to look at the final analysis which compares <FP **2> and <FPH**2> after scaling. If their ratio is not near to unity, something has gone wrong! See below for some possible reasons for the problem. Maybe the range you are scaling over is inappropriate; it is often best to exclude the lowest resolution data. Maybe there are a few reflections obscured by the backstop in one data set; this can distort scales badly. Maybe the sigmas are not appropriate. In the analysis, large differences (both isomorphous and anomalous) are listed (only if scales are refined): these reflections are candidates for spuriously large differences, and should be checked.

Note that there is no unique solution to the problem of scaling together two different data sets. Problems arise from:

Scaling over an inappropriate resolution range. Derivative data are often poorly isomorphous at low resolution, and there is no point trying to scale to a resolution where the data are too inaccurate.
random errors, particularly if the two data sets are of very different strengths. Excluding weak data (EXCLUDE SIG) may help in this case
Poor estimates of SIGMA. The default option is NOT to use weights derived from the standard deviations of the reflections in the scaling, since these are often unreliable, or the two data sets are of very different quality. Use keyword WEIGHT to use the standard deviations if you think they are reliable.
Systematic errors: the anisotropic scaling will take out some systematic errors, but proper scaling to remove such errors must be done on data processed as P1 data; i.e. the indices should be those of the actual observation, not those of a symmetry equivalent. If the LAMBDA values defined by the anisotropic ellipsoid are very different there is some problem in scaling the two data sets.
Rogue information; e.g. a reflection behind a backstop. If such an observation has a very low SIGMA it will be given a lot of weight in the refinement calculation (see EXCLUDE options for possible cures).
actual differences between the crystals: scaling can only be done properly with a model for the difference, thus in refinement of heavy-atom parameters, the derivative scale factor is also refined. The scale calculated by this program can only be regarded as a rough estimate but it is usually adequate for calculating Patterson functions. The program FHSCAL may provide a better estimate of scale for heavy-atom derivatives.

In general, scales may be calculated either by a least-squares procedure, or by Wilson scaling, i.e. making <Fph**2> = <Fp**2>. These procedures will give different answers, and it is not clear which is better. This program allows the option of a final Wilson scaling after least-squares determination of isotropic or anisotropic temperature factors: this changes the scale factor, but not the temperature factors (option REFINE [AN]ISOTROPIC WILSON).

Note that all scales output by the program apply to Fph, although they are determined from F**2.

It also possible to apply the scales to all ''scaleable'' columns in a dataset (i.e. to F+/- and to the structure intensities; see the LABIN keyword), and this is advisable to avoid mixtures of scaled and unscaled data for a single derivative. For input mtz files with dataset information, SCALEIT will attempt to check and warn you accordingly if it detects datasets which will be output with such a mixture. In these cases, specifying the AUTO keyword will cause the appropriate scale factor to be applied automatically to all such columns.

KEYWORDED INPUT

A line beginning with an '#' or '!' indicates to the parser that it is a comment line and will be ignored: this is useful for command procedures. Parameters given in [] below are optional.

The various data control lines are identified by keywords. The only compulsory keyword is LABIN to specify the MTZ column labels; other keywords have sensible defaults and are optional. The principal keywords controlling the function of the program are REFINE and ANALYSE. The full list of available keywords is as follows:

ANALYSE, AUTO, CONVERGE, EXCLUDE, GRAPH, LABIN, NOWT, REFINE, RESOLUTION, SCALE, SCATTER, SYMMETRY, TITLE, WEIGHT, END

TITLE <string>

title to replace header in output file

LABIN <program_label>=<file_label>...

Assign columns to be used. This both assigns columns, and defines how many derivatives to scale, and whether they have anomalous data.

The items required (<program_label>s) are as follows:

           H
           K
           L
           FP        F of native data
           SIGFP     sigma(F) of native data
           FPHn      F of data for nth derivative
           SIGFPHn   sigma(F) of derivative data
          [DPHn]     Anomalous Delta(F) of derivative data
          [SIGDPHn]  sigma anomalous Delta(F) of derivative data

and so on for up to 20 possible derivatives.

Additionally the following data items can be included, if present, for each derivative:

         FPHn(+)     F(+) of hkl for nth derivative
         SIGFPHn(+)  sigma of above
         FPHn(-)     F(-) of -h-k-l for nth derivative
         SIGFPHn(-)  sigma of above
         IMEANn      Average Structure Intensity for nth derivative
         SIGIMEANn   sigma of above
         In(+)       Structure Intensity of hkl for nth derivative
         SIGIn(+)    sigma of above
         In(-)       Structure Intensity of -h-k-l for nth derivative
         SIGIn(-)    sigma of above

If any of these items are specified then SCALEIT will also apply the appropriate scale factor (in the case of F+/-) or the scale factor squared (in the case of structure intensities) to those columns, however no analysis will be performed using the data in the columns.

Alternatively, by specifying the AUTO keyword, the scale factor will be applied automatically to all ``scalable'' columns in a dataset. Only FPHn and SIGFPHn need to be specified for each derivative on the LABIN line (see separate entry for AUTO).

AUTO

Switches on AUTOmatic column selection. This option can only be used if the input file contains dataset information.

It is only necessary to specify FPHn and SIGFPHn for each dataset on the LABIN line (except in special cases, see below). Other labels can also be specified if desired. The program will then try to identify all ''scalable'' columns in the dataset, automatically read them in and then apply the appropriate scale factor determined from FPHn.

This option is intended to prevent a mixture of scaled and unscaled columns within a dataset, e.g. FPHn is scaled but not FPHn(+) and FPHn(-). There are a couple of caveats:

It is assumed that each dataset contains the information for one derivative.
There may be problems with the automatic scaling if datasets contain both SIGIMEAN and SIGDPHn. This is because the program cannot distinguish between sigmas for intensities (which need to be scaled by the square of the scale factor) and those for other quantities (which are multiplied by the scale factor).
In these cases the automatic selection will make a best guess at which sigma is which; the ambiguity can also be resolved provided that IMEAN and SIGIMEAN are explicitly set by the user on the LABIN line (which is safer).

REFINE [ SCALE | ISOTROPIC | ANISOTROPIC ] [WILSON]

Alternative to ANALYSE. Default for program is REFINE ANISOTROPIC, which defines the type of scale-factors to be refined. This applies to all derivatives specified in this run.

SCALE: overall scale only
ISOTROPIC: scale and isotropic temperature factor
ANISOTROPIC: scale and anisotropic temperature factor (default)
WILSON: apply a final Wilson scale, after determining relative temperature factors. This can be combined with SCALE, ISOTROPIC or ANISOTROPIC keywords. I have no idea if this is a good thing to do.

ANALYSE

Alternative to REFINE

Analyse differences between derivative and native without refining scale factors. SCALE commands may be given to change the scale and temperature factors from no scaling. If this command is given, no output file is written.

CONVERGE [ NCYC <n> ] [ ABS <m> ] [ TOLR <l> ]

Conditions for convergence.

<n>: number of cycles of refinement required (default 4)
<m>: convergence limit. The refinement will be ended if all the shifts are less than (ABS * the standard deviation of the parameter). (Default = 0.001)
<l>: tolerance (default 0.00000001)

SCATTER

Include scatter plots of scales in logfile. Default is not to.

SCALE [FPHn] Scale [Biso]/[B11 B22 B33 B12 B13 B23]

Alternative to REFINE, not usually used.

Input scales (and temperature factors) to be applied to derivative n (i.e. column assigned to FPHn). If the FPHn key is not given, the scale is used for the 1st derivative FPH1. If any scales are given with this command, then the scales are NOT determined, just applied and the analysis performed. Isotropic and anisotropic temperature factors may NOT be mixed for different derivatives. No scale factor may be given for the native.

SYMMETRY <spacegroup name or number>

Not normally required.

Spacegroup name or number to override symmetry in input file.

RESOLUTION <rmin> <rmax>

Resolution range (either 4sin(theta)**2/lambda**2, or Angstrom). Reflections outside this range will be excluded from scale determination and analysis, but will be scaled and written to the output file. If this command is not given, all data in the file is included.

EXCLUDE [ FP | FPH<n> ] [ SIG <nsig> ] [ FMAX <fmax> ] [ DMAX <fmax> ] [ DIFF <diffmax>]

Set criteria for excluding data from the scale determination and analysis. Excluded data will still be scaled and written to the output file. The default is to include all data. If the first key is FP the exclusions apply to the native data, if FPH<n> to the <n>th derivative: if this key is omitted, the exclusions apply to all data, native and derivatives. Several EXCLUDE commands may be given.

SIG <nsig>: exclude reflections if FP < <nsig>* SIGFP or FPH<n> < <nsig>* SIGFPH<n>
FMAX <fmax>: exclude reflections if FP or FPH<n> > <fmax>
DMAX <fmax>: exclude reflections if abs(DPHn) > <fmax>
DIFF <diffmax>: exclude reflections if abs(FPHn-FP) > <diffmax>

GRAPH [ H K L MODF ]

List of the analyses to be included as well as that against 4sin(theta)**2/lambda**2. H,K,L and MODF can be in any order The default is just to analyse against resolution.

NOWT

If this command is present, the scale determination will be unweighted (the default).

WEIGHT

Weight the observations for scale determination according to the input standard deviations. The default is not to weight them.

END

End of input. If present, this must be last keyword.

INPUT AND OUTPUT FILES

The input files are:

(a): The control data file
(b): The input reflection data file in standard MTZ format.

The output is a reflection data file in standard MTZ format. This is a copy of the input reflection data file but with the data items for the selected derivative re-scaled.

PROGRAM FUNCTION

The program SCALEIT is used to calculate a derivative to native scaling function and apply it to the derivative data. Scales are determined from the squared amplitudes. The scaling function for F may be of the form:

An overall scale (REFINE SCALE)

Isotropic temperature factor (REFINE ISOTROPIC)

       C * exp (-B sintheta/lambda)

Anisotropic temperature factor (REFINE ANISOTROPIC) (default)

       C * exp(-(h**2 B11 + k**2 B22 + l**2 B33 + 
                      2hk B12 + 2hl  B13  +  2kl B23))

An initial (and optional final, see REFINE WILSON) scaling factor is calculated from the expression

              Kinit = Sqrt(Sigma FP**2 / Sigma FPH**2)
                       (relative Wilson scaling)

The scale and anisotropic temperature factors are then refined using a modification of the method of Fox and Holmes. The function minimised is

          Sigma Sigma w( h )i(I( h )i - GiI( h ))**2
            h     i

with respect to all parameters (2 scale factors and 6 beta values in the anisotropic case)

             G1   1.0
             G2 = (1/C) exp(+ 2 h_ B h_ )

Anisotropic temperature factors are determined on data expanded by symmetry to a hemisphere, which constrains certain combinations of coefficients in some space groups, e.g. in orthorhombic symmetry B12=B13=B23=0, in cubic spacegroups B11=B22=B33.

The scale and temperature factors are applied to the derivative data and an output file is written with the corrected data. The scale factor SigmaFP**2/SigmaFPH**2 is then analysed in ranges of h, k, l and 4sin**2 theta/lambda**2.

PROGRAM OUTPUT

The program output starts with details of the input reflection data file produced by the MTZ file handling routines, and details of the control data. Then for each cycle of the refinement the following details are output.

The eigenvalues of the matrix
The mean residual
The scaling parameters giving the new values, the shifts and the standard deviations.

At the end of the refinement, there is an analysis of the scaled data. For each derivative, an estimate of the acceptable isomorphous and anomalous differences is given, followed by a list of individual reflections with abnormally high differences. This information can be used to exclude outliers from Patterson calculations or direct methods calculations.

Then for each derivative, the following information is given as a function of resolution:

Kraut scale factor and relative Wilson scale factor.
R factor ("Rfactor" or "Rfac") and weighted R factor ("Rfactor_W" or "Wted_R") for agreement between native and derivative.
<diso> and max(diso)
<dano> and max(dano)

These statistics may also be given as a function of h, k, l or |FP|, see keyword GRAPH.

Some terms defined:

Rfac = [sum( abs(FPH - FP))]/[sum(FP)]
RF_I = [sum( abs(FPH*FPH - FP*FP))]/[sum(FP*FP)]
Wted_R = [sum( abs(FPsq-FFmean)/Var(FPsq) + abs(FPHsq-FFmean)/Var(FPHsq)] / [ sum(FPsq/Var(FPsq) + FPHsq/Var(FPHsq) ]
FFmean = [FPsq/Var(FPsq) + FPHsq/Var(FPHsq)] / [1/Var(FPsq) + 1/Var(FPHsq)]
Var(FPsq) = Var(FP) * 4FPsq
Diso = abs(FPH - FP)
Dano = abs(DPH)

Diso and Dano are very useful analytical tools. Diso should fall off with increasing resolution, and certainly should not increase! That is a good indication of either non-isomorphism, or data quality falling off. You need to run your Pattersons with resolution ranges which only use reliable data, and with sensible EXCLUDE terms based on the plots of Diso and Dano. However MLPHARE has a built in weighting scheme which means that it doesn't do much harm to include less good data in phasing. After all the poor hkl should get low FOMs, and then DM can use the few reflections with reasonable phases to help in the phase extension procedure.

If there is only one derivative then the results of a normal probability analysis are also given (see Lynne Howell and Dave Smith, J.Appl. Cryst. 25 81-86 (1992)). The reflections in each resolution bin are sorted according to the value of:

delta(real) = (FPH - FP)/sqrt(SIGFPH**2 + SIGFP**2)

where FPH and SIGFPH are the scaled values for the derivative. For each reflection, delta(expected) is then calculated based on an assumed normal distribution and the position of the reflection in the sorted list. A plot of delta(real) against delta(expected) is called a normal probability plot.

If the native and scaled derivative data sets are essentially identical (in statistical parlance, they represent two samplings of the same population), then the spread of the two data sets will be the same within the errors defined by SIGFP and SIGFPH, and the normal probability plot will be linear with a slope of about 1 and an intercept of 0. However, if the heavy atoms make a significant contribution to the observed structure factors, then (FPH - FP) will be larger than expected from SIGFP and SIGFPH, and the slope will be > 1. The intercept may also be non-zero.

The program plots the slope and intercept of the normal probability plot (obtained by a least squares fit) as a function of resolution for both centric and acentric reflections. These values are also plotted for the case where reflections at the tails of the distribution are excluded: these reflections tend not to lie on the straight line and distort the least squares fit. The existence and size of the heavy atom contribution to the structure factors can be gauged from the values of the slope and intercept, and the variation with resolution indicates to how high a resolution such contributions extend. A similar analysis can be applied to MAD data by assigning FP and FPH to data at different wavelengths (dispersive differences) or to F+ and F- (anomalous differences). In general, the size of the slope will be smaller in this case.

REFERENCES

Normal Probability Analysis:
Lynne Howell and Dave Smith, J.Appl. Cryst. 25 81-86 (1992)

AUTHORS

Phil Evans / Eleanor Dodson / Richard Dodson

EXAMPLES

Simple unix example script found in $CEXAM/unix/runnable/

scaleit.exam (Example of derivative to native scaling)

(A vms version found in $CEXAM/vms/scaleit.com)

Also found combined with other programs in the example scripts ($CEXAM/unix/runnable/)

rsearch.exam (Use of scaleit in R factor search).

fhscal.exam (Analysis after Kraut scaling of derivative data.)