topp (CCP4: Supported Program)
NAMEtopp - an automatic topological and atomic comparison program for protein structures
top3d foo_1.pdb foo_2.pdb
NOTES ON CCP4 VERSIONNote: TOPP has been renamed from the original TOP to avoid a clash with the UNIX command of that name.
TOPP can be run directly using the command topp with Keyworded input, or via the script top3d which takes two file names as arguments and program parameters from the file $CLIBD/TOP.PARM (see examples section). A search with one file against a database of structures can be done using the script topsearch which takes one file name as argument and program parameters from the file $CLIBD/SEARCH.PARM (see examples section).
Use of the browser facility to search a Protein Data Bank site requires two commands to be on the user's path, namely wget and pdbhtf. The latter is part of the CCP4 suite and should have been compiled and installed. On the other hand, wget is not part of CCP4, but is a GNU program available via internet from the usual GNU sites.
TOP is designed to be user friendly. For example, once the program is properly set up on unix computers, users can use simple commands such as top3d file1 file2 so that the coordinate file2 will be automatically superimposed to file1. The Protein Data Bank (PDB) entry code can be recognized by the program. For example if the second molecule is 2cnd in PDB, user can just type top3d file1 2cnd@pdb so the program will browse the coordinates of 2cnd into the local disk and perform the comparison. If a user wants to know whether a structure in file is similar to any structures in PDB, one can type topsearch file.pdb so that the program will output a list of pdb code which are ranked according to 3d-structure similarities. The user can type top3d file.pdb code@pdb to get the interested coordinates superimposed to the probe model. The program can detect sequence permutation and be used for special purpose, such as motif searching.
The program runs two steps in each structure comparison. In the first step topology of secondary structures in the two are compared. The program uses two points to represent each secondary structure element (alpha helixes or beta strands) then systematically searches all the possible superposition of these elements between the two protein structures. Once a couple of elements in the two structures can fit each other in 3-d space (defined as, the rms, the angle between the two lines formed by the two points and the line-line distance are smaller than the given values), the program will search whether more secondary structure elements can fit by the same superposition operation. If secondary structures which can fit each other exceed a given number, the program will claim the two structures are similar, outputs names of secondary structures which correspond to each other in the two proteins and output the superimposed coordinates. It also outputs a matrix, with which one molecule can be rotated and translated to the other molecule. The program output a comparison score called "Topological Diversity", which considers both the rate of matching SSEs and structure difference of the representing points. In the data base searching, this parameter can be used for rank the topological similarities of SSEs.
While Ca atoms are available, the program can run the second step to
find the alignment based on Ca atoms of all the residues from the
initial comparison matrix, and improve the comparison matrix based on
the superposition of newly aligned Ca atoms. The procedure is iterated
until the member of matching residues converges. The program is able
to overcome sequence permutation in the superpostions. According to
both r.m.s deviations and numbers of matching residues, the program
calculated a score of "Structure Diversity",
which can be used to rank the structure difference of homologous
The compact SSE library is automatically updated in Karolinska Institute every week, which include not only the current released structures in Protein Data Bank, but also compact SSE dastabases of independent family, super-family, structures classified in the SCOP database for efficient similarity search. It can be obtained from ftp://gamma.mbb.ki.se/pub/guoguang/sndlib.tar.Z . After you get this TAR file from FTP and save to your local disk as, for example /dir/sndlib.tar.Z, use following commands:
cd $TOPHOME zcat /dir/sndlib.tar.Z | tar -xvf -you can have the most recently updated SSE databases.
TOP can read 3d coordinates of protein structures in "Brookhaven" (PDB) format either from user's local computer disk, CD ROM or via internet. In the case of structure similarity searching, there can be many ways to read data. The recommended setup for the program is to use automatic updating of a secondary structure element (SSE) libary searching (see automatic updating of SSE library and MOLVEC). In this way the program can search most recent database from compact SSE library and browse the detailed coordinates of only those structures which are found similar with the molecule 1. It is considerably faster and does not require regular maintaining works for database after setup.
Example: MOL1 /nfs/disk1/guoguang/examples/test2.pdb The coordinates file name of molecule 1 for searching the similarity. Coordinate file must be in PDB (Brookhaven) format.
If you don't have the coordinates in your local disk and wish to read the coordinates directly from a Web site by giving a PDB entry code, you could give the filename something like code@pdb in this command, for example: MOL1 2cnd@pdb, the program will use the code and browse the coordinates from a PDB mirror site or another web site, the URL address of which is specified in the PDBSITE or WEBSITE commands.
MOL2 Coordinate_file_name or @List_file_name or @URL_address [zone]
If the second text string in the command start with @ and the rest text does not start with http: or ftp:, the rest text in this string text will be assumed a name of List_file which lists names of a number of coordinate files such as:
/nfs/protein/pdb/current_release/uncompressed_files/00/pdb200d.ent /nfs/protein/pdb/current_release/uncompressed_files/00/pdb200l.ent /nfs/protein/pdb/current_release/uncompressed_files/00/pdb300d.ent /nfs/protein/pdb/current_release/uncompressed_files/00/pdb100d.ent ....This can be used for searching structure similarities in PROTEIN DATA BANK.
200d ! | pdb200d.ent 200l | or | pdb200l.ent 300d | | pdb300d.ent .... | | ....the program will browse the coordinates of these PDB entries from a web site or local disk or CDs.
This list of PDB codes can be obtained from "3DB browser" in Protein Data Bank or other bioinfomtics tools outside the program. It provide a possiblity that TOP search for a certain group of structures for a special purpose.
LIBDIR directory_name If the program is searching a number of coordinates files (see MOL2) and those files are under an identical directory, the user can indicate in which directory the coordinates files are located. for example, if users have pdb200d.ent pdb3001.ent ... in the /nfs/pdb/all_entries/ directory, the user can use UNIX command: ls -1 /nfs/pdb/all_entries/uncompressed_files/ > allpdb.lis, this file will be something like
pdb100d.ent pdb101d.ent pdb101m.ent pdb102d.ent pdb102l.ent ...then use
libdir /nfs/pdb/all_entries/uncompressed_files/ mol2 @allpdb.lisso the program will compare all the files under directory /nfs/pdb/all_entries/uncompressed_files/ and with file names in allpdb.lis and list out which one is similar with the structure specified in the MOL1 command.
Alternatively, one can use UNIX command
find /directory_name/ -name "*.ent" -print > pdball.lisinstead of the ls command. The LIBDIR command is not neccesary in this case. This is usually used when the users have whole protein data bank on their local disk or CD ROM.
In the case the directory name in the LIBDIR command contains a substring ".../current_release/uncompressed_files", the program will think this directory is organised as "current_release" directory in Protein Data Bank i.e. PDB entries are distributed under subdirectories whose name correspond to the 2 middle characters of the PDB id code, e.g.
...pub/pdb_data/current_release/uncompressed_files/00 ...pub/pdb_data/current_release/uncompressed_files/zyand program will assume each line in List_file is a PDB entry code such as
100d pdb1001.ent 100e or pdb100e.ent ..... ....Please notice the local PDB should contain the coordinates of the structures with these ID codes in the file.
If the rest text after first character"@" start with "http:", the program will assume there is a 3db browser in this URL address and try to get a list of current released entries. (This command is not neccesary if PDBSITE command is present.)
If the rest text after first character"@" start with "ftp:", the program will list all the files under the directories. This can be used for an anonymous ftp site in which a directory contains all the entries of the coordinates (such as old PDB directory .../all_release/compressed_files/*.pdb ) However, in this form, all the PDB files should be in one directory, but not distributed in sub-directories.
WEBsite URL_address (or SITE or SERVER)
WEBSITE http://pdb.pdb.bnl.gov/ or http://www.rcsb.org/pdb/ WEBSITE ftp://pdb.pdb.bnl.gov/pub/pdb/all_entries/compressed_files WEBSITE ftp://gamma.mbb.ki.se/pub/pdb/current_release/uncompressed_filesThis command indicates the URL address of Web server. If the address is given correctly, the program is able to browse coordinates from site which provide data of Protein Data Bank by either http or FTP service in compressed or uncompressed form. In each issue of Protein Data Bank Quarterly Newsletter, there is a list of which lab might provide this service. (most likely in form of FTP server). A current URL address collection of these sites are listed in http://gamma.mbb.ki.se/~guoguang/webtop/url_collect.html
In the case it is FTP site, if the directory name contains a sub-string "current_release", the program can automatically find out the PDB entries in sub-directories. Otherwise, it will assume all the files are in the same directory in the argument of this command.
MAKEVEC output_database_filename pdb_list_file_name [format]
example: MAKEVEC sndnew.vec pdb.listIf you have PROTEIN DATA BANK on the disk, TOP program can make a compact database file to let those who don't have protein data bank on disk be able to perform the similarity searching. The pdb_list_file_name contains something like
101l.pdb 102l.pdb 103l.pdb 104l.pdb ....use this list together with LIBDIR command, one can make a compact SSE library, sndnew.vec
example: PDBSITE http://www2.ebi.ac.uk MAKEVEC sndnew.vec example: MAKEVEC snd.vec ftp://pdb.pdb.bnl.gov/pub/pdb/all_entries/compressed_files/If the file name starts with "ftp://" and ends with "/" the program will check the what PDB files contains under that FTP directory and browse all the coordinates in that directory. The files must be in the same directory but not sub-directory in this case. If the second argument starts with "ftp://" the program will request a 3DB server from the URL address to provide a list all the entries in PDB.
example: MAKEVEC snd.vec ftp://gamma.mbb.ki.se/pub/guoguang/scop_family.lis scopIf the second argument starts with "ftp://" or "http://" and ends with a file name, the program will assume URL address is a file which contains the PDB list. This example shows how to get an updated list for SCOP data base, which contains PDB code and range of a representing structure in each family or super family. (The format of TOP/SCOP list is the following)
3sdh a: 1.001.001.001.001.001 d3sdha_ 1phn a: 1.001.001.001.002.001 d1phna_ 1grj 2-79 1.001.002.001.001.001 d1grj_1 .... example: makevec.com # for PDB on local disk $LUEXE/top << 'end-top' LIBDIR /nfs/protein/pdb/current_release/ MAKEVEC sndlib.vec pdblist.txt 'end-top' #The pdblist.txt could be made by this way.
cd /nfs/pdb/full/ ls -1 *.pdb > /nfs/ylgs/guoguang/pdblist.txtIf LIBDIR is replaced by PDBSITE, the progam will read updated data from PDB via web.
In fact the keywords 3DBBEFore and 3DBAFTfer together with MAKEVEC provide a possibily of automaic making SSE libary of the new coming structures which can be appended to the old ones. This should be very quick.
example: MATCH RATE 0.35 0.8 MATCH auto [DEFAULT] MATCH 5If RATE appears as a subcommand, the program will read two more parameters RAT1and RAT2.
RAT1 is the minimum matching rate of secondary structures. The program chooses a minimium secondary structures (comparing mode) or number of secondary of mol1 (searching mode) and times with rat1. If matching secondary structures of the two compared protein exceeds this rate, the program will think the two structures are similiar. For example, if mol1 has 12 secondary structures, and mol2 has 10, and rat1 is 0.5, the program will think the two structures are similar when there are 5 secondary structures that can match each other in comparing mode (or 6 in searching mode).
Alternatively, users also can give this number by estimating at least how many secondary structures can match each other before runing the program. It has to be lower than real number. If the number is overestimated, the program will fail to superimpose the two similar structures. Under-estimating is usually OK. However if user gives a too low value, (for example 3), the program might superimpose motif instead of overall structures. This might give many ways of superpositions, many of which do not really interest the users. In database searching, an over underestimate value can also slow down the speed unecessarilly.
If user have no idea how to put this parameter, he/she can start either with 5 or 30%-50% of number of secondary structures in molecule 1 (use rate). This will be successful in 95% cases. If the comparison fails, look at the Hint section to see how to fix the problem.
ERRANG errang_alpha, errang_beta
ERRDLL errdll_alpha, errdll_beta
SINGLE/NOSIngle (or MULTiple)
SND1 Yes/No [CA]
AMPLify ampl ampltop [default: 1.5 2.0]
example: 3DBKEYWORD FAD + FMN + FLAVIN 3DBKEYWORD NITRATE REDUCTASE 3DBKEYWORD FAD .or. FMN .or. FLAVINEquivalent to the "Keyword" column in 3DB. If this command appears, the TOP program only searches those strucures with the words appearing in HEADER, TITLE, KEYWDS and COMPND fields. If two keywords are separated by space, relation between them are "AND". If separated by ".or." or "+" the relations between words are "OR".
example: 3DBTEXT FAD + FMN + FLAVIN 3DBTEXT REDUCTASEEquivalent to the "Text query" column in 3DB. If this command appears, the TOP program only searches those strucures with the Word in the complete PDB text. If two keywords are separated by space, relations between words are "AND". If two keywords are separated by + or ".or." relations between words are "OR".
Example: 3DBSEQ 0.02 GXGXTGGTX or 3DBSEQ 0.02 @zm.seqEquivalent to the "FASTA" column in 3DB. If this command appears, the TOP program will request the 3DB server running the FASTA program to provide a list of structures with homologies to the given sequence. Then it only searches structure similarity to those structures and output superimposed coordinates if WRITE command is presented. The sequence must be 1 letter code. It must be either in 1 line or in a file such as following example:
SYTVGTYLAERLVQIGLKHHFAVAGDYNLVLLDNLLLNKNMEQVYCCNEL NCGFSAEGYARAKGAAAAVVTYSVGALSAFDAIGGAYAENLPVILISGAP NNNDHAAGHVLHHALGKTDYHYQLEMAKNITAAAEAIYThe format is free but the sequence can not exceed 5000 residues. The detailed description of cutoff value, see 3DB Browser Help File (For TOP, this value should be between 0.02 and 0.01). This command is good for searching structures with a short sequence figure print or structures in a sequence family and superimpose them together. This makes TOP can be used as simple a modeling program.
example: 3DBRESOLUTION 0.1-3.0 or 3DBRESOLUTION 0.1 3.0Equivalent to the "Resolution" column in 3DB. If this command appears, the TOP program only search those structures with resolution higher than 3.0 A (and lower than 0.1 A) cutoff.
3DBBEFore (or 3DBUPPer) date
3DBAFTer (or 3DBLOWer) date
HELIX 1 F1 LEU 96 SER 103 HELIX 2 N1 ILE 148 ARG 160 HELIX 3 N2 ARG 184 GLU 193 HELIX 4 N3 GLU 223 HIS 229 HELIX 5 N4A PRO 245 GLN 249 HELIX 6 N4B SER 253 GLU 257 HELIX 7 N5 MET 263 SER 266 SHEET 1 FB 6 LYS 58 TYR 64 0 SHEET 2 FB 6 HIS 48 ILE 55 -1 SHEET 3 FB 6 TYR 109 LEU 116 -1 SHEET 4 FB 6 ILE 13 SER 24 -1 SHEET 5 FB 6 VAL 27 SER 33 -1 SHEET 6 FB 6 HIS 75 LYS 81 -1If there are no SSE assignments in the coordinates file, the program will take some CPU time to calculate it. If the file contains coordinates of all mainchain atoms, the program will use the "Smith-Laskowski method" as in the PROCHECK package. If the file only contains Ca coordinates or many mainchain atoms are missing, the program can also automatically assign the secondary structures using another method, but some elements, especially beta strands, might be not as accurate as in the case that all the mainchain atoms are provided. However, this does not influence the structure comparisons in most cases.
Unix script fileThere are several examples files available at http://gamma.mbb.ki.se/~guoguang/webtop/examples showing how to use the TOP program. Here is a summary of them
Example 1: Compare two structures Two files 1kxd.pdb and 1vcp.pdb will be compared by the following script file. ($TOPHOME/examples/top.com in the distribution package)
# rm fort.10 fort.11 fort.12 ln -s omatrix.ofm fort.10 ln -s mol1.ofm fort.11 ln -s mol2.ofm fort.12 $LUEXE/top << 'end-top' MOL1 1kxd.pdb MOL2 1vcp.pdb RESIDUE 3 WRITE 'end-top' #type "top.com > top.log", the program will output which secondary structure elements are corresponding to each other in the two structures. Optionally, the program also superimposes the two structures based on the Ca atoms and output the sequence comparison. (See instruction of keyword RESIDUE). The rms deviation is output. When the WRITE statement appears, the program will write a file which superimposes molecule 2 onto molecule 1. In this case the output file name is 1vcp_1kxd.pdb. Sometimes, there are more than one way to superimpose the two structures (e.g. when the two structures are dimers AB, the program can superimpose AB to A'B' and AB to B'A'). In this case the program will output several superimosed coordinates files, called 1vcp_1kxd.pdb, 1vcp_1kxd.pdb_2, 1vcp_1kxd.pdb_3,....). One can use any graphics program (such as O, Insight or Frodo) to display the superimposed coordinates together with 1kxd.pdb. Look at top.log for more information.
There are other commands concerning the paramenters for different purpose of the comparisons. For detail, please see "Keyworded Input"
The TOP software can directory browse coordinates from Protein Data Bank (PDB),
if an URL address of a mirror site of PDB is provided. In this example, if you
know one of structures PDB entry code is 1vcp , you can do the
1) add a command to indicate from which site you want to browse
2) use xxxx@pdb in the MOL2MOL2 1vcp@pdb
So the program will directly read 1vcp from Brookhaven National Laboratory
The recommended way run TOP is first searching a compact library of Secondary
Structure Elements (SSEs) . If SSEs constructions of some proteins are found to
be similar to the studied structure, the program can do the further comparisons
based on Ca atoms (as shown in pdbsearch.com and topsearch.com). This ways
requires a regularly updated SSEs library which can be obtained from
It can also be made and updated automatically (see instructions for "
Automatic updating of SSE library"
If users choose not to use compact SSE library, one can use pdbscan.com
or topscan.com instead of pdbsearch.com or topsearch.com for searching PDB
in local disk or via internet.
In pdbscan.com, it is assumed that user have all the Protein Data Bank files
under directory /nfs/protein/pdb/current_release/uncompressed_files and all
the files are called *.ent. In this example file, the command
find $pdbdir -name "*.ent" -print > current.lis
find all the PDB entries and write into the file current.lis which has
The recommended way run TOP is first searching a compact library of Secondary Structure Elements (SSEs) . If SSEs constructions of some proteins are found to be similar to the studied structure, the program can do the further comparisons based on Ca atoms (as shown in pdbsearch.com and topsearch.com). This ways requires a regularly updated SSEs library which can be obtained from ftp://gamma.mbb.ki.se/pub/guoguang/sndlib.tar.Z It can also be made and updated automatically (see instructions for " Automatic updating of SSE library"
If users choose not to use compact SSE library, one can use pdbscan.com or topscan.com instead of pdbsearch.com or topsearch.com for searching PDB in local disk or via internet.
In pdbscan.com, it is assumed that user have all the Protein Data Bank files under directory /nfs/protein/pdb/current_release/uncompressed_files and all the files are called *.ent. In this example file, the command find $pdbdir -name "*.ent" -print > current.lis find all the PDB entries and write into the file current.lis which has contents like:
Still take pdbscan.com as an example. To run database searching, type "pdbscan.com &", after some hours, there will be all the information in pdbscan.log which users usually don't have to look at. User can look at the summary files: "strdiv.lis" or "topdiv.log" (If the program crash, you could also look at the middle results by typing "grep Str pdbscan.log | sort +3 -4" or "grep Top pdbscan.log | sort +3 -4")
The content of strdiv.lis is the following:
1692 structures are found to be similar under the given criteria Best Structure Diversity 7.67 with 52 matched residues to 2cnd Best Structure Diversity 7.68 with 56 matched residues to 1azz Best Structure Diversity 8.13 with 57 matched residues to 1epa Best Structure Diversity 8.33 with 48 matched residues to 1cnf Best Structure Diversity 8.48 with 54 matched residues to 1ave Best Structure Diversity 8.70 with 54 matched residues to 1hav Best Structure Diversity 8.70 with 54 matched residues to 2pia Best Structure Diversity 9.28 with 51 matched residues to 1avd ............The structure here 2cnd, 1azz, 1epa ... and so on are found similar to the searched model. (2cnd is ranked as most similar structure by the program). Users can use command file of example 1 and pick up the coordinates to run the individual comparison which gives superimposed structure and details of the comparison such as r.m.s and sequence alignment and so on (these information are also inside pdbscan.log, run nicelist.com or toplist.com to get a better output.)
Example 3: Searching similar structures from a compact SSE library As described in the description section, in the first step TOP detects the similarites based on SSE topology of two proteins. Except coordinates files in PDB format, the program can also read a compact database which contains SSE topology derived from Protein Data Bank. Using the SSE library is a fast and recommended way for similarity searching in database. To make the library from PDB in local disk, user can use $TOPHOME/examples/makevec.com. To make the library from PDB on Web, please use $TOPHOME/examples/makevec_web.com. This SSE library can be automatically updated according most recent PDB data. Please see installation secton.
rm -f fort.10 fort.11 fort.12 ln -s omatrix.ofm fort.10 ln -s mol1.ofm fort.11 ln -s mol2.ofm fort.12 cat > topsearch.inp << EOF MATCH auto PDBSITE http://www2.ebi.ac.uk !LIBDIR /nfs/pdb/current_release/uncompressed_files/ MOL1 kinA.pdb MOLVEC $TOPHOME/lib/sndlib.vec EOF $TOPBIN/top < topsearch.inp > topsearch.log grep Top topsearch.log | sort +3 -4 >> topdiv.lis grep similar topsearch.log > strdiv.lis grep Str topsearch.log | sort +3 -4 >> strdiv.lisThe runing and analysis procedure is similar with example 2
If you use an other SSE dastabase, for example MOLVEC $TOPHOME/lib/scop_structure.vec You search only about 2000 independent domain structures selected in the SCOP dastabase instead of 8000 in Protein Data Bank. The speed would be much faster (only 1/10 to 1/5 as before). For same reason, you could use $TOPHOME/lib/scop_family.vec (about 900 domain structures) or $TOPHOME/lib/scop_superfamily.vec (about 600 domain structures) to even search for a short time. The SCOP database is not updated as frequent as PDB, so far once every year. The the SSE database for most recent SCOP is always kept in our FTP distibution site
In the Web server of TOP, there is another way to search all the structures: The program search classification unit of independent domain structures, families or super-families in SCOP. Once it found the similarity, it can optionally futher search other structures in the same classification unit. The search in this way is very efficient in terms of speed although it does not search the most recent data in Protein Data Bank. Please have a look at: http://alfa.mbb.ki.se:8000/TOP/search_SCOP_new.html
Example 4: Superimpose all the sequence-homologous proteins in PDB If users wish to compare all the structures in PDB which have sequence homology to a particular structure, one can use following simple procedure to make all the superimposed structures.
#!/bin/csh rm fort.10 fort.11 fort.12 ln -s omatrix.ofm fort.10 ln -s mol1.ofm fort.11 ln -s mol2.ofm fort.12 $TOPBIN/top << 'end-top' MOL1 zmA.pdb MOLVEC snd1.vec pdbsite http://www2.ebi.ac.uk 3dbseq 0.02 @zm.seq MATCH auto WRITE yes 'end-top'In this example zm.pdb is the PDB coordinates of the probe structure. zm.seq is the file which contains the sequence in format of 1-letter code:
SYTVGTYLAERLVQIGLKHHFAVAGDYNLVLLDNLLLNKNMEQVYCCNEL TLKFIANRDKVAVLVGSKLRAAGAEEAAVKFTDALGGAVATMAAAKSFFP EENALYIGTSWGEVSYPGVEKTMKEADAVIALAPVFN ....The filename for all the superimposed coordinates will be 1pyd_zmA.pdb, 1pvd_zmA.pdb, 1pox_zmA.pdb....