You may redistribute this program under the terms of the GPL.
Alternatively you may redistribute the functions and subroutines of this program as an API under the terms of the LGPL.
This program and related scripts are available as
The information below is available for download from arxiv.org/abs/1307.1811.
Revision 0.9 of 11 August 2015 added support for searching cell from the Cambridge Structural Database from Cambridge Crystallographic Data Centre as a development effort in support of image identification for synchrotrons. That version was experimental. Support for specifying the sphere radius as a percentage and for specifying a limit on the number of hits are provided.
Revision 0.8 of 24 April 2014 had revised output, following the suggestions of Graeme Winter and the good example of Nearest-Cell (Ramraj et al., 2011) in giving some of the provenance of the cell and organizing long output lists into collapsed lists of families. Unlike Nearest-Cell we have chosen to organize the families on the PDB HEADER field, rather than the sequence. For clarity, many uses of the term "algorithm" have been changed to "metric". The output of the nearest cell search has been changed to include the distance and to conform to the style of the style of the Sphere search output. The NCDist metric algorithm has been updated to the most current release. -- MA & HJB
(Andrews & Bernstein, 2012) introduced a topological embedding of the Niggli "cone" of reduced cells with the goal of calculating a meaningful distance between unit cells. In the second paper of this series, the embedding was used to determine likely Bravais lattices for a unit cell. Here we apply the embedding to searching within a database for lattices "close" to the lattice of a given probe cell.
A crystallographic cell is a representation of a lattice, but each lattice can be represented just as well by any of an infinite number of such unit cells. Searching for matches to an experimentally determined crystallographic unit cell in a large collection of previously determined unit cells is a useful verification step in synchrotron data collection and can be a screen for "similar" substances (Ramraj et al., 2011) (Mighell, 2002), but it is more useful to search for a match to the lattice represented by the experimentally determined cell, which may involve many more cells. For identification of substances with small cells, a unit cell match may be sufficient for unique identification (Mighell, 2001).
Due to experimental error and multiple cells representing the same lattice and differing choices of lattice centering, simple searches based on raw cell edges and angles can miss similarities. A database of lattices using the G6 representation of the Niggli-reduced cell as the search key provides a more robust and complete search. Searching is implemented by finding the distances from the probe cell to related cells using a topological embedding of the cone of Niggli reduced cells in G6. Comparison of results to those from older cell-based search algorithms suggests significant value in the new approach.
Tabulations of data for the identification of minerals dates to the 18th and 19th centuries. Data collected included interfacial angles of crystals (clearly related to unit cell parameters) and optical effects. See the historical review in (Burchard, 1998). With the discovery of x-ray diffraction, those tables were supplanted by new collections. Early compilations that included unit cell parameters arranged for material identification were "Crystal Structures" (Wyckoff, 1931), "Crystal Data Determinative Tables" (Donnay, 1943), and Handbook for Metals and Alloys (Pearson, 1958). Early computerized searches were created by JCPDS in the mid-1960's (Johnson, 2013) and the Cambridge Structural Data file and its search programs (Allen et al., 1973).
Those first searches were sensitive to the issues of differing equivalent presentations of the same lattice. The first effective algorithm for resolving that issue was (Andrews et al., 1980) using the V7 algorithm (NIH/EPA, 1980). Subsequently, other programs using the V7 algorithm have been described (see Table 1). The V7 algorithm has the advantage over simple Niggli-reduction based cell searches of being stable under experimental error. However, sensitivity to a change in an angle is reduced as that angle nears 90 degrees.
An effective search method must find ways to search for related
unit cells, even when they appear to be tabulated in ways that
make them seem different. A trivial example is:
a = 10.0, b = 10.01, c = 20, α = 65, β = 75, γ = 90
a = 10.0, b = 10.05, c = 20, α = 75, β = 65, γ = 90.
Clearly, these unit cells are almost identical, but simple tabulations
might separate them. A somewhat more complex example
includes the following primitive cells:
a = 3.1457, b = 3.1457, c = 3.1541, α = 60.089, β = 60.0887, γ = 60.104
a = 3.1456, b = 3.1458, c = 3.1541, α = 90.089, β = 119.907, γ = 119.898.
Here the relationship is not as obvious. The embedding of (Andrews & Bernstein, 2012) can be used to show that the distance between these two cells is quite small in G6 (0.004 Ångstrom units squared in G6).
The program SAUC is structured to allow use of several alternative metrics for searching among cells in an attempt to identify cells representing similar lattices. To simplify comparisons among results with the different metrics, all have been linearized and normalized, i.e. converted to Ångstrom units and scaled to be commensurate with the L2 norm given below:
A simple L1 or L2 norm based on
[a, b, c, α(b + c)/2, β(a + c)/2, γ(a + b)/2]
with the distance scaled by 1/√(6) in the case of the L1 norm and unscaled in case of the L2 norm. The angles as assumed to be in radians and the edges in Ångstroms. The angles were converted to Ångstroms by multiplying by the average to the relevant edge lengths.
The square root of the BGAOL Niggli cone embedding
distance NCDist based on
[a2, b2, c2, 2bc cos(α), 2ac cos(β), 2ab cos(γ)].
Before taking the square root, the distances are scaled by √(6) divided by the average length of the cell edges. The overall square root linearizes the metric to Ångstrom units.
The V7 distances based on individual components linearized
to Ångstrom units
[a, b, c, 1/a∗, 1/b∗, 1/c∗,V1/3]
and scaled by √(6/7). V is the volume.
These metrics are applied to reduced primitive cells [a, b, c, α, β, γ] and, when the reciprocal cell [a∗, b∗, c∗, α∗, β∗, γ∗] is needed for the V7 metric, that cell is also reduced.
In order to facilitate comparisons to older searches that just consider simple ranges in [a, b, c, α, β, γ], an option for such searches was also included in SAUC.
The square root satisfies the stated requirements. It is monotone,
√(u + v) ≤ √(u) + √(v)
↔ u + v ≤ (√(u) + √(v))2 = u + v + 2√(uv)
which is clearly true.
Range searching in amapped embedding needs to be done using a nearest-neighbor algorithm (or "post-office problem" algorithm (Knuth, 1973)). Exact matches are unlikely since most unit cells representing lattices in a database are experimental, and probe cells are also likely to be experimental data. Several efficient algorithms are available; we have used an implementation of neartree (Andrews, 2001).
The raw unit cell data is loaded into the tree once and serialized to a dump file on disk; subsequent searches do not need to wait for the O(Nlog(N)) tree build, which for the 70,000+ cells from the PDB can take half an hour in the BGAOL NCDist metric. The linearization makes the search space more compact and reduces the tree depth, thereby speeding searches. Because the PDB unit cell database contains many identical cells, we modified NearTree to handle the duplicates in auxiliary lists, further reducing the tree depth and speeding searches.
The simplest approach to lattice searching is a simple box search on ranges in unit cell a, b and c and possibly on α, β, and γ, as for example in the "cell dimensions" option in the RCSB advanced search at http://www.rcsb.org/pdb/search/advSearch.do for the Protein Data Bank (Berman et al., 2000). In the following examples, we will call that type of search "Range". For the reasons discussed above, such simple searches can fail to find unit cells representing similar lattices but with very different edges that actually represent similar lattices. Such searches are best characterized as cell searches, rather than as lattice searches.
Searching on primitive reduced cells greatly improves the reliability of a search, as for example in (Ramraj et al., 2011) at http://www.strubi.ox.ac.uk/nearest-cell/nearest-cell.cgi, which uses a metric based on the reduced cell and all permutations of axes.While an improvement over simple range searches as discussed above, such searches can also miss similar lattices if the number of alternate lattice presentations considered is not complete. One way to reduce such gaps in searches is to use only parameters that do not depend on the choice of reduced presentation. The (Andrews et al., 1980) approach using 7 parameters (three reduced cell edges, three reduced reciprocal cell edges and the volume), "V7", helps, but has difficulty distinguishing cells with angles near 90 degrees. The NCDist approach used here, derived from (Andrews & Bernstein, 2012) both fills in the gaps and handles angles near to 90 degrees.
Consider, for example, the unit cells of phospholipase A2 discussed
by (Le Trong & Stenkamp, 2007). They present three
alternate cells from three different PDB entries that are actually
for the same structure:
[57.98, 57.98, 57.98, 92.02, 92.02, 92.02] from entry 1FE5 (Singh et al., 2001) in space group R32,
[80.36, 80.36, 99.44, 90, 90, 120] from entry 1U4J (Singh et al., 2005) in space group R3 and
[80.949, 80.572, 57.098, 90.0, 90.35, 90.0] from entry 1G2X (Singh et al., 2004) in space group C2. No simple Range search can bring these three cells together. For example, if we use the PDB advanced cell dimensions search around the cell from IU4J with edge ranges of +-3 Ångstroms and angle ranges of +-1 degree, we get 28 hits: 1CG5, 1CNV, 1FW2, 1G0Z, 1GS7, 1GS8, 1HAU, 1ILD, 1ILZ, 1IM0, 1LR0, 1NDT, 1OE1, 1OE2, 1OE3, 1QD5, 1U4J, 2BM3, 2BO0, 2H8A, 2HZ5, 2OHG, 2REW, 2WCE, 3I06, 3KKU, 3Q98, 3RP2, of which only three actually have cells close to the target using the linearized NCDist metric : 2WCE at 2.96 Ångstroms, 1G0Z at 0 Ångstroms, and 1U4J, the target itself. The remaining cells are, as we will see, rejected under the Nearest-Cell and the V7 metric. The simple Range searches are not appropriate to this problem.
Table 2 shows partial results from a lattice search using Nearest-Cell, and a V7 search using SAUC and a NCDist search using SAUC. We have restricted the searches to NCDist distances ≤ 3.5 Ångstroms. The Nearest-Cell metric appears to be in Å2. The column with the square root of the Nearest- Cell metric facilitates comparison with the linearized SAUC V7 and NCDist metrics. The searches showed consistent behavior: The three cells noted by (Le Trong & Stenkamp, 2007) are found in the same relative positions by all three searches. All cells found by Nearest-Cell are also found by both V7 and NCDist. Of the 42 structures found by all three metrics within 3.5 Ångstroms under the NCDist metric, four (1G0Z, 1G2X, 1DPY and 1FE5) are E.C. class 188.8.131.52 Phospholipase A2 structures, and three (1PKR, 1SGC and 1VRI) are other hydrolases (E. C. classes 184.108.40.206, 220.127.116.11, and 18.104.22.168, respectively) However, ten cells found by V7 and NCDist were not found by Nearest-Cell (2OSN, 2CMP, 3MIJ, 2SGA, 2YZU, 3SGA, 4SGA, 5SGA, 1CDC and 2CVK). Of those ten, one (2OSN) is an E.C class 22.214.171.124 Phospholipase A2 structure and four (2SGA, 3SGA, 4SGA and 5SGA are hydrolases, specifically E.C. class 126.96.36.199 Proteinase A. Two of the ten (2YZU and 2CVK) are thioredoxin, for which the ProMOL (Craig et al., submitted) motif finder shows significant active site homologies to multiple hydorolase motifs (2YZU has site homologies to 132L, 135L and 1LZ1 in E.C. class 188.8.131.52 and to 4HOH in E.C. class 184.108.40.206, 2CVK to 1AMY in E.C. class 220.127.116.11, to 1BF2 in E.C. class 18.104.22.168, to 1EYI in class 22.214.171.124, etc.). For 1CDC, a "metastable structure of CD2", proMOL shows an active site homology to 1ALK of E.C. class 126.96.36.199, another hydrolase.
The significant gaps in the Nearest-Cell search do not appear to be a problem of the distance for the Nearest-Cell search having been cut off at a too-small value. For the common hits between the square root of the Nearest-Cell metric and the linearized NCDist metric, a linear fit is excellent, with R2 = 0.89 and no points are very far from the line. The agreement of the linearized V7 to the other two metrics is much noisier because of loss of sensitivity of the V7 metric for angles near 90 degrees and the inherent difficulty the V7 metric has in discriminating between the + + + and - - - parts of the Niggli cone. For example, 1GUT (Schüttelkopf et al., 2002) is at distances 1.2 and 3.7 from 1UJ4 in the Nearest-Cell and linearized NCDist metrics, respectively, but only 0.1 in the V7 metric. The 1GUT cell is
[78.961, 82.328, 57.031, 90.00, 93.44, 90.00] in C 1 2 1, Z=24, with a primitive cell
[57.031, 57.0367, 57.0367, 92.3918, 92.3804, 92.3804] which corresponds to a G6 vector
[3252.53, 3253.18, 3253.18,-271.53,-270.208,-270.208] and a linearized V7 vector
[52.8004, 52.8057, 52.8057, 52.7101, 52.7101, 52.7053, 52.7569]. The 1U4J cell is
[80.36, 80.36, 99.44, 90, 90, 120] in R3, Z=18, with a primitive cell
[57.02, 57.02, 57.02, 89.605, 89.605, 89.605] which corresponds to a G6 vector
[3251.28, 3251.28, 3251.28, 44.8265, 44.8265, 44.8265] and a linearized V7 vector
[52.7902, 52.7902, 52.7902, 52.7878, 52.7878, 52.7878, 52.789]
This is almost identical to the 1GUT V7 vector, even though the corresponding primitive cells and G6 cells differ significantly.
|Cryst||Andrews et al., 1980|
|Quest||Allen et al., 1973||Reduced cell|
|Nearest-Cell||Ramraj et al., 2011||Reduced cell|
|WebCSD, Conquest||Thomas et al., 2010||G6 iterative|
|SAUC||this work||G6, Niggli embedding|
|1U4J (*)||0||0||0||0||Phospholipase A2 isoform 2||188.8.131.52|
|1G2X (*)||0.11||0.33||0.2||0.9||Phospholipase A2||184.108.40.206|
|2OSN||0.2||0.9||Phospholipase A2 isoform 3||220.127.116.11|
|2CMP||0.7||1.5||Terminase small subunit|
|3KP8||0.43||0.66||1.1||1.7||VKORC1/Thioredoxin domain protein|
|3E56||0.4||0.63||1.5||1.9||Putative uncharacterized protein|
|1CSQ||0.49||0.7||1.8||2.0||Cold Shock Protein B (CSPB)|
|3SVI||0.54||0.73||1.9||2.1||Type III effector HopAB2|
|1FKF||0.83||0.91||2.7||2.4||FK506 binding protein||18.104.22.168|
|1FKJ||0.83||0.91||2.7||2.4||FK506 binding protein||22.214.171.124|
|1FKD||0.86||0.93||2.8||2.5||FK506 binding protein||126.96.36.199|
|2FKE||0.91||0.95||2.9||2.6||FK506 binding protein||188.8.131.52|
|3TJY||0.88||0.94||3.0||2.6||Effector protein hopAB3|
|2I5L||1.06||1.03||3.7||2.7||Cold shock protein cspB|
|1F9P||1.37||1.17||4.7||3.1||Connective tissue activating peptide-III|
|3SGA||5.0||3.1||Proteinase A (SGPA)||184.108.40.206|
|4SGA||4.8||3.1||Proteinase A (SGPA)||220.127.116.11|
|5SGA||4.9||3.1||Proteinase A (SGPA)||18.104.22.168|
|1GUS||1.15||1.07||0.3||3.2||Molybdate binding protein II|
|2VRI||1.5||1.22||5.2||3.2||Non-structural protein 3||22.214.171.124|
|1FE5 (*)||1.24||1.11||2.3||3.3||Phospholipase A2||126.96.36.199|
|1GUT||1.2||1.1||0.1||3.37||Molybdate binding protein II|
|2C9Q||1.46||1.21||4.8||3.3||Copper resistance protein C|
|2HE2||1.48||1.22||4.8||3.3||Discs large homolog 2|
|2IT5||1.62||1.27||5.6||3.3||CD209 antigen, DCSIGN-CRD|
|3SU5||1.58||1.26||5.1||3.3||NS3 protease, NS4A protein|
|3SU6||1.52||1.23||5.0||3.3||NS3 protease, NS4A protein|
|1SL4||1.68||1.3||5.8||3.4||mDC-SIGN1B type I isoform|
|3SU3||1.64||1.28||5.3||3.4||NS3 protease, NS4A protein|
inhibitor family protein
|2E6L||1.78||1.33||5.8||3.5||Nitric oxide synthase, inducible||188.8.131.52|
|3SV6||1.74||1.32||5.6||3.5||NS3 protease, NS4A protein|
|3SV7||1.73||1.32||5.6||3.5||NS3 protease, NS4A protein|
Allen, F. H., Kennard, O., Motherwell, W. D. S., Town, W. G. & Watson, D. G. (1973). Journal of Chemical Documentation, 13(3), 119 -- 123.
Andrews, L. (2001). C/C++ Users Journal, 19, 40 -- 49. http://sf.net/projects/ neartree.
Andrews, L. C. & Bernstein, H. J. (2012). arXiv preprint arXiv:1203.5146. arxiv.org/abs/1203.5146.
Andrews, L. C. & Bernstein, H. J. (2013). arXiv preprint arXiv:1305.6561. arxiv.org/abs/1305.6561.
Andrews, L. C., Bernstein, H. J. & Pelletier, G. A. (1980). Acta Crystallogr. A36, 248 -- 252. Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalov, I. N. & Bourne, P. E. (2000). Nucleic Acids Res. 28, 235 -- 242.
Bernstein, F. C., Koetzle, T. F., Williams, G. J. B., Meyer, Jr., E. F., Brice, M. D., Rodgers, J. R., Kennard, O., Shimanouchi, T. & Tasumi, M. (1977). J. Mol. Biol. 112, 535 -- 542.
Burchard, U. (1998). The Mineralogical Record, 29(6), 517 -- 583.
Craig, P. A., Hanson, B., Westin, C., Rosa, M., Bernstein, H. J., Grier, A., Osipovitch, M., MacDonald, M., Dodge, G., Boli, P. M., Corwin, C. W. & Kessler, H. (submitted). B. M. C. Bioinformatics.
Donnay, J. D. H. (1943). The American Minerologist, 28, 313.
Johnson, G. G. (2013). Private Communication.
Knuth, D. E. (1973). Sorting and Searching (The Art of Computer Programming volume 3). Addison Wesley, Reading, MA.
Le Trong, I. & Stenkamp, R. E. (2007). Acta Crystallogr. D63, 548 -- 549.
Mighell, A. D. (2001). Journal of Research - National Institute of Standards and Technology, 106(6), 983 -- 996.
Mighell, A. D. (2002). Journal of Research - National Institute of Standards and Technology, 107(5), 425 -- 430.
NIH/EPA (1980). User's Manual NIH-EPA Chemical Information System, chap. User's Guide to CRYST The X-Ray Crystallographic Search System. National Institutes of Health, Environmental Protections Agency.
Pearson, W. B. (1958). Handbook of Lattice Spacings and Structures of Metals and Alloys. International Series of Monographs on Metal and Physics and Physical Metallurgy, G. V. Raynor (ed.). Pergamon Press.
Ramraj, V., Esnouf, R. & Diprose, J. (2011). Nearest-Cell A fast and easy tool for locating crystal matches in the PDB. Tech. rep. Division of Structural Biology, University of Oxford. http://www.strubi.ox.ac.uk/nearest-cell/nearest-cell.cgi.
Schüttelkopf, A. W., Harrison, J. A., Boxer, D. H. & Hunter, W. N. (2002). Journal of Biological Chemistry, 277(17), 15013 -- 15020.
Singh, G., Gourinath, S., Saravanan, K., Sharma, S., Bhanumathi, S., Betzel, C., Srinivasan, A. & Singh, T. P. (2004). Acta Crystallogr. F61(1), 8 -- 13.
Singh, G., Gourinath, S., Sarvanan, K., Sharma, S., Bhanumathi, S., Betzel, C., Yadav, S., Srinivasan, A. & Singh, T. P. (2005). J. Struct. Biol. 149(3), 264 -- 272.
Singh, G., Gourinath, S., Sharma, S., Paramasivam, M., Srinivasan, A. & Singh, T. P. (2001). J. Mol. Biol. 307(4), 1049 -- 1059.
Thomas, I. R., Bruno, I. J., Cole, J. C., Macrae, C. F., Pidcock, E. & Wood, P. A. (2010). J. Appl. Crystallogr. 43(2), 362 -- 366.
Toby, B. (1994). Private Communication.
Wyckoff, R. W. G. (1931). The structure of crystals. No. 19. The Chemical Catalog Company. inc.