CAMBRIDGE STRUCTURAL DATABASE HELP – writing an introduction and citing accordingly
Report Draft 1;As the globe is continuously changing so is the world of science. The accessibility of
knowledge and improved globalisation has allowed for huge advancements in chemistry and
pharmacy. Through this report we shall be analysing molecules within the Cambridge
structural database and their evolution through multiple experiments.
A molecule refers to the particle of a substance that has all the physical and chemical
properties of that substance is referred to as a molecule. Molecules are made up of one or
more atoms. When a compound is of low molecular weight (less than 900 Daltons) it is then
referred to as a small molecule. i
This is relevant in pharmaceutical practise as small molecule drugs have some distinct
advantages as therapeutics: most can be administered orally, and they can pass through cell
membranes to reach intracellular targets. Due to their relatively low molecular weight and
simple chemical structures, their pharmacokinetics and pharmacodynamics are normally
more predictable than those of biologics, which often lead to simpler dosing protocols ii
Crystalline arrangement of atoms in single crystals are recognisable macroscopically as the
shapes of crystals is reflective of the atomic structure. The crystallization of proteins, nucleic
acids and large biological complexes, such as viruses, depends on the creation of a solution
that is supersaturated in the macromolecule but exhibits conditions that do not significantly
perturb its natural state. Supersaturation is produced through the addition of mild
precipitating agents such as neutral salts or polymers, and by the manipulation of various
parameters that include temperature, ionic strength and pH. Also important in the
crystallization process are factors that can affect the structural state of the macromolecule,
such as metal ions, inhibitors, cofactors or other conventional small molecules. A variety of
approaches have been developed that combine the spectrum of factors that effect and
promote crystallization, and among the most widely used are vapor diffusion, dialysis, batch
and liquid-liquid diffusion. Successes in macromolecular crystallization have multiplied
rapidly in recent years owing to the advent of practical, easy-to-use screening kits and the
application of laboratory robotics. iii
1. Solid state, basic crystallography concepts e.g. unit cell, atom position, Z’
What is X-ray diffraction:
• The definitive structural probe for the solid state
• Involves shining X-rays onto crystalline materials and observing scattered radiation
• Analysis of the scattered radiation yields much information, including;
• Where atoms are (you can ‘see’ them)
• How they are connected to form molecules
• How they pack to form crystals
• To what extent they are vibrating
Molecular crystals:
Single crystal: One crystal //// Size typically 0.1 x 0.1 x 0.1mm
Powder: Millions of crystals /// Size typically a few m
An analogy:
• Lego “building” = crystal
•
•
•
One brick = one cell
One dot = one molecule
There is long-range order
Remember:
• The unit cell is purely imaginary
• We use it because it is easier to deal with one basic block that repeats than to deal with
every atom
• The only real thing is the full array of atoms
Describing a crystal:
• Unit cell type and dimensions
• Atomic coordinates
• Space group symmetry
Most common unit cell types:
• Orthorhombic – like a shoebox
a b c = = = 90
• Monoclinic – like a showbox where you’ve leaned on one edge
a b c = = 90 90
Atom positions: X,Y,Z
Are defined relative to the unit cell axes
Refer to points along each axis, treating the origin as ‘0’ and the end of an axis as ‘1’
Coord
Axis
Min
Max
x
a
0
1
y
b
0
1
z
c
0
1
•
•
•
•
A typical cell contains more than one molecule
These molecules are (in general) not independent, but are related by symmetry elements
So we do not need to know the positions of every molecule in the unit cell
Rather, we need to find just one molecule, as long as we know the symmetry relationships
between the molecules
Powder X-ray diffraction (PXRD)
• Diffractometer
• Data
Used when?
• Powders are collections of millions of tiny (too small to be seen easily) single crystals
• Generally, when you cant grow a single crystal of the API of interest
• When dealing with bulk powders in pharmaceutical industry
Source: X- ray diffraction lecture
Crystal form
• APIs in the solid form can be either
•
•
•
Crystalline, or
Amorphous (no regular molecular lattice arrangement)
Many crystalline APIs exist in more than one crystal form with different crystal packing
arrangements. These different forms are known as polymorphs
Crystal forms – polymorphs
• Polymorphs have different physicochemical and mechanical properties:
• Solubility’s, stabilities, dissolution rates, melting point etc.
•
•
•
As these are key factors in formulation, it is VITAL that the polymorphic form of the API
being formulated is known and is consistent
i.e. different batches of the API do not consist of different polymorphs
If they did, formulations would not behave as expected, with important consequences for
the patient
Get polymorphism wrong…
• Example: ritonavir, used for HIV: Abbott lost $250M in sales by developing a formulation
based around a metastable polymorph and then having to re-formulate when the insoluble
stable polymorph was found to be forming upon storage. The insoluble form was not
absorbed and so patients were no longer ‘getting the drug’, despite taking!
• What to do?
• Try to formulate around the most stable polymorph if possible
• Strategies exist for discovering the most stable polymorph; these are generally known as
‘polymorph screens’ and involve lots of re-crystallisations under different conditions (e.g.
different solvents) to find polymorphs
Source: Powders lectures
2. The importance of crystal structures as a source of information on accurate
molecular structure
CSD data are widely used in establishing standard molecular dimensions,
determining conformational preferences, and in the study of intermolecular
interactions, all of which are crucial in structural chemistry, rational drug design,
pharmaceutical materials design, and drug delivery. More recently, information
derived from the CSD has been used to construct two dynamic libraries of structural
knowledge: Mogul, which stores intramolecular information, and IsoStar, which
stores information about intermolecular interactions. These electronic libraries
provide click-of-a-button access to structural information and, in turn, serve as
sources of knowledge for applications software that address specific problems in
structural chemistry, rational drug design, and crystallography.
Source: https://www.sciencedirect.com/topics/chemistry/crystal-structure-data
3. The importance of 3d structures in drug development
The 3-dimensional (3D) structure of therapeutics and other bioactive molecules is an
important factor in determining the strength and selectivity of their protein–
ligand interactions. Previous efforts have considered the strain introduced and
tolerated through conformational changes induced upon protein binding.
Source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7294711/
4. What is the CSD and what does it allow you to do
The Cambridge Structural Database (CSD) is a system that scientists around the
world use to access information on organic and metal-organic small-molecule crystal
structures. The database was established in 1965 and by the early 1980s it was
distributed to more than 30 countries. Now it is distributed to academics in 70
countries worldwide. The are many ways to inspect the growth within the database
and determine their significance. The growth of the compounds as a function of time
is striking and could have tremendous impacts on the world of science and
technology. The database allows for research in structural chemistry, material
science, life sciences and drug discovery/development.
5. The importance of the CSD as a source of crystallography and molecular
information
Crystal structure data are of fundamental importance in a wide spectrum of scientific
activities.
Source: https://www.sciencedirect.com/topics/chemistry/crystal-structure-data
As all the information is collated in one system, it allows for those using the database
to search for compounds quickly and efficiently, which was not previously possible.
Over the years crystallography has allowed for many improvements that relate to our
daily lifestyle such as the use of more powerful batteries.
The Cambridge Structural Database (CSD) contains a complete record of all
published organic and metal–organic small-molecule crystal structures. The
database has been in operation for over 50 years and continues to be the primary
means of sharing structural chemistry data and knowledge across disciplines. As
well as structures that are made public to support scientific articles, it includes many
structures published directly as CSD Communications. All structures are processed
both computationally and by expert structural chemistry editors prior to entering the
database. A key component of this processing is the reliable association of the
chemical identity of the structure studied with the experimental data. This important
step helps ensure that data is widely discoverable and readily reusable. Content is
further enriched through selective inclusion of additional experimental data. Entries
are available to anyone through free CSD community web services. Linking services
developed and maintained by the CCDC, combined with the use of standard
identifiers, facilitate discovery from other resources. Data can also be accessed
through CCDC and third party software applications and through an application
programming interface.
Source:
https://www.researchgate.net/publication/299572828_The_Cambridge_Structural_D
atabase
Organic crystal structures include:
•
•
•
•
•
Drugs and pharmaceuticals
Agrochemicals
Pigments
Explosives
Protein ligands.
Metal-Organic crystal structures include:
•
•
•
•
Metal Organic Frameworks (MOFs)
Models for new catalysts
Porous frameworks for gas storage
Fundamental chemical bonding.
Source: https://www.ccdc.cam.ac.uk/solutions/csd-core/components/csd/
6. The limitations of the CSD due to limited metadata i.e. how much data there is
associated with the crystal data
–
Chemical content of the CSD
Problems with crystal structures
Accuracy, presentation and extent of CSD data
The relevance of the CSD data
Problems endemic to data analysis
Source: Article sent by dr Shankland
7. Describe stuff about how it was collected etc
Source: https://www.ccdc.cam.ac.uk/Community/depositastructure/scientific-datapreservation/
8. Factors affecting the CSD has changed over the years e.g. improved
instrumentation
Women in science – Article: https://scripts.iucr.org/cgibin/paper?S2053273318092860
9. Your aims and objectives
The aim of this project is to observe the factors that have changed within the
database and analyse their change over time. These parameters include: the
number of structures determined by single-crystal diffraction, the number of
structures determined by powder diffraction, the number of co-crystals, the number
of neutron structures, the number of electron structures, the average number of
atoms in structures, the changing distribution of space groups, the number of
hydrates and the rate of change of structure deposition. These changes will then be
analysed to establish their significance and conclude the pros and cons of such a
rapid growth in the CSD.
The CSD has rapidly changed since its creation and now has more than a million
crystal structures. The goal of this project is to examine the evolution of several
important indicators (such as structure size, unit cell volume, and space group
distribution) over time. To better spot trends, searches will be categorised by several
criteria, including radiation source. Conclusions will be drawn from the search results
once they have been plotted or tabulated.
Objectives:
– Research and analyse reliable sources using PubMed and Web of Science
such as literature survey to establish a basis for the research.
– Learn how to use the CSD and to understand its importance in research
– Determine reliable ways in which the CSD has evolved by analysing the
growth of the database over time.
– Determine a reliable way of identifying and extracting the parameters of
interest
– Collate and tabulate results found and discuss any trends found in relation to
crystallography
– Summarise results, form a conclusion and submit a reliable report
We began by accessing the database to extrapolate data and anaylse how it has evolved
every 5 years since 1980. Then plotted graphs to suggest this change overall from 1980 to
2020
Then we analysed the 2020-2022. Having this varying timeframe allowed for the evolution
to be discussed during the time of the pandemic.
Future Challenges:
The advent of big data
The CSD was once one of the few, now it is one of the many; there are scientific databases all over the
place. Furthermore, collecting data used to be an unfashionable and boring activity but has now become
exciting and à la mode. Online information can be gathered together by spiders and crowdsourcing and
web indexed; an obvious example is ChemSpider.552 Add in artificial intelligence and cognitive
augmentation, with neural nets combined together into deep-learning algorithms, and big data is upon us.
There seems no limit to what it can do. If it is any comfort, the human brain has 10 11 neurons and 1014
synapses, so we might still have a role to play.
Having become a small fish in a big pond well, “medium- sized fish” is probably a more apposite
metaphor the first requirement for the CSD of the future is that it should be accessible to client
applications as well as to humans. Its value will be fully realized only if it can be linked to other data
compilations. CCDC has taken the critical step that will enable this goal to be achieved by releasing the
CSD Python API (section 4.1.4). It is likely to be directly used (as opposed to indirectly) by only a minority
of the CSD user community but nonetheless is of primary strategic importance.
The continued evolution of crystallography
We said in the Introduction that a small-molecule crystal structure can now be determined in a few hours.
This is by using standard equipment. Advanced instrumentation with new hybrid pixel detectors enables
data to be collected in a few minutes. Furthermore, a new software feature (Rigaku’s “What
is this?”553) will try to solve the structure as the data collection proceeds, enabling a basic atomic
coordinate set to be obtained in less than 2 min. Add to this the increasing number of 3D structures
solved by powder-diffraction and the continued development of techniques like NMR crystallography,
micro- electron diffraction, and cryo-electron microscopy, and the mind boggles at how many structures
appropriate for inclusion in the CSD will be produced annually in the years to come. Also, crystal
structure prediction may become sufficiently reliable that at least some of its results will be deemed
suitable for inclusion in the CSD (albeit, flagged as “theoreti- cal”).554−556 We may assume that the
exponential rise in crystallographic output must eventually flatten off, but there is no reason to expect it
any time soon.
It is nevertheless important that the current focus on maintaining CSD data quality is maintained. It might
be thought that, with so many structures, a pool of “broken structures” (incorrect chemistry assignments,
etc.) can be tolerated. The problem is that there will still be many searches that produce few hits; multiply
the number of structures in the CSD a hundred-fold and they will still represent a small fraction of
chemical space. Further, using the CSD in big-data analyses, possibly with sophisticated machine-learning
techni- ques, is likely to bury the effects of database errors so deeply that they will become impossible to
detect.
The continued evoltution of chemistry
proportion of the CSD input comprises very complex molecules, frequently large, often polymeric, and
sometimes with unusual bonding or exotic topologies (e.g., Figure 33).
Representing these chemistries in an accurate, searchable form is already challenging and will only
become more so. There is perhaps a perception that the molecules in the CSD are simple, small, and much
easier for database builders to deal with than biological macromolecules. The ingenuity of synthetic
chemists has changed that.
The solutions to the problems outlined in this and the previous subsections must primarily lie in
improvements to the software infrastructure around the CSD, for both building and searching the
database. This will be a difficult undertaking. Nevertheless, it is appropriate for us to introduce a positive
note. We have talked about the challenges of the future but must remember that meeting them will
greatly increase the value of the CSD. Einstein, as usual, hit the nail on the head: in the middle of difficulty
lies opportunity.
i
University lecture
ii https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6072476/
iii
https://pubmed.ncbi.nlm.nih.gov/24419610/
Faid Salmoon; Professor K Shankland
What can be learnt from analysing the evolution of the Cambridge Structural Database?
Background:
The Cambridge Structural Database (CSD) is a system that scientists around the world use to
access information on organic and metal-organic small-molecule crystal structures. The database
was established in 1965 and by the early 1980s it was distributed to more than 30 countries. Now it
is distributed to academics in 70 countries worldwide. The are many ways to inspect the growth
within the database and determine their significance. The growth of the compounds as a function
of time is striking and could have tremendous impacts on the world of science and technology. The
database allows for research in structural chemistry, material science, life sciences and drug
discovery/development.
Crystallography is used by materials scientists to show different materials. The crystalline
arrangement of atoms in single crystals are recognisable macroscopically as the shapes of crystals
is reflective of the atomic structure. Crystallographic databases such as the CSD, the Inorganic
Crystal Structural Database and CrystMet (used for metal and related materials), allows for the
structure of a compound to be represented graphically to inspect the geometrically and make easy
changes if required. As all the information is collated in one system, it allows for those using the
database to search for compounds quickly and efficiently, which was not previously possible. Over
the years crystallography has allowed for many improvements that relate to our daily lifestyle such
as the use of more powerful batteries.
The aim of this project is to observe the factors that have changed within the database and
analyse their change over time. These parameters include: the number of structures determined by
single-crystal diffraction, the number of structures determined by powder diffraction, the number of
co-crystals, the number of neutron structures, the number of electron structures, the average
number of atoms in structures, the changing distribution of space groups, the number of hydrates
and the rate of change of structure deposition. These changes will then be analysed to establish
their significance and conclude the pros and cons of such a rapid growth in the CSD.
Aims:
The CSD has rapidly changed since its creation and now has more than a million crystal
structures. The goal of this project is to examine the evolution of several important indicators (such
as structure size, unit cell volume, and space group distribution) over time. To better spot trends,
searches will be categorised by several criteria, including radiation source. Conclusions will be
drawn from the search results once they have been plotted or tabulated.
Objectives:
– Research and analyse reliable sources using PubMed and Web of Science such as
literature survey to establish a basis for the research.
– Learn how to use the CSD and to understand its importance in research
– Determine reliable ways in which the CSD has evolved by analysing the growth of the
database over time.
– Determine a reliable way of identifying and extracting the parameters of interest
– Collate and tabulate results found and discuss any trends found in relation to
crystallography
– Summarise results, form a conclusion and submit a reliable report
• Introduction to molecules.
• A combination of atoms creates a
• importance of small molecules to pharmacy and medicine – Small molecules serve
as drugs in therapies and chemical probes to explore the novel function
of a target. Structural biology is a critical method to interpret interactions
between small molecules and their targets, serving as an important tool in
target-based drug discovery (
• importance of crystal structures as a source of info on accurate molecular structures.
Crystal structure data are of fundamental importance in a wide
spectrum of scientific activities. CSD data are widely used in
establishing standard molecular dimensions, determining
conformational preferences, and in the study of intermolecular
interactions, all of which are crucial in structural chemistry,
rational drug design, pharmaceutical materials design, and
drug delivery.
• Importance of 3d stucutre in drug development in pharmacy – The 3-dimensional
(3D) structure of therapeutics and other bioactive molecules is an important
factor in determining the strength and selectivity of their protein–ligand
interactions. Previous efforts have considered the strain introduced and
tolerated through conformational changes induced upon protein binding.
• Crystallography is the experimental science of determining the
arrangement of atoms in crystalline solids. Crystallography is a
fundamental subject in the fields of materials science and solid-state physics
Solid state dr cobolva second year
X-ray diffraction:
• The definitive structural probe for the solid state
• Involves shining X-rays onto crystalline materials and observing scattered radiation
• Analysis of the scattered radiation yields much information, including;
• Where atoms are (you can ‘see’ them)
• How they are connected to form molecules
• How they pack to form crystals
• To what extent they are vibrating
Molecular crystals:
Single crystal
• One crystal
• Size typically 0.1 x 0.1 x 0.1mm
Powder
•
•
Millions of crystals
Size typically a few m
An analogy:
• Lego “building” = crystal
• One brick = one cell
• One dot = one molecule
• There is long-range order
Remember:
• The unit cell is purely imaginary
We use it because it is easier to deal with one basic block that repeats than to deal with
every atom
The only real thing is the full array of atoms
•
•
Describing a crystal:
• Unit cell type and dimensions
• Atomic coordinates
• Space group symmetry
Unit cell dimensions:
An origin
Three lengths (a,b,c)
Three angles (,,)
•
•
•
Lengths are in Å
Angles are in degrees
Most common unit cell types:
Orthorhombic – like a shoebox
a b c = = = 90
• Monoclinic – like a showbox where you’ve leaned on one edge
a b c = = 90 90
•
Atom positions: X,Y,Z
Are defined relative to the unit cell axes
Refer to points along each axis, treating the origin as ‘0’ and the end of an axis as ‘1’
Coord
Axis
Min
Max
x
a
0
1
y
b
0
1
z
c
0
1
Example;
Atom at (0.33,0.5,0.9)
The coordinates are called fractional coordinates, because they are given in fractions of a cell
edge.
Symmetry:
• A typical cell contains more than one molecule
• These molecules are (in general) not independent, but are related by symmetry elements
• So we do not need to know the positions of every molecule in the unit cell
• Rather, we need to find just one molecule, as long as we know the symmetry relationships
between the molecules
Symmetry example – 1
Symmetry example – 2
The mirror plane
Indomethacin revisited:
Z = total number of molecules in the unit cell
Z’ = the number of unique molecules in the unit cell
So here, Z = 2 and Z’ = 1
Because there is a centre of symmetry
Generally (but not always)
Z’ = 1
•
The space group:
The space group is a symbol used to summarise all symmetry relationships in the unit cell
e.g. P212121, P-1
Seeing inside crystals: X-rays
• Electromagnetic radiation 0.5Å – 2.5Å
• 1 Å = 0.1 nm = 1 x 10-10 m
• Highly penetrating, but interact
• Most commonly used X-ray wavelength is 1.5406Å
•
•
•
X-rays and matter:
X-rays scatter off of electrons
Atoms with lots of electrons scatter strongly (e.g. Au) and so can be ‘seen’ easily
Atoms with few electrons scatter little (e.g. H) and so are much more difficult to ‘see’
A simple X-ray experiment:
Shine some X-rays at a crystal; the crystal diffracts the X-rays
How can we explain diffraction?
• X-ray diffraction is most easily visualised as reflection from a plane in a unit cell
• The planes are drawn within a unit cell but are imaginary
Bragg’s law:
A reflection is only seen when Bragg’s law is satisfied i.e. when
= 2 d sin()
is the wavelength of the radiation (in Å)
d is the interplanar spacing (in Å)
is half the observed diffraction angle (in )
Use of Bragg’s law:
So if we see a reflection and can measure its angle
We can work out cell dimensions
Bragg’s law example (=0.707Å)
• Imagine a reflection seen at 2 = 11.6
• So = 5.8
• = 2 d sin(); 0.707 = 2 d sin(5.8)
• So d = 0.707 / [2 sin(5.8)]
• d-spacing = 3.5Å
•
The reflection is coming from a set of planes with characteristic d-spacing of 3.5Å
Important link established:
You should now be able to imagine that if you measure lots of X-ray reflections, you can
work out lots of d-spacings
• From these, mathematically, one can work out the size and the shape of the unit cell
•
•
•
Reflection intensity:
Strength of scattered X-ray will reflect the distribution of atoms in the cell and their types
For example, if a plane has lots of atoms with lots of electrons, it is likely to scatter strongly
•
•
BG; reflecting planes in a cell
There are 1000s of possible reflecting planes in a cell
We need a numbering system to identify them… called “miller indices”, h,k,l
What can we link up now?
• Crystalline structure to unit cell building blocks
• Unit cells to individual molecules by symmetry
• X-ray reflection angles to individual lattice planes
• Strength of X-ray reflections to atomic positions and types
• Bottom line:
Methods and applications: single crystal diffraction:
• Diffractometer
• Data
Single crystal X-ray diffraction (SX-XRD):
Characteristics
• A single crystal
•
•
•
•
•
BG: typical size .1mm3
BG: data collection typically ~ 4 hours
BG: thousand of accurate reflection intensities
BG: analysis time typically minutes
Main applications
Full crystal structure determination – accurate atom positions and therefore accurate
molecular geometry
Best method to use if you can grow a crytal
Single crystal example – fluroro benzoic acid
Application: chirality
Many drugs exist as racemic mixtures i.e. mixture of left and right handed enantiomers
Increasingly, we require drugs to be formulated as only one enantiomer
Suppose we crystallise a drug and find that it crystallises in a space group with a centre of
symmetry or a mirror plane
• Then it must be racemic, as the mirror or centre related the left handed molecules to the
right handed one
• It cannot be a single enantiomer
•
•
•
•
•
Methods and applications: powder X-ray diffraction (PXRD)
Diffractometer
Data
Used when?
• Powders are collections of millions of tiny (too small to be seen easily) single crystals
• Generally, when you cant grow a single crystal of the API of interest
• When dealing with bulk powders in pharmaceutical industry
IMPORTANT : anatomy of a powder pattern
•
•
By convention, PXRD patterns are shown with an x-axis of TWO THETA
Remember that Bragg’s law uses
, not 2
•
For example, a reflection appears at 2 = 28 In a PXRD patter. To calculate its d-spacing, in
Bragg’s law you would use a value of 28 2 = 14
Geometry:
The easiest PXRD is flat plate style; fast
•
PXRD application 1: polymorph identification
A polymorph is a particular crystalline form of a compound. For any given polymorph, the
PXRD pattern is a ‘fingerprint’
•
Overlaid: form 1 in red, form 2 in black
•
For any given phase, the PXRD pattern is a fingerprint
•
Reminder : why are they different?
•
Peak positions are different, because the unit cells are different
•
Peak intensities are different, because the atomic positions are different
PXRD application 2: phase transformation
•
Polymorphs frequently interconvert
o e.g. over time, to a more stable form
o e.g. with temperature
•
Change apparent in patterns (peak positions different, peak heights different), because the
crystal structure (unit cell and atom positions) has changed
PXRD application 3: Amorphous
• Why 3d structure is important ,
• basic crystallography concepts e.g. unit cell, atom positions, Z’ (refer back to my part 2
lectures)
• Talk about the database and what it allows it to do.
The Cambridge Structural Database (CSD) is a system that scientists around the world use to access
information on organic and metal-organic small-molecule crystal structures. The database was
established in 1965 and by the early 1980s it was distributed to more than 30 countries. Now it is
distributed to academics in 70 countries worldwide. The are many ways to inspect the growth within
the database and determine their significance. The growth of the compounds as a function of time is
striking and could have tremendous impacts on the world of science and technology. The database
allows for research in structural chemistry, material science, life sciences and drug
discovery/development.
The database allows us to access records with clear refining factors and parameters. It allows to easy
user friendly access globally. I will be conducting this thesis on the database and analysing
parameter over time. This has been easily enabled by the databases easy functionalities…
Crystallography is used by materials scientists to show different materials. The crystalline
arrangement of atoms in single crystals are recognisable macroscopically as the shapes of crystals is
reflective of the atomic structure. Crystallographic databases such as the CSD, the Inorganic Crystal
Structural Database and CrystMet (used for metal and related materials), allows for the structure of
a compound to be represented graphically to inspect the geometrically and make easy changes if
required. As all the information is collated in one system, it allows for those using the database to
search for compounds quickly and efficiently, which was not previously possible. Over the years
crystallography has allowed for many improvements that relate to our daily lifestyle such as the use
of more powerful batteries.
The aim of this project is to observe the factors that have changed within the database and analyse
their change over time. These parameters include: the number of structures determined by singlecrystal diffraction, the number of structures determined by powder diffraction, the number of cocrystals, the number of neutron structures, the number of electron structures, the average number
of atoms in structures, the changing distribution of space groups, the number of hydrates and the
rate of change of structure deposition. These changes will then be analysed to establish their
significance and conclude the pros and cons of such a rapid growth in the CSD.
• How have things changed over the evolution of the cad
From its humble beginnings, CAD has developed ever more complex capabilities
over the years. No longer are designers constrained to working with 2D drafts, or
even 3D wireframes—now, product designers create 3D solid models, whose
virtual properties can be defined to match those of the intended finished
object. Used by engineers, architects, and construction managers, CAD
has replaced manual drafting. It helps users creating designs in either 2D or 3D so
that they can visualize the construction. CAD enables the development, modification,
and optimization of the design process.
• importance of the CSD as a source of crystallographic and molecular information
• limitations of CSD due to limited “metadata” i.e. how much data there is associated
with the crystal data
• to describe stuff about how it was collected etc
• factors affecting how the CSD has changed over the years e.g. improved
instrumentation
• your aims and objectives
Amorphous solids are homogeneous and isotropic because there is no long
range order or periodicity in their internal atomic arrangement. By
contrast, the crystalline state is characterised by a regular arrange- ment of
atoms over large distances. Crystals are therefore anisotropic – their
properties vary with direction. — https://www.phasetrans.msm.cam.ac.uk/2001/intro.cryst.pdf
Single crystal X-ray crystallography can be applied to the entire spectrum of molecular size.
If performed correctly the result is an unambiguous, three dimensional image of all the atoms
located within a molecule. This applies to small chemical structures all the way through to
biological macromolecules.
The Cambridge Structural Database (CSD; Groom et al., 2016
) is a carefully curated collection
of more than 800 000 structures of organic and metal–organic compounds provided by the
Cambridge Crystallographic Data Centre (CCDC). The CSD has proven to be an invaluable resource
for chemistry since its creation in 1965, and is heavily used in pharmaceutical research and
development as well as in academic research. It is to the physical sciences what the Protein Data
Bank (PDB; Berman et al., 2003
) is to the life sciences, but the CSD is well used to further
protein crystallographic methods. Indeed, the paper by Engh and Huber describing
parametrizations for macromolecular refinement (Engh & Huber, 1991
) opens with the
sentence
Bond-length and bond-angle parameters are derived from a statistical survey of X-ray structures
of small compounds from the Cambridge Structural Database.
Historically, the elucidation of protein structures was driven by interest in discovering biological
mechanisms; consequently, the focus of crystallographic methods was often on the protein
component of a crystal structure,and less effort was expended on any associated ligands, as these
were often peripheral to the critical information that a protein structure could provide. In recent
years, there has been significant growth of interest in structures in the PDB directed at structurebased drug design, where protein–ligand crystal structures can be used to provide guidance to
chemists in optimizing binding of small molecules to drug targets. This has been accompanied by
the development of superb software that makes macromolecular crystallography accessible to
those without a strong chemical background. Consequently, there has been a drive to improve
methods for handling small molecules bound within macromolecular structures.
Chemical crystallography and macromolecular crystallography can be of mutual interest and
benefit. Small-molecule crystallography can be useful to provide plausible hypotheses
for molecular conformation. Such mutual relationships have always existed between the
crystallographic communities and carry on today. An early example (Watson et al., 1993
) of
such synergy is the case of conformations of glucose analogues bound to glycogen phosphorylase.
These analogues were initially modelled in classical chair conformations when bound to protein
structures, until a small-molecule structure showed that in certain cases glucose analogues could
occupy a skew-boat conformation. In a more recent example (Tatum et al., 2013
), the use of
small-molecule crystallography provided accurate starting models for input to docking studies of
binding to the mycobacterial mono-oxygenase EthA. The small-molecule structures generated also
aided in interpretation of the likely conformational changes undergone by the ligands on binding.
Such uses of small-molecule crystallography are invaluable and frequent but are often
underappreciated.
The need for better handling of ligand structures has been highlighted historically (Liebeschuetz et
al., 2012
) and consequently efforts have been made to improve the tools and practices
adopted in this area (Adams et al., 2016
). The CSD has been used as part of validation
protocols (Read et al., 2011
) and in the generation of dictionaries of restraints (Moriarty et al.,
2016
; Vagin et al., 2004
); work has also been carried out to integrate Mogul (Bruno et al.,
2004
) into PHENIX (Adams et al., 2010
) and Coot (Emsley et al., 2010
). CSD data are
now used routinely in the generation of the wwPDB chemical component dictionary through the
CRESTANO project (wwPDB News, 2015
). Distance restraints for small-molecule and protein
structures derived from CSD data are also used in SHELXL (Sheldrick, 2015
).
https://journals.iucr.org/d/issues/2017/03/00/ba5250/index.html
40/50 references and material
For discussion compare to others, pdb protein data bank, records protein structure (taking
a target of a protein structures, grown crystal of proteins blast with). ICD inorganic
crystal structure database eye minaerals and salts, rocks etc
Objective to look at parameters and see how they’ve evolved with time.
Report Draft 1;
As the globe is continuously changing so is the world of science. The accessibility of
knowledge and improved globalisation has allowed for huge advancements in chemistry and
pharmacy. Through this report we shall be analysing molecules within the Cambridge
structural database and their evolution through multiple experiments.
A molecule refers to the particle of a substance that has all the physical and chemical
properties of that substance is referred to as a molecule. Molecules are made up of one or
more atoms. When a compound is of low molecular weight (less than 900 Daltons) it is then
referred to as a small molecule. i
This is relevant in pharmaceutical practise as small molecule drugs have some distinct
advantages as therapeutics: most can be administered orally, and they can pass through cell
membranes to reach intracellular targets. Due to their relatively low molecular weight and
simple chemical structures, their pharmacokinetics and pharmacodynamics are normally
more predictable than those of biologics, which often lead to simpler dosing protocols ii
Crystalline arrangement of atoms in single crystals are recognisable macroscopically as the
shapes of crystals is reflective of the atomic structure. The crystallization of proteins, nucleic
acids and large biological complexes, such as viruses, depends on the creation of a solution
that is supersaturated in the macromolecule but exhibits conditions that do not significantly
perturb its natural state. Supersaturation is produced through the addition of mild
precipitating agents such as neutral salts or polymers, and by the manipulation of various
parameters that include temperature, ionic strength and pH. Also important in the
crystallization process are factors that can affect the structural state of the macromolecule,
such as metal ions, inhibitors, cofactors or other conventional small molecules. A variety of
approaches have been developed that combine the spectrum of factors that effect and
promote crystallization, and among the most widely used are vapor diffusion, dialysis, batch
and liquid-liquid diffusion. Successes in macromolecular crystallization have multiplied
rapidly in recent years owing to the advent of practical, easy-to-use screening kits and the
application of laboratory robotics. iii
1. Solid state, basic crystallography concepts e.g. unit cell, atom position, Z’
What is X-ray diffraction:
• The definitive structural probe for the solid state
• Involves shining X-rays onto crystalline materials and observing scattered radiation
• Analysis of the scattered radiation yields much information, including;
• Where atoms are (you can ‘see’ them)
• How they are connected to form molecules
• How they pack to form crystals
• To what extent they are vibrating
Molecular crystals:
Single crystal: One crystal //// Size typically 0.1 x 0.1 x 0.1mm
Powder: Millions of crystals /// Size typically a few m
An analogy:
• Lego “building” = crystal
•
•
•
One brick = one cell
One dot = one molecule
There is long-range order
Remember:
• The unit cell is purely imaginary
• We use it because it is easier to deal with one basic block that repeats than to deal with
every atom
• The only real thing is the full array of atoms
Describing a crystal:
• Unit cell type and dimensions
• Atomic coordinates
• Space group symmetry
Most common unit cell types:
• Orthorhombic – like a shoebox
a b c = = = 90
• Monoclinic – like a showbox where you’ve leaned on one edge
a b c = = 90 90
Atom positions: X,Y,Z
Are defined relative to the unit cell axes
Refer to points along each axis, treating the origin as ‘0’ and the end of an axis as ‘1’
Coord
Axis
Min
Max
x
a
0
1
y
b
0
1
z
c
0
1
•
•
•
•
A typical cell contains more than one molecule
These molecules are (in general) not independent, but are related by symmetry elements
So we do not need to know the positions of every molecule in the unit cell
Rather, we need to find just one molecule, as long as we know the symmetry relationships
between the molecules
Powder X-ray diffraction (PXRD)
• Diffractometer
• Data
Used when?
• Powders are collections of millions of tiny (too small to be seen easily) single crystals
• Generally, when you cant grow a single crystal of the API of interest
• When dealing with bulk powders in pharmaceutical industry
Source: X- ray diffraction lecture
Crystal form
• APIs in the solid form can be either
•
•
•
Crystalline, or
Amorphous (no regular molecular lattice arrangement)
Many crystalline APIs exist in more than one crystal form with different crystal packing
arrangements. These different forms are known as polymorphs
Crystal forms – polymorphs
• Polymorphs have different physicochemical and mechanical properties:
• Solubility’s, stabilities, dissolution rates, melting point etc.
•
•
•
As these are key factors in formulation, it is VITAL that the polymorphic form of the API
being formulated is known and is consistent
i.e. different batches of the API do not consist of different polymorphs
If they did, formulations would not behave as expected, with important consequences for
the patient
Get polymorphism wrong…
• Example: ritonavir, used for HIV: Abbott lost $250M in sales by developing a formulation
based around a metastable polymorph and then having to re-formulate when the insoluble
stable polymorph was found to be forming upon storage. The insoluble form was not
absorbed and so patients were no longer ‘getting the drug’, despite taking!
• What to do?
• Try to formulate around the most stable polymorph if possible
• Strategies exist for discovering the most stable polymorph; these are generally known as
‘polymorph screens’ and involve lots of re-crystallisations under different conditions (e.g.
different solvents) to find polymorphs
Source: Powders lectures
2. The importance of crystal structures as a source of information on accurate
molecular structure
CSD data are widely used in establishing standard molecular dimensions,
determining conformational preferences, and in the study of intermolecular
interactions, all of which are crucial in structural chemistry, rational drug design,
pharmaceutical materials design, and drug delivery. More recently, information
derived from the CSD has been used to construct two dynamic libraries of structural
knowledge: Mogul, which stores intramolecular information, and IsoStar, which
stores information about intermolecular interactions. These electronic libraries
provide click-of-a-button access to structural information and, in turn, serve as
sources of knowledge for applications software that address specific problems in
structural chemistry, rational drug design, and crystallography.
Source: https://www.sciencedirect.com/topics/chemistry/crystal-structure-data
3. The importance of 3d structures in drug development
The 3-dimensional (3D) structure of therapeutics and other bioactive molecules is an
important factor in determining the strength and selectivity of their protein–
ligand interactions. Previous efforts have considered the strain introduced and
tolerated through conformational changes induced upon protein binding.
Source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7294711/
4. What is the CSD and what does it allow you to do
The Cambridge Structural Database (CSD) is a system that scientists around the
world use to access information on organic and metal-organic small-molecule crystal
structures. The database was established in 1965 and by the early 1980s it was
distributed to more than 30 countries. Now it is distributed to academics in 70
countries worldwide. The are many ways to inspect the growth within the database
and determine their significance. The growth of the compounds as a function of time
is striking and could have tremendous impacts on the world of science and
technology. The database allows for research in structural chemistry, material
science, life sciences and drug discovery/development.
5. The importance of the CSD as a source of crystallography and molecular
information
Crystal structure data are of fundamental importance in a wide spectrum of scientific
activities.
Source: https://www.sciencedirect.com/topics/chemistry/crystal-structure-data
As all the information is collated in one system, it allows for those using the database
to search for compounds quickly and efficiently, which was not previously possible.
Over the years crystallography has allowed for many improvements that relate to our
daily lifestyle such as the use of more powerful batteries.
The Cambridge Structural Database (CSD) contains a complete record of all
published organic and metal–organic small-molecule crystal structures. The
database has been in operation for over 50 years and continues to be the primary
means of sharing structural chemistry data and knowledge across disciplines. As
well as structures that are made public to support scientific articles, it includes many
structures published directly as CSD Communications. All structures are processed
both computationally and by expert structural chemistry editors prior to entering the
database. A key component of this processing is the reliable association of the
chemical identity of the structure studied with the experimental data. This important
step helps ensure that data is widely discoverable and readily reusable. Content is
further enriched through selective inclusion of additional experimental data. Entries
are available to anyone through free CSD community web services. Linking services
developed and maintained by the CCDC, combined with the use of standard
identifiers, facilitate discovery from other resources. Data can also be accessed
through CCDC and third party software applications and through an application
programming interface.
Source:
https://www.researchgate.net/publication/299572828_The_Cambridge_Structural_D
atabase
Organic crystal structures include:
•
•
•
•
•
Drugs and pharmaceuticals
Agrochemicals
Pigments
Explosives
Protein ligands.
Metal-Organic crystal structures include:
•
•
•
•
Metal Organic Frameworks (MOFs)
Models for new catalysts
Porous frameworks for gas storage
Fundamental chemical bonding.
Source: https://www.ccdc.cam.ac.uk/solutions/csd-core/components/csd/
6. The limitations of the CSD due to limited metadata i.e. how much data there is
associated with the crystal data
–
Chemical content of the CSD
Problems with crystal structures
Accuracy, presentation and extent of CSD data
The relevance of the CSD data
Problems endemic to data analysis
Source: Article sent by dr Shankland
7. Describe stuff about how it was collected etc
Source: https://www.ccdc.cam.ac.uk/Community/depositastructure/scientific-datapreservation/
8. Factors affecting the CSD has changed over the years e.g. improved
instrumentation
Women in science – Article: https://scripts.iucr.org/cgibin/paper?S2053273318092860
9. Your aims and objectives
The aim of this project is to observe the factors that have changed within the
database and analyse their change over time. These parameters include: the
number of structures determined by single-crystal diffraction, the number of
structures determined by powder diffraction, the number of co-crystals, the number
of neutron structures, the number of electron structures, the average number of
atoms in structures, the changing distribution of space groups, the number of
hydrates and the rate of change of structure deposition. These changes will then be
analysed to establish their significance and conclude the pros and cons of such a
rapid growth in the CSD.
The CSD has rapidly changed since its creation and now has more than a million
crystal structures. The goal of this project is to examine the evolution of several
important indicators (such as structure size, unit cell volume, and space group
distribution) over time. To better spot trends, searches will be categorised by several
criteria, including radiation source. Conclusions will be drawn from the search results
once they have been plotted or tabulated.
Objectives:
– Research and analyse reliable sources using PubMed and Web of Science
such as literature survey to establish a basis for the research.
– Learn how to use the CSD and to understand its importance in research
– Determine reliable ways in which the CSD has evolved by analysing the
growth of the database over time.
– Determine a reliable way of identifying and extracting the parameters of
interest
– Collate and tabulate results found and discuss any trends found in relation to
crystallography
– Summarise results, form a conclusion and submit a reliable report
We began by accessing the database to extrapolate data and anaylse how it has evolved
every 5 years since 1980. Then plotted graphs to suggest this change overall from 1980 to
2020
Then we analysed the 2020-2022. Having this varying timeframe allowed for the evolution
to be discussed during the time of the pandemic.
Future Challenges:
The advent of big data
The CSD was once one of the few, now it is one of the many; there are scientific databases all over the
place. Furthermore, collecting data used to be an unfashionable and boring activity but has now become
exciting and à la mode. Online information can be gathered together by spiders and crowdsourcing and
web indexed; an obvious example is ChemSpider.552 Add in artificial intelligence and cognitive
augmentation, with neural nets combined together into deep-learning algorithms, and big data is upon us.
There seems no limit to what it can do. If it is any comfort, the human brain has 10 11 neurons and 1014
synapses, so we might still have a role to play.
Having become a small fish in a big pond well, “medium- sized fish” is probably a more apposite
metaphor the first requirement for the CSD of the future is that it should be accessible to client
applications as well as to humans. Its value will be fully realized only if it can be linked to other data
compilations. CCDC has taken the critical step that will enable this goal to be achieved by releasing the
CSD Python API (section 4.1.4). It is likely to be directly used (as opposed to indirectly) by only a minority
of the CSD user community but nonetheless is of primary strategic importance.
The continued evolution of crystallography
We said in the Introduction that a small-molecule crystal structure can now be determined in a few hours.
This is by using standard equipment. Advanced instrumentation with new hybrid pixel detectors enables
data to be collected in a few minutes. Furthermore, a new software feature (Rigaku’s “What
is this?”553) will try to solve the structure as the data collection proceeds, enabling a basic atomic
coordinate set to be obtained in less than 2 min. Add to this the increasing number of 3D structures
solved by powder-diffraction and the continued development of techniques like NMR crystallography,
micro- electron diffraction, and cryo-electron microscopy, and the mind boggles at how many structures
appropriate for inclusion in the CSD will be produced annually in the years to come. Also, crystal
structure prediction may become sufficiently reliable that at least some of its results will be deemed
suitable for inclusion in the CSD (albeit, flagged as “theoreti- cal”).554−556 We may assume that the
exponential rise in crystallographic output must eventually flatten off, but there is no reason to expect it
any time soon.
It is nevertheless important that the current focus on maintaining CSD data quality is maintained. It might
be thought that, with so many structures, a pool of “broken structures” (incorrect chemistry assignments,
etc.) can be tolerated. The problem is that there will still be many searches that produce few hits; multiply
the number of structures in the CSD a hundred-fold and they will still represent a small fraction of
chemical space. Further, using the CSD in big-data analyses, possibly with sophisticated machine-learning
techni- ques, is likely to bury the effects of database errors so deeply that they will become impossible to
detect.
The continued evoltution of chemistry
proportion of the CSD input comprises very complex molecules, frequently large, often polymeric, and
sometimes with unusual bonding or exotic topologies (e.g., Figure 33).
Representing these chemistries in an accurate, searchable form is already challenging and will only
become more so. There is perhaps a perception that the molecules in the CSD are simple, small, and much
easier for database builders to deal with than biological macromolecules. The ingenuity of synthetic
chemists has changed that.
The solutions to the problems outlined in this and the previous subsections must primarily lie in
improvements to the software infrastructure around the CSD, for both building and searching the
database. This will be a difficult undertaking. Nevertheless, it is appropriate for us to introduce a positive
note. We have talked about the challenges of the future but must remember that meeting them will
greatly increase the value of the CSD. Einstein, as usual, hit the nail on the head: in the middle of difficulty
lies opportunity.
i
University lecture
ii https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6072476/
iii
https://pubmed.ncbi.nlm.nih.gov/24419610/
This is an open access article published under a Creative Commons Non-Commercial No
Derivative Works (CC-BY-NC-ND) Attribution License, which permits copying and
redistribution of the article, and creation of adaptations, all for non-commercial purposes.
Review
Cite This: Chem. Rev. 2019, 119, 9427−9477
pubs.acs.org/CR
A Million Crystal Structures: The Whole Is Greater than the Sum of
Its Parts
Robin Taylor and Peter A. Wood*
Cambridge Crystallographic Data Centre, 12 Union Road, Cambridge CB2 1EZ, United Kingdom
Downloaded via 191.101.113.36 on September 17, 2019 at 17:40:07 (UTC).
See https://pubs.acs.org/sharingguidelines for options on how to legitimately share published articles.
S Supporting Information
*
ABSTRACT: The founding in 1965 of what is now called the Cambridge Structural
Database (CSD) has reaped dividends in numerous and diverse areas of chemical
research. Each of the million or so crystal structures in the database was solved for its
own particular reason, but collected together, the structures can be reused to address a
multitude of new problems. In this Review, which is focused mainly on the last 10 years,
we chronicle the contribution of the CSD to research into molecular geometries,
molecular interactions, and molecular assemblies and demonstrate its value in the
design of biologically active molecules and the solid forms in which they are delivered.
Its potential in other commercially relevant areas is described, including gas storage and delivery, thin films, and
(opto)electronics. The CSD also aids the solution of new crystal structures. Because no scientific instrument is without
shortcomings, the limitations of CSD research are assessed. We emphasize the importance of maintaining database quality:
notwithstanding the arrival of big data and machine learning, it remains perilous to ignore the principle of garbage in, garbage
out. Finally, we explain why the CSD must evolve with the world around it to ensure it remains fit for purpose in the years
ahead.
CONTENTS
1. Introduction
2. Fundamental Science
2.1. Molecular Geometry and Structure
2.1.1. Atomic Radii
2.1.2. Conformational Analysis
2.1.3. Standard Geometries and Geometry
Libraries
2.1.4. Crystal Packing Effects on Molecular
Conformations
2.1.5. Metal Coordination
2.1.6. Crystallization Propensity
2.1.7. Tautomerism and Proton Transfer
2.2. Intermolecular Interactions
2.2.1. Hydrogen Bonds
2.2.2. σ-Hole Interactions
2.2.3. Dipole−Dipole and Orthogonal Multipolar (π-Hole) Interactions
2.2.4. Aromatic Interactions
2.2.5. All That Glisters Is Not Gold
2.3. The Systematics of Crystalline Assemblies
2.3.1. Motifs and Synthons
2.3.2. Crystal-Structure Architectures
2.3.3. Symmetry and Chirality
2.3.4. Z′
2.3.5. Polymorphism
2.3.6. Cocrystals
2.3.7. Hydrates and Solvates
3. Design of Biologically-Active Molecules
3.1. Molecular Shapes
3.2. Molecular Recognition
© 2019 American Chemical Society
3.3. The CSD as a Diverse Chemical Database
4. Emerging Applications
4.1. Challenges in Drug Development
4.1.1. Interactions and Packing
4.1.2. Risk of Polymorphism
4.1.3. Cocrystal Design
4.1.4. Morphology and Other Physical Properties
4.2. Other Industrially Relevant Applications
4.2.1. Energetic Materials
4.2.2. Paints, Pigments, and Dyes
4.2.3. Organic Semiconductors
4.2.4. Nonlinear Optical Materials
4.2.5. Ferroelectricity
4.2.6. Magnetic Anisotropy and Single-Molecule Magnets
4.2.7. Catalysts
4.2.8. Gas Storage and Separation
4.2.9. Thin Films and Coatings
4.2.10. Solar Thermal Fuels
4.3. Structure Solution
4.3.1. Macromolecular Crystal Structure Determination
4.3.2. Structure Determination from Powder
Diffraction Data
5. Lessons from the Past and Prospects for the
Future
5.1. Limitations of CSD-Based Research
9428
9429
9429
9429
9430
9430
9431
9431
9433
9433
9433
9433
9434
9437
9439
9441
9441
9441
9442
9443
9443
9444
9445
9445
9446
9446
9448
9450
9451
9451
9451
9452
9453
9454
9455
9455
9455
9456
9456
9456
9456
9456
9456
9458
9458
9458
9458
9459
9459
9459
Received: March 7, 2019
Published: June 17, 2019
9427
DOI: 10.1021/acs.chemrev.9b00155
Chem. Rev. 2019, 119, 9427−9477
Chemical Reviews
Review
5.1.1. Chemical Content of the CSD
5.1.2. Problems with Crystal Structures
5.1.3. Accuracy, Presentation, and Extent of
CSD Data
5.1.4. The Relevance of CSD Data
5.1.5. Problems Endemic to Data Analysis
5.2. The Keys to Success
5.3. Future Challenges
5.3.1. The Advent of Big Data
5.3.2. The Continued Evolution of Crystallography
5.3.3. The Continued Evolution of Chemistry
6. Concluding Remarks
Associated Content
Supporting Information
Author Information
Corresponding Author
ORCID
Notes
Biographies
Acknowledgments
References
ments”.3 This seems obvious now, as things often do in
retrospect. In fact, there were four very good reasons why, at
the time, it was not obvious at all. First, there were few if any
precedents. Older scientific databases may exist, but we do not
know of any that were intended from the start to become
research tools in their own right. Second, collecting all
published structures would require the cooperation of many
people across the globe. Such a thing is never to be taken for
granted and must have looked particularly daunting at the
height of the Cold War. Third, computers were in short supply
in 1965 and by today’s standards ludicrously slow. Moore’s
Law4 was only published that year and had yet to gain
credibility. The Internet and World Wide Web must have been
almost beyond imagination (we say “almost”some claim that
Mark Twain predicted it in 18985). So it was by no means
evident that computerized databases could become hugely
important in science.
The final reason was that solving crystal structures was not
easy in 1965, so it was uncertain how large and therefore how
useful the CSD could become. The intensities of X-ray
reflections were often estimated by eye from photographs
taken on Weissenberg cameras. Direct methods for solving the
phase problem were still in their infancy. Structure solution
involved models with plastic balls and metal sticks and usually
took months, with complete failures not uncommon. Bernal
and Kennard would have expected the methodology to
improve, but the extent to which it has done so is breathtaking.
With the current generation of equipment, a structural
chemistry research group can now have their own benchtop
X-ray diffractometer capable of determining the crystal
structure of a small molecule in just a few hours.
The first signs that the founders’ vision would be fulfilled
came in the late 1970s and early 1980s, when a few workers
began to demonstrate that interesting results could be obtained
by using the CSD as a research tool; some of these pioneering
papers are mentioned later. Since then, there has been an everincreasing number of scientists using the CSD for an everexpanding variety of applications. It is this research that we
review here. It has not been comprehensively surveyed for
many years,6−8 and a great number of relevant papers have
been published since then. To reduce our task somewhat, we
focus on the last 10 years, though earlier papers are frequently
mentioned to provide context.
We begin with studies aimed at clarifying fundamental
issues: molecular structure and geometry, intermolecular
interactions, and molecular assemblies (the systematics,
symmetries, and topologies of crystal structures). While
much of this is relatively basic research, it is still being actively
pursued by numerous research groups and generates invaluable
foundations for others to build on. Many of the older CSDbased papers on these topics have become citation classics, and
we have no doubt that the same will prove true of numerous
recent publications.
We then move on to the leading industrial application of the
CSD, its use to aid the discovery of biologically active
molecules and, in particular, pharmaceuticals. In this context, it
primarily serves as a guide to conformational preferences and
intermolecular interactions. Increasingly, however, data derived
from the CSD are being used to drive other software
applications, e.g. for scaffold-hopping, conformer generation.
We then cover emerging applications. The most mature of
these is the use of the CSD in drug development, particularly
formulation. Ever since the disastrous “disappearing poly-
9459
9459
9459
9460
9460
9460
9461
9461
9461
9462
9462
9462
9462
9462
9462
9462
9462
9462
9463
9463
1. INTRODUCTION
Long before there were people on the earth, crystals were
already growing in the earth’s crust. On one day or another,
a human being first came across such a sparkling morsel of
regularity lying on the ground or hit one with his stone tool
and it broke off and fell at his feet, and he picked it up and
regarded it in his open hand, and he was amazed.
M. C. Escher
This Review will be published at about the time that the
millionth structure is added to the Cambridge Structural
Database (CSD).1 The CSD is the definitive collection of
published small-molecule organic and metal−organic crystal
structures and was founded in 1965 at the instigation of the
famous physicist J. D. Bernal and his collaborator Olga
Kennard. In the first decade or so of its existence, Kennard’s
small group (the embryonic Cambridge Crystallographic Data
Centre, CCDC) developed basic infrastructure for maintaining
the CSD, including protocols for acquiring, checking, and
storing crystal structure data and detecting duplicates. A lot of
keyboarding of scarcely legible deposited material was
involved. There was a backlog of structures to be processed,
going back to the earliest X-ray determinations of carboncontaining compounds. In addition, the appearance of new
structures had to be monitored so they could be added too.
Each year’s input to the CSD was summarized in book form.
These first few years of the enterprise were therefore busy and
filled with essential work, but they had little impact on the
outside world. (The books were appreciated. A leading
crystallographer reviewed one of them and said it was good
for propping doors open and pressing wild flowers.2 But his
tongue was firmly in his cheek. He happened at the time to be
the boss of one of the authors of this Review, so we can say
from personal observation that he went through each new
book religiously, looking for interesting new structures.)
Bernal and Kennard had shared a vision, as Kennard
explained many years later: “We had a passionate belief that
the collective use of data would lead to the discovery of new
knowledge which transcends the results of individual experi9428
DOI: 10.1021/acs.chemrev.9b00155
Chem. Rev. 2019, 119, 9427−9477
Chemical Reviews
Review
morph” of ritonavir,9 pharmaceutical companies have been
greatly interested in the crystal forms their drugs adopt, or
might adopt. Is the known crystal structure of a development
candidate its most stable polymorph? Could cocrystallization
with a pharmaceutically acceptable additive address formulation problems? The use of the CSD to help investigate these
and related issues has grown significantly over the past decade
or so.
In addition, recent papers report the use of the CSD for
research aimed at designing commercially important materials
such as dyes, energetic materials, crystalline porous materials,
and organic semiconductors. Also, and in a nice quid pro quo,
the CSD, built from solved crystal structures, is becoming
increasingly important in aiding the solution of new structures.
On the one hand, there is growing use of CSD data to restrain
or validate ligand geometries in protein−ligand crystal
structure analysis. On the other, crystal structure solution
from powder diffraction data is becoming increasingly viable, is
potentially of enormous value as an analytical technique, and is
greatly assisted by using CSD information.
There are always downsides. Quite recently, one author
summarized CSD research thus: “That purely statistical
evaluations with data bases such as the CSD can be misleading
is obvious”.10 Of course, the same could be said about a great
many other research techniques that are nevertheless
invaluable. It is important, however, to be open about the
deficiencies of the technique we are espousing. Therefore, the
latter part of our Review contains a discussion of the
limitations of CSD-based research in particular and of database
analysis in general. Conversely, we highlight key reasons why
the CSD has been a success. These considerations are not
merely of parochial interest. At a time when scientific “big
data” and machine-learning are increasingly advocated, the
quality of scientific data is more important than ever, and
lessons can be learned from one of the oldest scientific
databases around. We also consider the challenges that must be
surmounted to keep the CSD fit for purpose in the everevolving world of chemical research and with the looming
onset of big data.
Perhaps to some, scientific databases appear boring and
mundane. We aim to show that they are, in fact, a new
generation of scientific instruments. The collective use of data
does indeed lead to the discovery of new knowledge which
transcends the results of individual experiments. Or as Aristotle
more or less said, the whole is greater than the sum of its parts.
a new set of vdw radii using a hugely greater number of
structures taken from the CSD.12 A radius was assigned to each
of the naturally occurring elements. For a given element, X, the
value was determined from the crystallographic distribution of
X···Y interatomic distances, where Y was a probe atom, usually
oxygen. The distribution included vdw contacts and random,
noninteraction X,Y pairs, often separated by long distances.
Alvarez isolated the region of the distribution that
corresponded to vdw contacts (illustrated in Figure 1 for the
Figure 1. Distribution of Os···O distances. The bonded pairs are in
black, and the intermolecular contacts are in light blue (fitted by the
blue line). The latter are deconvoluted into random pairs, increasing
with distance cubed (dashed line), and vdw contacts (red line). Figure
prepared for us by Professor Santiago Alvarez, author of ref 12, to
whom we are very grateful.
example pair Os,O) and determined the distance at which this
subdistribution reached half its maximum height. The halfheight distance was deemed to be where X and Y were in vdw
contact, i.e. equal to the sum of their vdw radii. This definition
was taken from an earlier study by Rowland and Taylor
(R&T), who chose half-height distance for the pragmatic
reason that it was the point on a vdw distribution that can most
precisely be determined.13 It was not, of course, the definition
used by Bondi, but the agreement between the Bondi, Alvarez,
and R&T radii is surprisingly good.
While undoubtedly useful, the vdw radius not only has no
universally accepted definition but also is based on
assumptions that do not stand up to close inspection. One
example is the assumption of perfect sphericity. Analysis of
CSD-derived contact distances showed long ago that this is
untrue for many terminal atoms (e.g., Cl, Br, I, S, and Se),
which tend to be smaller along the extension of the covalent
bond.14 The extent of this flattening was redetermined recently
for several elements.15 It is a hot topic because the effect of any
anisotropy of vdw shapes is convoluted with close atomic
approaches due to “σ-hole” interactions (section 2.2.2).
Another invalid assumption is that the radius of element X is
the same in X···Y and X···Z contacts, where Y and Z are
different elements. That this is only an approximation is shown
by the vdw radius of hydrogen, which was determined as 1.20
Å by Bondi and Alvarez but only 1.10 Å by R&T. The reason is
simple: the R&T value was determined from several different
types of H···Y distributions (Y = H, C, F, etc.), whereas the
Bondi and Alvarez values were determined exclusively from
2. FUNDAMENTAL SCIENCE
2.1. Molecular Geometry and Structure
Because crystallography is the definitive method for determining molecular geometry and structure, it is no surprise that
many CSD-based research studies have been focused on these
topics. They include investigations into atomic radii, conformational preferences, metal coordination, crystallization propensity, and tautomeric preferences.
2.1.1. Atomic Radii. It may be simplistic to regard atoms
as having radii, but Bondi’s 1964 publication on van der Waals
(vdw) radii11 has been cited over 15 000 times. His radii were
primarily based on intermolecular contact distances in a
handful of crystal structures and were intended to enable
calculation of molecular volumes. Now they are used for a
multitude of purposes, including analysis of crystal packing and
protein−ligand binding. Almost 50 years later, Alvarez derived
9429
DOI: 10.1021/acs.chemrev.9b00155
Chem. Rev. 2019, 119, 9427−9477
Chemical Reviews
Review
substituted by π-acceptors.29 Conjugation can occur between
the two moieties and, if steric factors permit, the substituent
adopts a cis- or trans-bisected conformation; carbonyl
acceptors prefer the former (i.e., with the oxygen sitting
“over” the ring) and vinyl the latter. The conjugation causes
the ring bond distal to the substituent to shorten and the
vicinal ones to lengthen. In contrast, the reverse happens if the
ring is substituted with a σ-acceptor (e.g., halogen).30
(d) Comparison of Conformations in Dif ferent Environments.
Raghavender compared the backbone conformations of amino
acid residues in (a) small peptides bound to proteins in Protein
Data Bank (PDB)31 structures and (b) unbound peptides in
CSD structures.32 Aliphatic residues (Ala, Ile, Leu, and Val)
occurred often enough in both to allow meaningful comparison
and were found to show broadly similar geometric trends, but
with the CSD residues being somewhat more conformationally
variable. One reason may be that many of the CSD peptides
are cyclic, with the ring-closure constraints forcing rotatable
bonds into particular geometries.
It was argued long ago that multivariate statistical and
pattern recognition techniques (e.g., factor analysis, multidimensional scaling) are helpful for performing conformational
analysis with the CSD.33,34 However, they were not widely
adopted. More recently, Parkin et al. resurrected the idea,
illustrating how these techniques can provide conformational
insights in an objective manner.35,36 The approach may finally
become established when the CSD is used in large big-data
projects (section 5.3.1). Interestingly, Parkin et al. used the
Boltzmann distribution to infer conformational energy differences from the relative frequencies of conformers in the
CSD.36 This was shown to be theoretically invalid a long time
ago37 but may sometimes be a practicable approximation.
2.1.3. Standard Geometries and Geometry Libraries.
Tabulations of CSD-derived average bond-lengths and -angles
were compiled over 25 years ago and are heavily used.38−40
This type of work is still performed. For example, mean
distances of covalent bonds to hydrogen were evaluated in
2010 from CSD neutron-diffraction structures.41 Arnautova et
al. derived standard geometries for hexapyranoses as part of a
project aimed at the simulation of glycan systems.42
Unfortunately, printed tables and ad hoc residue geometries
are often insufficient for modern research, which needs
comprehensive, continually updated, and computer-searchable
geometry libraries. A step in this direction was the development by CCDC of Mogul,43 which can be used for rapid
retrieval of CSD-derived bond-length, bond-angle, and torsionangle distributions. The ability to retrieve simple (unfused,
unbridged) ring geometries was added later.44 Mogul works by
describing the substructural environment of a molecular
feature (e.g., a rotatable bond) by a set of keys and then
searching a key-indexed library of preprepared distributions. If
the distribution retrieved for a particular feature contains too
few observations, related distributions are retrieved and pooled
together. This latter step works well but can occasionally be
slow.
Distributions retrieved from Mogul are commonly used in
discussions of new crystal structures. However, they have
several other applications, including (a) setting up refinement
restraints for protein−ligand crystal structures (section
4.3.1),45,46 (b) validation of ligand geometries (section
4.3.1),47,48 (c) aiding the solution of three-dimensional (3D)
structures by powder diffraction (section 4.3.2),49,50 (d) drug
discovery (section 3.1),51 (e) crystal structure prediction,52 (f)
H···H (or D···D) contact distances. Covalently bonded
hydrogen atoms usually carry a small net positive charge, so
H···H contacts are likely to be slightly lengthened by
electrostatic repulsion. The opposite will occur in, for example,
H···C contacts. Hence, R&T got a smaller value.
A number of other relevant publications have appeared. Hu
et al. wrote a very helpful comparison of the different sets of
vdw radii that have been published, some determined from
crystallographic data, some from other sources.16 A study of
the distributions of intramolecular nonbonded contact
distances showed that, for most element pairs, the first
percentile is well estimated by the sum of Bondi vdw radii
minus 0.5 Å.17 Hirshfeld analysis of crystal structures
determined at high pressure showed that H···H contacts do
not appear to compress below 1.7 Å.18 At ambient pressure,
about 1.8% of H···H contacts are shorter than 2.0 Å.
Cordero et al. determined a new set of covalent atomic radii
by analysis of bond lengths in the CSD.19 Their results showed
clear and smooth periodicity, with the largest element in each
period being the alkaline metal and the smallest the halogen
and noble gas. Most of the shrinkage occurs from group 1 to
group 13.
2.1.2. Conformational Analysis. Using the CSD for
conformational analysis is an attractive alternative or adjunct to
popular theoretical methods such as density functional theory
(DFT). The CSD has the advantage that it provides
unequivocal evidence of observed conformations in a
condensed phase. It can also confirm theoretically predicted
relationships between conformations and bond lengths and
angles. Here are a few illustrative examples:
(a) Ring Geometries. Khorasani et al. showed that singly
substituted 12-membered cycloalkanes are surprisingly inflexible.20 With only one possible outlier, all of the rings
examined adopted a square-like conformation with D 2
symmetry. Pérez et al. investigated the conformations of the
8-membered ring in the [M(μ-OPO)]2 core of complexes in
which transition metals (M) are double bridged by phosphate
and related groups.21 The large number of CSD structures
containing this fragment made it possible to reach detailed
conclusions that we suspect would have been difficult to obtain
in any other way. Claramunt et al. determined the degree of
nonplanarity of the 7-membered ring of 1,5-benzodiazepine
derivatives as a function of the substitution and protonation
pattern.22 Even the mundane benzene ring has attracted
attention recently. The influence of substituents on the ring’s
degree of aromaticity was estimated by the extent to which the
CC bond lengths differ from the value expected for perfect
aromaticity. The highest reduction in aromaticity was found in
meta-diamino and -dinitro benzene derivatives.23
(b) Conformation-Directing Interactions. The role of intramolecular C−H···π interactions in stabilizing gauche alkylaromatic bonds and axial alkylcyclohexanone conformations
was inferred from theoretical and CSD studies.24,25 So too was
the strong influence of intramolecular S···O interactions on the
conformations of the carboxamides of sulfur-containing
heterocycles.26 Of course, the most important conformationdirecting interaction is the hydrogen bond (henceforth “Hbond”). Galek et al. found that over 95% of intramolecular Hbonded rings contain 5, 6, or 7 atoms.27 An exception is the
preponderance of 8-membered rings when the intramolecular
H-bond is of the type N−H···OS.28
(c) Relationships between Conformations and Bond Lengths.
DFT and CSD analyses were performed on cyclopropane rings
9430
DOI: 10.1021/acs.chemrev.9b00155
Chem. Rev. 2019, 119, 9427−9477
Chemical Reviews
Review
assignment of tautomeric forms,53 and (g) optimization of
molecular geometries. The latter is done by converting Mogul
distributions into smooth, differentiable probability density
functions using kernel density estimation (Figure 2).54,55
Thompson and Day estimated the strain energies of
molecules in crystal structures using dispersion-corrected
DFT (DFT-D).63 For each of 36 molecular geometries from
the CSD, they calculated the difference in energy between the
observed geometry and the nearest local minimum, and
between that local minimum and the global minimum. The
results indicated that these differences could be remarkably
high, the largest value of each being over 20 kJ mol−1.
However, when the ostensibly strained, observed geometry of
DADNUR was compared with the global minimum, it was
noted that the former was extended and the latter folded
(Figure 3; here and elsewhere, structures are referred to by the
Figure 2. Mogul distribution for labeled torsion angle, fitted
probability density function (solid line), and derived objective
function for use in geometry optimization (dashed line); y-axis
dimensionless. Adapted from ref 55. Copyright 2016 American
Chemical Society.
Figure 3. Geometries of DADNUR: (a) gas-phase lowest energy and
(b) crystallographically observed. Reprinted from ref 63 under
Creative Commons License (https://creativecommons.org/licenses/
by/3.0/). Published by Royal Society of Chemistry.
An improved version of Mogul was developed for use in
conformer generation.56,57 It is faster and can produce
templates of low-energy geometries of simple, fused, or
bridged-ring systems (i.e., models of the ring systems in
favorable geometries), and each of its torsion distributions
respects any symmetry or chirality in the substructural
environment of the rotatable bond. Other research groups
have also created torsion libraries for conformer generation,
based either on the CSD alone or both the CSD and PDB.
Manually defined substructures were used to generate the
torsion angle distributions in one of the libraries.58,59 In
contrast, the libraries developed by Sadowski and Boström60
and Kothiwale et al.61 were, like Mogul, generated algorithmically. The manual approach makes use of chemical know-how
but necessarily produces smaller and less comprehensive
libraries than those produced automatically. The library of
Kothiwale et al. places a strong emphasis on multidimensional
torsion distributions that relate to fragments containing more
than one rotatable bond. The idea is to take into account
correlations between the torsion angles of adjacent bonds.
2.1.4. Crystal Packing Effects on Molecular Conformations. Some interesting papers were published in the
last 10 years on the effects of crystal packing forces on
molecular conformations. Weng et al. reviewed flexible
molecules that occur in more than one crystal structure
(polymorphs, solvates, or cocrystals).62 As expected, conformational diversity was found to increase with the number of
rotatable bonds in the molecule. Surprisingly, when the
molecules were subdivided by the number of crystal environments in which they were observed (Nenv), the percentage that
adopted only one conformation was about 60%, irrespective of
Nenv. Common conformational changes were trans↔gauche
and 180° flips of planar groups such as −CO2H. Many of the
changes were forced by different H-bonding schemes.
Conformational variability across different polymorphs or
cocrystals was less, on average, than across differently solvated
structures.
CSD reference code; details in the Supporting Information).
This is explicable. The DFT-D calculations pertained to the gas
phase, where the isolated molecule is likely to fold up to
optimize attractive electrostatic and dispersion interactions.
Conversely, extended conformations in crystal structures allow
attractive interactions with neighboring molecules. The authors
concluded that the calculated gas-phase energies were
inadequate on their own; exposed surface area matters too.
In another study, searches of the CSD and ab initio
calculations were performed to find highly strained molecules.64 The calculations used a polarizable continuum model,
which takes some account of solvent effects. Two types of
molecules were found with high strain energies. The first were
molecules such as biphenyl and bispyridinium, which have long
been known to have an undue tendency to be planar in crystal
structures.65 The second were cyclobutane and its derivatives,
which are puckered in the gas phase but sometimes flat in
crystal structures. Strain energies were calculated to be up to
about 8−10 kJ mol−1. It was noted that the strained planar
conformations almost exclusively occurred for molecules sited
on crystallographic inversion centers. The authors’ conclusion
was summarized in the title of their paper: “Systematic
conformational bias in small-molecule crystal structures is rare
and explicable”.
The final paper focused on CSD hydrocarbon molecules
situated on crystallographic special positions (almost always
inversion centers).66 As in the previous study, there was a
noticeable preference for some of these molecules to adopt
strained planar geometries in the crystalline state, their gasphase optimum geometries being very different and typically
twisted (e.g., Figure 4). It therefore appears that molecules on
inversion centers can have abnormally high strain energies.
Convincing examples of high strain for molecules not on
special positions are much harder to find; one interesting
example was reported in 2012 by Back et al.67
2.1.5. Metal Coordination. 2.1.5.1. Ligand Coordination
Modes. Over half of the CSD comprises crystal structures of
metal−organic compounds, so it is the definitive source of
9431
DOI: 10.1021/acs.chemrev.9b00155
Chem. Rev. 2019, 119, 9427−9477
Chemical Reviews
Review
supported the established view that the trans effect is strongest
for strong σ-binding ligands. However, the π-bonding ability is
also a factor. For several ligands, the octahedral complexes
showed slightly stronger trans effects than the square planar.
A project published in the same year examined bond
length−bond strength correlations by comparing the distances
of metal bonds to alkoxide, carboxylate, and azolate with those
of the corresponding bonds to alcohol, carboxylic acid, and
azide.73 As expected, the anionic ligands tend to form shorter
bonds than their neutral analogues, typically by 0.02−0.05 Å.
However, the differences are relatively small, indicating that
neutral and anionic ligands do not form two distinct classes of
metal−ligand bonds. In another study, Holland found that O−
O and N−N distances of metal-coordinated O2 and N2 are
unreliable guides to the oxidation state of the attached metal
because they may be artificially shortened by libration.74
2.1.5.4. Symmetry and Shape. Alvarez et al. have published
an extensive and elegant series of papers on metal coordination
symmetry and shape.75−80 Given a metal complex, they
determine which polyhedron best describes the coordination
geometry and how big the distortions are from this ideal
polyhedron. The latter is quantified by finding the best
superposition between the actual geometry and the ideal
polyhedron and measuring the deviations between observed
and actual vertices by a parameter termed the continuous
shape measure (CShM).81 The method has been used to
characterize the geometries of, e.g., 9- (Figure 5) and 10-
Figure 4. Observed geometries of two molecules sited on inversion
centers (top) and their calculated ideal geometries (below). Reprinted
from ref 66. Copyright 2012 American Chemical Society.
information about many aspects of metal coordination. Its
most common use is probably to compare and classify ligand
coordination modes, something often done in the course of
discussing new structures. An illustrative example is an analysis
of azide, thiocyanate, and cyanate binding to first-row
transition metal ions.68 It showed that the ligands are usually
terminal but can also be end-on (μ-1,1) or end-to-end (μ-1,3)
bridging; (μ-1,1) is more common for azides and (μ-1,3) for
thiocyanates. In the (μ-1,1) mode, azides usually bridge
symmetrically while the other ligands can be symmetric or
asymmetric, an observation that can be explained in terms of
the ligand orbitals (σ or π) involved in the bonding.
2.1.5.2. Metal Coordination Numbers. A CSD investigation
into metal coordination numbers looked at their dependence
on the size, charge, and charge-accepting ability of the metal
and the size, charge, charge-donating ability, and denticity of
the ligand.69 For a given type of ligand donor atom, it was
concluded that the size of the metal is more important than its
charge in determining coordination number (but alkali metals
may be an exception70). Conversely, for a given metal, the
ligand’s charge and charge-donating ability is more important
than its size. Almost a hundred types of metal ion were studied,
all of which were found to adopt more than one coordination
number. Unsurprisingly, odd coordination numbers were less
common than even. In a separate study by Kuppuraj et al.,71
the preferred coordination geometries of 63 types of metal ions
were determined.
2.1.5.3. Bond Lengths. The primary aim of the study just
cited was to elucidate how metal−ligand (M−L) bond lengths
depend on the properties of the metal cation and the ligand
donor atom. A large sample of bond lengths from the CSD was
subdivided by the coordination numbers of the metal ion
(CN) and the ligand (LCN). For a given (CN, LCN) pair,
M−L distances going down a group or across a row of the
periodic table were found to be linearly correlated with the
metal ionic radius. The metal ionic radius depends, in turn, on
oxidation state, spin state, and CN.
A 2013 study used CSD data to quantify the trans effect
using −Cl and −PPh3 as probe ligands (PL).72 The average
M−PL bond lengths in d8 square-planar and low-spin d6
octahedral complexes were determined as a function of the
ligand trans to PL. Some ligands had little or no effect on the
metal−PL bond (e.g., pyridine, chloride); at the other extreme,
ligands such as hydride, phenyl, and triphenylphosphine
lengthened the bond significantly (>0.1 Å, implying an
approximately 30% reduction in bond order). The results
Figure 5. Two 9-coordinate complexes. [Pu(NCMe)9]3+ (left) is a
capped square antiprism (CSAPR), while [Nd(H2O)9]3+ (right) is on
the interconversion pathway between CSAPR and tricapped trigonal
prism. Reprinted with permission from ref 76. Copyright 2008 WileyVCH Verlag GmbH & Co. KGaA.
coordinate compounds,76,78 Jahn−Teller distorted Cu(II)
complexes,77 and complexes involving double or triple
metal−ligand bonds.79 Davis et al. made a similar analysis of
3-coordinate metal complexes, showing that actual geometries
are usually quite different from any of the textbook ideals
(trigonal planar, T-shaped, and trigonal pyramidal).82
2.1.5.5. Spin States. When spin crossover occurs in the
crystalline state, the resulting geometry changes can alter
crystal symmetry. A recent review of this phenomenon, based
heavily on examples taken from the CSD, focused on
complexes of first-row transition metals.83 The characteristic
crossover behavior is an abrupt transition to another space
group as the temperature is increased. Also possible is a
lowering of crystal symmetry to accommodate a mixed highspin/low-spin state. The first type of behavior is relevant to the
design of spin-crossover materials that can exhibit useful
properties (e.g., ferroelectricity) in one of their phases.
2.1.5.6. Ligand Cone Angles. The steric requirements of a
monodentate ligand are often measured by its cone angle. The
concept has now been extended to bidentate ligands, the cone
angles of which depend on the ligand bite angle as well as the
9432
DOI: 10.1021/acs.chemrev.9b00155
Chem. Rev. 2019, 119, 9427−9477
Chemical Reviews
Review
steric bulk of the ligand.84 Values of this parameter were listed
for over 280 different phosphanes.
2.1.5.7. Polynuclear Complexes. A CSD search established
that polynuclear complexes containing an even number of
metal atoms are significantly more common than those with an
odd number.85 Suggested reasons included a possible
preference for high-symmetry complexes. The observation is
reminiscent, however, of an earlier observation that molecules
with even numbers of carbon atoms are more common than
those with odd numbers;86 at least one synthesis-related
explanation of this has been posited.87
Novikov performed theoretical calculations on CSD entries
with short Ni···Ni contacts and concluded that the interactions
are attractive and have significant covalent contributions when
ligand-supported but usually not when ligand-unsupported.88
Other investigations include a survey of Ni(II) and Co(II)
cubanes89 and an extensive analysis of semibridging carbonyl
ligands.90
2.1.5.8. Magnetism. The relationships between magnetic
properties and molecular structure can sometimes be clarified
with the help of the CSD. An example is provided by a study of
dinuclear bis(phenoxo)-bridged Cu(II) complexes.91 Most
exhibit antiferromagnetic coupling, but a small number are
ferromagnetic. DFT calculations and a survey of relevant CSD
structures provided some insight, though the picture was
complex. The coupling had a large dependence on the Cu−
O−Cu angle for planar complexes. However, when the
phenoxo groups were tilted strongly out of plane, the
dependence on Cu−O−Cu was small. This was explained in
terms of energy crossing of the two magnetic orbitals. Four
geometrical characteristics were listed: two associated with
ferromagnetic and two with antiferromagnetic coupling in
these complexes.
2.1.5.9. Software Parametrization. The CSD can provide
information for the parametrization of (semi)empirical
programs (and force fields; section 3.1). For example, bondlength distributions were used to extend the semiempirical
PM3 method to lanthanides. The ultimate aim was to assist the
design of luminescent and other commercially important
materials.92 CSD data were used to derive metal−organic bond
valence parameters for metals with different spin states93 and
for alkali− and alkaline-earth−oxygen pairs.94 CSD structures
for which experimental sublimation energies are available were
used to extend and validate the parametrization of PIXEL so
that it could handle molecules containing transition metals.95
(PIXEL is a popular semiempirical program for calculating
lattice energies.96) Finally, a fragment library has been derived
from the CSD, together with fragment connection rules. The
ambitious aim is the automated design of realistic, synthetically
accessible organometallic molecules.97
2.1.6. Crystallization Propensity. An interesting 2015
paper described an empirical model for predicting crystallization propensity.98 It was developed using two training sets
of molecules. The first was taken from the CSD, so the
molecules had a proven ability to form crystals suitable for
single-crystal diffraction. The other comprised molecules
absent from the CSD; it could confidently be assumed that
some would be unable to form good crystals. A large variety of
descriptors were calculated for each molecule from its twodimensional structure (e.g., connectivity indices). Those most
useful for discriminating between the two sets of molecules
were determined by use of support vector machines. The final
model had about 80% classification accuracy, determined by
crystallization experiments on a small sample of non-CSD
molecules. Only two of the descriptors were used in the model:
the rotatable bond count (obviously measuring molecular
flexibility) and a connectivity index whose value correlated
with molecular volume. Subsequently, the same authors
invented an improved molecular-flexibility index, which was
the most predictive descriptor of all.99
2.1.7. Tautomerism and Proton Transfer. One of the
most amusing things about James Watson’s book The Double
Helix is his admission that he and Crick wasted their time
trying to build DNA models using the wrong tautomeric form
of guanine. The crystallographer Jerry Donohue put them
right, and the rest is historya particularly dramatic example
of the importance of understanding tautomeric preferences.
The CSD is an obvious place to look for enlightenment,
though it must be done with care as hydrogen misplacement is
not uncommon.
Henry used CSD structures of sulfonamides and sulfonimides to illustrate how the tautomeric form adopted in the
crystalline state can have a huge effect on the H-bonding
network.100 Nanubolu et al. noticed that conjugation has a
pronounced influence on whether amino or imino forms occur
in the CSD.101 Cruz-Cabeza et al. compared the lowest-energy
tautomers of various heterocycles (calculated with MP2 and a
polarizable continuum model) with those observed in the
CSD, finding good but not perfect agreement.102 Reasonably
enough, discrepancies occurred when the energy difference
between alternative tautomeric forms was small.
Another survey found 108 molecules that crystallize in two
different tautomeric forms.103 This usually happens when
tautomer pairs occur in the same crystal structure; it is very
rare for different polymorphs of a compound to contain
different forms. Milletti and Vulpetti chose 13 ring systems
capable of tautomerism and deduced the forms each adopted
in several protein−ligand complexes.104 This was done by
examining H-bonding networks. They then compared their
results with the tautomers observed in water, the gas phase,
and the CSD. There was a good consensus, but with the
occasional discrepancy, e.g. the adenine tautomer favored in
water, the gas phase, and the CSD was less common in the
PDB than the alternative form.
On a related theme, Cruz-Cabeza studied over six thousand
crystal structures containing ionized or un-ionized acid−base
pairs. She was successful at correlating the occurrence of
proton transfer with the difference in the calculated aqueous
pKa of the two species (ΔpKa = pKa[protonated base] −
pKa[acid]).105 Thus, ionized forms were found if ΔpKa > 4;
un-ionized if ΔpKa < −1; and between these limits, increasing
ΔpKa by 1 increased the probability of proton transfer by
about 17%. The rule was supported by a subsequent test on a
matrix of acid−pyridine cocrystals.106
2.2. Intermolecular Interactions
The CSD has had a greater impact on the study of
intermolecular interactions than on any other topic. Many
seminal studies have been published,107 and interest in the area
shows no sign of abating. As we will see, controversy surrounds
some of the weak interactions that have been investigated
recently.
2.2.1. Hydrogen Bonds. Being the most important
intermolecular interaction by far, it is unsurprising that a
great many CSD research studies have been focused on the Hbond, including several in the last 10 years or so. One looked at
9433
DOI: 10.1021/acs.chemrev.9b00155
Chem. Rev. 2019, 119, 9427−9477
Chemical Reviews
Review
H-bond “coordination numbers”, i.e. the number of H-bonds
that a donor or acceptor can simultaneously form.108 The
distributions of this parameter were determined for over 70
different types of acceptors and donors. For example, amide
carbonyl oxygen atoms were found to accept 0, 1, 2, and 3 Hbonds on 763, 1214, 189, and 21 occasions, respectively.
Another paper pointed out that about 2.5% of organic
structures in the CSD contain no H-bonds, despite the
presence of strong donor and acceptor groups.109 In about
two-thirds of these, steric factors were deemed to be
responsible. In carbamazepine, for example,...