Hostname: page-component-cd9895bd7-hc48f Total loading time: 0 Render date: 2024-12-26T08:08:52.237Z Has data issue: false hasContentIssue false

Exploring structural database use in crystallography: a workshop series of the U.S. National Committee for Crystallography

Published online by Cambridge University Press:  05 April 2023

Ana Ferreras*
Affiliation:
Board on International Scientific Organizations, The National Academies of Sciences, Engineering, and Medicine, Washington, DC, USA
Mitchell D. Miller
Affiliation:
Department of BioSciences, Rice University, Houston, TX, USA
*
a)Author to whom correspondence should be addressed. Electronic mail: aferreras@nas.edu
Rights & Permissions [Opens in a new window]

Abstract

The U.S. National Committee for Crystallography (USNC/Cr) of the National Academies of Sciences, Engineering, and Medicine provided an online workshop series for researchers on the use, development, and maintenance of crystallographic and structural databases in the Spring of 2022. Encompassing macromolecular, small molecule, and powder diffraction information, the series included 11 modules each meeting for 1 or 2 days. Graduate students, postdoctoral fellows, faculty members and researchers in any of the crystallographic, diffraction, and imaging sciences affiliated with the International Union of Crystallography (IUCr) were encouraged to register and participate in the training sessions that interest them.

Type
International Report
Copyright
Copyright © The Author(s), 2023. Published by Cambridge University Press on behalf of International Centre for Diffraction Data

I. BACKGROUND

Crystallographic databases play a vital role in research. They provide comprehensive archives of crystal structures of inorganic, organic, metal–organic, and biological macromolecular compounds to a wide range of structural sciences. Proper database structure and data validations are crucial in making any database useful and reliable. Well-documented procedures and quality metrics enable trust in the data and allow others to build on past studies. To help train students and researchers on the use, development, and maintenance of crystallographic and structural databases, the U.S. National Committee for Crystallography (USNC/Cr) organized an online workshop series.

The USNC/Cr represents U.S. crystallographers in the International Union of Crystallography (IUCr) through the National Academies of Sciences, Engineering, and Medicine (NASEM). Further information about the USNC/Cr can be found at https://www.nationalacademies.org/our-work/us-national-committee-for-crystallography-usnc-cr. The goals of the IUCr are to promote international cooperation in crystallography; promote international publication of crystallographic research; facilitate standardization of methods, units, nomenclatures, and symbols; and form relationships between crystallography and other sciences. The IUCr provides access to a global infrastructure, a global community of experts, and international resources. For more information about the IUCr, visit https://www.iucr.org/.

II. THE SCOPE OF THE ACTIVITY

The USNC/Cr held a workshop series (15 sessions, 2–4 h duration each) running from March 21 through April 14, 2022 on structural database use in crystallography with the aim of grooming a future generation of users that develop and maintain open databases highly needed in the crystallographic field. The scope of research areas addressed by the workshop series was very broad, ranging from small molecule crystallography and powder diffraction to macromolecular crystallography and electron microscopy (3DEM). Specific topics ranged from searching for data, analyzing data, quality assessment, deposition, validation, all the way to considerations for building your own scientific database. The registration system allowed participants to choose those sessions that overlap their interests. The workshop series was sponsored by the National Institute of Standards and Technology (NIST).

The webinar series was fully virtual and comprised a keynote kickoff session followed by each of our eight database partners leading 1–3 sessions on their database resources and related topics spread out over a 4-week period. Most sessions started at 11 am EDT so that live participation from West Coast U.S. to Eastern Europe was convenient. All sessions were recorded to make the content available to those who could not participate live.

The activity had over 400 registrants (each session had 200–275 registrants), and the live-view attendance numbers varied between 30 and 80 per session. The registration was open and free to anyone in the world. The registrants were diverse in both career levels (from full professor through graduate and undergraduate students) and gender (see Figure 1), While 50–100 registrants reported previously using each database resource, others registered for sessions focused on resources they had not used previously. Each session provided an abstract that included the learning objectives and takeaway skills.

Figure 1. Responses from the registration survey questions about career stage (a), gender (b), techniques used probe structure (c), and which (if any) databases they had used before the workshop series (d).

III. ACTIVITY STRUCTURE, SCHEDULE, RECORDINGS, COLLABORATORS, AND SPEAKERS

The home for the workshop series is on the NASEM website https://www.nationalacademies.org/our-work/exploring-structural-database-use-in-crystallography-a-usnccr-workshop-series. Recordings of each session are available on a Vimeo page that showcases the series https://vimeo.com/showcase/9403870. The details about every session can be found in the abstract book on the NASEM website.

The following eight partners provided content to the workshop series and were in charge of their respective sessions.

  • Crystallography Open Database (COD)

  • Cambridge Crystallographic Data Centre (CCDC) / Cambridge Structural Database (CSD)

  • Electron Microscopy Data Bank (EMDB)

  • International Centre for Diffraction Data (ICDD) – Powder Diffraction File (PDF)

  • Inorganic Crystal Structure Database (ICSD)

  • Integrated Resource for Reproducibility in Macromolecular Crystallography (IRRMC)

  • RCSB Protein Data Bank (RCSB PDB)/Protein Data Bank (PDB)

  • SBGrid Software and Data Bank

Table I summarizes the workshop series' intense calendar.

TABLE I. 2022 Workshop at a glance.

On March 21, 2022, the Keynote speaker was John R Helliwell, Emeritus Professor of Chemistry, University of Manchester, UK and DSc in Physics, University of York, UK. Professor Helliwell discussed “The Exemplary Crystallography and Structural Databases” at the opening session to frame the workshop series. Professor Helliwell is the International Union of Crystallography Chairman of the Committee on Data and its Representative to CODATA. He noted how in structural science, the science is in the data. He praised the vision of William Lawrence Bragg (Reference Bragg1913) to include the diffraction data in his article on the first crystal structure. Data archival and exchange standards underpin the resources and drive the quest for the best quality data achievable. Collection of this data into databases has facilitates their reuse and contributes to the reproducibility and replicability of the science, which allows others to build on the results (National Academies of Sciences, Engineering, and Medicine, 2019). He encouraged participants to explore a review article (Bruno et al., Reference Bruno, Gražulis, Helliwell, Kabekkodu, McMahon and Westbrook2017) and the IUCr website https://www.iucr.org/resources/data/databases where they could find out more about the many crystallographic databases (both those included in the workshop series and others including the Pauling File (Villars et al., Reference Villars, Cenzual, Gladyshevskii and Iwata2018)). The exemplary databases developed in crystallography and structural science are widely admired and enable trust in the data. The use of vast collections of structural data is very broad and has great science and societal impacts across many disciplines from biology, chemistry, and materials science.

On March 23 and 24, 2022, Stephan Rühl, FIZ Karlsruhe led a 2-day tutorial on “Searching and Using the Inorganic Crystal Structure Database (ICSD)”. The ICSD is the world's largest database for fully determined inorganic crystal structures and is maintained by FIZ Karlsruhe. The first day provided students an introduction to the resource, its history and the validation checks that ensure highest possible quality standards of the data. The cooperation between FIZ Karlsruhe and the Cambridge Crystallographic Data Centre was discussed. There was an emphasis on how to find information in the database with detailed instruction on the use of search masks – from simple to complex. The second day was a hands-on tutorial where students were tasked with answering a number of questions that could be answered by applying the search techniques they had learned. The takeaway skill was the ability to use the ICSD to answer interesting problems as modeled through practice problems.

On March 28 and 30, 2022, Thomas Blanton and Soorya Kabekkodu, ICDD presented on “The International Center for Diffraction Data (ICDD) Powder Diffraction File™: Database Concepts and Applications”. Design, data curation, and data management are all critical factors in developing a successful and useful database. The Powder Diffraction File in Relational Database (RDB) format contains extensive chemical, physical, bibliographic, and crystallographic data including atomic coordinates enabling characterization and computational analysis. This workshop focused on the processes used in creating the Powder Diffraction File (PDF®) including quality, reliability, management and accessibility of data, and highlighted how the database could be used for materials analysis applications. Database applications covered included phase identification, quantitative analysis, and materials characterization using data mining applications.

On March 29 and 31, 2022, Stephen K. Burley, Ezra Peisach, Jasmine Young, Chenghua Shao, and Shuchi Dutta, RCSB PDB, provided a two-day workshop “Introducing the Protein Data Bank (PDB): 3D Macromolecular Structure Data Deposition, Validation, Biocuration, Archiving, and Delivery for Researchers, Educators, and Students Worldwide”. The PDB is the single global archive for preserving and disseminating information about the experimentally determined three-dimensional (3D) structures/shapes of proteins, nucleic acids, and complex assemblies. The PDB was established in 1971 as the first open-access digital data resource in biology and since 2003, the archive has been managed by the Worldwide Protein Data Bank (wwPDB) partnership. This 2-day workshop, sponsored by the RCSB Protein Data Bank (U.S. PDB data center) introduced participants to the science and technology of 3D macromolecular structure data deposition, validation, bio-curation, archiving, and delivery. It covered the PDBx/mmCIF data standards that underpins the data organization of the archive. Participants learned how data are deposited and validated, how to search the archive and retrieve structures and how to view structures and interpret the quality metrics in the validation reports.

On April 4, 2022, Jack Turner, EMDB, presented “Introducing the Electron Microscopy Data Bank (EMDB)”. The Electron Microscopy Data Bank (EMDB) is a public repository for electron cryo-microscopy volume maps and tomograms of macromolecular complexes and subcellular structures. It covers a variety of techniques, including single-particle analysis, electron tomography, and electron crystallography. During the workshop, participants learned about the different methods used to produce high-resolution EM structures. Participants learned how to use the EMDB website to browse for EM structures of interest. The various validation metrics that are stored were introduced to allow participants to assess the quality of EM data both in the EMDB archive and more generally. The use of EM visualization software was covered as was a brief introduction to EMPIAR (the electron microscopy public image archive) and AlphaFoldDB. A set of interactive, assisted, exercises were used to reinforce participants understanding of EM and the EMDB.

On April 5, 2022, Piotr Sliz, Peter Meyer, Shaun Rawson, Carol Herre, James Vincent, and Giorgos Boutsioukis, from SBGrid, presented “Using the SBGrid Software Installer, AppCiter, and SBGrid Data Bank on PCs, HPC Clusters and AWS Cloud”. The SBGrid (www.sbgrid.org), a consortium of 427 structural biology groups that utilize homogeneous stacks of scientific applications. The SBGrid “Factory” at Harvard Medical School, which is led by Dr. Sliz, actively curates over 1000 software titles ranging from tools used in crystallography, electron microscopy, computational chemistry to computational biology (biogrids.org). The SBGrid consortium members use the extensive software collection, and work with the Factory to improve the quality SBGrid environment. The session explained how the database infrastructure helps them organize, compile, and deploy the software titles. Participants learned how to use SBGrid Software Installer, AppCiter, and SBGrid Data Bank. The deployment of SBGrid applications and experimental data on personal computers, high performance, and cloud computing clusters was highlighted. The SBGrid data bank allows the archival and validation of X-ray diffraction, MicroED, and LLSM datasets. Participants learned how to utilize SBGrid applications in support of CryoEM, small-molecule docking and structure prediction. The backend database infrastructure that allows them to curate, compile and deploy the large library of scientific applications was explained.

On April 6 and 11, 2022, Saulius Gražulis, Antanas Vaitkus, and Andrius Merkys, Vilnius University/COD, presented two sessions on the Crystallography Open Database (COD) and Theoretical Crystallography Open Database (TCOD). Using an open-access distribution model, the COD collects all known “small molecule/small to medium-sized unit cell” crystal structures and makes them available freely on the Internet by offering basic search capabilities and the possibility to download all or part of the database. A website provides capabilities for all registered users to deposit published and so far, unpublished structures as personal communications or pre-publication depositions. Such a setup enables extension of the COD database by many users simultaneously. This increases the possibilities for growth of the COD database, and is the first step toward establishing a worldwide Internet-based collaborative platform dedicated to the collection and curation of structural knowledge. The first session was titled “Depositing and managing data in the Crystallography Open Database (COD)”. It focused on how to deposit data in the database, including the use of validation tools and quality standards, how to manage the data including correction of errors and release (on-hold vs public release). The peer-review process for entries was also covered. The second session was “Searching and getting data from the Crystallography Open Database (COD) and Theoretical Crystallography Open Database (TCOD)”. It presented various ways to search and extract and analyze data from the COD and TCOD. Participants learned about underlying the structure of the COD including contents, scope, versioning, curation, and identification principles of the COD. The way the COD integrates with other databases was presented. Participants learned how to perform computations and queries with the COD and view and assess the results.

On April 7, 2022, Wladek Minor – University of Virginia / IRRMC, David Cooper – University of Virginia / IRRMC, Dariusz Brzezinski – Technical University of Poznan, Brinda Vallat – RCSB PDB, and were joined by John Helliwell, University of Manchester and presented a session titled “Access to Experimental Data Is Crucial for Scientific Reproducibility”. The session drew on the example of the Integrated Resource for Reproducibility in Macromolecular Crystallography (IRRMC). The IRRMC includes a repository system and website designed to make the raw data of protein crystallography more widely available. Its focus is on identifying, cataloging, and providing the metadata related to datasets, which could be used to reprocess the original diffraction data. This project seeks to make the resulting three-dimensional structures more reproducible and easier to modify and improve as processing methods advance. Participants learned about the proteindiffraction.org repository of diffraction experiments used to determine structures that are in the protein data bank as well as the IRRMC tools used for working with data from the archive. The application of neural networks to find anomalies in raw diffraction images and malfunctions of experimental systems was presented. The wwPDB's plans for archival of integrative structures, whose models rely on combining information from multiple techniques and sources, was detailed. The experimental data supporting the models may come from techniques such as X-ray crystallography (X-ray), nuclear magnetic resonance (NMR) spectroscopy, three-dimensional electron microscopy (3DEM), small-angle solution scattering (SAS), chemical cross-linking mass spectrometry (CX-MS), Förster resonance energy transfer (FRET), electron paramagnetic resonance spectroscopy (EPR), hydrogen-deuterium exchange mass spectrometry (HDX-MS), and other biophysical and proteomics methods. The final part of the session focused on how access to unprocessed experimental data impacts scientific reproducibility.

On April 12 and 13, 2022, Jeff Lengyel, CCDC led a 2-day tutorial titled “Working with the Cambridge Structural Database (CSD): Searching, Analyzing and Depositing Data Using Cambridge Crystallographic Data Centre (CCDC) Tools”. After the first day's session, he was joined by Allen Oliver, University of Notre Dame, in a live one hour “AMA – Ask Me Anything” session on Twitter. The Cambridge Crystallographic Data Centre (CCDC) is the curator of the Cambridge Structural Database (CSD), the world's largest repository of fully curated organic and organometallic experimental crystal structures. The large amount of reliable data makes the CSD an excellent place to gather insights into chemical and structural space. Participants explored how the CCDC's tools (both the graphical user interfaces as well as Python-based scripting) allow users to search for structures of interest. From the search results, participants learned how to generate knowledge from statistical trends. Finally, the process of data submission and validation was covered.

On April 14, 2022, Saulius Gražulis, Vilnius University/COD, presented the final session of the series titled “Under the Hood: Building Your Own Database”. Targeted to anyone wishing to manage their (own) data reliably in a long term. Participants learned the basics of preparing their own database including choosing stable identifiers, versioning, using sensitive data, the use of relational models, and data curation. Participants learned the nuts and bolts of databases as data formats, low-level data encoding, and data dictionaries were introduced. Participants explored logistical issues such as web access and web security. By the end of the session, participants had the ability to work toward the preparation of a database with awareness of potential issues that must be addressed in the design and implementation of the database.

IV. CONCLUSION

This online workshop series introduced participants to eight crystallographic and structural databases. We are grateful for the contributions of all the speakers that provided participants with such a depth of content on such wide-ranging resources. Sessions covered the content, history, data curation, and use of the databases. Participants were able to select the sessions relevant to their careers and learn how to apply these resources to answer their scientific questions. The recording of each session ensures that those unable to participate live benefit from this initiative and represents a fantastic repository for the scientific community in the U.S. and overseas.

DISCLAIMER

The statements made here are those of the author(s) and do not necessarily represent positions of the National Academies of Sciences, Engineering, and Medicine.

References

REFERENCES

Bragg, W. L. 1913. “The Structure of Some Crystals as Indicated by Their Diffraction of X-Rays.” Proceedings of the Royal Society London, A 89: 248–77. doi:10.1098/rspa.1913.0083.Google Scholar
Bruno, I., Gražulis, S., Helliwell, J. R., Kabekkodu, S. N., McMahon, B., and Westbrook, J.. 2017. “Crystallography and Databases.” Data Science Journal 16 (38): 117. doi:10.5334/dsj-2017-038.CrossRefGoogle Scholar
National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. Washington, DC, The National Academies Press. doi:10.17226/25303.Google Scholar
Villars, P., Cenzual, K., Gladyshevskii, R., and Iwata, S.. 2018. “PAULING FILE - Towards a Holistic View.” Chemistry of Metals and Alloys 11: 4376. doi:10.30970/cma11.0382.CrossRefGoogle Scholar
Figure 0

Figure 1. Responses from the registration survey questions about career stage (a), gender (b), techniques used probe structure (c), and which (if any) databases they had used before the workshop series (d).

Figure 1

TABLE I. 2022 Workshop at a glance.