SUMMARY Nearly 90% of structural models in the Protein Data Bank (PDB), the central resource worldwide for three-dimensional structural information, are currently derived from macromolecular crystallography (MX). A major bottleneck in determining MX structures is finding conditions in which a biomolecule will crystallize. Here, we present a searchable database of the chemicals associated with successful crystallization experiments from the PDB. We use these data to examine the relationship between protein secondary structure and average molecular weight of polyethylene glycol and to investigate patterns in crystallization conditions. Our analyses reveal striking patterns of both redundancy of chemical compositions in crystallization experiments and extreme sparsity of specific chemical combinations, underscoring the challenges faced in generating predictive models for de novo optimal crystallization experiments.
In Brief Free text formatted metadata from public databases are difficult to extract and leverage. We present a curated dataset of experimental details from the PDB, the primary repository of macromolecular structures. We contribute a software tool for parsing PDB free text fields for users to generate updated or customized datasets. Our parsing function handles irregular free text information to produce usable datasets with a controlled vocabulary. We illustrate extracted metadata use via analyses of relationships between chemicals and protein structure features.
Graphical Abstract