Hostname: page-component-7b9c58cd5d-f9bf7 Total loading time: 0 Render date: 2025-03-13T11:30:54.533Z Has data issue: false hasContentIssue false

Protein secondary structure determined from independent and integrated infra-red absorbance and circular dichroism data using the algorithm SELCON

Published online by Cambridge University Press:  03 February 2025

Søren Vrønning Hoffmann
Affiliation:
ISA, Department of Physics and Astronomy, Aarhus University, Aarhus, Denmark
Nykola C. Jones
Affiliation:
ISA, Department of Physics and Astronomy, Aarhus University, Aarhus, Denmark
Alison Rodger*
Affiliation:
Research School of Chemistry, Australian National University, Canberra, Australia
*
Corresponding author: Alison Rodger; Email: alison.rodger@anu.edu.au
Rights & Permissions [Opens in a new window]

Abstract

Protein circular dichroism (CD) and infrared absorbance (IR) spectra are widely used to estimate the secondary structure content of proteins in solution. A range of algorithms have been used for CD analysis (SELCON, CONTIN, CDsstr, SOMSpec) and some of these have been applied to IR data, though IR is more commonly analysed by bandfitting or statistical approaches. In this work we provide a Python version of SELCON3 and explore how to combine CD and IR data to best effect. We used CD data in Δε/amino acid residue and scaled the IR spectra to similar magnitudes. Normalising the IR amide I spectra scaled to a maximum absorbance of 15 gives best general performance. Combining CD and IR improves predictions for both helix and sheet by ~2% and helps identify anomalously large errors for high helix proteins such as haemoglobin when using IR data alone and high sheet proteins when using CD data alone.

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BYCreative Common License - NCCreative Common License - SA
This is an Open Access article, distributed under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike licence (http://creativecommons.org/licenses/by-nc-sa/4.0), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the same Creative Commons licence is used to distribute the re-used or adapted article and the original article is properly cited. The written permission of Cambridge University Press must be obtained prior to any commercial use.
Copyright
© The Author(s), 2025. Published by Cambridge University Press

Introduction

Protein circular dichroism (CD) spectra, particularly with the high-quality data produced by synchrotron radiation sources (Guerra et al., Reference Guerra, Blanchet and Vieira2023; Bruque et al., Reference Bruque, Rodger and Hoffmann2024; Krokengen et al., Reference Krokengen, Touma and Mularski2024), are usually the first choice for determining the average secondary structure of proteins in aqueous solution. However, infra-red absorbance (IR) data are sometimes used in preference, particularly when high concentration samples or high concentrations of buffer components are present. There are a few studies where CD and IR data have been combined to good effect. In this paper we show the value of integrating both types of spectral data where possible.

When extracting secondary structure percentages from CD data alone we have found it valuable to use different reference sets and different algorithms. SELCON and SOMSpec seem to be the most reliable methods for CD analysis (Hall et al., Reference Hall, Nash, Hines and Rodger2013). SELCON (Sreerama and Woody, Reference Sreerama and Woody2000) for CD is currently available as Fortran code and also via the Dichroweb server with a selection of reference sets. (Whitmore and Wallace, Reference Whitmore and Wallace2004). SOMSpec is available as a MATLab code (Pinto Corujo et al., Reference Pinto Corujo, Olamoyesan and Tukova2022).

Infra-red (IR) absorbance spectra, particularly of the amide I band between 1600–1700 cm−1, are also generally recognised to contain information about the protein’s secondary structure. The vibrational contribution of the amide I band is dominated by the C=O stretching of the amide group coupled with the in-phase bending of N–H bonds and stretching of C–N bonds (Krimm and Bandekar, Reference Krimm and Bandekar1986; Bandekar, Reference Bandekar1992). A great deal of work has been done on protein IR spectroscopy, but the best way to extract secondary structure information for, for example, regulatory or research purposes remains unclear.

A range of different curve fitting methods, often preceded by band-narrowing (Kauppinen et al., Reference Kauppinen, Moffatt, Mantsch and Cameron1981; Maddams and Tooke, Reference Maddams and Tooke1982; Susi and Byler, Reference Susi and Byler1983; Byler and Susi, Reference Byler and Susi1986; Calero and Gasset, Reference Calero, Gasset and Sigurdsson2005), have been implemented as summarised and illustrated in reference (Pinto Corujo et al., Reference Pinto Corujo, Olamoyesan and Tukova2022). The recent consensus, e.g. (Kong and Yu, Reference Kong and Yu2007; Yang et al., Reference Yang, Yang, Kong, Dong and Yu2015) is that 1620–1640 cm−1 is attributed to β-sheet, 1640–1650 cm−1 to other structures, 1650–1656 cm−1 to α-helix, and 1670–1685 cm−1 to turns. Errors of between 10 and 20% were found with band-fitting and derivative-band fitting methods (Pinto Corujo et al., Reference Pinto Corujo, Olamoyesan and Tukova2022).

Factor analysis methods (Pancoska et al., Reference Pancoska, Yasui and Keiderling1991; Baumruk et al., Reference Baumruk, Pancoska and Keiderling1996) including the BioTools (Jupiter, US) program ProtaTM provides reasonably good structure estimates, but the details of the fittings cannot be interrogated by the user. Oberg et al. (Reference Oberg, Ruysschaert and Goormaghtigh2004) have extensively explored the application of a partial least squares analysis (PLS) using reference sets and concluded that the most important issue is the quality of the reference set. They observed that larger reference sets usually do not perform better than smaller ones, as they may include more ‘anomalous’ spectra – so it is important to be able to interrogate results rather than simply accept a number. Goormaghtigh et al. had significant success with an approach which identifies three key wavenumbers for the three structural features that can be distinguished in the IR spectrum (Goormaghtigh et al., Reference Goormaghtigh, Ruysschaert and Raussens2006; De Meutter and Goormaghtigh, Reference De Meutter and Goormaghtigh2021). However, their preference for using a data point in the amide II band for films of proteins is in our experience not transferable to biopharmaceutical formulations, as we have observed that the magnitude of this band varies significantly with different solution components.

We have explored the application of our self-organising map circular dichroism structure fitting algorithm, SOMSpec (Hall et al., Reference Hall, Sklepari and Rodger2014; Pinto Corujo et al., Reference Pinto Corujo, Olamoyesan and Tukova2022), to the analysis of protein infrared spectra and found it to be more accurate than most of the other methods used. The key feature of SOMSpec is that it organises a reference set onto a map by similarity of spectra shape then places the unknown spectra on the same map to extract secondary structure estimates.

There are limited examples in the literature where both CD and IR data have been used to give better estimates of secondary structure data. Most applications, such as in (Pancoska et al., Reference Pancoska, Yasui and Keiderling1991; Baumruk et al., Reference Baumruk, Pancoska and Keiderling1996; Ascoli et al., Reference Ascoli, Pergami and Luu1998; Calero and Gasset, Reference Calero, Gasset and Sigurdsson2005) involve independent consideration of CD and IR spectra, usually with some kind of band-fitting approach for the IR data. Oberg et al. (Reference Oberg, Ruysschaert and Goormaghtigh2004) explored combining CD and IR (amide I and II) spectra. They measured CD and IR of 50 proteins and used Partial Least Squares (PLS) regression as their main analysis method with DSSP annotation for the spectra, and also considered principal component analysis followed by multiple regression and SELCON. Overall, they found that in general IR is better for β-sheet and turn estimates and CD for α-helix and ‘the rest’. They also noted that when the α-helix content from either CD or IR is noticeably lower than from the other, then the larger one will be closer to reality. In general, using the combined data set gave better estimates, but noting when the independent estimates differed significantly was a good indication of failed analyses. The authors also concluded larger reference sets are better but warned against enhancing a reference set with anomalous spectra.

Although Oberg et al. (Reference Oberg, Ruysschaert and Goormaghtigh2004) had success with using their PLS approach with CD and IR data, both independently and combined, and we have extensively used our self-organising map approach with both CD and IR data (Hall et al., Reference Hall, Sklepari and Rodger2014; Pinto Corujo et al., Reference Pinto Corujo, Olamoyesan and Tukova2022), and could have used it for combined data sets, neither approach has been implemented for routine use. As the CD community has used SELCON successfully for decades and Oberg et al. (Reference Oberg, Ruysschaert and Goormaghtigh2004) found it worked as well as their approach, the goal of this work was therefore to make SELCON available for CD, IR Amide I and CD + IR analysis and to test how well it works.

Methods

The SELCON3 routine used in this work is based on an implementation of the algorithm in the MatLab script SelMat, (Lees et al., Reference Lees, Miles, Wien and Wallace2006) re-written into Python. The new Python-based program and reference set package SSCalcPy includes this implementation of SELCON3 and is available on Zenodo (Hoffmann and Jones, Reference Hoffmann and Jones2024a) and GitHub (Hoffmann and Jones, Reference Hoffmann and Jones2024b). The SSCalcPy package includes two reference sets for CD secondary structure calculations, namely SP175 (Lees et al., Reference Lees, Miles, Wien and Wallace2006) and SMP180 (Abdul-Gader et al., Reference Abdul-Gader, Miles and Wallace2011) obtained from the PCDDB (Whitmore et al., Reference Whitmore, Woollett, Miles, Klose, Janes and Wallace2011; Whitmore et al., Reference Whitmore, Miles, Mavridis, Janes and Wallace2016; Ramalli et al., Reference Ramalli, Miles, Janes and Wallace2022). Details of the origin of the original code and the reference datasets are given in the supplementary information as S1 . The secondary structure assignments for the CD reference data are based on a DSSP (Kabsch and Sander, Reference Kabsch and Sander1983) method, see S1 for more details.

The reference set used for SELCON3 analysis of IR data is based on the RaSP50 (Oberg et al., Reference Oberg, Ruysschaert and Goormaghtigh2003; Goormaghtigh et al., Reference Goormaghtigh, Ruysschaert and Raussens2006) data available in the Supplementary Material of the SOMSpec analysis publication (Pinto Corujo et al., Reference Pinto Corujo, Olamoyesan and Tukova2022). A detailed description for the method of sample preparation and data collection can be found in the 2006 paper of Goormaghtigh et al. (Reference Goormaghtigh, Ruysschaert and Raussens2006). A list of the 50 proteins, their SOMSpec annotation, and crystal secondary structure can be found in S2 of the supplementary information. This reference set contains IR spectra collected on protein samples dried on an ATR crystal and has data in the wavenumber range 1600–1800 cm−1. They scaled their IR data to a maximum of 1 and when combining CD and IR data they scaled the CD spectral intensities by 0.0015 (Oberg et al., Reference Oberg, Ruysschaert and Goormaghtigh2004). The IR data to be analysed must have a baseline spectrum subtracted making sure that the 2100 cm−1 region is flat and the data zeroed at 1718 cm−1 prior to normalization.

From the proteins included in RaSP50, SP175 and SMP180 we have identified 28 common proteins where both high-quality IR and CD spectra are available. We used these to produce a combined CD-IR reference set (CD-IR28). For a list of the 28 proteins, see S3 in the supplemental information. To perform a SELCON3 Leave One Out Validation (LOOV) analysis of both the RaSP50 and the new CD-IR28 reference sets, each spectrum is removed from the reference set and subjected to SELCON3 analysis with the remaining 49(RaSP50)/27(CD-IR28) spectra used as reference sets. The LOOV Python script is included in the SSCalcPy package under the folder “Tools.”

Since the IR reference data are scaled to a maximum absorbance of 1 and the CD spectra are in molar extinction (Δε) units, the CD data magnitudes are typically much larger than the magnitude of the IR data. To optimize the SELCON3 analysis of the combined CD and IR data, the IR data have been scaled further to achieve more similar magnitudes between the CD and IR spectra. This scaling factor (IRscale) has been varied between 1 and 20 in the analysis, and we give suggestions for the optimum scaling of the IR spectra as part of the analysis in the Results and Discussion section. A scaling of zero results in analysis of the secondary structure based on the CD spectra alone. This approach is opposite from that taken by Oberg et al. ( Reference Oberg, Ruysschaert and Goormaghtigh2004 ) who scaled the CD spectra and used normalised IR spectra.

Results and discussion

The Python SELCON3 code in the SSCalcPy package was tested using CD data files and the reference set SP175 and satisfactorily compared with the results from the server Dichroweb (Lobley et al., Reference Lobley, Whitmore and Wallace2002; Whitmore and Wallace, Reference Whitmore and Wallace2004; Whitmore and Wallace, Reference Whitmore and Wallace2008; Miles et al., Reference Miles, Ramalli and Wallace2022) prior to its use for IR data. First, we validated the performance of SELCON3 for the 50 IR spectra in the RaSP50 reference dataset by performing LOOV analysis and calculating for each protein the difference between the SELCON3 secondary structure (SSi) and the crystal secondary structure (cSSi), ΔIR,i = SSi-cSSi for protein i (see S2 for a list of SSi and cSSi). The results are shown in Figure 1 and compared to the similar analysis using the SOMSpec method (Pinto Corujo et al., Reference Pinto Corujo, Olamoyesan and Tukova2022). The reconstructed spectra generated by the SELCON3 algorithm displayed with the corresponding protein spectrum are shown in S4 of the supplementary information for each of the members of the RaSP50 reference dataset.

Figure 1. The fractional difference between the calculated helix and sheet content for the 50 proteins in the RaSP50 reference set using SELCON (this work) and SOMSpec (previous work (Pinto Corujo et al., Reference Pinto Corujo, Olamoyesan and Tukova2022)). Helix denotes combined α-helix and 3–10 helix and sheet denotes β-sheet. In order to improve visibility of the smallest values, the scale has been limited to +/− 0.2, so the absolute values of the largest differences are not shown, see the text for a discussion of these outliers.

Visual inspection of Figure 1 indicates that SELCON and SOMSpec secondary structure predictions from IR data are of similar reliability. To quantify the performance of both the SELCON3 and the SOMSpec methods on the IR data, two metrics are calculated: the average of the absolute differences, avg(Δabs,IR) = ΣiIR,i| / n, and the standard deviations of the differences σ(ΔIR). For the IR SELCON3 analysis, n is 49 for RaSP50 as one protein is left out for the LOOV. Both are calculated individually for the helix and the sheet differences. A summary of this analysis is shown in Table 1.

Table 1. The overall performance of SOMSpec and SELCON3 for secondary structure predictions

For both of these metrics, SELCON3 performs slightly better than SOMSpec. However, careful inspection of Figure 1 does reveal that in a few cases SOMSpec outperforms SELCON3 (e.g. F47) and for other cases the opposite is true (e.g. F42). The average SELCON helix deviation is 8% and sheet deviation is 6% which are slightly better than the 9% and 7% of SOMSpec. The standard deviations of the errors are also slightly tighter for SELCON being 0.10 and 0.07 versus the 0.12 and 0.09 of SOMSpec.

Overall, there is no evidence for a general under or overprediction of secondary structure, although helical contents of highly helical proteins tend to be under-estimated. As noted previously, highly helical proteins with very similar spectra may have quite different amounts of helix. Haemoglobin (F4) is particularly problematic with the crystal structure having 77% helix and the IR-prediction being only 51%. Also, Metallothionein II (F50) and Soy Trypsin Inhibitor (F46) structures are not well predicted, essentially because they do not retain any helical structure, and for Metallothionein II not even sheet structure, in their crystal structure.

When combining IR and CD reference datasets for analysis, some care should be made in not emphasizing one set over the other, thus skewing the results. To this end, we reduced the IR dataset to include data in the 1600–1720 cm−1 wavenumber range in 2 cm−1 steps. The data outside this range is essentially zero after baseline subtraction and 2 cm−1 steps is sufficient to represent the spectral features in the IR data. This brings the number of data points (wavenumbers) for the IR data down to 61, very similar to the 66 data points (wavelengths) in the CD reference dataset.

The LOOV analysis of this combined CD and IR data set containing 28 proteins, CD-IR28 (see S3 for the full list), was performed for a range of scaling factors (IRscale) of the IR data ranging from 0, that is pure CD data analysis, up to 20. We note that for the highest scaling factor, the IR spectrum is significantly larger than the CD spectrum for low helix content proteins. For each scaling factor, the difference between the SELCON3 results and the crystal structure was calculated for each protein, ΔCD-IR,i, and the metrics avg(Δabs,CD-IR) and σ(ΔCD-IR) derived. The ΔCD-IR, for all proteins in CD-IR28 and for each scaling factor are shown in the Supplementary Information S5 .

In Figure 2, both the standard deviations and the average absolute differences for helix and sheets are shown for each scaling factor. We note that for a scaling factor of zero, that is pure CD analysis, both metrics show a better performance for the determination of helical content compared to the sheet content. In contrary, the metrics in Table 1 show that IR has a better performance in determining the sheet over the helical content. This is in line with the notion that CD is more sensitive to helical content, whereas IR is more sensitive to sheet content.

Figure 2. The standard deviations and the average absolute differences for helix and sheets for a range of scaling factors (IRscale) of the IR data in the combined CD and IR reference dataset.

From Figure 2 it is clear that the performance of SELCON3 is improved when including IR data in the analysis (IRscale >0), not only for sheet content, as expected due to the higher sensitivity to sheets, but also for the helix content. For an IRscale = 10 the metric avg(Δabs,CD-IR) are 0.043/0.058 for helix/sheet to be compared to the pure IR SELCON3 fit values of 0.08/0.06, that is an improvement for helix without sacrificing the performance for sheets. From the analysis in Figure 2, taking both metrics into consideration, the best choice of IR scaling factor is 15, but 10 may also be considered a good choice.

To further analyse the performance of SELCON3 for the CD-IR28 data reference set, the maximum of the absolute difference between the SELCON3 results and the cSS, max(|ΔCD-IR,i|), is shown in Figure 3. This metric is a measure of the worst performance of SELCON3, for both helix, sheet and their average, and using this metric a scaling factor of 5 minimizes the outliers.

Figure 3. The maximum absolute difference between the SELCON3 calculated secondary structure and the crystal secondary structure, individually for helix and sheets, as well as their average.

In combination, these three metrics show that the best choice of IRscale is in the range 5–15. This range seems very reasonable when considering that an IRscale of 15 brings the IR spectrum magnitude close to that of the CD spectra for helix-rich proteins, and a scale of 5 brings the IR spectra close to the magnitude of low helix content proteins.

As the final metric considered to elucidate the optimum IRscale, we have calculated the root mean square deviation (RMSD) between the protein spectrum under analysis and the SELCON3 reconstructed spectrum. The average RMSD for all 28 proteins in CD-IR28 is shown in Figure 4 (top) for both the individual CD and IR parts and for the combined CD-IR spectrum.

Figure 4. The average (top) and the maximum of the RMSD (bottom) between the protein spectrum under analysis in LOOV and the SELCON3 reconstructed spectrum. The RMSD is shown for both the individual CD and IR parts and for the combined CD-IR spectrum.

For scaling factors up to 10, the average RMSD for the IR part of the spectrum increases with increasing scaling factor. If we consider the simple case where the reconstructed spectrum has the same shape and is only scaled, then the RMSD would increase linearly with the scaling factor. Hence, the general increase is well understood for the IR RMSD. However, the increase in the RMSD of the CD part of the spectrum is not a direct result of the IR scaling factor. To understand this increase, we must consider how reference spectra are selected in the SELCON3 method. First the reference spectra are sorted according to their RMSD with respect to the protein query spectrum, and then an increased number of the reference spectra, most similar to the query spectrum, are included while searching for valid solutions (Sreerama and Woody, Reference Sreerama and Woody1993). When we concatenate the IR spectrum to the CD spectrum, other reference spectra might be more similar, that is have lower RMSD with respect to the query spectrum, than those for the CD spectrum only. This gives rise to reconstructed CD spectra that are no longer optimized for the CD part of the combined spectrum, but rather optimized for the both the CD and IR spectra. Therefore the overall RMSD increases with scaling factor, while still retaining a better prediction of the secondary structure as evidenced by the metrics avg(Δabs,CD-IR) and σ(ΔCD-IR) in Figure 2.

To understand why the average RMSD is highest at a scaling factor of 10, the maximum RMSD among the proteins in CD-IR28 is shown in Figure 4 (bottom). The scaling factor of 10 is a clear outlier here, driven by a badly reconstructed CD spectrum for Chymotrypsinogen A. The combined CD and IR spectrum for Chymotrypsinogen A is shown in S6 for scaling factors of 10 and 15. For IRscale = 10 the selected reference spectra in SELCON3 give rise to wavelength shifted peaks in the CD spectrum, resulting in a high RMSD, whereas the more dominating IR part of the spectrum at IRscale = 15 assists SELCON3 in selecting proteins that give results in a better fit between the CD part of the Chymotrypsinogen A spectrum and its reconstructed spectrum.

Overall, the analysis of all the considered metrics points to a scaling factor of 15 provides reliable results for the predictive power of SELCON3 using combinations of CD and IR reference spectra. The SSCalcPy software allows the user to select other scaling factors, and in particular for proteins with low helix content, that is with lower magnitude CD signal, we suggest that lower scaling factors are examined and compared to higher scaling factor predictions for consistency.

Conclusions

We have shown that the algorithm originally created by Sreerama and Woody (Sreerama and Woody, Reference Sreerama and Woody2000) to extract secondary structure estimates from circular dichroism spectra can be used with amide I infrared protein absorbance data with slightly more average accuracy than any other method reported to date for analysis of IR spectra. The SELCON3 IR results are very similar to those we obtained previously using our self-organising map algorithm SOMSpec. Furthermore, the combination of CD and IR data was shown to give improved prediction accuracy in SELCON3 analysis compared to separate CD or IR analysis.

Open peer review

To view the open peer review materials for this article, please visit http://doi.org/10.1017/qrd.2025.4.

Supplementary material

The supplementary material for this article can be found at http://doi.org/10.1017/qrd.2025.4.

Acknowledgements

This project has received funding from the European Union’s Horizon 2020 research and innovation program MOSBRI under grant agreement no. 101004806 and the Australian Research Council Industrial Transformation Training Centre for Facilitated Advancement of Australia’s Bioactives (Grant IC210100040).

Author contribution

All authors contributed equally to the formulation of the research goals and aims and different aspects of the data generation and analysis. SVH coded SELCON3 in Python.

Competing interest

There are no conflicts to declare.

References

Abdul-Gader, A, Miles, AJ and Wallace, BA (2011) A reference dataset for the analyses of membrane protein secondary structures and transmembrane residues using circular dichroism spectroscopy. Bioinformatics 27, 16301636.Google Scholar
Ascoli, GA, Pergami, P, Luu, KX, et al. (1998) Use of CD and FT-IR to determine the secondary structure of purified proteins in the low-microgram range. Enantiomer 3(4–5), 371381.Google Scholar
Bandekar, J (1992) Amide modes and protein conformation. Biochimica et Biophysica Acta (BBA) – Protein Structure and Molecular Enzymology, 1120(2), 123143. https://doi.org/10.1016/0167-4838(92)90261-BGoogle Scholar
Baumruk, V, Pancoska, P and Keiderling, TA (1996) Predictions of secondary structure using statistical analyses of electronic and vibrational circular dichroism and Fourier transform infrared spectra of proteins in H2O. Journal of Molecular Biology 259(4), 774791. https://doi.org/10.1006/jmbi.1996.0357Google Scholar
Bruque, MG, Rodger, A, Hoffmann, SV, et al. (2024 ) Analysis of the structure of 14 therapeutic antibodies using circular dichroism spectroscopy. Analytical Chemistry 96(38), 1515115159. https://doi.org/10.1021/acs.analchem.4c01882Google Scholar
Byler, DM and Susi, H (1986 ) Examination of the secondary structure of proteins by deconvolved FTIR spectra. Biopolymers 25(3), 469487. https://doi.org/10.1002/bip.360250307Google Scholar
Calero, M and Gasset, M. (2005) Fourier transform infrared and circular dichroism spectroscopies for amyloid studies. In Sigurdsson, EM (ed.) Amyloid Proteins: Methods and Protocols. Humana Press, pp. 29151.Google Scholar
De Meutter, J and Goormaghtigh, E. (2021) FTIR imaging of protein microarrays for high throughput secondary structure determination. Analytical Chemistry 93(8), 37333741. https://doi.org/10.1021/acs.analchem.0c03677Google Scholar
Goormaghtigh, E, Ruysschaert, J-M and Raussens, V (2006 ) Evaluation of the information content in infrared spectra for protein secondary structure determination. Biophysical Journal 90(8), 29462957. https://doi.org/10.1529/biophysj.105.072017Google Scholar
Guerra, JPL, Blanchet, CE, Vieira, BJC, et al. (2023 ) Controlled modulation of the dynamics of the Deinococcus grandis Dps N-terminal tails by divalent metals. Protein Science 32(2), e4567. https://doi.org/10.1002/pro.4567Google Scholar
Hall, V, Nash, A, Hines, E and Rodger, A. (2013) Elucidating protein secondary structure with circular dichroism and a neural network. Journal of Computational Chemistry 34, 27742786. https://doi.org/10.1002/jcc.23456Google Scholar
Hall, V, Sklepari, M and Rodger, A (2014 ) Protein secondary structure prediction from circular dichroism spectra using a self-organizing map with concentration correction. Chirality 26, 471482. https://doi.org/10.1002/chir.22338.Google Scholar
Hoffmann, SV and Jones, NC. (2024a) Zenodo https://doi.org/10.5281/zenodo.13995323Google Scholar
Hoffmann, SV and Jones, NC (2024b) GitHub github.com/AU-SRCD/SSCalcPyGoogle Scholar
Kabsch, W and Sander, C (1983 ) Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22, 25772637.Google Scholar
Kauppinen, JK, Moffatt, DJ, Mantsch, HH and Cameron, DG (1981 ) Fourier self-deconvolution: A method for resolving intrinsically overlapped bands. Applied Spectroscopy 35(3), 271276.Google Scholar
Kong, J and Yu, S (2007) Fourier transform infrared spectroscopic analysis of protein secondary structures protein FTIR data analysis and band assignment. Acta Biochimica et Biophysica Sinica 39, 549559.Google Scholar
Krimm, S and Bandekar, J. (1986) Vibrational spectroscopy and conformation of peptides, polypeptides, and proteins. Advances in Protein Chemistry 38, 181364.Google Scholar
Krokengen, OC, Touma, C, Mularski, A, et al. (2024 ) The cytoplasmic tail of myelin protein zero induces morphological changes in lipid membranes. Biochimica et Biophysica Acta (BBA) – Biomembranes 1866(7), 184368. https://doi.org/10.1016/j.bbamem.2024.184368Google Scholar
Lees, JG, Miles, AJ, Wien, F and Wallace, BA (2006) A reference database for circular dichroism spectroscopy covering fold and secondary structure space. Bioinformatics 22(16), 19551962. https://doi.org/10.1093/bioinformatics/btl327Google Scholar
Lobley, A, Whitmore, L and Wallace, BA (2002 ) DICHROWEB: An interactive website for the analysis of protein secondary structure from circular dichroism spectra. Bioinformatics 18(1), 211212. https://doi.org/10.1093/bioinformatics/18.1.211Google Scholar
Maddams, WF and Tooke, PB (1982 ) Quantitative conformational studies on poly(vinyl chloride). Journal of Macromolecular Science: Part A – Chemistry 17(6), 951968. https://doi.org/10.1080/00222338208056495Google Scholar
Miles, AJ, Ramalli, SG and Wallace, BA (2022 ) DichroWeb, a website for calculating protein secondary structure from circular dichroism spectroscopic data. Protein Science 31(1), 3746. https://doi.org/10.1002/pro.4153Google Scholar
Oberg, KA, Ruysschaert, J-M and Goormaghtigh, E (2003 ) Rationally selected basis proteins: A new approach to selecting proteins for spectroscopic secondary structure analysis. Protein Science 12, 20152031.Google Scholar
Oberg, KA, Ruysschaert, J-M and Goormaghtigh, E (2004 ) The optimization of protein secondary structure determination with infrared and circular dichroism spectra. European Journal of Biochemistry 271, 29372948.Google Scholar
Pancoska, P, Yasui, SC and Keiderling, TA (1991) Statistical analyses of the vibrational circular dichroism of selected proteins and relationship to secondary structures. Biochemistry 30(20), 50895103. https://doi.org/10.1021/bi00234a036Google Scholar
Pinto Corujo, M, Olamoyesan, A, Tukova, A, et al. (2022 ) SOMSpec as a general purpose validated self-organising map tool for rapid protein secondary structure prediction from infrared absorbance data. Original Research. Frontiers in Chemistry, 9. https://doi.org/10.3389/fchem.2021.784625Google Scholar
Ramalli, SG, Miles, AJ, Janes, RW and Wallace, BA (2022 ) The PCDDB (Protein Circular Dichroism Data Bank): A bioinformatics resource for protein characterisations and methods development. Journal of Molecular Biology 434(11), 167441. https://doi.org/10.1016/j.jmb.2022.167441Google Scholar
Sreerama, N and Woody, RW (1993) A self-consistent method for the analysis of protein secondary structure from Circular dichroism. Analytical Biochemistry 209, 3244.Google Scholar
Sreerama, N and Woody, RW (2000 ) Estimation of protein secondary structure from circular dichroism spectra: Comparison of CONTIN, SELCON, and CDSSTR methods with an expanded reference set. Analytical Biochemistry 287, 252260. https://doi.org/10.1006/abio.2000.4880Google Scholar
Susi, H and Byler, MD (1983 Protein structure by Fourier transform infrared spectroscopy: Second derivative spectra. Biochemical and Biophysical Research Communications 115(1), 391397. https://doi.org/10.1016/0006-291X(83)91016-1Google Scholar
Whitmore, L, Miles, AJ, Mavridis, L, Janes, RW and Wallace, BA (2016 ) PCDDB: New developments at the protein circular dichroism data bank. Nucleic Acids Research 45(D1), D303D307. https://doi.org/10.1093/nar/gkw796Google Scholar
Whitmore, L and Wallace, BA (2004).DICHROWEB, an online server for protein secondary structure analyses from circular dichroism spectroscopic data. Nucleic Acids Research 32, W668–73. https://doi.org/10.1093/nar/gkh371Google Scholar
Whitmore, L and Wallace, BA (2008 ) Protein secondary structure analyses from circular dichroism spectroscopy: Methods and reference databases. Biopolymers 89(5), 392400. https://doi.org/10.1002/bip.20853Google Scholar
Whitmore, L, Woollett, B, Miles, AJ, Klose, D, Janes, RW and Wallace, BA (2011 ) PCDDB: The protein circular dichroism data bank, a repository for circular dichroism spectral and metadata. Nucleic Acids Research 39, D480D486.Google Scholar
Yang, H, Yang, S, Kong, J, Dong, A and Yu, S (2015) Obtaining information about protein secondary structures in aqueous solution using Fourier transform IR spectroscopy. Nature Protocols 10, 382. https://doi.org/10.1038/nprot.2015.024Google Scholar
Figure 0

Figure 1. The fractional difference between the calculated helix and sheet content for the 50 proteins in the RaSP50 reference set using SELCON (this work) and SOMSpec (previous work (Pinto Corujo et al., 2022)). Helix denotes combined α-helix and 3–10 helix and sheet denotes β-sheet. In order to improve visibility of the smallest values, the scale has been limited to +/− 0.2, so the absolute values of the largest differences are not shown, see the text for a discussion of these outliers.

Figure 1

Table 1. The overall performance of SOMSpec and SELCON3 for secondary structure predictions

Figure 2

Figure 2. The standard deviations and the average absolute differences for helix and sheets for a range of scaling factors (IRscale) of the IR data in the combined CD and IR reference dataset.

Figure 3

Figure 3. The maximum absolute difference between the SELCON3 calculated secondary structure and the crystal secondary structure, individually for helix and sheets, as well as their average.

Figure 4

Figure 4. The average (top) and the maximum of the RMSD (bottom) between the protein spectrum under analysis in LOOV and the SELCON3 reconstructed spectrum. The RMSD is shown for both the individual CD and IR parts and for the combined CD-IR spectrum.

Supplementary material: File

Hoffmann et al. supplementary material

Hoffmann et al. supplementary material
Download Hoffmann et al. supplementary material(File)
File 2.2 MB

Author comment: Protein secondary structure determined from independent and integrated infra-red absorbance and circular dichroism data using the algorithm SELCON — R0/PR1

Comments

Dear Editor

Apologies for the lateness of this manuscript - we too a while to get it right. The manuscript shows the integration of circular dichroism and infrared absorbance data for protein secondary structure fitting. We have created a new version of Bob Woody’s algorithm SELCON3 which can now be used by anyone and applied to CD or IR or combined data sets. We have done some careful analysis.

We do not have any competing interests.

I believe I have chosen the correct corresponding author for your system. Oddly the only place ANU registers on your system is with one subunit of the university so I transferred corresponding author to Aarhus.

Please note dichroism is mis-spelled in the key words for QRB-D.

Best wishes

Alison

Review: Protein secondary structure determined from independent and integrated infra-red absorbance and circular dichroism data using the algorithm SELCON — R0/PR2

Conflict of interest statement

I have collaborated and published research with the authors previously.

Comments

Infrared spectroscopy (IR) is versatile technique that is coming back into mainstream with more sensitive instruments and more sophisticated data analysis methods becoming available. This is also driven by a need for more methods capable of characterisation of the emerging biopharmaceuticals and biosimilars.

This paper focuses on extension of the existing SELCON algorithm, commonly used for analysis of circular dichroism data (CD), to include both CD and IR data in order to improve the prediction of protein secondary structure. The authors provide a new Python version of SELCON3 algorithm, which has previously only been available in Fortran. This is an important contribution, as the algorithm is still commonly used and lack of compatibility of the original code with modern workstations has been increasingly challenging.

Authors also examine combined analysis of CD and IR data and explore different scaling factors to find the best performance without biasing the data. I find the analysis robust and the results will be useful to many research groups and industries where knowing secondary structure of proteins in solution is of great importance.

I recommend publication.

Review: Protein secondary structure determined from independent and integrated infra-red absorbance and circular dichroism data using the algorithm SELCON — R0/PR3

Conflict of interest statement

Reviewer declares none.

Comments

In the manuscript, Hoffman et al. propose a method for the combined analysis of CD and IR data to predict protein secondary structure. The method was validated using 28 proteins recognized for having both high-quality CD and IR spectra, with spectra respectively obtained from PCCDB (SP175 and SP180 datasets) and RaSP50 (a dataset of 50 rationally selected proteins based on the quality of their crystal structures and commercial protein preparations). The authors are to be commended for the rigorous and well-executed work presented, which showcases a thoughtful approach to integrating these complementary spectroscopic techniques.

The results clearly demonstrate that a combined approach improves secondary structure prediction compared to using a single spectroscopic tool. This improvement stems from the differential sensitivity of the spectroscopies used towards secondary structure elements, specifically α-helix and β-sheet structures.

As such, the presented work is of significant value to researchers in protein/peptide biochemistry, biophysics, pharmacology, and related fields, as it provides an additional set of tools to probe structural dynamics and functional correlations.

In an additional note to the authors, and for the benefit of all readers, we recommend including a brief summary in the Supplementary Information outlining the key aspects of proper data acquisition and baseline subtraction procedures, which are particularly critical for IR spectral analysis. In this context, Table S2.1 lists the RaSP50 reference set, “50 Dried Thin Film Proteins,” which should be cross-checked with reference 14 (10.1110/ps.0354703). In that reference, it is stated that "(IR) Spectra were collected using 3% protein stock solutions (<sup>1</sup>H<sub>2</sub>O) placed between CaF<sub>2</sub> windows separated with a 5-μm Teflon spacer.”

Recommendation: Protein secondary structure determined from independent and integrated infra-red absorbance and circular dichroism data using the algorithm SELCON — R0/PR4

Comments

No accompanying comment.

Decision: Protein secondary structure determined from independent and integrated infra-red absorbance and circular dichroism data using the algorithm SELCON — R0/PR5

Comments

No accompanying comment.

Author comment: Protein secondary structure determined from independent and integrated infra-red absorbance and circular dichroism data using the algorithm SELCON — R1/PR6

Comments

ATTN Prof. Dr. Bengt Nordén, Associate Editor, QRB Discovery

Regarding: QRBD-2024-0051

“Protein secondary structure determined from independent and integrated infra-red absorbance and circular dichroism data using the algorithm SELCON”

Thank you very much for considering the above-mentioned paper for publication in QRB Discovery. We are delighted to read the positive evaluations of the manuscript and answer the reviewers’ questions below. The questions and comments from the reviewers are shown in italic.

R1:

Infrared spectroscopy (IR) is versatile technique that is coming back into mainstream with more sensitive instruments and more sophisticated data analysis methods becoming available. This is also driven by a need for more methods capable of characterisation of the emerging biopharmaceuticals and biosimilars.

This paper focuses on extension of the existing SELCON algorithm, commonly used for analysis of circular dichroism data (CD), to include both CD and IR data in order to improve the prediction of protein secondary structure. The authors provide a new Python version of SELCON3 algorithm, which has previously only been available in Fortran. This is an important contribution, as the algorithm is still commonly used and lack of compatibility of the original code with modern workstations has been increasingly challenging.

Authors also examine combined analysis of CD and IR data and explore different scaling factors to find the best performance without biasing the data. I find the analysis robust and the results will be useful to many research groups and industries where knowing secondary structure of proteins in solution is of great importance.

I recommend publication.

Answer:

We thank the reviewer for the positive evaluation and are very pleased that our new Python code and the data analysis are well received.

R2:

In the manuscript, Hoffman et al. propose a method for the combined analysis of CD and IR data to predict protein secondary structure. The method was validated using 28 proteins recognized for having both high-quality CD and IR spectra, with spectra respectively obtained from PCCDB (SP175 and SP180 datasets) and RaSP50 (a dataset of 50 rationally selected proteins based on the quality of their crystal structures and commercial protein preparations). The authors are to be commended for the rigorous and well-executed work presented, which showcases a thoughtful approach to integrating these complementary spectroscopic techniques.

The results clearly demonstrate that a combined approach improves secondary structure prediction compared to using a single spectroscopic tool. This improvement stems from the differential sensitivity of the spectroscopies used towards secondary structure elements, specifically α-helix and β-sheet structures.

As such, the presented work is of significant value to researchers in protein/peptide biochemistry, biophysics, pharmacology, and related fields, as it provides an additional set of tools to probe structural dynamics and functional correlations.

Answer:

We thank the reviewer for the insight and for putting the manuscript into perspective.

In an additional note to the authors, and for the benefit of all readers, we recommend including a brief summary in the Supplementary Information outlining the key aspects of proper data acquisition and baseline subtraction procedures, which are particularly critical for IR spectral analysis. In this context, Table S2.1 lists the RaSP50 reference set, “50 Dried Thin Film Proteins,” which should be cross-checked with reference 14 (10.1110/ps.0354703). In that reference, it is stated that "(IR) Spectra were collected using 3% protein stock solutions (1H2O) placed between CaF2 windows separated with a 5-μm Teflon spacer.”

Answer:

We thank the reviewer for the observation about the apparent discrepancy in the Table S2.1 listing of “50 Dried Thin Film Proteins” and the SI reference 14 [10.1110/ps.0354703, Oberg et al. 2003]. The SI reference 14 is the original reference to the RaSP50 reference dataset and as the reviewer correctly points out, this publication uses liquid samples for the IR data. RaSP50 was later re-measured by the same group as dry films on an ATR crystal in the paper by Goormaghtigh et al. from 2006 [10.1529/biophysj.105.072017] and used in the Corujo et al. publication in Frontiers in Chemistry [10.3389/fchem.2021.784625]. The RaSP50 data we analyze in our current manuscript is from the Supplementary Material of the latter publication.

We have amended the Table S2.1 caption to clarify this aspect:

“Table S2.1. The RaSP5014 reference set proteins with the crystal secondary structure (cSSi) annotations (F1-F50) used in15 where SOMSpec was used to analyse infra-red (IR) data and the SELCON3 prediction (SSi). The reference set has been reordered and numbered to have a decreasing amount of helical structure in the crystal structure as in reference 15. Although the original RaSP50 publication14 used IR spectra of liquid samples, the analysis presented here is based on a RaSP50 reference set of dried samples15, 16”

We also acknowledge that proper data acquisition and baseline subtraction is an important aspect of IR spectroscopy on proteins. The procedure is described in Goormaghtigh et al. from 2006 [10.1529/biophysj.105.072017] paper (reference 20 in our main manuscript). As we have not collected the RaSP50 data set, we find that it is most appropriate to refer to this publication, and have included the following clarifying description in the methods section:

“A detailed description for the method of sample preparation and data collection can be found in the 2006 paper of Goormaghtigh et al.20”

and we have added the Goormaghtigh et al. reference when we mention RaSP50 the first time in the Methods section.

Recommendation: Protein secondary structure determined from independent and integrated infra-red absorbance and circular dichroism data using the algorithm SELCON — R1/PR7

Comments

No accompanying comment.

Decision: Protein secondary structure determined from independent and integrated infra-red absorbance and circular dichroism data using the algorithm SELCON — R1/PR8

Comments

No accompanying comment.