We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Chapter 5 is dedicated to the most important part of predictive modeling for biomarker discovery based on high-dimensional data – multivariate feature selection. When dealing with sparse biomedical data whose dimensionality is much higher than the number of training observations, the crucial issue is to overcome the curse of dimensionality by using methods capable of elevating signal (predictive information) from the overwhelming noise. One way of doing this is to perform many (hundreds or thousands) parallel feature selection experiments based on different random subsamples of the original training data and then aggregating their results (for example, by analyzing the distribution of variables among the results of those parallel experiments). Two designs of such parallel feature selection experiments are discussed in detail: one based on recursive feature elimination, and the other on implementing the stepwise hybrid selection with T2. The chapter includes also descriptions of three evolutionary feature selection algorithms: simulated annealing, genetic algorithms, and particle swarm optimization.
Multiple linear regression generalizes straight line regression to allow multiple explanatory (or predictor) variables, in this chapter under the normal errors assumption. The focus may be on accurate prediction. Or it may, alternatively or additionally, be on the regression coefficients themselves. Simplistic interpretations of coefficients can be grossly misleading. Later chapters elaborate on the ideas and methods developed in this chapter, applying them in new contexts. The attaching of causal interpretations to model coefficients must be justified both by reference to subject area knowledge and by careful checks to ensure that they are not artefacts of the correlation structure. There is attention to regression diagnostics, to assessment, and comparison of models. Variable selection strategies can readily over-fit. Hence the importance of training/test approaches and cross-validation. The potential is demonstrated for errors in x to seriously bias regression coefficients. Strong multicollinearity leads to large variance inflation factors.
This chapter explains weighting in a manner that allows us to appreciate both the power and vulnerability of the technique and, by extension, other techniques that rely on similar assumptions. Once we understand how weighting works, we will better understand when it works. This chapter opens by discussing weighting in general terms. The subsequent sections get more granular. Sections 3.2 and 3.3 cover widely used weighting techniques: cell-weighting and raking. Section 3.4 covers variable selection, a topic that may well be more important than weighting technique. Section 3.5 covers the effect of weighting on precision, a topic that frequently gets lost in polling reporting. This chapter mixes intuitive and somewhat technical descriptions of weighting. The technical details in Sections 3.2 and3.3 can be skimmed by readers focused on the big picture how weighting works.
Chapter 3 introduces readers to correlation and regression analysis. Methods of correlation answer the question concerning the portion of variance that two variables share. Regression uses, in most cases, metric dependent and metric or categorical independent variables. The OLS solution for the standard case of one predictor and one outcome variable is derived. Based on this derivation, characteristics of model parameters are explained. Real-world data examples are given for simple regression (one predictor and one outcome variable) and for multiple regression (multiple predictors and one outcome variable). In addition, this chapter describes GLM approaches to curvilinear regression (regression lines are curved rather than straight to approximate non-linear variable relations), to curvilinear regression of repeated observations, to symmetric regression (where the regression of Y on X results in the same solution as the regression of X on Y), to best subset and stepwise selection regression (which results in an optimal selection from multiple predictors), and the recently developed direction dependence analysis to evaluate hypotheses concerning the causal flow of a variable association
About 30% of patients drop out of cognitive–behavioural therapy (CBT), which has implications for psychiatric and psychological treatment. Findings concerning drop out remain heterogeneous.
Aims
This paper aims to compare different machine-learning algorithms using nested cross-validation, evaluate their benefit in naturalistic settings, and identify the best model as well as the most important variables.
Method
The data-set consisted of 2543 out-patients treated with CBT. Assessment took place before session one. Twenty-one algorithms and ensembles were compared. Two parameters (Brier score, area under the curve (AUC)) were used for evaluation.
Results
The best model was an ensemble that used Random Forest and nearest-neighbour modelling. During the training process, it was significantly better than generalised linear modelling (GLM) (Brier score: d = –2.93, 95% CI (−3.95, −1.90)); AUC: d = 0.59, 95% CI (0.11 to 1.06)). In the holdout sample, the ensemble was able to correctly identify 63.4% of cases of patients, whereas the GLM only identified 46.2% correctly. The most important predictors were lower education, lower scores on the Personality Style and Disorder Inventory (PSSI) compulsive scale, younger age, higher scores on the PSSI negativistic and PSSI antisocial scale as well as on the Brief Symptom Inventory (BSI) additional scale (mean of the four additional items) and BSI overall scale.
Conclusions
Machine learning improves drop-out predictions. However, not all algorithms are suited to naturalistic data-sets and binary events. Tree-based and boosted algorithms including a variable selection process seem well-suited, whereas more advanced algorithms such as neural networks do not.
Identifying predictors of patient outcomes evaluated over time may require modeling interactions among variables while addressing within-subject correlation. Generalized linear mixed models (GLMMs) and generalized estimating equations (GEEs) address within-subject correlation, but identifying interactions can be difficult if not hypothesized a priori. We evaluate the performance of several variable selection approaches for clustered binary outcomes to provide guidance for choosing between the methods.
Methods:
We conducted simulations comparing stepwise selection, penalized GLMM, boosted GLMM, and boosted GEE for variable selection considering main effects and two-way interactions in data with repeatedly measured binary outcomes and evaluate a two-stage approach to reduce bias and error in parameter estimates. We compared these approaches in real data applications: hypothermia during surgery and treatment response in lupus nephritis.
Results:
Penalized and boosted approaches recovered correct predictors and interactions more frequently than stepwise selection. Penalized GLMM recovered correct predictors more often than boosting, but included many spurious predictors. Boosted GLMM yielded parsimonious models and identified correct predictors well at large sample and effect sizes, but required excessive computation time. Boosted GEE was computationally efficient and selected relatively parsimonious models, offering a compromise between computation and parsimony. The two-stage approach reduced the bias and error in regression parameters in all approaches.
Conclusion:
Penalized and boosted approaches are effective for variable selection in data with clustered binary outcomes. The two-stage approach reduces bias and error and should be applied regardless of method. We provide guidance for choosing the most appropriate method in real applications.
This chapter focuses on model evaluation and selection in Hierarchical Modelling of Species Communities (HMSC). It starts by noting that even if there are automated procedures for model selection, the most important step is actually done by the ecologist when deciding what kind of models will be fitted. The chapter then discusses different ways of measuring model fit based on contrasting the model predictions with the observed data, as well as the use of information criteria as a method for evaluating model fit. The chapter first discusses general methods that can be used to compare models that differ either in their predictors or in their structure, e.g. models with different sets of environmental covariates, models with and without spatial random effects, models with and without traits or phylogenetic information or models that differ in their prior distributions. The chapter then presents specific methods for variable selection, aimed at comparing models that are structurally identical but differ in the included environmental covariates: variable selection by the spike and slab prior approach, and reduced rank regression that aims at combining predictors to reduce their dimensionality.
We present a model of political networks that integrates both the choice of trade partners (the extensive margin) and trade volumes (the intensive margin). Our model predicts that regimes secure in their survival, including democracies as well as some consolidated authoritarian regimes, will trade more on the extensive margin than vulnerable autocracies, which will block trade in products that would expand interpersonal contact among their citizens. We apply a two-stage Bayesian LASSO estimator to detailed measures of institutional features and highly disaggregated product-level trade data encompassing 131 countries over a half century. Consistent with our model, we find that (a) political institutions matter for the extensive margin of trade but not for the intensive margin and (b) the effects of political institutions on the extensive margin of trade vary across products, falling most heavily on those goods that involve extensive interpersonal contact.
Selecting important variables and estimating coordinate covariation have received considerable attention in the current big data deluge. Previous work shows that the gradient of the regression function, the objective function in regression and classification problems, can provide both types of information. In this paper, an algorithm to learn this gradient function is proposed for nonidentical data. Under some mild assumptions on data distribution and the model parameters, a result on its learning rate is established which provides a theoretical guarantee for using this method in dynamical gene selection and in network security for recognition of malicious online attacks.
This paper deals with variable selection in regression and binary classificationframeworks. It proposes an automatic and exhaustive procedure which relies on the use ofthe CART algorithm and on model selection via penalization. This work, of theoreticalnature, aims at determining adequate penalties, i.e. penalties whichallow achievement of oracle type inequalities justifying the performance of the proposedprocedure. Since the exhaustive procedure cannot be realized when the number of variablesis too large, a more practical procedure is also proposed and still theoreticallyvalidated. A simulation study completes the theoretical results.
Specific Gaussian mixtures are considered to solve simultaneouslyvariable selection and clustering problems. A non asymptoticpenalized criterion is proposed to choose the number of mixturecomponents and the relevant variable subset. Because of the nonlinearity of the associated Kullback-Leibler contrast on Gaussianmixtures, a general model selection theorem for maximum likelihoodestimation proposed by [Massart Concentration inequalities and model selection Springer, Berlin (2007). Lectures from the 33rd Summer School on Probability Theory held in Saint-Flour, July 6–23 (2003)] is used to obtainthe penalty function form. This theorem requires to control thebracketing entropy of Gaussian mixture families. The ordered andnon-ordered variable selection cases are both addressed in thispaper.
In proteomics study, Imaging Mass Spectrometry (IMS) is an emerging and very promisingnew technique for protein analysis from intact biological tissues. Though it has showngreat potential and is very promising for rapid mapping of protein localization and thedetection of sizeable differences in protein expression, challenges remain in dataprocessing due to the difficulty of high dimensionality and the fact that the number ofinput variables in prediction model is significantly larger than the number ofobservations. To obtain a complete overview of IMS data and find trace features based onboth spectral and spatial patterns, one faces a global optimization problem. In thispaper, we propose a weighted elastic net (WEN) model based on IMS data processing needs ofusing both the spectral and spatial information for biomarker selection andclassification. Properties including variable selection accuracy of the WEN model arediscussed. Experimental IMS data analysis results show that such a model not only reducesthe number of side features but also helps new biomarkers discovery.
Recommend this
Email your librarian or administrator to recommend adding this to your organisation's collection.