Clinical prediction models are used frequently in clinical practice to identify patients who are at risk of developing an adverse outcome so that preventive measures can be initiated. A prediction model can be developed in a number of ways; however, an appropriate variable selection strategy needs to be followed in all cases. Our purpose is to introduce readers to the concept of variable selection in prediction modelling, including the importance of variable selection and variable reduction strategies. We will discuss the various variable selection techniques that can be applied during prediction model building (backward elimination, forward selection, stepwise selection and all possible subset selection), and the stopping rule/selection criteria in variable selection (p values, Akaike information criterion, Bayesian information criterion and Mallows’ C_{p} statistic). This paper focuses on the importance of including appropriate variables, following the proper steps, and adopting the proper methods when selecting variables for prediction models.

Prediction models play a vital role in establishing the relation between the variables used in the particular model and the outcomes achieved and help forecast the future of a proposed outcome. A prediction model can provide information on the variables that are determining the outcome, their strength of association with the outcome and predict the future of an outcome using their specific values. Prediction models have countless applications in diverse areas, including clinical settings, where a prediction model can help with detecting or screening high-risk subjects for asymptomatic diseases (to help prevent developing diseases with early interventions), predicting a future disease (to help facilitate patient–doctor communication based on more objective information), assisting in medical decision-making (to help both doctors and patients make an informed choice regarding treatment) and assisting healthcare services with planning and quality management.

Different methodologies can be applied to build a prediction model, which techniques can be classified broadly into two categories: mathematical/statistical modelling and computer-based modelling. Regardless of the modelling technique used, one needs to apply appropriate variable selection methods during the model building stage. Selecting appropriate variables for inclusion in a model is often considered the most important and difficult part of model building. In this paper, we will discuss what is meant by variable selection, why variable selection is important, the different methods for variable selection and their advantages and disadvantages. We have also used examples of prediction models to demonstrate how these variable selection methods are applied in model building. The concept of variable selection is heavily statistical and general readers may not be familiar with many of the concepts discussed in this paper. However, we have attempted to present a non-technical discussion of the concept in a plain language that should be accessible to readers with a basic level of statistical understanding. This paper will be helpful for those who wish to be better informed of variable selection in prediction modelling, have more meaningful conversations with biostatisticians/data analysts about their project or select an appropriate method for variable selection in model building with the advanced training information provided by our paper. Our intention is to provide readers with a basic understanding of this extremely important topic to assist them when developing a prediction model.

Variable selection means choosing among many variables which to include in a particular model, that is, to select appropriate variables from a complete list of variables by removing those that are irrelevant or redundant.

Due to rapid digitalisation, big data (a term frequently used to describe a collection of data that is extremely large in size, is complex and continues to grow exponentially with time) have emerged in healthcare and become a critical source of the data that has helped conceptualise precision public health and precision medicine approaches. At its simplest level, precision health involves applying appropriate statistical modelling based on available clinical and biological data to predict patient outcomes more accurately. Big data sets contain thousands of variables, which makes it difficult to handle and manage efficiently using traditional approaches. Consequently, variable selection has become the focus of much research in different areas including health. Variable selection offers many benefits such as improving the performance of models in terms of prediction, delivering variables more quickly and cost-effectively by reducing training and utilisation time, facilitating data visualisation and offering an overall better understanding of the underlying process that generated the data.

There are many reasons why variables should be selected, including practicality issues. It is not practical to use a large set of variables in a model. Information involving a large number of variables may not be available for all patients or may be costly to collect. Some variables also may have a negligible effect on outcome and can therefore be excluded. Having fewer variables in the model means less computational time and complexity.

There is no set rule as to the number of variables to include in a prediction model as it often depends on several factors. The ‘one in ten rule’, a rule that stipulates for how many variables/parameters can be estimated from a data set, is quite popular in traditional clinical prediction modelling strategy (eg, logistic regression and survival models). According to this rule, one variable can be considered in a model for every 10 events.

Existing theory and literature, as well as experience and clinical knowledge, provide a general idea as to which candidate variables should be considered for inclusion in a prediction model. Nevertheless, the actual variables used in the final prediction model should be determined by analysing the data. Determining the set of variables for the final model is called variable selection. Variable selection serves two purposes. First, it helps determine all of the variables that are related to the outcome, which makes the model complete and accurate. Second, it helps select a model with few variables by eliminating irrelevant variables that decrease the precision and increase the complexity of the model. Ultimately, variable selection provides a balance between simplicity and fit.

Variable selection steps. AIC, Akaike information criterion; BIC, Bayesian information criterion.

One way to restrict the list of potential variables is to choose the candidate variables first, particularly, if the sample is small. Candidate variables for a specific topic are those that have demonstrated previous prognostic performance with the outcome.

Grouping/combining similar, related variables based on subject knowledge and statistical technique can also help restrict the number of variables. If variables are strongly correlated, combining them into a single variable has been considered prudent.

How variables are distributed can also provide an indication of which ones to restrict. Variables that have a large number of missing values can be excluded, because imputing a large number of missing values will be suspicious to many readers due to the lack of reliable estimation, which problem may recur in applications of the model.

Once the number of potential candidate variables has been identified from the list of all available variables in the data set, a further selection of variables is made for inclusion in the final model. There are different ways of selecting variables for a final model. However, there is no consensus on which method is the best.

It has also been suggested that variable selection should start with the univariate analysis of each variable.

Backward elimination is the simplest of all variable selection methods. This method starts with a full model that considers all of the variables to be included in the model. Variables then are deleted one by one from the full model until all remaining variables are considered to have some significant contribution to the outcome.

Kshirsagar

While a set of variables can have significant predictive ability, a particular subset of them may not. Unfortunately, both forward selection and stepwise selection do not have the capacity to identify less predictive individual variables that may not enter the model to demonstrate their joint behaviour. However, backward elimination has the advantage to assess the joint predictive ability of variables as the process starts with all variables being included in the model. Backward elimination also removes the least important variables early on and leaves only the most important variables in the model. One disadvantage of the backward elimination method is that once a variable is eliminated from the model it is not re-entered again. However, a dropped variable may become significant later in the final model.

The forward selection method of variable selection is the reverse of the backward elimination method. The method starts with no variables in the model then adds variables to the model one by one until any variable not included in the model can add any significant contribution to the outcome of the model.

Dang

One advantage of forward selection is that it starts with smaller models. Also, this procedure is less susceptible to collinearity (very high intercorrelations or interassociations among independent variables). Like backward elimination, forward selection also has drawbacks. In forward selection, inclusion of a new variable may make an existing variable in the model non-significant; however, the existing variable cannot be deleted from the model. A balance between backward elimination and forward selection is therefore required which can be achieved in stepwise selection.

Stepwise selection methods are a widely used variable selection technique, particularly in medical applications. This method is a combination of forward and backward selection procedures that allows moving in both directions, adding and removing variables at different steps. The process can start with both a backward elimination and forward selection approach. For example, if stepwise selection starts with forward selection, variables are added to the model one at a time based on statistical significance. At each step, after a variable is added, the procedure checks all the variables already added to the model to delete any variable that is not significant in the model. The process continues until every variable in the model is significant and every excluded variable is insignificant. Due to its similarity, this approach is sometimes considered as a modified forward selection. However, it differs from forward selection in that variables entered into the model do not necessarily remain in the model. However, if stepwise selection starts with backward elimination, the variables are deleted from the full model based on statistical significance and then added back if they later appear significant. The process is a rotation of choosing the least significant variable to drop from the model and then reconsidering all dropped variables to re-enter into the model. Stepwise selection requires two separate significance levels (cut-offs) for adding and deleting variables from the model. The significance levels for adding variables should be less than the significance levels for deleting variables so that the procedure does not get into an infinite loop. Within stepwise selection, backward elimination is often given preference as in backward elimination the full model is considered, and the effect of all candidate variables is assessed.

Chien

The stepwise selection method is perhaps the most widely used method of variable selection. One reason is that it is easy to apply in statistical software.

In all possible subset selection, every possible combination of variables is checked to determine the best subset of variables for the prediction model. With this procedure, all one-variable, two-variable, three-variable models, and so on, are built to determine which one is the best according to some specific criteria. If there are K variables, then there are 2^{K} possible models that can be built.

Holden

The ability to identify a combination of variables, which is not available in other selection procedures, is an advantage of this method.

In all stepwise selection methods including all subset selection, a stopping rule or selection criteria for inclusion or exclusion of variables need to be set. Generally, a standard significance level for hypothesis testing is used._{p} statistic. We discuss these major selection criteria below.

If the stopping rule is based on p values, the traditional choice for significance level is 0.05 or 0.10. However, the optimum value of the significance level to decide which variable to include in the model is suggested to be 1, which exceeds the traditional choices.

AIC is a tool for model selection that compares different models. Including different variables in the model provides different models, and AIC attempts to select the model by balancing underfitting (too few variables in the model) and overfitting (too many variables in the model).

AIC only provides information about the quality of a model relative to the other models and does not provide information on the absolute quality of the model. With a small sample size (relative to a large number of parameters/variables or any number of variables/parameters), AIC often provides models with too many variables. However, this issue can be solved with a modified version of AIC called AIC_{C,} which introduces an extra penalty term for the number of variables/parameters. For a large sample size, this penalty term becomes zero and AIC_{C} subsequently converges to AIC, which is why it is suggested that AIC_{C} be used in practice.

BIC is another variable selection criterion that is similar to AIC, but with a different penalty for the number of variables (parameters) included in the model. Like AIC, BIC also balances between simplicity and goodness of model fitting. In practice, for a given data set, BIC is calculated for each of the candidate models, and the model corresponding to the minimum BIC value is chosen. BIC often chooses models that are more parsimonious than AIC, as BIC penalises bigger models more due to the larger penalty term inherent in its formula.

Although there are similarities between AIC and BIC, and both criteria balance simplicity and model fit, differences exist between them. The underlying theory behind AIC is that the data stem from a very complex model, there are many candidate models to fit the data and none of the candidate models (including the best model) are the exact functional form of the true model.

Mallows’ C_{p} statistic is another criterion used in variable selection. The purpose of the statistic is to select the best model using a subset of variables from all available variables. This criterion is most widely used in the all subset selection method. Different models derived in all subset selection are compared based on Mallows’ C_{p} statistic and the model with the lowest Mallows’ C_{p} statistic closest to the number of variables plus the constant is often chosen. A small Mallows’ C_{p} value near the number of variables indicates that the model is relatively more precise than other models (small variance and less bias).

It is extremely important to include appropriate variables in prediction modelling, as model’s performance largely depends on which variables are ultimately included in the model. Failure to include the proper variables in the model provides inaccurate results, and the model will fail to capture the true relation that exists in the data between the outcome and the selected variables. There are numerous occasions when prediction models are developed without following the proper steps or adopting the proper method of variable selection. Researchers need to be more aware of and cautious about these very important aspects of prediction modelling.

@drturin

TCT and MZIC developed the study idea. MZIC prepared the manuscript with critical intellectual inputs from TCT. The manuscript has been finalised by MZIC and TCT.

The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.

None declared.

Not required.

Not commissioned; externally peer reviewed.

There are no data in this work.