There are three kinds of lies: lies, damned lies, and statistics.
To explain the level and inequality of wellbeing, we use the standard tools of quantitative social science. These are mainly the techniques of multiple regression. In this chapter, we shall show how multiple regression can address the following issues.Footnote 1
(1) What is the effect of different factors on the level of wellbeing (using survey data)?
(2) What problems arise in estimating this and how can they be handled?
(3) How far do different factors contribute to the observed inequality of wellbeing?
(4) How can experiments and quasi-experiments show us the effect of interventions to improve wellbeing?
So suppose that a person’s wellbeing (W) is determined by a range of explanatory variables
(X1, …, XN) in an additive fashion. But in addition there is an unexplained residual (e), which is randomly distributed around an average value of zero. Then the wellbeing of the ith individual (Wi) is given by
which we can also write as
In this equation, wellbeing is being explained by the Xjs. So wellbeing is the ‘dependent’ variable (or left-hand variable) and the Xjs are the ‘independent’ or (right-hand) variables. These right-hand variables can be of many forms. They can be continuous like income or the logarithm of income or like age or age squared. Or they can be binary variables like unemployment: you are either unemployed or not unemployed. These binary variables are often called dummy variables and they take the value of 1 when you are in that state (e.g., unemployed) and the value of 0 when you are not in that state (e.g., not unemployed).
If we want to explain wellbeing, we have to discover the size of the effect of each thing that affects wellbeing. In other words, we have to discover the size of the ajs. For example, suppose
From Chapter 8, you will find as benchmark numbers that a1 = 0.3 and a2 = −0.7. This means that when a person’s log Income increases by one point, her wellbeing increases by 0.3 points (out of 10). Similarly, when a person ceases to be unemployed, her wellbeing increases by 0.7 points (ignoring any effect of a simultaneous change in income). And, if both things happen together, wellbeing increases by a whole point (0.3 + 0.7).
Estimating the Effect of a Variable
But how are we to estimate, as best we can, the true values of these aj coefficients? The best unbiased way of doing this is to find the set of ajs that leaves the smallest sum of squared residuals ei2, across the whole sample of people being studied.Footnote 2 This is known as the method of Ordinary Least Squares (OLS). Standard programmes like STATA will do it for you automatically. However, there are 4 possible problems with such estimates when obtained from a cross-section of the population.
Omitted variables
Suppose that equation (2) is not the correct model but that another X variable should also be in the equation. Suppose, for example, that the right model is
where Education means years of education. Clearly education and income are positively correlated. So if a1 and a3 are positive, people with higher income will be getting higher wellbeing for 2 reasons:
the direct effect of income (a1) and
the effect of education in so far as it is correlated with income.
Thus, equation (2) will give an exaggerated estimate of the direct effect (a1) of income on wellbeing.Footnote 3 To leave out education is to leave out a confounding variable. And any such confounding variable must have two properties:
it is causally related to the dependent (LHS) variable and
it is correlated with an independent (RHS) variable.
If we lack data on the confounding variable, the classic way to overcome this problem is to use time-series panel data on the same people. Provided the omitted variable is constant over time, it can cause no problem, since we can now estimate how changes in income within the same person affect changes in her wellbeing. Thus, if we use time-series data, we cease to compare different individuals at the same point of time and we compare the same individual at different periods of time. Algebraically, we do this by expanding equation (2) to include multiple time periods (t) and adding a fixed effect dummy variable (fi) for each individual. This picks up the effect of all the fixed characteristic of the individual (which for most adults will include education). Thus, we now explain the wellbeing of the ith person in the tth time period by
There are standard programmes for including fixed-effects. A similar method to this is used for analysing the effect of experiments, but we shall come to this later.
Reverse causality
However, there is another problem. Suppose we are interested in the effect of income on wellbeing. But suppose that there is also the reverse effect – of wellbeing on income.Footnote 4 How can we be sure that, when we estimate equation (2), we are really estimating the effect of income on wellbeing rather than the reverse relationship or a mixture of the two? In other words is equation (2) in principle ‘identifiable’?
For an equation to be identifiable, it must exclude at least one of the variables that appears in the second relationship (the one that determines income).Footnote 5 But, even if it is identifiable, there is still the problem of getting a causal estimate of the effects of the endogenous variable.
The aim has to be to isolate that part of the endogenous variable that is due to something exogenous to the system. A variable that can isolate that part of the endogenous variable is called an instrumental variable. For example, if tax rates or minimum wages changed over time, these would be good instruments. Instrumental variables can also be used to handle the problem of omitted variables. In every case a good instrument
(i) is well related in a causal way to the variable it instruments and
(ii) should not itself appear in the equation, (i.e., it is not correlated with the error term in the equation).
There are programmes for the use of instrumental variables (IVs).
Another way to isolate causal relationships is through the timing of effects. For example, income affects wellbeing in the next period rather than the current period. We can then identify its effect by regressing current wellbeing on income in the previous period. Similarly with unemployment. This gives us
Measurement error
Another source of biased estimates is measurement error. If the left-hand variable has high measurement error, this will not bias the estimated coefficients aj. But, if an explanatory variable Xj is measured with error, this will bias aj towards zero. If the measurement error is known, this can be used to correct for the bias. But, if not, an instrumental variable can again come to the rescue, provided it is uncorrelated with the measurement error in the variable it is instrumenting.
Mediating variables
A final issue is this. A multiple regression equation such as (3) shows us the effect of each variable upon wellbeing holding other things constant. But suppose we are interested in the total effect of changing one variable upon wellbeing. For example, we might ask What is the total effect of unemployment upon wellbeing?
The total effect is clearly
a2, plus
a1 times the effect of unemployment upon log income.
That is one way you could estimate it. An alternative way is to take equation (2) and leave income out of the equation, so that the estimated coefficient on unemployment includes any effect that unemployment has on wellbeing via its effect on income.
In a case like this, income is a mediating variable. If we are only interested in the total effect of unemployment, we can simply leave the mediating variable out of the equation. Or we can estimate a system of structural equations consisting of (2) and the equation that determines income. This discussion brings out one crucial point in wellbeing research. We should always be very clear what question we are trying to answer. We should choose our equation or equations accordingly.
Standard errors and significance
All coefficients are estimated with a margin of uncertainty. Each estimated coefficient has a ‘standard error’ (se) around the estimated value. The true value will lie within 2 ‘standard errors’ on either side of the estimated coefficients in 95% of samples. Thus the ‘95% confidence interval’ for the αj coefficient runs from , where means the estimated value of αj. If this confidence interval does not include the value zero, the estimated coefficient is said to be ‘significantly different from zero at the 95% level’.
For many psychologists, this issue of significance is considered crucial. It answers the question ‘Does X affect W at all?’ But for policy purposes the more important question is ‘How much does X affect W?’ So the coefficient itself is more interesting than its significance level. For any sample size, the estimated coefficient is the best available answer to the question of how much X changes W. And, if you increase the size of the sample, the expected value of the estimated aj does not change but its standard error automatically falls (it is inversely proportional to the square root of the sample size). So in this book we focus more heavily on the size of coefficients than on their significance (though we sometimes show standard errors in brackets in the tables).
The question we have been asking thus far in this chapter is How does wellbeing change when an independent variable changes? In algebraic terms, we have been studying dW/dXj? This is the type of number we need in order to evaluate a policy change. For example, suppose we increased the income of poor people by 20%, how much would their wellbeing change (on a scale of 0–10)? If aj = 0.3, it would increase by 0.06 points (0.3 × 0.2). A quite different question is In which areas of life should we look hardest in the search for better policies?
The Explanatory Power of a Variable
If our main aim is to help the people with the lowest wellbeing (as we discussed in Chapter 2), then our focus should be on what explains the inequality of wellbeing. To see why, suppose first that wellbeing depends only on one variable X1, with W = α0 + α1X1. Then the distribution of W depends only on the distribution of X1. If W is unequal, it is because X1 is unequal and α1 is high. The higher the standard deviation (σ1) of X1 and the higher α1, the greater the inequality of W. This is illustrated in Figure 7.1. For high variance of W, the numbers in misery correspond to the areas A and B. But for the low variance of W the numbers in misery correspond only to the area B.
A next natural step is to compare the standard deviation of a1σ1 with the standard deviation of wellbeing itself. Obviously, if they were equal in size, the spread of X1 would be ‘explaining’ the whole spread of wellbeing σw – in other words, the two variables would be perfectly correlated. The correlation coefficient (r) between W and X1 is therefore a1σ1/σw:
However, this can be either positive or negative depending on the sign of a1. So a natural measure of the explanatory power of a right-hand variable is the squared value of r (which is also often written as R2):
Since the denominator is the variance of wellbeing, this shows what proportion of the variance in wellbeing is explained by the variance of X1.
In the real world, wellbeing depends on more than one variable (see equation [1]). The policy-maker may then ask Which of these variables is producing the largest amount of misery?Footnote 6 For this purpose, we need to compare the explanatory power of the different variable. This is done by computing for each variable its partial correlation coefficient with wellbeing. This partial correlation coefficient is normally described as βj where
This β-coefficient will appear frequently throughout this book.Footnote 7
These β-coefficients are hugely interesting, as we shall see via two steps. First, starting from equation (1) we can readily derive the following equation.Footnote 8
Here we have standardised each variable by measuring it from its mean and dividing it by its standard deviation. These standardised equations appear many times in this book.Footnote 9
But, to see the importance of these βs, we move on to a second equation, which is derived from (6).Footnote 10 This says
r2 is the proportion of the variance of W that is explained by the right-hand variables. And rgk is the correlation coefficient between Xg and Xk.
Thus, the left-hand side is the share of the variance of wellbeing that is explained. The right-hand side consists of Σβj2, which includes all the effects of the independent variation of the Xjs, plus the effects of all their covariances. Thus βj (or the partial correlation coefficient) measures the explanatory power of a variable (just as the correlation coefficient does in a simple bivariate relationship).
But some readers may wonder if this approach can handle independent variables that are binary. It can, because the standard deviation of a binary variable is simply , where p is the proportion of people answering Yes to the binary question. For example, the standard deviation of Unemployed is where u is the unemployed rate. Thus, if Xj is Unemployed, its β coefficient is.
Binary dependent variables
The matter is more complicated when it is the dependent variable that is binary. For example, suppose we divide the population into those who are in misery (with wellbeing below say 6) and the rest. How can we handle this? The most natural approach is, as normal, to regress the binary variable on all the other variables. This is what we often do in this book and, since it provides statistics of the standard kind, it is easy to understand.Footnote 11
In analysing the effect of one binary variable on another binary variable, psychologists and sociologists often use the concept of an ‘odds ratio’ rather than the values of aj and βj we have been discussing. Suppose, for example, we ask: How much more likely are unemployed people to be in misery, compared with people who are not unemployed? Imagine 100 people were distributed as follows (Table 7.1):
In misery | Not in misery | Total | |
---|---|---|---|
Unemployed | 2 | 8 | 10 |
Not unemployed | 9 | 81 | 90 |
Total | 11 | 89 | 100 |
In this situation, the chance of an unemployed person being in misery is much higher than the chance of a non-unemployed person being in misery. The odds-ratio is
But odds ratios do not answer either of the main questions we are addressing in this chapter. First, if we are interested in the effect on wellbeing of reducing unemployment, the proper measure of this effect is not the odds ratio but the absolute difference in the probabilities of misery between unemployed and non-unemployed people, that is, 0.2−0.1 = 0.1. Second, if we are interested in the power of unemployment to explain the prevalence of misery, the correct statistic is the correlation coefficient between the two. So we shall not be showing odds ratios in this book, though the reader is able to compute them, given the necessary information.
Effect size of a binary independent variable
We have so far considered two ways in which to report regression results. One is to report the absolute effect of say unemployment on wellbeing in units of wellbeing. The other is to look at the relationship when both variables are standardised. However, there is the third approach that is often useful. This is to measure only the dependent variable in a standardised fashion. For example, we might ask ‘When a person becomes unemployed, by how many standard deviations does his wellbeing go down?’ This is a measure known as the effect size of the independent variable (sometimes knows as Cohen’s d):
This is particularly useful when reporting the effect of an experiment.Footnote 12
Experiments
So far, we have been discussing the use of naturalistic data – mainly obtained by surveys of the population. As we have mentioned, it is often difficult to establish the causal effect of one variable on another from this type of data. The simplest way to establish a causal relationship is through a properly controlled experiment. Moreover, if you want to examine the effect of a policy that has never been tried before, it is the best way to get convincing evidence of its effects.
So how do we estimate the effects of being ‘treated’ in an experiment? Let’s begin with a simple example. Suppose we want to try introducing a wellbeing curriculum into a school. Our aim is to see whether it makes any difference to those who receive it. So we would select two groups of pupils who were as similar to each other as possible. Then we would give the wellbeing curriculum to the treatment group (T) but not the control group (c). We would also measure the wellbeing of both groups before and after the treatment. So we would have the following values of wellbeing for each of four situations (Table 7.2).
Before | After | |
---|---|---|
Treatment group (T) | WT0 | WT1 |
Control group (C) | WC0 | WC1 |
To find the average effect of the treatment, we would compare the change in wellbeing experienced by the treatment group (T) with that experienced by the control group (C). Thus, the ‘average treatment effect on the treated’ (ATT) would be estimated as
In other words, the ATT is the ‘difference in differences’, or for short the ‘diff in diff’.
There may of course be many ways in which both groups changed between periods 0 and 1 – they will become older, they may experience a flu epidemic or whatever. But those changes should be similar for both groups. Thus the only observable thing that can produce a different change in wellbeing is the fact that Group T took the course and Group C did not.
Of course, there may also be some unobservable difference in experience, which means that the ATT is always estimated with a standard error. So, to put things into a more general form, let’s imagine we have observations over a number of years. We then estimate
Here Tit is a variable which takes the value 1 in all periods after someone has taken the course, vt is a year dummy, fi is a person fixed effect and eit is random noise.
So far, we have assumed that in our experiment we can easily arrange for the treatment group and the control group to be reasonably similar. This is never in fact completely possible. But the method that gets us closest to it is ‘random assignment’.Footnote 13 In this case, we select an overall group for the experiment and then randomly assign people to either Group T or Group C (e.g., by tossing a coin for each individual). In this way, the groups are more likely to be similar than in any other way. Of course we can then check whether they differ in observable characteristics (X) and we can then allow in our equation for the possibility that these variables affect the measured ATT. Our equation then becomes
Estimating equations like this are quite common.
However, randomisation between individuals is often not practicable. For example, suppose you wanted to test whether higher income transfers raised wellbeing enough to justify the cost. You could not randomly allocate money within a given population – it would be considered unfair since the transfer clearly benefits the recipient. You might, however, choose to transfer money to all eligible people in some areas and not in others, with the allocation between areas being random. This might not be considered unfair. Similarly, suppose you wanted to test the effects of improved teaching of life-skills in schools. Within a school it might be organisationally impossible to give improved teaching to some children and not others – or even to some classes. But you could use random assignment across schools. Or you could even argue that it is ‘quasi-random’ whether a child is born in Year t or Year t − 1; in this case, you could use children born in year t as a control group in the trial of a treatment applied to those born in year t + 1 (see Chapter 9). So all experiments should, if at all possible, use randomisation to reduce the unobservable differences between the treatment and control groups.
Selection bias
But suppose an innovation is made without an experiment and we then want to know its effects. For example, an exercise programme has been established, which some people have decided to adopt. Has it done them any good?
The only information that we have is for the period after the innovation. But we do also have information on people who did not opt in to the programme. So, can we answer our question by comparing the wellbeing of those who took the programme with those who didn’t? Probably not, because the people who opted into the programme may have differed from those who didn’t: they may well have started with higher wellbeing in the first place. So, if we just compared their final wellbeing with those of non-participants, the difference could be largely due to ‘selection bias’.
One method to deal with this is called Propensity Score Matching. In it we first take the whole sample of participants and non-participants and do a logit (or probit) analysis to identify that equation that best predicts whether they participate or not. From this analysis, we can say for every participant what was the probability they participated. We then find, for each participant, a non-participant with the same (or nearly the same) probability of participating. It is those non-participants who become the control group and we now compare their wellbeing with that of the treatment group. This gives us our estimate of the average treatment effect on the treated:
Summary
(1) If W = a0 + ∑aj Xj +e, then the best unbiased way to estimate the values of the ajs is by Ordinary Last Squares (choosing the ajs to minimise the sum of squared residuals e2).
(2) Omitted variables are confounders that can lead to biased estimates of the effect of the variables which are included.
(3) Time series estimation can eliminate any problem caused by omitted variables which are constant over time. Time series can also help to identify a causal effect if this takes place with a lag, so that for example Xt-1 is affecting Wt.
(4) If a right-hand variable is endogenous, it should if possible be instrumented by an instrumental variable that is independent of the error in the equation. Instrumental variables can also help with omitted variables and measurement error.
(5) If an explanatory variable is measured with error, its estimated coefficient will be biased towards zero. This problem can again be solved by using an instrumental variable uncorrelated with the original measurement error.
(6) All regression estimates are estimated with ‘standard errors’ (se). The 95% confidence interval is the coefficient ± 2 se. Provided this interval does not include zero, the coefficient is ‘significantly different from zero at the 95% level’. But the coefficient estimate is more interesting that its significance.
(7) To find the explanatory power of the different variables, we run the equation using standardised variables, that is, the original variables minus their mean and divided by their standard deviation. The resulting coefficients (βj) – or partial correlation coefficients – reflect the explanatory power of the independent variation of each variable Xj. They are equal to ajσj/σw where w is the dependent variable.
(8) The surest way to determine a causal effect is by experiment. The best form of experiment is by random assignment. We then measure the wellbeing of the treatment and the control group before and after the experiment. This difference-in-difference measures the average treatment effect on the treated.
(9) Where random assignment is impossible, naturalistic data can be used and the outcome for the treatment group compared with a similar untreated group chosen by Propensity Score Matching.
(10) If the measured effect of a treatment is a (in units of the outcome variable W), the ‘effect size’ is a/σw.
We can now put these tools to work.