1 The Measurement of Happiness

Happiness is typically measured by self-report and cross-national studies on happiness mostly use single questions. An example of such a frequently used question is: “Taking all things together, how would you say things are these days–would you say you are… ?” The respondent is requested to make a choice out of e.g. four possible ratings:

$$ \begin{array}{*{20}c} {\square \quad \hbox{``}{\text{unhappy}}\hbox{''}\quad \left( {{\text{R}}_{ 1} } \right)} \hfill \\ {\square \quad \hbox{``}{\text{not too happy}}\hbox{''}\quad \left( {{\text{R}}_{ 2} } \right)} \hfill \\ {\square \quad \hbox{``}{\text{pretty happy}}\hbox{''}\quad \left( {{\text{R}}_{ 3} } \right)} \hfill \\ {\square \quad \hbox{``}{\text{very happy}}\hbox{''}\quad \left( {{\text{R}}_{ 4} } \right)} \hfill \\ \end{array} $$

In this example, happiness is rated by the respondent on a 4-step verbal rating scale. In this context, the possible ratings are referred to as ‘categories’. This term stems from the name “the method of successive categories”, as is in use for the above method of measurement among psychometricians; see e.g. Guildford (1954, Chap. 10).

In the World Database of Happiness (Veenhoven 2010), further abbreviated WDH, a set of one question and all admissible responses to that question is referred to as a “measure of happiness”, previously as “item”. A great many (about 1250 by the end of 2009) of alternative measures which have been reported as used in at least one survey or other study, are gathered, not only verbal ones, but also numerical, pictorial scales using ‘smilies’ and other graphical scales.

In most of them, the respondent has to select one out of a limited number of discrete ratings. In the above example, the four possible responses are denoted as R 1, R 2, R 3 and R 4 respectively. In general we shall use the symbol R j for the j-th response, being a member of a set of k possible alternatives, written as {R j | j = 1(1)k}; in the above example k = 4.

The notation j = 2(1)5 means that the variable j ranges from 2 to 5

with steps of size 1, so in this case j = 2, 2 + 1 = 3, 3 + 1 = 4 or 4 + 1 = 5.

In this example, R 1 corresponds to the most unhappy situation and R k to the happiest one. This is the most frequently occurring choice and in this paper, we will assume that this choice has been made. In the case of a scale with R 1 as the happiest situation, a simple reversion of the order of the code numbers will enable the application of the methods described in this paper.

Questions of the above type are presented to members of a sample from a population, e.g. some nation, in order to obtain information about the happiness situation in that population. The happiness distribution of such a community is defined as the probability distribution of the individual happiness values of all members of this community. This distribution is unknown, but its parameters should be estimated from the frequency distribution of the individual happiness values in the sample that represents that population. The average value and the standard deviation can be estimated from the corresponding frequency distribution parameters of the k responses {R j } in the sample that represents the society of the study.

The basic results in this type of investigation are the counted absolute frequencies {n j } at which members of that sample with size N select one out of the k alternatives {R j | j = 1(1)k}. Respondents who report “Don’t know” or who do not make any choice are ignored in this context.

From these absolute frequencies, we can compute the k relative frequencies {\( f_{j} := n_{j} /N \)} and the k cumulative relative frequencies {F j | j = 1(1)k}, which in the above example are defined as

$$ F_{ 1} := f_{ 1} $$
$$ F_{ 2} := f_{ 1} + f_{ 2} $$
$$ F_{ 3} := f_{ 1} + f_{ 2} + f_{ 3} ,\quad {\text{and}} $$
$$ F_{ 4} := f_{ 1} + f_{ 2} + f_{ 3} + f_{ 4} \left( { = 1} \right) $$

while the symbol “:=” means “is defined as”. In general \( F_{j} := \sum\nolimits_{i = 1}^{j} {f_{i} } \)

So, the total basis information can be summarized as {N; F j | j = 1(1)k} under the condition 0 ≤ F 1 ≤ F 2 ≤ ⋯ ≤ F k−1 ≤ F k  = 1.

The central issue in this paper is how to convert this information into valid and useful information on the population that is represented by the sample in which the measurements have been performed. There are two major problems in this.

The first one is that happiness, as it is measured above, is always a variable at the ordinal level of measurement. It is common practice to replace the various {R j } with the corresponding j-value as a code, but these k code numbers are essentially ordinal numbers. This implies that it is not admissible to subject them to addition, multiplication or other arithmetical operations, which are applied in the calculation of average values, standard deviations and other current descriptive statistics; such operations are defined on cardinal numbers only. So we have to find a solution for the “cardinalization problem”: how to transform the ordinal code numbers into cardinal numbers?

A second major problem is that in happiness studies happiness is measured with different rating scales, which may even have different numbers of ratings. So there is a need to transform the happiness values as they are measured primarily to a common secondary rating scale. For this common secondary scale, a scale on the interval [0, 10] is the usual choice, where the upper end always represents the most happy situation and “0” the most unhappy one.

Since in practice the solutions of both major problems are interconnected, we shall discuss them jointly.

Plan of this Paper

In Sect. 2, we shall discuss some of the methods in which the cardinalization problem is solved in practice together with the transformation to a common secondary scale. As one of the ways-out, the international “Happiness Scale Interval Study” (HSIS) is proposed. For this approach, a model is presented in Sect. 3, where a continuous happiness variable is postulated, which is mapped onto a discrete scale of measurement. In Sect. 4, the underlying assumptions are specified into more detail. In Sect. 5, three possible models are described to convert the basis measurement information into information about the happiness distribution within the population that is assumed to be represented by the sample from which the observations have been obtained. In Sect. 6, we start with a brief description of how the HSIS runs in practice and what achievements have been realized until now. As an illustration, we present the results of the application to the happiness data in 20 Dutch surveys in the period 1990–2008. On that basis we recommend the application of a specific happiness distribution model, which is not the most attractive from a validity point of view, but which allows the construction of confidence intervals for the mean population happiness value in e.g. a nation. Moreover, this section lists the potential merits of the proposed approach. To what extent these expectations are empirically confirmed will be described in a separate paper.

2 The Cardinalization Problem

The traditional approach for the further condensation of the counted frequencies is to consider happiness as a discrete variable, which can adopt only a limited number (k) of different values, which number has been chosen by the investigator. As has been pointed out above, the responses are recorded as code numbers, R j being recorded as a “rating = j”.

For the subsequent processing, one has to solve the already mentioned cardinalization problem. Three alternatives will be discussed below:

  1. (1)

    Simple cardinalization by direct stretching;

  2. (2)

    Thurstone values and related approaches;

  3. (3)

    The happiness scale interval approach.

There are more alternatives, but a discussion on these is outside the scope of this paper.

2.1 First Alternative: Simple Cardinalization and Linear Stretching

The most frequently occurring solution is to fully ignore (1) the label of the categories, e.g. “unhappy”, and (2) the distinction between ordinal and cardinal numbers. Although the ratings {j} are code numbers and hence are essentially ordinal numbers, they are treated as if they were cardinal. In that case, the various possible ratings are treated as equidistant numbers on a metric [1, k] scale, in our case integer numbers in the closed interval [1, 4]. Such a scale will be referred to as “pseudo-metric”.

For comparing results obtained by using different scales, the results of the primary numerical scale are often subjected to ‘direct rescaling’ or ‘stretching’, which is a linear transformation onto a common ‘secondary’ scale. This linear scale transformation, has been described in e.g. Veenhoven and Kalmijn (2005, Appendix C) and in Kalmijn (2010, Appendix B).

In the above example, the primary scale is a [1, 4] scale. For the common secondary scale, we select the [0, 10] scale as usual. Then the result of the [1, 4] scale transformation would be

  • 1 → 0

  • 2 → 3,33

  • 3 → 6,67

  • 4 → 10

The three underlying assumptions for such a linear scale transformation can be summarized for this example as

  1. (a)

    1 → 0, where “0” on the common secondary scale expresses feelings that are identical to the feelings corresponding to either the lowest or, for inverted scales, the highest rating on all primary scales, irrespective of the phrasing of that category,

  2. (b)

    k = 4 → 10 in a similar way, and

  3. (c)

    the primary scale is ‘metric’, i.e. the k ratings are considered to represent equidistant happiness intensity feelings, and so are the corresponding secondary values.

2.2 Second Alternative: Thurstone Values and Related Approaches

A possible alternative might be to request all members of a panel to place k marks on a line, one for each of the possible responses, e.g. “Please place a mark on this line, at the position of which you feel the most appropriate for the judgment ‘pretty happy’, irrespective of your personal happiness judgment”. The ‘upper’ end (10) of the line represents the most happy conceivable situation of the respondent personally and the ‘lower’ end (0) the most unhappy conceivable one. For each category, the average position of those given by all panel members is adopted as the transformed position of that category on the [0, 10] scale.

Jones and Thurstone (1955) describe a method in which they presented 51 verbal qualifications to a panel of 905 respondents, who were requested to select the most appropriate appreciation rating on a 9-point Likert scale for each qualification separately. As a result, the 51 qualifications could be mapped on a common interval scale.

Ehrhardt has proposed to apply the basic idea of this method in a similar way to the WDH on the basis of expert ratings. In 1993 Veenhoven and twelve co-workers, all involved in happiness studies at the Erasmus University Rotterdam (NL), were asked independently to assign the number they considered the most appropriate for the position of ratings in the interval [0, 10] on a scale which was presented as continuous. This was done for each of 29 categories that were current in a number of verbal happiness measures in happiness research. Their average values obtained in this way are included in the WDH and referred to as “Thurstone values” although “Jones—Thurstone values” might have been more correct. On this basis, average values and standard deviations of samples are computed by simply replacing ordinal numbers of the categories with the corresponding Thurstone values.

In the WDH, an extensive use is made of this method, in particular for verbal scales with 3 or 4 possible ratings, for which the application of direct rescaling is highly debatable. Although these Thurstone values have been established for one specific language (English), just like Jones and Thurstone did, it is current practice to apply them in the WDH to other ones as well.

A similar study was run by Bartram and Yelding (1973) among 166 adult regular London ITV-watchers. A number of their qualifications overlapped those of the Thurstone values; the absolute differences of the numerical values range from 0.1 to 0.7, which differences are of the same order of magnitude as the inaccuracy of those numbers.

It should be noted that the procedure according to Ehrhardt was not completely identical to that of Jones and Thurstone, nor of that of Bartram and Yelding, since Ehrhardt engaged experts vs. the 905 non-experts of Jones and Thurstone and the 166 of Bartram and Yelding.

2.3 Objections to the Above Approaches

The procedures for measuring happiness and their underlying assumptions as has been described above were not at all uncontested, but as long as no suitable alternatives are available, this has hardly any consequences. At least four objections emerge at (ir)regular intervals.

An obvious criticism with respect to the simple cardinalization concerns the equidistance assumption, lacking any evidence for small k-values. The Thurstone and related methods claim to resolve this problem.

As a second, there is a validity problem in the approach in which happiness is measured as a discrete variable in its relationship to happiness as a psychological concept. The respondent has to make a forced choice out of a limited number of alternatives. However, if we consider happiness as the intensity of something in a subject’s personal situation, it is obvious to look for a continuous variable rather than to a discrete one. If we managed to construct some variable that is related to happiness as measured above and that is continuous at the same time, this would improve the validity, at least in this respect.

The third class of objections especially concerns the verbal happiness ratings scales. Differences between e.g. “unhappy”, “not too happy” and “extremely unhappy” are ignored as long as they refer to a lowest category of the scale. Moreover, in the comparison of studies in different nations, the usual assumption is that for Spanish people “feliz” has exactly the same significance or meaning as “happy” has for the British. Besides, it is questionable whether this meaning is the same for the Australians and for the (i.e. all) US citizens. As long as we are unable to demonstrate the existence of differences in this respect, we simply use to declare them non-existent.

Finally, there is a problem caused by the fact that happiness was measured by self-response, not only in different languages, but also by using scales with structural differences. Not all of them have equal numbers of possible ratings. Examples are known in which the same verbal expression is part of two or more scales with different values of k. It is most doubtful to assume that such an expression has identical significances in these different contexts.

The above objections to this practice do not concern all scales to the same extent. There exists a type of scales, known as the “Best-Worst Ladder Scales”, that meets reasonably well all three underlying assumptions for direct rescaling. As an example, we mention the adapted version of Cantril’s self-anchoring ladder rating of life (Cantril 1946; Kilpatrick and Cantril 1960). The respondent is presented with Fig. 1 and with the question: “Here is a picture of a ladder. The ‘10’ at the top of the ladder means the best possible life you can imagine. The ‘0’ at the bottom of the ladder means the worst possible life you can imagine. On which place of the ladder is your life as a whole? Please mark the number that best corresponds with how you feel about your life now.”

Fig. 1
figure 1

Cantrill’s ladder scale

On the other hand, violation of the assumptions is presumably rather strong for verbal scales that has been described in Sect. 1. Especially for relative small values of k, say for k ≤ 4, we strongly dissuade linear scale transformation.

2.4 Third solution: The Happiness Scale Interval Study (HSIS)

In order to encounter a number of the above problems, Veenhoven (2009) has started his International “Happiness Scale Interval Study”. In this study, local judges are requested to partition the total [0, 10] continuum into k intervals in such a way, that each of them corresponds to one of the k possible response ratings. In the example, each panel member has to identify his or her subjective boundary between “unhappy” and “not too happy”, as (s)he sees that boundary, and (s)he is expected to do so irrespective of one’s own happiness. More details are given in Sect. 6.

The proposed approach does not pretend to solve all problems concerning measurement of happiness, nor that of life satisfaction etc., but it (cl)aims at reducing at least a number of them, especially the above ones.

3 The Model Underlying the Happiness Scale Interval Study

The model underlying the Happiness Scale Interval Study postulates the existence of a variable, here denoted H, that—in this application—expresses the intensity of the feelings of happiness of a respondent. In this description, we will deal with the application of the model to the measurement of happiness, but it is equally applicable to the measurement of life satisfaction or some other related subjective self-judgment of the respondent’s hedonic situation.

To this variable H the following properties are assigned:

  1. I.

    H is postulated to be a variable, measured at the metric level of measurement and expressed as a real number in the closed interval [0, 10].

  2. II.

    the value H = 0 represents the respondent’s subjectively worst conceivable situation with respect to his or her happiness, whereas H = 10 represents the subjectively best conceivable situation. This choice excludes the possibility of any H-value outside the [0, 10] interval.

  3. III.

    H is an intensity variable and is a strictly increasing continuous function of the happiness intensity as experienced by the respondent: if a person at the moment t2 feels happier than at the moment t1, then h2 > h1, where h1 and h2 are the H-values at t1 and t2 respectively;

  4. IV.

    the variable H is a latent variable. It is unobservable as such, but can be mapped by the respondent onto a set of k different verbal, numerical or pictorial observable ordered qualifications (ratings) {R j | j = 1(1) k}, k being a natural number, usually k ≤ 12. The order of the qualifications is assumed to be unambiguous;

  5. V.

    the interval [0, 10] can be partitioned into k contiguous subintervals, each of which being defined as the subset of H-values that are mapped to the same image. All these intervals are right-hand closed half open intervals, except the closed interval including the value H = 0;

  6. VI.

    the above mapping is monotonous, while the subinterval with the largest H-values is mapped as the happiest qualification R k .

  7. VII.

    the variable H is a random variable; within a population, it has a probability distribution: different individuals in that population will have a happiness which is represented by generally different H-values.

    In general, different populations will have different probability distributions of H. These are of the same type, but have different values of the parameters.

  8. VIII.

    except for H = 0 and H = 10, the H-values of the subinterval boundaries are subjective, since the interpretation of the possible responses is subjective as well. This applies especially to verbal qualifications, which may have a strong cultural component. Not only the language/nation combination will influence their interpretation, but also conditions as social class, age etc.; moreover the emotional value of terms may shift over time. Therefore, in linking H-values to qualifications, especially the verbal ones, some degree of variability in the results is to be expected.

As an example, we consider the next situation (Fig. 2).

Fig. 2
figure 2

Representation of model for happiness scale interval study

In this model, there is a one-to-one correspondence between each of the k presented different qualifications R j and one of the intervals of {h}. The upper boundary of the jth subinterval will be denoted as b j and this half-open subinterval as (b j−1, b j ], with j = 1(1)k, b o  = 0 and b k  = 10. For convenience reasons, the set {(b j−1, b j ], j = 1(1)k} is assumed to include the closed interval [b o , b 1] as well. The values {b j ; j = 1(1)k − 1} are also referred to as ‘cut points’; however, this term is usually extended to include also the values b o  = 0 and b k  = 10. We shall use the terms “boundary values” and “cut points” as synonyms.

In this way, there is also a one-to-one relation between each qualification R j and the mid-interval value (further abbreviated MIV) of the jth interval, which is defined as m j := ½(b j−1 + b j ).

4 Further Assumptions of the Model

In an ideal world, there would be complete consensus about the H-values of all subinterval boundaries. However, under VIII in the previous section, it has already pointed out why individual opinions on the same boundary are expected to differ.

Each panel member is requested to report the value of H at which in his personal opinion a shift ought to be made towards a “more happy judgment category”. The average value of these judgments is adopted as the estimated cut point position to be used in the application phase later on.

The basic assumption of this approach is that every respondent in the application phase with R = R j will report this rating on the basis of his happiness feeling which corresponds to an H-value in the interval (b j−1, b j ]. However, it is conceivable that for some respondent in the sample b j  < H i  < (b j ) i , where b j  = the estimated cut point position as obtained in the construction phase, (b j ) i  = his personal opinion on the position of the boundary between the j-th and the (j + 1)-th interval and H i his personal happiness value. This respondent will report “R j ”, and in this way the observed frequency of the j-th category is overestimated. This bias may, however, be compensated by an other respondent to whom (b j ) i  < H i  < b j . Unless the distribution of individual opinions around their average value is very skewed, the net bias is assumed to be negligible and we will make this assumption, at least for the moment.

Two identical phrasings, but within different items, are judged in the HSIS separately and independently within each item. This practice was not applied to the determination of the Thurstone values nor to similar other approaches. The proposed practice is justified in the comparison of the mid-interval values (MIV) of the judgment “very satisfied” within two different items of the WDH as an example. Item coded O-SLW/c/sq/v/5/p raises the question “All things together, how satisfied are you with your life as-a-whole these days?” with five response categories: completely satisfied/very satisfied/satisfied/not very satisfied/not at all satisfied. In item O-SLS/c/sq/v/3/a it is asked: “How satisfied are you with the way you are getting on now ?” with three response categories: very satisfied/all right/not at all. On a [0, 10] scale, the MIV of “very satisfied” for these different questions with different alternatives were 7.6 and 8.9 respectively, which demonstrates that the other categories and their phrasings should not be ignored.

Intuitively, one might expect that the average result of all respondents in the determination of the Thurstone and related values, whether or not done by experts, is a good estimate for the MIV as defined in the HSIS. The answer to the question whether this expectation is correct is negative, at least in general. The reason is that the k MIV are not mutually independent. They have to satisfy a simple criterion which can be described as follows: write down the supposed MIV in descending order of magnitude and connect them with alternating minus and plus signs, starting with a minus sign. Then the result in the case of a [0, 10] scale should be equal to 5. In the case for k = 4, one gets m 4 − m 3 + m 2 − m 1 = 5. If the ‘alternating sum’ ≠ 5, the {m j } cannot be considered to be MIV. This proof of this rule is to be found in Kalmijn (2010, Appendix F3).

After substitution of (the positions of) some set of four marks in the above equation, the ‘alternating sum’ will in general ≠ 5, and in that case these four average positions{m j } cannot be considered to be a set of unbiased estimates of the MIV. In case of modest departures from this condition, some adjustment procedure of the marks position may be a ‘solution’ to deliver a more or less valid estimation of the MIV. In practice, however, it appears that it is rather exceptional when acceptable results are obtained along these lines.

Consequently, generally speaking, Thurstone values cannot be considered as pseudo-MIV, since usually they do not satisfy our criterion that their alternating sum equals the value 5. This is easily demonstrated for the scale example in Sect. 1. The Thurstone values of the four responses in the WDH have been agreed to be {0.6; 4.1; 6.7; 9.3}. Since 9.3 − 6.7 + 4.1 − 0.6 = 6.1 ≠ 5, the set of Thurstone values of this item clearly does not satisfy our MIV criterion, in this particular case not even approximately. This can also be demonstrated by the graphical representation below. Suppose that all Thurstone values are MIV, and that at least the largest three of them are correct. Then the boundary values are {0; 3.4; 4.8; 8.6; 10} Consequently the smallest Thurstone value in this case should be 1.7 and not 0.6.

5 Conversion of the Sample Data to Information About the Population Happiness Distribution

The happiness distribution of a community is defined as the probability distribution of the individual H-values of members of that community. This population probability distribution is unknown, but it can be estimated from the frequency distribution of the individual H-values in the sample that represents that population. The expected or mean value and the standard deviation can be estimated from the corresponding frequency distribution parameters of the k responses {R j } in the sample that represents the community of the study.

If the variable H is assumed to be a random variable, it will have a cumulative distribution function, denoted as G(h):= Probability {H ≤ h}. This G(h) is a monotonically nondecreasing function of h with G(−∞) = 0 and G(∞) = 1.

In the case H is assumed to be a discrete random variable, G(h) is a step function with k steps, one at each value h that H can adopt, the size of the j-th step being Prob{H = h j }.

If however H is assumed to be a continuous variable, G(h) is a continuous function. Now we define:

$$ g\left( h \right)\,:= \,{\frac{dG\left( h \right)}{dh}} $$

provided it exists, which derivative is called the probability density function (p.d.f.) of H. Whether or not g(h) exists depends on the further assumptions made on G(h).

We will discuss three possible models, which have been represented in Fig. 3. Under the model described in Sect. 3, it is assumed that each respondent with a happiness feeling corresponding to any H-value in the interval (b j−1, b j ] will respond as R j . However, all we know is the number of respondents with R j , but it is unknown which H-value in the interval (b j−1, b j ] belongs to each of them. Therefore, we have to make assumptions on the unknown distribution of H over [0, 10], more precisely, over each of the k intervals ⊂ [0, 10]. The three models differ in these underlying assumptions.

  1. I.

    In model I, it is assumed that all respondents giving the same response Rj are equally happy and have the same H-value, for which the MIV of the jth interval is the obvious one to be selected. These k responses are the only ones available, not only for the sample members, but also in the population as a whole. In other words, the population probability distribution of H is assumed to be discrete with only k possible H-values.

  2. II.

    The variable H is assumed to be continuous and has a distribution which is uniform over each of the k intervals.

  3. III.

    The variable H is assumed to be a continuous variable with a beta distribution. From the observations, estimates of the two model parameters α and β are calculated, Subsequently, estimates of the mean and the variance of the distribution are calculated on the basis of these estimates of α and β.

Fig. 3
figure 3

Probability values and densities (left) and cumulative Probabilities (right) for h ∈ [0,10] in three models: I (discrete distribution), II(semi-continuous distribution) and III (beta distribution), all on the basis of a four-point rating scale

A more detailed description of the three models will be given below.

An important property of any estimator is whether it is biased or not. If θ is a parameter or a function of one or more parameters of a probability distribution of some random variable, and is estimated by a statistic \( \hat{\theta } \), then the bias of \( \hat{\theta } \) with expectation \( {\text{E(}}\hat{\theta }) \) is defined as the difference \( {\text{E(}}\hat{\theta }) - {\theta } \), where θ is either a scalar or a vector, and \( \hat{\theta } \) will be accordingly. It should be emphasized that a bias is defined only if the distribution of the statistic is known and that it depends on which type of probability distribution is adopted for the random variable. Hence the same statistic, which is an unbiased estimator for some parameter in model I and/or II may not necessarily be unbiased for the same parameter in e.g. model III.

5.1 Model I: The Discrete Approach

One way-out could be to locate all respondents in the middle of the interval and to use the MIV as an estimate of the H-value of all of them.

This approach is rather similar to the traditional one and yet considers happiness as a discretely distributed variable. The essential difference is the replacement of the transformed code number of the categories with the empirical MIV, but the conversion of sample results into information on the population happiness distribution follows identical lines.

In the traditional approach, it is very unusual to specify the probability distribution in the population explicitly. Implicitly, the situation in the population is assumed to be structurally identical to that of the sample, but with larger size only. The same assumption is made in this model I. This means that this population probability distribution is assumed to be a discrete polytomous distribution with 2k parameters, k for the probabilities \( \left\{ {\pi_{j} |\,0\, \le \,\,\pi_{j} \, \le \,1,\,j = 1(1)k,\,\sum {\pi_{j} \, = \,1} } \right\}, \) and k for the mid-interval values, 2(k−1) of which parameters being independent. The parameters \( \left\{ {\pi_{j} } \right\} \) are defined as π j := the probability that an individual, ‘selected’ at random from the population, will report R j . They are estimated as the k relative frequencies in the sample. In that case the sample mean is an unbiased estimator of the mean happiness of the population probability distribution. The second moment about the mean of the sample is made an unbiased estimator of the population variance by the application of Bessel’s correction, i.e. by replacing the denominator n with n − 1. Its square root is underestimating the value σ of the population systematically, but since this estimator is consistent, usually the sample size is sufficiently large to neglect this bias.

In this model, the cumulative probability distribution G(h) := Prob{Hh} is a step function with a step of size π j at H = b j for j = 1(1) k − 1, where at each step the value of G(h) is the higher one.

5.2 Model II: The ‘Semi-Continuous’ Model

A second alternative is to assume that all H-values in an interval are equally likely, i.e.to assume a uniform distribution of H over each of the k intervals separately. In that case, consecutive points in the cumulative distribution plot with co-ordinates (b j−1, G(b j−1)) and (b j, G(b j )) are connected by straight line segments, making G(h) a broken line with kinks in all cut points where H = b j . At these H-values, G(h) is not differentiable, so there g(h) does not exist. Consequently, in this approach g(h) is a step function with steps in H = b j for all j = 1(1)k and horizontal lines of different elevations in between. In other words: at each cut point, the probability density is changing stepwise to remain constant until the next boundary/step.

As long as no explanation can be offered for such steps at a number of points, all selected by the investigator, such a model is not very satisfactory. A sufficiently realistic model should at least satisfy the condition that its p.d.f. is continuous over the complete interval (0, 10). We refer to the model II as “semi-continuous”, since it assumes the happiness variable H to be continuous, while its probability density function is not.

Just like the model I, the model II has 2k − 2 parameters. As long as no better alternative is available, we have to accept this model. The consequences of this assumption for the estimation of the population mean and variance have been described in Kalmijn (2010, Appendix F1), including those for the precision of these estimators.

5.3 Model III: The Beta Distribution as Continuous Model

Because the model II is not satisfactory in all respects, there is at least one alternative to be considered. This is known as the beta distribution, which has a continuous density function of a random variable in a closed interval with finite boundaries (see e.g. Kendall and Stuart 1977; 35 and 46).

As applied to our situation, the cumulative distribution function is defined by:

$$ dG\left( h \right)\, = \,\left[ {10\, \cdot \,B\left( {\alpha ,\beta } \right)} \right]^{ - 1} \,h^{\alpha - 1} \left( {10 - h} \right)^{\beta - 1} \,dh, $$

in which B(α, β) is the complete beta function with parameters α and β, defined as:

$$ B\left( {\alpha ,\beta } \right):= \,\int_{0}^{1} {t^{\alpha - 1} } \left( {1 - t} \right)^{\beta - 1} \,dt. $$

This model of the beta distribution has only two parameters, α and β, which are positive real numbers; they are usually referred to as the two shape parameters of the distribution. This number of parameters is considerably smaller than in the models I and II, because in this model, there are no categories at all in the population distribution. The density function g(h) is continuous over the complete domain, finite and positive for all h ∈ (0, 10) and zero outside the interval [0, 10]. All relevant properties and other information on this application of the beta distribution have been summarized in Kalmijn (2010, Appendix H), most of which can be found in various textbooks on calculus and statistics and/or in other public sources, e.g. Gupta and Nadarajah (2004).

In applying this distribution as the model, the empirical frequency information, available as {F j | j = 1(1)k}, is compared to the corresponding values of G(b j ), minimizing the differences between F and G jointly. The value of G is dependent on both α and β for all {b j | j = 1(1)k − 1}.

The comparison of F and G is possible and meaningful only at k − 1 values of H {b j | j = 1(1)k − 1}, since the equations F(0) = G(0) = 0 and F(10) = G(10) = 1 are trivial. The situation can be considered as one with a screen before the cumulative distribution function G(h), which is observable only through one of the k − 1 very narrow windows at H = b j (j = 1(1)k − 1). From these k − 1 comparisons, the two model parameters {α, β} are to be estimated, leaving k − 3 degrees of freedom (df).

For k = 3, there is always a unique solution with a perfect fit.

For k = 2, the number of solutions for this underdetermined situation is infinite.

For k ≥ 4, we have an overdetermined situation and in general there will be no perfectly fitting distribution, so we have to look for the ‘best fitting’ solution.

If one has found this distribution, it would be possible to a apply a ‘goodness- of-fit test’ (see e.g. Cramér 1974; 416–424). For this situation, K. Pearson has proposed a test statistic, which is based on the multinomial distribution of N respondents over k possible responses and which is defined as

$$ \sum\limits_{j = 1}^{k} {{\frac{{\left( {n_{j} - E n_{j}|\text{H}_{\text{o}} } \right)^{2} }}{{(E n_{j}|\text{H}_{\text{o}}) }}}} $$

where \( E n_{j} |{\text{H}}_{\text{o}} := \) the expected value of nj under the null hypothesis Ho that the estimated distribution is a perfect representation of the actual distribution in the population. Under Ho and under some additional conditions, Pearson’s statistic is approximately distributed as chi-square (χ2) with in our case k − 3 degrees of freedom (df). These conditions are that k > 3, that N is not too small and that responses with \( E n_{j} |{\text{H}}_{\text{o}}\) ≤ 5 are ‘pooled’ with an adjacent response, which is obviously done at the cost of the number of df due to the effective reduction of k. Such a test in other than comparative situations is well debatable from the point of view of standard statistical test theory.

The two parameters of the beta distribution cannot be interpreted directly as a location and a dispersion parameters as is the case for e.g. the normal distribution. From the relationship between α, β, μ and σ2, the mean μ and the variance σ2 of the distribution of H can be estimated by direct substitution of the estimates of the shape parameters α and β:

$$ \mu {\kern 1pt} = \,{\frac{\alpha }{\alpha + \beta }} $$

and

$$ \sigma^{2} = \,{\frac{\alpha \beta }{{\left( {\alpha + \beta } \right)^{2} \left( {\alpha + \beta + 1} \right)}}} $$

In general, the values of the estimates obtained in this way will not be identical to those of the corresponding sample statistics. However, they may be more valid as they allow for the assumption of a continuous random variable H with a continuous p.d.f. over (0, 10).

The beta distribution also enables one to compute a potentially useful in a comparative study of nations, especially in relationship to other characteristics. It is the “percentage happy”, which is defined in this context as the percentage of the society for which the happiness, expressed as the H-value, is closer to their most happy situation than to the most unhappy one, i.e. for which H > 5. In the above notation, this proposed statistic is defined as the estimate of [1 − G(5)]·100%, and can be computed on the basis of the estimates of the parameters α and β. Since the value of this statistic is influenced by both the mean value and the variance of the distribution, it may be considered as a possible alternative to the ‘Inequality-adjusted happiness’ as has been described by Veenhoven and Kalmijn (2005).

6 Application and Merits of the Model

6.1 The HSIS in Practice

The application of the HSIS method is a two-step process. The first one is the scale construction phase by a panel as has been described by Veenhoven (2009), and the second is its application to characterize the happiness of a population by a sample of subjects using this scale. Note that we use the terms ‘panel’ and ‘judges’ for the scale construction phase and ‘sample’ and ‘respondents’ for the application phase as a contribution to strengthen the distinction—and the separation—of these two phases.

In the HSIS, the judges in the construction phase have to identify their personal opinions with respect to of the k − 1 cut points {b j | j = 1 (1)k − 1}, bearing in mind that b 0 = 0 and b k  = 10 are fixed. For a given measure of happiness, the values of the k − 1 boundaries or cut points have to be estimated as the average values reported by n panel members. Each of these judges has to specify the above mapping by indicating the b-values he feels to separate the consecutive categories, ignoring his personal happiness self-judgment.

In the second phase, the outcomes of the first phase are applied to the observed frequencies of the various categories as counted in a sample of N subjects from the relevant population. From these results, the sample mean and its happiness inequality are calculated, the latter being expressed in the standard deviation. These statistics are used to compute estimates of the parameters of the distribution of the variable H in the population represented by the study sample. As a matter of fact, both stages will contribute to the eventual inaccuracy of these estimates.

We have to emphasize that the application phase of the methods described in this paper is only applicable to samples of which the ‘complete’ empirical sample cumulative distribution {F j | j = 1(1)k} is known, albeit for k happiness values only. Knowledge of both the average value and the standard deviation of the sample happiness only is insufficient.

6.2 First Results

Since the start of the HSIS, a large amount of data has been gathered. Of the first harvest, 100 cases have been analyzed. The observations are also available http://worlddatabaseofhappiness.eur.nl/scalestudy/datafiles/first100cases.xls and the results have been described by Kalmijn (2010, Chap. VII).

These data has been delivered by 12 institutes and cover 9 different languages. In this context, a case is defined as the set of judgments on the cut points of a specific happiness measure (one leading question + k response categories), obtained within the same participating institute and the same session. The total number of happiness measures involved is 52, since several measures have been presented to judges in more than one institute.

6.3 Some Findings as Illustration

Five of these cases have been applied to 20 already existing happiness distribution data from Dutch surveys in the period 1980–2008. As an illustration, the results have been summarized in Table 1.

Table 1 Estimated mean values and standard deviations 1980–2008 in The Netherlands

For each of the five cases, denoted A, B, C, D and E, we included the text in English of the leading question and all response categories. Each row below this description refers to one of the existing surveys, the year of which has been specified. In the next columns, the estimated mean values have been listed according to the different approaches. We start with the traditional approach (happiness as a discrete variable and equidistant ratings ranging from 0 to 10). Then follows the estimate obtained on the basis of Thurstone values. In the next column, we report the estimate according to the models I and II as described in Sect. 5; both models always give identical estimates for the mean happiness value. Moreover, we have calculated the estimates on the basis of the best fitting beta distribution. In Table 1, we recorded the difference between the latter estimate and the one according to the models I/II. Next there are two columns with estimates of the within-nation standard deviation, one according to the traditional method and the other one on the basis of model II. Finally, the right hand column gives the 95% confidence limits for the true, but unknown mean happiness value of the happiness of the Dutch population.

The number of judges in the panel was about 30, the sample size in the application phase varied between 1000 and 1500, except for case D, in which much larger samples were involved. For comparison reasons we considered the average happiness value measured by using numerical scales. In all those cases, the leading question was at least very similar and incidentally even identical to the one of the verbal scales. Over the total period 1990–2008 this estimate varied between 7.4 and 7.8 on a [0, 10] scale.

From this table, we conclude that there are substantial differences between the estimated mean values. These do not only depend on the text of the happiness measures and the number of categories, but also on the model according to which the observations have been processed. Moreover, the agreement with the above estimates on the basis of the use of numerical scales is not always excellent.

From a validity point of view, the model III on the basis of a beta distribution is the most attractive one, but it has one serious disadvantage: we are unable to estimate the inaccuracy of the estimates, at least on the basis of our present knowledge. As a consequence, we are unable to construct 95% confidence intervals for the true but unknown population mean value. The application of model II does not have this disadvantage. From the column III–II we learned that the difference between the fully continuous and the semi-continuous model is modest (<0.2) and that this difference is always well within the 95% confidence interval. Our final conclusion is that eventually the model II is to be preferred over the beta distribution model III, since it does not only provide us estimates, but also information about their inaccuracy.

A more elaborate analysis and discussion is given by Kalmijn (2010, Chap. VII).

6.4 Potential Merits of the Scale Interval Approach

The main possible merits of the above approach–some of which are potential–can be summarized as follows:

  1. (a)

    Improvement of the validity of the method in that sense that the proposed approach considers happiness no longer as a discretely distributed variable, but allows for its continuous nature. In this way, the method described in this paper is no doubt closer to reality and is to be considered more valid, so more relevant for social scientists than previously conventional methods were.

    Moreover, as compared to the method of direct rescaling, the criticism on the latter method does not apply to the results obtained according to the scale interval approach. This especially includes the objections against the controversial treating of ordinal ratings as if they were cardinal, since in the proposed approach, no equidistance between the ratings is no longer assumed.

  2. (b)

    A consequence could also be an improvement of correlational findings, at least in the validity perspective. Moreover, it is conceivable, at least theoretically, that this improvement of the validity of the happiness measurement may also result in higher numerical values of the association measures with conditions of happiness. Such an expectation would be based on the assumption that associations that are really present, may be blurred by the fact that happiness is measured in a suboptimal way rather than due to the fact that the associations are intrinsically insufficiently strong.

  3. (c)

    Meta-analytical studies are almost always hampered by the problem that different findings that need to be combined arise from the application of different WDH items. It is to be expected that the results obtained according to the scale interval approach will be more reliable than those obtained according to previously current methods, so the method may seriously enlarge our meta-analytical opportunities. Similar considerations can be applied to the investigation of trends of happiness in nations or other societies.

  4. (d)

    Finally, the method enables the opportunity to optimize the set of questions. Items with a relative large skipping rate, with a large interval width inequality and/or in which a relatively poor consensus about the positions of the boundaries has been observed within panels and/or between panels from different nations, are less suitable than those without these problems. All these observations could be good reasons to discontinue the application of these problematic happiness measures, although a number of studies will still remain where they have been applied in the past. In this way, the present approach may contribute to the standardization and improving the quality of measuring happiness.

In a next paper we will evaluate the application of this approach to a number of verbal scales and test to what extent the underlying assumptions and the model can be corroborated or not.