Source of data
Earnings data were retrieved from the ILO. The ILO’s annual wages and retail prices survey collates data on 159 occupations. ILO’s ISCO-08 classifies occupations into 10 major categories and 4 skill levels. The ILO’s ILOSTAT database contained wages data for only 32 countries from 2009 onwards at the time of data extraction. ILO’s older ILOSTA database contains wages data for most of countries between 1998 and 2008.
Wages data were retrieved from the (ILO) wage estimate database [
12] for a variety of job titles across countries, and then classified into four skill levels according to ISCO-08 Major Groups [
13] (Table
1, The dataset used in the analysis contains, for each country, (a) a pooled data point of monthly wages (in 2010 USD) for the 4 skill levels, (b) GDP per capita for 2010, (c) country income level and (d) WHO region. The pooled data point refers to a sub-sample of the broader set of countries’ occupational wages available in the ILOSTAT database from 1999 to 2011. As many occupations are represented for each of the skill levels in the ILO database, we selected earnings for medical professions wherever possible; For skill Level 4, data extraction focussed on earnings for general physicians, dentists and professional nurses, for Level 3 the data extraction focussed on earnings for medical X-ray technicians, physiotherapists and auxiliary nurses, Level 2 focused on clerks and secretaries and Level 1 on physical labourers (Additional file
1: Table S1 further information in Appendix A). For those countries that do not have wage data available in ILOSTAT, the former ILO database ILOSTA with data up until 2008 is used. A proxy wage value for each skill level was entered using the salary of a representative occupation from the ILO’s annual wages and retail prices survey (“October Inquiry”). A single representative occupation was used because of the complexity and time that would have been involved in calculating an average salary for different occupations at the same skill level corresponding to different years. As often as possible, the same representative occupation was used for all countries. The preferred representative occupations for each skill level were those that were well-defined and found in most countries. These preferred representative occupations as well as a full list of the occupations covered by the ILO October Inquiry are contained in Additional file
1: Appendix A. Data from the most recent year available in ILOSTAT were used in our econometric analysis. In total, 324 observations from 86 countries were available for analysis.
Table 1
Definition of ISCO-08 Major skill level groups [
13]
Professionals | 4 |
Technicians and associated professionals | 3 |
Clerical support workers | 2 |
Elementary occupations | 1 |
Nearly 75% of the data is from 2006 or more recent. However, there are a number of countries, mainly low-income countries in Africa, whose most recent data is over 10 years old. Additional file
1: Table S2 in Appendix A presents the distribution of available data by year. The wages were observed by skill level using a box pot (Additional file
2: Figure S1, Appendix B), which showed all distributions to be right skewed with a few outliers. As the skill level increases, the right tail becomes longer and more populated, and the number of outliers decreased. The distribution suggests larger wage differentials at the lower skill levels across countries. This is consistent with the literature [
14] that has shown higher rates of migration for lower skilled workers due to high disparity in wages across countries. We expect wages to increase by skill level, to different degrees, depending on the region. To confirm this, we use a bar graph with mean wage by region and skill level (Additional file
2: Figure S2, Appendix B). We expect mean wages to have increased over time globally. A bar graph of mean wages over time showed mean wages peaking in 2007, drastically falling in 2008, and slightly recovering thereafter (Additional file
2: Figure S3, Appendix B). The drop between 2007 and 2008 coincides with the global financial crisis, which is the main factor contributing to this decline. The ILO highlights that the link between wages and labour productivity levels had been “broken” prior to the global financial crisis, and market corrections were necessary in order to create a sustainable wage and productivity link [
15].
We use the variable Mortality, as a dummy that captures countries with high infant and adult death rates. The 193 Member States have been divided into 5 mortality strata by the WHO, based on their level of child (5q0) and adult male mortality (45q15) as follows: A = Very low child, very low adult; B = Low child, low adult; C = Low child, high adult; D = High child, high adult; E = High child, very high adult. Counties in strata E, D and C are captured by the Mortality dummy. Countries in strata A are captured by a second dummy, called Developed. The 6 WHO regions are also used as dummies in the analysis of this paper (Africa, Eastern Mediterranean, Latin America, Asia, Western Pacific, Europe, and North America), plus a dummy for North America.
Model specification
Many countries have incomplete data with only a small percentage of the total occupations collected by the ILO, which limits the number of possible representative wages for the four skill levels. We had 324 observations from 86 countries available for analysis. As data are classified into 4 skill levels, 4 observations are sought for each country, corresponding to 772 (193 × 4) data points. Thus, 58% of possible observations were missing. If the data are missing at random, maximum likelihood estimation and multiple imputation techniques are most common, and vast. Multiple imputation methods produces unbiased estimates, even for large percentages of missing values (say < 50%), when at least one auxiliary variable if available to predict the value of Y (i.e. strong correlates of Y are available) [
16]. However, multiple imputation methods require that the data must be missing at random, meaning that the probability of observing the variable of interest (Y = wage) can depend other explanatory variables, but not the value of Y [
17]. Modelling the missing mechanism becomes imperative, otherwise one must accept a level of bias.
A partial and semi partial correlation matrix was used to investigate the relationship between missing data and possible covariates for estimation (Additional file
2: Table S3, Appendix B). In the partial and semi partial correlations we find that GDP per capita and certain regions are positively correlated with missing data. The non-significance of the semi partial correlation with GDP per capita tells us that in a model to predict the probability the data is missing, the amount by which R
2 decreases when removing GDP per capita from the model would not be significant. Conversely, we find that variables such as mortality, WHO regions, or year result in a significant change in the R
2 of such model. Given the substantial amount of missing data (nearly 60%), and the existence of variables that can be used to predict the probability of observing data, the missing data were classified as missing not at random using Rubins classification for missing values [
18]. Under this assumption nonresponse mechanisms are considered non-ignorable, and should be modelled.
We consider the missing wages to be caused by sample selection, with the data incidentally truncated, where the probability of observing a missing wage depends on country characteristics (i.e. sample selection depends on a country’s health care system, cultural practices and perhaps economic development).
Due to the truncation of the missing data, omitted variable bias occurs in the sample due to the sample selection of countries in the ILO database. As a result, the endogenous processes that dictate the probability of observing the variable of interest are modelled separately. Heckman’s [
19] two-stage sample selection model allows us to remove endogeneity by replacing missing values with data estimated in an auxiliary model, simple regression models can then be used for the wage estimation. This two-step model offers a means of correcting for non-randomly selected samples with copious missing data that characterizes the phenomenon of incidental selection in censored data. Censoring is present here since there are non-observable responses where one can nevertheless observe all of the explanatory variables [
20]. As such, the distribution that applies to the observed data is a combination of discrete and continuous distributions as defined by the latent variables which describe the outcome [
21]. In the absence of collinearity problems, the full-information maximum likelihood estimator is preferable in terms of robustness, although limited information two-step also gives reasonable results [
22]. For this analysis we find that collinearity is an issue because the key variable (mortality) used in the auxiliary model is correlated (0.63) with the key variable (log GDP per capita) used in the wage equation. Thus, we proceed using Heckman’s subsample OLS (Two Part Model): that is, a probit model for selection equation (i.e. the probability of observing the response) and ordinary least squares (OLS) for the wage equation.
Regression equation:
$$y_{1i}*= x_{i} \beta + \varepsilon_{i}$$
Selection equation:
$$y_{2i}*= z_{i} \gamma + U_{i}$$
such that:
$$y_{1i}= y_{1i} * if \;y_{2i}* >0$$
$$y_{1i} = 0\; if\; y_{2i}* \le 0$$
where x is a vector of explanatory variables which determine the wage, and z is a vector of explanatory variables that determine the probability of observing the wage.
In the context of the skill level wages, the explanatory vector x is comprised of the logarithm of GDP per Capita, the skill level (1–4), and a dummy variable for identifying countries with developed health systems (i.e. very low child and adult mortality). Furthermore, z is a vector comprised of dummy variables which identify specific regions where there are high or low levels of response (i.e. Western Pacific, Asia, Eastern Mediterranean), the logarithm of GDP per Capita, and a dummy variable which identifies high levels of both infant and adult mortality in WHO regions. An elevated level of mortality is used to predict the probability of observing the wage because it implies a low level of development, specifically within health systems, which would be associated with poor data collection within the system.
By observing the log wage data in a density plot, we perceived that the wages are not normally distributed, but rather have a distribution resembling the log-normal distribution. In a Jarque–Bera normality test (Additional file
2: Table S4, Appendix B), the log-wages were tested and the hypothesis of normality was not rejected [
23]. It is expected that distribution of the error term in an OLS regression of untransformed variables would not necessarily follow a normal distribution. Normality in the distribution of the errors, a crucial assumption for the OLS framework, is thus preserved by implementing the OLS regression equation using the logarithm of earnings.
The wage function was estimated using generalized linear methods (GLM). In this model, the response variable yi is assumed to follow an exponential family distribution, which is assumed to be a nonlinear function of xi. The link function specifies the link between the probability distribution of y and explanatory variables x. Thus, specifying a logarithmic link function says how the expected value of the response relates to the linear predictor of explanatory variables. The gamma distribution has a property shared by the log-normal; namely that its variance is proportional to its mean-squared error (i.e. it has a constant coefficient of variation). Within the GLM framework, both the gamma and Gaussian (normal) distributions were tested, however, the parameter for selection bias (calculated as a mills ratio) was found insignificant under the assumption of a Gaussian distribution. Moreover, a limited information maximum likelihood (LIML) estimator was also tested against two alternative models, OLS and GLM (gamma). The parameter estimates for the LIML were similar to that of the GLM because they both use maximum likelihood methods, and thus, the LIML output was omitted. Lastly, because different sub-samples of countries were observed in different years, all the models were compared both with and without a dummy variable for calendar time.