Four dimensions are considered during the HL questionnaire’s design, which are health concepts and knowledge literacy, healthy lifestyle and behavior, and healthy skills, as well as health status and disease history. It is different from the existing three dimensions methods [
28‐
30] and five dimensions approach [
31] in China, because the health status and disease history of respondents aren’t considered in [
28‐
31]. The HL questionnaire with 68 questions are designed by both improving the HLS-EU-Q47 and analyzing the characteristics in Mongolians in China. In order to verify the presented HL assessment method by a set of cross - sectional data, 742 Mongolians in Inner Mongolia of China are invited to answer the above HL questionnaire.
Based on the HL questionnaires completed by 742 Mongolians, the reliability and validity of the designed HL questionnaire are analyzed by using Cronbach’s
\(\alpha\) coefficient, Mutual Information Score (MIS), KMO and Bartlett Spherical Test Chi-square Value (BSTCV). The results show that the designed HL questionnaire has the high reliability and validity, because we get Cronbach’s
\(\alpha =0.807\), MIS=0.803, KMO=0.765, and BSTCV=2486 (
\(p<0.001\)) by using our Python programs. The MIS method is better than Pearson correlation coefficient approach [
32], because the latter can only handel linear correlations, however, the former can not only deal with linear correlation but also nonlinear correlation.
A data set with 742 samples is constructed, where each sample has 68 features and 1 target. 68 features correspond to 68 questions in the HL questionnaire, and 1 target corresponds to the HL score that each respondent obtained by answering the questionnaire. Based on this data set, the XGB and LGBM regression models to predict HL are constructed, respectively. 80% samples in the above data set are designed as training samples, and others are looked as testing samples. The XGB and LGBM regression models are trained by 594 (80%) samples, respectively. Then the XGB and LGBM regression models are tested by 148 (20%) samples, respectively. The \(R^2 (0 <R^2\le 1 )\) index is chosen as an evaluation accuracy index. The large \(R^2 (0 <R^2\le 1 )\) means the high assessment accuracy. The results show that \(R^2\) index and the absolute error by using LGBM regression model are 0.98347 and 11, respectively, which are better than ones by applying XGB. It can be seen that the HL assessment model based on LGBM can achieve the assessment results with high accuracy.
In addition, the existing correlation analysis methods, such as Covariance method, Pearson correlation coefficient, and MIS approach, can only give quantitative results for analyzing the correlation problem among questions of questionnaires. This does not meet the growing demand for HL assessments with high-precision. Therefore, we quantitatively analyze the influence of each question in the questionnaire on the HL assessment results by using the feature-importance function in the HL assessment model based on LGBM. The quantitative results for correlation analysis among all questions are given in Fig.
6. It can be seen that the biggest impact factor is 1105, and the smallest impact factor is 23. The age has the highest influence on the HL level. It shows there is a strong correlation between age and HL levels, which is consistent with other studies [
28‐
31,
33]. For example, Japanese HL survey [
33] concluded that the HL level for Japanese increased with age; The HL survey in European countries and Turkey demonstrated that older people tended to have lower HL [
33]. The impact index of the salary level of the respondents (
\(Column_{-}27\)) is 286, which is the second, but it is much smaller than one of age. This result is consistent with the conclusions from [
28‐
30]. The impact index of the ability of the interviewees to judge relevant health information in the media (
\(Column_{-}36\)) is 270, which is the third. The impact indexes of the probability of medical attendance (
\(Column_{-}25\)), the knowing about vaccinations and checkups ()
\(Column_{-}43\), and the obtaining healthy eating information(
\(Column_{-}53\) ) are the forth, the fifth, and the sixth, which are 256,254, and 253, respectively. These analysis aren’t found in the existing results. The influence of Gender (
\(Column_{-}1\)) on the HL level is 69. The scores of the respondents show that Men’s HL is higher than Women’s HL, which is consistent with ones in [
29,
34], but the quantification of influencing factors wasn’t investigated in [
29,
34]. The impact indexes of the Territory (
\(Column_{-}2\)), Education background (
\(Column_{-}20\)), and Professional (
\(Column_{-}21\)) are 96, 69, and 71, respectively. And the scores of the respondents show that the HL levels of respondents living in cities are higher than ones of the residents in villages; there is a positive linear correlation between the level of HL and the educational background of the respondents. These results for Territory and Education background are consistent with ones in [
29]. The fourth dimension (health status and disease history) of the HL questionnaire is reflected by the
\(Column_{-}3,4,5,7,8,9,10\) in Fig.
6, where the impact index of the health status (
\(Column_{-}3\)) is the largest, which is 168. However, they aren’t considered in [
28‐
30]. The impact indexes of other questions aren’t addressed individually, which can be found in Fig.
6. It is worth mentioning that the least influence question on the final HL assessment result is the insurance type (
\(Column_{-}6\)), and its value is 23. However, this factor isn’t investigated in other papers.