Latent Class Analysis: An example for reporting results

https://doi.org/10.1016/j.sapharm.2016.11.011Get rights and content

Abstract

Objective

The purpose of this paper is to provide a brief non-mathematical introduction to Latent Class Analysis (LCA) and a demonstration for researchers new to the analysis technique in pharmacy and pharmacy administration. LCA is a mathematical technique for examining relationships among observed variables when there may be collections of unobserved categorical variables. Traditionally, LCA focused on polytomous observed variables, but recent work has extended the types of data that can be utilized. Included in this introduction are basic guidelines for the information that should be part of a manuscript submitted for review. For the analysis, LatentGold is used, but I also include basic R code for running LCA and LC Regressions with the poLCA package.

Introduction

Latent classes are unobserved, or latent, segments. Participants, or more generally, cases, within the same latent class are considered homogenous based on certain pieces of information. Latent Class Analysis (LCA) was developed over 60 years ago as a way to characterize latent variables while analyzing dichotomous items.1 During the past several years, it has expanded to all for all types of data. In the literature, LCA is referred to in different ways. It has been called Latent Structure Analysis,2 Mixture Likelihood Clustering,3, 4 Model Based Clustering,5, 6, 7 Mixture-model Clustering,8 Bayesian Classification,9 and Latent Class Cluster Analysis.10

LCA allows researchers to create or characterize a multidimensional discrete latent variable based on a cross-classification of two more observed categorical variables.2 Because of the categorical nature of the latent classes, LCA is different from other latent approached such as factor analysis and structural equation modeling. LCA provides the possibility to develop typologies for understanding and, if desired, can be used in predictive models. In addition to the ability to analyze relationships in categorical data, numeric and non-numeric data can be utilized.

Classes or clusters are a common need for researchers and data analysists in general. K-means is a common technique used, but has several problems. The potential difficulties include sensitivity to outliers (outliers are extreme values that can skew the results) as well as the need to use interval or ratio data—which means that, in calculating distances, you have to know whether the numbers actually add up—and there are some concerns about the order in which data is assembled.11 In some cases, data may not be appropriate for the K-means method. More fundamentally, the stability of clusters cannot be assumed because traditionally there has been no objective set of criteria to judge the suitability of solutions. K-means will always produce a solution, and some of those solutions are likely to fit your expectations.

LCA quite easily overcomes all of the problems with K-means clustering. The increase of computing power in the 1990s made LCA a very efficient technique. The best way to distinguish between LCA and cluster analysis techniques is to note that LCA is model-based and cluster analysis is not. “Model-based,” means there is a statistical model that is assumed to come from the population from which the data was gathered.11 Both cluster and LCA are seeking divisions that maximize the between-cluster differences and minimize the within-cluster differences. But in traditional cluster analysis this decision is arbitrary or subjective. In LCA, a statistical model allows the comparison to be statistically tested, so that the decision to adopt a particular model is less subjective, or at least has some grounding for comparison. In addition, the items used in the analysis do not need to have the same scale or have equal variances. Finally, LCA allows for the examination of the residuals between items used in the analysis. In other words, LCA is useful in examining the data that does not fit the model or models, thus allowing the analyst to judge the overall quality.

A simulation study to compare K-means analysis and LCA against discriminant function analysis with known groups, a method generally considered to be the “gold standard,” tested how well variables predict group membership.10 In the study, group membership was known in advance and the authors applied the three methods to the data to examine classification performance. They argued that they used data that favored K-means analysis. Even so, the results of the comparison showed that K-means had an 8% misclassification rate versus 1.3% for LCA.10

The remainder of this article focuses on procedures for running LC models, decision points, and what should be included in a manuscript for review.

Section snippets

Sample

This data comes from a larger data set with military and non-military personnel. The full data set involved 1100 individuals who completed several surveys (e.g., Reading Interests, Cognition, along with some pattern recognition tasks, such as the Raven's Matrices). The data was never published and is quite old now, but did provide the foundation for other studies. For this example, there are five hundred participants that provided demographic data, some health history data, and depression data

Basic reporting guidelines

  • 1.

    Show Evaluative Information for all models tested, all of it.

    • a.

      Covariance Matrix (Table 7)

    • b.

      Profiles

    • c.

      Percentages in Each Class

    • d.

      Evaluative criteria used for choosing a specific model

    • e.

      Residual analysis

  • 2.

    Explain reasoning choices for choosing a specific model

  • 3.

    Explain choices related to “fixing” a bivariate residual relationship to zero.

  • 4.

    Provide software used and the version number. Software changes and you must let your reader know.

  • 5.

    Submit the traditional descriptive and frequency data along with correlation

LC-regression

If examining by classes is the starting point, but the desire to predict an outcome based on the classes, then running a latent class regression analysis technique is needed. One could create the classes and save them by individual and then use the class designation in a regression or logistic regression or run the regression model all at once. The example below utilizes the same data and variables, with one switch, CESD-Category is now the outcome variable.

The evaluation of the models begins

Conclusions

LCA and the advancements within LCA have provided a powerful alternative to previously used statistical techniques such as K-means for clustering. The ability to use a wide variety of data with different variances along with regression and factor analysis options makes latent class analysis an appealing option for many researchers. As with the changes in ease of use and understanding of the analysis technique, I expect many more LC analyses to be published in the near future.

References (16)

  • A. McCutcheon

    Latent Class Analysis. Quantitative Applications in the Social Sciences Series

    (1987)
  • P. Lazarsfeld et al.

    Latent Structure Analysis

    (1968)
  • G. McLachlan et al.

    Mixture Models: Inference and Application to Clustering

    (1988)
  • B. Everitt

    Cluster Analysis

    (1993)
  • J. Banfield et al.

    Model- based Gaussian and non-Gaussian clustering

    Biometrics

    (1993)
  • H. Bensmail et al.

    Inference in model based clustering

    Statistics Comput

    (1997)
  • C. Fraley et al.

    MCLUST: Software for Model-based Cluster and Discriminant Analysis

    (1998)
  • G. McLachlan et al.

    The EMMIX software for the fitting of mixtures of normal and t-components

    J Stat, Soft

    (1999)
There are more references available in the full text version of this article.

Cited by (132)

  • Heterogeneity of smartphone impact on everyday life and its relationship with personality and psychopathology: A latent profile analysis

    2023, Comprehensive Psychiatry
    Citation Excerpt :

    The optimal solution (i.e., the most adequate number of classes to represent our data) was determined based on both statistical and theoretical considerations. Statistically speaking, the guidelines on how to properly conduct an LPA suggest evaluating multiple indices simultaneously, also recommending that the interpretability and theoretical utility of the profiles be considered in order to choose the optimal solution [75,86,90,102]. Statistical considerations were based on the following indices as useful methods for comparing two models: (1) the Akaike information criterion (AIC), (2) the Bayesian information criterion (BIC), and (3) the sample-size adjusted BIC (SABIC), which are model fit indices; (4) the entropy of the solution, which indicates the extent to which classes are distinct from one another; (5) the classes' posterior probability, representing the accuracy by which individuals are assigned to a class; (6) the Lo-Mendell-Rubin (LMR); and (7) the bootstrapped likelihood ratio (BLRT) tests.

View all citing articles on Scopus
View full text