Breaking free from the limitations of classical test theory: Developing and measuring information systems scales using item response theory

https://doi.org/10.1016/j.im.2016.06.005Get rights and content

Abstract

Information systems (IS) research frequently uses survey data to measure the interplay between technological systems and human beings. Researchers have developed sophisticated procedures to build and validate multi-item scales that measure latent constructs. The vast majority of IS studies uses classical test theory (CTT), but this approach suffers from three major theoretical shortcomings: (1) it assumes a linear relationship between the latent variable and observed scores, which rarely represents the empirical reality of behavioral constructs; (2) the true score can either not be estimated directly or only by making assumptions that are difficult to be met; and (3) parameters such as reliability, discrimination, location, or factor loadings depend on the sample being used. To address these issues, we present item response theory (IRT) as a collection of viable alternatives for measuring continuous latent variables by means of categorical indicators (i.e., measurement variables). IRT offers several advantages: (1) it assumes nonlinear relationships; (2) it allows more appropriate estimation of the true score; (3) it can estimate item parameters independently of the sample being used; (4) it allows the researcher to select items that are in accordance with a desired model; and (5) it applies and generalizes concepts such as reliability and internal consistency, and thus allows researchers to derive more information about the measurement process. We use a CTT approach as well as Rasch models (a special class of IRT models) to demonstrate how a scale for measuring hedonic aspects of websites is developed under both approaches. The results illustrate how IRT can be successfully applied in IS research and provide better scale results than CTT. We conclude by explaining the most appropriate circumstances for applying IRT, as well as the limitations of IRT.

Introduction

Social science research and information systems (IS) research produce a wealth of empirical papers that use survey or experimental data either to create new measurement scales or to apply previously validated scales to measure constructs. In most cases, the authors rely on fundamental measurement principles that have been developed and refined in classical test theory (CTT) over decades. Although several shortcomings of this approach are increasingly understood, the underlying measurement paradigm of CTT remains largely unquestioned in IS. In line with a recent call in IS literature to improve the methodological foundation of our domain [13] – the measurement and validation procedures [50] – in this paper, we present an alternative to CTT that opens up new perspectives for empirical IS research.

Psychometricians such as Spearman [82], [83], Thurstone [86], [87], Rasch [64], and Birnbaum [9] have formulated different statistical models to achieve the measurement of latent traits. Usually, latent traits pertain to any type of construct that cannot be directly observed. Two main approaches for measuring continuous latent traits emerged: CTT [e.g., [33], [45]] and Factor Analysis (FA) [e.g., [96]] on the one hand and Item Response Theory (IRT) [e.g., [44]] on the other, with the former gaining widespread popularity.

Today, most research papers utilizing IRT can be found in psychology and educational testing, and at the same time the IRT paradigm is slowly but steadily gaining traction in social science and marketing research [74]. Several publications have clearly shown the advantages of this measurement approach [e.g., [28], [29]], and thus have sparked new interest in using IRT in behavioral research [22], [27], [73], [76]. Despite these promising developments, so far, IS research has virtually ignored IRT, which might be due to the fact that IRT is frequently only associated with psychological testing. However, as Edelen and Reeve [20] have shown in their comprehensive study, “when used appropriately, IRT can be a powerful tool for questionnaire development, evaluation, and refinement, resulting in precise, valid, and relatively brief instruments that minimize response burden” (p. 5).

A few key example studies show that IRT and Rasch Models, which are often perceived as being restricted to specific kinds of psychological testing, are in fact very versatile measurement methods that are applicable in a wide variety of disciplines. Rasch Models are a special class of IRT models that focus on the requirements for fundamental measurement and are relatively easy to understand; whereas, IRT in general deals with fitting flexible models to observed data.

An example in IS research includes a paper published in Information System Research in which they strived to understand software development practices [17]. The authors conclude that “The Rasch model analysis describes the likelihood of a practice deployment for any level of evolution and provides precise and meaningful measures” (p. 95). A marketing paper proposed a ten-item instrument for measuring customer satisfaction, which is a construct also frequently used in IS research [67]. Related examples from Marketing include brand equity [99] or the presence of gender item bias [75]. A Finance study used IRT to measure corporate social responsibility [57].

Moreover, Reise and Revicki [68] present several useful applications of IRT, including the assessment of data quality and the generation of item banks for hospital patients’ questionnaires, which bears important implications for researchers interested in the healthcare industry. Another interesting example from the healthcare sector is given by Melas et al. [54] who illustrate with the help of IRT that the previously assumed poor correlation between attitudes toward evidence-based practice and communication technology is a methodological artifact rather than a substantive fact. Additionally, the current PISA study (Programme for International Student Assessment), which is conducted in most 60 OECD member countries (OECD, 2014), has successfully applied an extended version of the Rasch model [1]. Finally, Alvarez et al. [4] illustrate the versatily of the Rasch model in their publication on optimal road planning where they use it to obtain an objective measure of road conditions.

In this paper, we therefore explain why IS researchers should consider adding IRT to their existing pool of methods. Typically, when researchers measure latent variable(s), they strive to find a “good” set of items that allows for reliable, highly informative, and possibly invariant measurement of the underlying construct. Such measurement cannot be sufficiently guaranteed by CTT and related approaches. The many models of IRT were developed to overcome this problem; to meet different goals and to allow different insight into the measurement process. The models’ nature range from exploratory to confirmatory, from flexible to strict, from parametric to nonparametric (for an overview see Ref. [94]), and they try to meet different objectives in terms of what constitutes good measurement.

Objective measurement means “the repetition of a unit amount that maintains its size, within an allowable range of error, no matter which instrument, intended to measure the variable of interest, is used and no matter who or what relevant person or thing is measured” [66]. In this paper, we adopt and demonstrate the unique perspective of objective measurement typical for a class of IRT models, the family of Rasch models. Although Rasch models are restrictive in terms of item selection and model fit, they can provide a number of properties that are advantageous for scale development and substantive research based on these scales.

We argue that in the IS field certain conventions (such as treating measurement variables as metric) as well as the nature of CTT can be problematic in not meeting research goals because of the following limitations of CTT: (1) it assumes a linear relationship between the latent variable and observed scores; (2) the true score can either not be estimated directly or only by making strong assumptions; and (3) parameters such as reliability, discrimination, location, or factor loadings depend on the sample being used.

These limitations have a number of implications when used with categorical measures in behavioral IS research. For example, by assuming linear relationships, CTT treats a scale that is discrete and restricted, to say 5 values, as if it was stretching continuously from minus infinity to plus infinity. But visualizations of data derived from categorical measures show a very different behavior, for example, accumulation at certain values, gaps between values or more than a single peak. For these scales, the continuous assumption may only serve as an approximation. Another implication is that the sample dependence of parameters makes it hard to generalize results to a population, particularly if non-probabilistic sampling was used. Constant replication and revalidation of results derived from such measures is needed to gauge their validity. Also, inference about the behavior of the units in question, about possible group differences, or the influence of a unit's characteristics can be associated with considerable bias.

In contrast, IRT offers five benefits, in that it: (1) allows nonlinear relationships; (2) allows appropriate estimation of the true score; (3) can estimate item parameters independently of the sample being used; (4) allows the researcher to select items that are in accordance with a desired model; and (5) applies and generalizes concepts such as reliability and internal consistency, and thus allows researchers derive more information about the measurement process.

As a demonstration of the applicability of Rasch models to IS research, we developed a scale for measuring hedonic IS, an area of IS research that has increasingly gained importance in recent years [18], [43], [93], [98]. For the purpose of this research, we initially create an item base that is as broad as possible to reflect the hedonic attributes of websites. To demonstrate the advantages that IRT models can offer, we perform an empirical comparative analysis of the scale results from a CTT versus Rasch perspective. Our goal is to find those items that measure hedonism as a latent construct unidimensionally and objectively, and to investigate how the underlying construct is measured by the items. Before demonstrating the empirical advantages of IRT scales and our example hedonic measure, we first provide the requisite background on CTT and IRT.

Section snippets

THE concepts and assumptions of CTT and IRT

Conceptually, CTT and IRT strive to achieve the same thing—namely, inference about a continuous latent trait based on a number of manifest indicators (i.e., measurement variables). Both approaches are concerned with how to approach reliability, internal consistency, and the construct validity of scales; how to infer estimates of the latent trait value for each subject; and how to gain information about and assert certain properties of the measurement process. They mainly differ in the response

Demonstration of scale development in an IS (Information System) context

In this section, we demonstrate the applicability of the Rasch-type scale construction and measurement in IS by constructing a scale to measure hedonic IS. Hedonism, a powerful form of intrinsic motivation, has gained a lot of attention in the IS community, and several non-utilitarian constructs (i.e., non-extrinsic motivation) have been integrated into various theoretical models as its importance has become clearer. These constructs include perceived affective quality, cognitive absorption,

Theoretical implications

Zagorsek et al. [101], who use IRT to analyze the reliability of the leadership practices inventory, pointedly emphasized that “an instrument’s measurement precision is crucial for the quality of the inferences and decisions based on that instrument, whether the purpose is leader assessment in organizations or academic theory building” (p. 180). They further elaborate that wrong measurement invariably leads to wrong conclusions with far-reaching consequences. A further prominent example in this

Thomas Rusch is Assistant Professor and Statistical Consultant at the Competence Center for Empirical Research Methods at WU Vienna University of Economics and Business. His research focuses on applied statistics and data analysis, computational statistics, multivariate statistics, exploratory data analysis and psychometrics.

References (102)

  • T.M. Bechger et al.

    Using classical test theory in combination with item response theory

    Appl. Psychol. Meas.

    (2003)
  • A. Birnbaum

    Some latent trait models

  • D. Borsboom

    The attack of the psychometricians

    Psychometrika

    (2006)
  • A. Burton-Jones et al.

    Reconceptualizing system usage: an approach and empirical test

    Inf. Syst. Res.

    (2006)
  • J. Cadwell, Item response theory: Developing your intuition. (2012), Date last accessed: October 10, 2015, Retrieved...
  • W.W. Chin et al.

    Some considerations for articles introducing new and/or novel quantitative methods to IS researchers

    Eur. J. Inf. Syst.

    (2012)
  • L.J. Cronbach

    Coefficient alpha and the internal structure of tests

    Psychometrika

    (1951)
  • J.G. Dawes

    Do data characteristics change according to the number of scale points used? An experiment using 5 point, 7 point and 10 point scales

    Int. J. Market Res.

    (2008)
  • P. de Boeck et al.

    Explanatory Item Response Models: A Generalized Linear and Nonlinear Approach

    (2004)
  • S. Dekleva et al.

    Measuring software engineering evolution: a Rasch calibration

    Inf. Syst. Res.

    (1997)
  • L. Deng et al.

    User experience, satisfaction, and continual usage intention of IT

    Eur. J. Inf. Syst.

    (2010)
  • R. Dittrich et al.

    A paired comparison approach for the analysis of sets of Likert-scale responses

    Stat. Modell.

    (2007)
  • M.O. Edelen et al.

    Applying item response theory (IRT) modeling to questionnaire development, evaluation, and refinement

    Qual. Life Res.

    (2007)
  • S.E. Embretson et al.

    Item Response Theory for Psychologists

    (2000)
  • M. Ewing et al.

    An alternate approach to assessing cross-cultural measurement equivalence

    J. Advertising

    (2005)
  • G.H. Fischer

    Einführung in Die Theorie Psychologischer Tests [Introduction to Mental Test Theory]

    (1974)
  • G.H. Fischer et al.

    Some applications of logistic latent trait models with linear constraints on the parameters

    Appl. Psychol. Meas.

    (1982)
  • J. Fox, Polycor: Polychoric and Polyserial Correlations R package version 0. 7-7. (2009), Date last accessed: November...
  • J. Fox, Sem: Structural Equation Models. R package version 0. 9–16. (2009), Date last accessed: April 18, 2012,...
  • A. Ganglmair-Wooliscroft

    A comparison of affective response to consumption in two contexts’

    der markt: Int. J. Market.

    (2007)
  • A. Ganglmair et al.

    Advantages of Rasch modelling for the development of a scale to measure affective response to consumption

    Eur. Adv. Consum. Res.

    (2003)
  • A. Ganglmair et al.

    Measuring affective response to consumption using Rasch modelling

    J. Customer Satisfaction Dissatisfaction Complaining Behav.

    (2003)
  • C. Glas et al.

    Tests of fit for polytomous Rasch models

  • R. Göb et al.

    Ordinal methodology in the analysis of Likert scales

    Qual. Quant.

    (2007)
  • S.B. Green et al.

    Limitations of coefficient alpha as an index of test dimensionality

    Educ. Psychol. Meas.

    (1977)
  • H. Gulliksen

    Theory of Mental Tests

    (1950)
  • R. Hambleton et al.

    Fundamentals of Item Response Theory

    (1991)
  • R.K. Hambleton et al.

    Comparison of classical test theory and item response theory and their applications to test development

    Educ. Meas.

    (1993)
  • M.C. Hart

    Improving the discrimination of SERVQUAL by using magnitude scaling

  • R. Hatzinger et al.

    IRT models with relaxed assumptions in eRm: a manual-like instruction

    Psychol. Sci. Q.

    (2009)
  • D. Hooper et al.

    Structural equation modelling: guidelines for determining model fit

    Electron. J. Bus. Res. Methods

    (2008)
  • E. Huang

    The acceptance of women-centric websites

    J. Comput. Inf. Syst.

    (2005)
  • S. Jamieson

    Likert scales: how to (ab)use them

    Med. Educ.

    (2004)
  • C.B. Jarvis et al.

    A critical review of construct indicators and measurement model misspecification in marketing and consumer research

    J. Consum. Res.

    (2003)
  • R. Likert

    A technique for the measurement of attitudes

    Arch. Psychol.

    (1932)
  • C.-P. Lin et al.

    Extending technology usage models to interactive hedonic technologies: a theoretical model and empirical test

    Inf. Syst. J.

    (2010)
  • F.M. Lord

    Applications of Item Response Theory to Practical Testing Problems

    (1980)
  • F.M. Lord et al.

    Statistical Theories of Mental Test Scores

    (1968)
  • Paul Benjamin Lowry et al.

    Proposing the multimotive information systems continuance model (MISC) to better explain end-user system evaluations and continuance intentions

    J. Assoc. In. Syst. (JAIS)

    (2015)
  • P.B. Lowry et al.

    Taking ‘fun and games’ seriously: proposing the hedonic-motivation system adoption model (HMSAM)

    J. Assoc. Inf. Syst.

    (2013)
  • Cited by (0)

    Thomas Rusch is Assistant Professor and Statistical Consultant at the Competence Center for Empirical Research Methods at WU Vienna University of Economics and Business. His research focuses on applied statistics and data analysis, computational statistics, multivariate statistics, exploratory data analysis and psychometrics.

    Dr. Paul Benjamin Lowry is a Full Professor of Information Systems at the Information Systems Department of the City University of Hong Kong. He received his Ph.D. in Management Information Systems from the University of Arizona where he was advised by Jay F. Nunamaker Jr. He serves as a the co-editor-in-chief at AIS-Transactions on HCI; guest SE and guest AE at MIS Quarterly (MISQ); SE at Decision Sciences; and an Associate Editor at European Journal of IS (EJIS), Information & Management (I&M), and Communications of the AIS (CAIS). Professor Lowry has published over 180 articles, with 85+ of these in journals, such as: MIS Quarterly (MISQ); Information Systems Research (ISR); Journal of Management Information Systems (JMIS); Journal of the Association for Information Systems (JAIS); Information & Management (I&M).

    Patrick Mair is a Senior Lecturer in Statistics at Harvard University. He obtained his PhD in Statistics from the University of Vienna in 2005 and did his Habilitation (Venia Legendi) in Statistics in 2010. From 2005–2011 he worked as an Assistant Professor at the Department of Statistics and Mathematics, WU Vienna University of Economics and Business. From 2007–2008 he was a Research Fellow at the Department of Statistics/Department of Psychology at UCLA. Since 2013 he works as Senior Lecturer in Statistics at the Department of Psychology, Harvard University. His research focuses on Computational and Applied Statistics with special emphasis on psychometric methods such as latent variable models and multivariate exploratory techniques. The research typically involves some programming work in the R environment for statistical computing.

    Horst Treiblmaier is a Full Professor at the Faculty of International Management at Modul University Vienna. Previously, he was a Visiting Professor at Purdue University, UCLA and UBC. His work has appeared in journals such as Information Systems Journal, Structural Equation Modeling, Journal of Business Economics, Information & Management, Communications of the AIS, Journal of Electronic Commerce Research and Schmalenbach Business Review. He serves as an AE at AIS Transactions on Human-Computer Interaction. His research interests include gamification, Physical Internet (a novel Supply Chain concept to organize the flow of goods similar to the flow of data in the Internet), e-commerce and methodology.

    View full text