Breaking free from the limitations of classical test theory: Developing and measuring information systems scales using item response theory
Introduction
Social science research and information systems (IS) research produce a wealth of empirical papers that use survey or experimental data either to create new measurement scales or to apply previously validated scales to measure constructs. In most cases, the authors rely on fundamental measurement principles that have been developed and refined in classical test theory (CTT) over decades. Although several shortcomings of this approach are increasingly understood, the underlying measurement paradigm of CTT remains largely unquestioned in IS. In line with a recent call in IS literature to improve the methodological foundation of our domain [13] – the measurement and validation procedures [50] – in this paper, we present an alternative to CTT that opens up new perspectives for empirical IS research.
Psychometricians such as Spearman [82], [83], Thurstone [86], [87], Rasch [64], and Birnbaum [9] have formulated different statistical models to achieve the measurement of latent traits. Usually, latent traits pertain to any type of construct that cannot be directly observed. Two main approaches for measuring continuous latent traits emerged: CTT [e.g., [33], [45]] and Factor Analysis (FA) [e.g., [96]] on the one hand and Item Response Theory (IRT) [e.g., [44]] on the other, with the former gaining widespread popularity.
Today, most research papers utilizing IRT can be found in psychology and educational testing, and at the same time the IRT paradigm is slowly but steadily gaining traction in social science and marketing research [74]. Several publications have clearly shown the advantages of this measurement approach [e.g., [28], [29]], and thus have sparked new interest in using IRT in behavioral research [22], [27], [73], [76]. Despite these promising developments, so far, IS research has virtually ignored IRT, which might be due to the fact that IRT is frequently only associated with psychological testing. However, as Edelen and Reeve [20] have shown in their comprehensive study, “when used appropriately, IRT can be a powerful tool for questionnaire development, evaluation, and refinement, resulting in precise, valid, and relatively brief instruments that minimize response burden” (p. 5).
A few key example studies show that IRT and Rasch Models, which are often perceived as being restricted to specific kinds of psychological testing, are in fact very versatile measurement methods that are applicable in a wide variety of disciplines. Rasch Models are a special class of IRT models that focus on the requirements for fundamental measurement and are relatively easy to understand; whereas, IRT in general deals with fitting flexible models to observed data.
An example in IS research includes a paper published in Information System Research in which they strived to understand software development practices [17]. The authors conclude that “The Rasch model analysis describes the likelihood of a practice deployment for any level of evolution and provides precise and meaningful measures” (p. 95). A marketing paper proposed a ten-item instrument for measuring customer satisfaction, which is a construct also frequently used in IS research [67]. Related examples from Marketing include brand equity [99] or the presence of gender item bias [75]. A Finance study used IRT to measure corporate social responsibility [57].
Moreover, Reise and Revicki [68] present several useful applications of IRT, including the assessment of data quality and the generation of item banks for hospital patients’ questionnaires, which bears important implications for researchers interested in the healthcare industry. Another interesting example from the healthcare sector is given by Melas et al. [54] who illustrate with the help of IRT that the previously assumed poor correlation between attitudes toward evidence-based practice and communication technology is a methodological artifact rather than a substantive fact. Additionally, the current PISA study (Programme for International Student Assessment), which is conducted in most 60 OECD member countries (OECD, 2014), has successfully applied an extended version of the Rasch model [1]. Finally, Alvarez et al. [4] illustrate the versatily of the Rasch model in their publication on optimal road planning where they use it to obtain an objective measure of road conditions.
In this paper, we therefore explain why IS researchers should consider adding IRT to their existing pool of methods. Typically, when researchers measure latent variable(s), they strive to find a “good” set of items that allows for reliable, highly informative, and possibly invariant measurement of the underlying construct. Such measurement cannot be sufficiently guaranteed by CTT and related approaches. The many models of IRT were developed to overcome this problem; to meet different goals and to allow different insight into the measurement process. The models’ nature range from exploratory to confirmatory, from flexible to strict, from parametric to nonparametric (for an overview see Ref. [94]), and they try to meet different objectives in terms of what constitutes good measurement.
Objective measurement means “the repetition of a unit amount that maintains its size, within an allowable range of error, no matter which instrument, intended to measure the variable of interest, is used and no matter who or what relevant person or thing is measured” [66]. In this paper, we adopt and demonstrate the unique perspective of objective measurement typical for a class of IRT models, the family of Rasch models. Although Rasch models are restrictive in terms of item selection and model fit, they can provide a number of properties that are advantageous for scale development and substantive research based on these scales.
We argue that in the IS field certain conventions (such as treating measurement variables as metric) as well as the nature of CTT can be problematic in not meeting research goals because of the following limitations of CTT: (1) it assumes a linear relationship between the latent variable and observed scores; (2) the true score can either not be estimated directly or only by making strong assumptions; and (3) parameters such as reliability, discrimination, location, or factor loadings depend on the sample being used.
These limitations have a number of implications when used with categorical measures in behavioral IS research. For example, by assuming linear relationships, CTT treats a scale that is discrete and restricted, to say 5 values, as if it was stretching continuously from minus infinity to plus infinity. But visualizations of data derived from categorical measures show a very different behavior, for example, accumulation at certain values, gaps between values or more than a single peak. For these scales, the continuous assumption may only serve as an approximation. Another implication is that the sample dependence of parameters makes it hard to generalize results to a population, particularly if non-probabilistic sampling was used. Constant replication and revalidation of results derived from such measures is needed to gauge their validity. Also, inference about the behavior of the units in question, about possible group differences, or the influence of a unit's characteristics can be associated with considerable bias.
In contrast, IRT offers five benefits, in that it: (1) allows nonlinear relationships; (2) allows appropriate estimation of the true score; (3) can estimate item parameters independently of the sample being used; (4) allows the researcher to select items that are in accordance with a desired model; and (5) applies and generalizes concepts such as reliability and internal consistency, and thus allows researchers derive more information about the measurement process.
As a demonstration of the applicability of Rasch models to IS research, we developed a scale for measuring hedonic IS, an area of IS research that has increasingly gained importance in recent years [18], [43], [93], [98]. For the purpose of this research, we initially create an item base that is as broad as possible to reflect the hedonic attributes of websites. To demonstrate the advantages that IRT models can offer, we perform an empirical comparative analysis of the scale results from a CTT versus Rasch perspective. Our goal is to find those items that measure hedonism as a latent construct unidimensionally and objectively, and to investigate how the underlying construct is measured by the items. Before demonstrating the empirical advantages of IRT scales and our example hedonic measure, we first provide the requisite background on CTT and IRT.
Section snippets
THE concepts and assumptions of CTT and IRT
Conceptually, CTT and IRT strive to achieve the same thing—namely, inference about a continuous latent trait based on a number of manifest indicators (i.e., measurement variables). Both approaches are concerned with how to approach reliability, internal consistency, and the construct validity of scales; how to infer estimates of the latent trait value for each subject; and how to gain information about and assert certain properties of the measurement process. They mainly differ in the response
Demonstration of scale development in an IS (Information System) context
In this section, we demonstrate the applicability of the Rasch-type scale construction and measurement in IS by constructing a scale to measure hedonic IS. Hedonism, a powerful form of intrinsic motivation, has gained a lot of attention in the IS community, and several non-utilitarian constructs (i.e., non-extrinsic motivation) have been integrated into various theoretical models as its importance has become clearer. These constructs include perceived affective quality, cognitive absorption,
Theoretical implications
Zagorsek et al. [101], who use IRT to analyze the reliability of the leadership practices inventory, pointedly emphasized that “an instrument’s measurement precision is crucial for the quality of the inferences and decisions based on that instrument, whether the purpose is leader assessment in organizations or academic theory building” (p. 180). They further elaborate that wrong measurement invariably leads to wrong conclusions with far-reaching consequences. A further prominent example in this
Thomas Rusch is Assistant Professor and Statistical Consultant at the Competence Center for Empirical Research Methods at WU Vienna University of Economics and Business. His research focuses on applied statistics and data analysis, computational statistics, multivariate statistics, exploratory data analysis and psychometrics.
References (102)
- et al.
Development of a measure model for optimal planning of maintenance and improvement of roads
Comput. Ind. Eng.
(2007) - et al.
Detecting gender item bias and differential manifest response behavior: a Rasch-based solution
J. Bus. Res.
(2014) - et al.
Extrinsic versus intrinsic motivations for consumers to shop on-line
Inf. Manage.
(2005) - et al.
Exploratory factor analysis revisited: how robust methods support the detection of hidden multivariate data structures in IS research
Inf. Manage.
(2010) - et al.
Application of multivariate Rasch models in international large-scale educational assessments
- et al.
Time flies when you’re having fun: cognitive absorption and beliefs about information technology usage
MIS Q.
(2000) Analysis of Ordinal Categorical Data
(2010)Rasch Models for Measurement
(1988)Measuring Intelligence: Facts and Fallacies
(2004)- et al.
Latent Variable Models and Factor Analysis
(1999)
Using classical test theory in combination with item response theory
Appl. Psychol. Meas.
Some latent trait models
The attack of the psychometricians
Psychometrika
Reconceptualizing system usage: an approach and empirical test
Inf. Syst. Res.
Some considerations for articles introducing new and/or novel quantitative methods to IS researchers
Eur. J. Inf. Syst.
Coefficient alpha and the internal structure of tests
Psychometrika
Do data characteristics change according to the number of scale points used? An experiment using 5 point, 7 point and 10 point scales
Int. J. Market Res.
Explanatory Item Response Models: A Generalized Linear and Nonlinear Approach
Measuring software engineering evolution: a Rasch calibration
Inf. Syst. Res.
User experience, satisfaction, and continual usage intention of IT
Eur. J. Inf. Syst.
A paired comparison approach for the analysis of sets of Likert-scale responses
Stat. Modell.
Applying item response theory (IRT) modeling to questionnaire development, evaluation, and refinement
Qual. Life Res.
Item Response Theory for Psychologists
An alternate approach to assessing cross-cultural measurement equivalence
J. Advertising
Einführung in Die Theorie Psychologischer Tests [Introduction to Mental Test Theory]
Some applications of logistic latent trait models with linear constraints on the parameters
Appl. Psychol. Meas.
A comparison of affective response to consumption in two contexts’
der markt: Int. J. Market.
Advantages of Rasch modelling for the development of a scale to measure affective response to consumption
Eur. Adv. Consum. Res.
Measuring affective response to consumption using Rasch modelling
J. Customer Satisfaction Dissatisfaction Complaining Behav.
Tests of fit for polytomous Rasch models
Ordinal methodology in the analysis of Likert scales
Qual. Quant.
Limitations of coefficient alpha as an index of test dimensionality
Educ. Psychol. Meas.
Theory of Mental Tests
Fundamentals of Item Response Theory
Comparison of classical test theory and item response theory and their applications to test development
Educ. Meas.
Improving the discrimination of SERVQUAL by using magnitude scaling
IRT models with relaxed assumptions in eRm: a manual-like instruction
Psychol. Sci. Q.
Structural equation modelling: guidelines for determining model fit
Electron. J. Bus. Res. Methods
The acceptance of women-centric websites
J. Comput. Inf. Syst.
Likert scales: how to (ab)use them
Med. Educ.
A critical review of construct indicators and measurement model misspecification in marketing and consumer research
J. Consum. Res.
A technique for the measurement of attitudes
Arch. Psychol.
Extending technology usage models to interactive hedonic technologies: a theoretical model and empirical test
Inf. Syst. J.
Applications of Item Response Theory to Practical Testing Problems
Statistical Theories of Mental Test Scores
Proposing the multimotive information systems continuance model (MISC) to better explain end-user system evaluations and continuance intentions
J. Assoc. In. Syst. (JAIS)
Taking ‘fun and games’ seriously: proposing the hedonic-motivation system adoption model (HMSAM)
J. Assoc. Inf. Syst.
Cited by (0)
Thomas Rusch is Assistant Professor and Statistical Consultant at the Competence Center for Empirical Research Methods at WU Vienna University of Economics and Business. His research focuses on applied statistics and data analysis, computational statistics, multivariate statistics, exploratory data analysis and psychometrics.
Dr. Paul Benjamin Lowry is a Full Professor of Information Systems at the Information Systems Department of the City University of Hong Kong. He received his Ph.D. in Management Information Systems from the University of Arizona where he was advised by Jay F. Nunamaker Jr. He serves as a the co-editor-in-chief at AIS-Transactions on HCI; guest SE and guest AE at MIS Quarterly (MISQ); SE at Decision Sciences; and an Associate Editor at European Journal of IS (EJIS), Information & Management (I&M), and Communications of the AIS (CAIS). Professor Lowry has published over 180 articles, with 85+ of these in journals, such as: MIS Quarterly (MISQ); Information Systems Research (ISR); Journal of Management Information Systems (JMIS); Journal of the Association for Information Systems (JAIS); Information & Management (I&M).
Patrick Mair is a Senior Lecturer in Statistics at Harvard University. He obtained his PhD in Statistics from the University of Vienna in 2005 and did his Habilitation (Venia Legendi) in Statistics in 2010. From 2005–2011 he worked as an Assistant Professor at the Department of Statistics and Mathematics, WU Vienna University of Economics and Business. From 2007–2008 he was a Research Fellow at the Department of Statistics/Department of Psychology at UCLA. Since 2013 he works as Senior Lecturer in Statistics at the Department of Psychology, Harvard University. His research focuses on Computational and Applied Statistics with special emphasis on psychometric methods such as latent variable models and multivariate exploratory techniques. The research typically involves some programming work in the R environment for statistical computing.
Horst Treiblmaier is a Full Professor at the Faculty of International Management at Modul University Vienna. Previously, he was a Visiting Professor at Purdue University, UCLA and UBC. His work has appeared in journals such as Information Systems Journal, Structural Equation Modeling, Journal of Business Economics, Information & Management, Communications of the AIS, Journal of Electronic Commerce Research and Schmalenbach Business Review. He serves as an AE at AIS Transactions on Human-Computer Interaction. His research interests include gamification, Physical Internet (a novel Supply Chain concept to organize the flow of goods similar to the flow of data in the Internet), e-commerce and methodology.