Introduction
A central challenge for referees in sports games is the decision-making in dynamic and complex situations under physical load (Helsen & Bultynck,
2004). In the recent decade, game dynamics of sports games like handball increased, leading to higher physical load referees must cope with (Bilge,
2012; Michalsik,
2018). Specific to handball, a well-trained endurance performance is essential. Handball requires aerobic and anaerobic endurance to perform well on the highest level of competition, as an insufficient physical capacity could impair
referees’ decision-making (RDM; Belcic, Ruzic, & Marošević,
2020; Morillo, Reigal, Hernández-Mendo, Montaña, & Morales-Sánchez,
2017). Also considering that referees are especially assessed through correct RDM, research on the relationship between physical load and RDM is of particular relevance (MacMahon et al.,
2015).
RDM, in terms of knowledge and application of the rules of the game is considered a central cornerstone of referees’ performance (Mascarenhas, Collins, & Mortimer,
2005b). RDM is understood as a primarily perceptual–cognitive process (Gaoua, de Oliveira, & Hunter,
2017). That is, to ensure high-quality decision-making on dynamically evolving game situations, referees must be capable of picking up, processing and integrating several environmental cues (Helsen, MacMahon, & Spitz,
2019; MacMahon et al.,
2015; Plessner & Haar,
2006). Research suggests that perceptual–cognitive functions can be facilitated or impaired through physical load, depending on the task and the exercise intensity (Chang, Labban, Gapin, & Etnier,
2012). Hence, RDM might also be enhanced or impaired by physical load (Bloß et al.,
2020; Schmidt et al.,
2019). Currently, there are only a few studies investigating the relationship between physical load and RDM in sports, but none in handball.
Current research on the relationship between physical load and RDM can be differentiated into the consideration of external (e.g. running distance) and internal load (e.g. heart rate; Impellizzeri, Marcora, & Coutts,
2019). With regard to studies using
external load as a parameter for physical load (Bloß et al.,
2020; Impellizzeri et al.,
2019), findings neither indicate a relationship between running time (Paradis, Larkin, & O‘Connor,
2015) nor distance covered and RDM (Emmonds et al.,
2015; Mascarenhas et al.,
2009). While findings by Oudejans et al. (
2005) indicate that more RDM errors occur at higher velocities (> 8 km/h), the study of Gomez-Carmona and Pino-Ortega (
2016) point out less accurate RDM at slow velocities (< 8 km/h). Furthermore, findings of Catteeuw, Gilis, Wagemans, and Helsen (
2010), Mascarenhas et al. (
2009) and Emmonds et al. (
2015) suggest that RDM does not deteriorate through progressive match periods, while Ahmed, Davison, and Dixon (
2017) and Mallo, Frutos, Juarez, and Navarro (
2012) found an impairment of RDM in the second compared to the first half of a match. Also, RDM was found to decrease in the last 10 or 15 minutes of a match respectively under high physical load (rugby; Emmonds et al.,
2015; soccer; Mallo et al.,
2012). In accordance with previous findings, Samuel, Galily, Guy, Sharoni, and Tenenbaum (
2019) showed in a laboratory setting that RDM of soccer referees decreased from quarter two to three as well as from quarter three to four. In contrast, Emmonds et al. (
2015) found an improved RDM in rugby referees from the minutes 40–50 to 50–60, as well as findings from Larkin et al. (
2014) indicate an improvement with increasing match period (quarters; Australian football). Descriptively, RDM in soccer referees is less accurate in the first 15 minutes of a match and became more accurate after 15 minutes (Mascarenhas et al.,
2009). Regarding studies investigating the effects of
internal load on RDM, one study did not reveal a relationship between RDM and physical exertion (blood lactate; Larkin et al.,
2014). Similarly, Emmonds et al. (
2015) and Mascarenhas et al. (
2009) did not find a relationship between heart rate and RDM, whereas Gomez-Carmona and Pino-Ortega (
2016) revealed that RDM errors occur especially above 95% of their maximum heart rate.
Overall, findings on the relationship between physical load and RDM are heterogeneous. A central issue is that previous studies mostly conducted ex-post video analyses and thus systematically controlled neither external nor internal load (Bloß et al.,
2020; MacMahon et al.,
2015). Also, confounding variables such as psychological load (e.g. crowd noise; Balmer et al.,
2007; e.g. rumination; Poolton, Siu, & Masters,
2011) or environmental conditions (e.g. temperature; Watkins et al.,
2014) were not controlled in previous research. Consequently, subsequent studies should conduct laboratory studies in which potential confounding variables can be controlled. This approach, in contrast to much of the research to date, would provide the added value of being able to control the internal load as the individual response to an external load (Bloß et al.,
2020; Impellizzeri et al.,
2019). It is important to note that these studies must be as externally valid as possible, meaning that the task and exercise are representative to examine the effects of physical load on RDM (Bloß et al.,
2020; Hancock, Bennett, Roaten, Chapman, & Stanley,
2021).
Furthermore, in the past, RDM has been investigated within theoretical frameworks such as the social information processing model (e.g. see Plessner & Haar,
2006). However, recent research suggests that RDM should be examined under more realistic or representative conditions, respectively, following the perspective of naturalistic decision-making (NDM) and in consequence under physical load for instance (Kittel, Cunningham, Larkin, Hawkey, & Rix-Lièvre,
2021; Mascarenhas et al.,
2009; Mascarenhas, Collins, Mortimer, & Morris,
2005; Mascarenhas et al.,
2005). NDM is a framework aiming to investigate decision-making in real-world settings likewise in experiments that approximate real world conditions as close as possible. Characteristics of NDM are time pressure, high risks, multiple players, uncertain and dynamic environments, ill-structured problems, shifting or competing goals and action/feedback loops (Orasanu & Connoly,
1993; Zsambok,
1997); however, not all characteristics need to be included for a study being labelled ‘naturalistic’ (Mascarenhas et al.,
2005). If conducting a study oriented towards naturalistic criteria, studies are encouraged to use representative tasks meaning that studies researching RDM in sports games must as best as possible simulate decision-making situations (Mascarenhas et al.,
2009). Researchers must therefore not only generate representative tasks for the decision-making, but, for instance, also investigate RDM under physical load. From the NDM perspective, however, studies must also consider how the decision-making of the interest decision-maker—in the case of this study a handball referee—is composed (Zsambok,
1997). Hence, in accordance with the international rule book (International Handball Federation,
2016), a handball referee has to decide if she/he has to whistle or not so as to call a foul or not. Next, it is important to understand the reasoning of the decision, i.e. if a referee calls a foul, there must be a reason for this decision (‘why whistling’; Mascarenhas, Collins, & Mortimer,
2005a). The knowledge about the reasoning supports the referees’ situation understanding and may enables to anticipate a critical situation before it occurs (Mascarenhas et al.,
2005). In handball, the underlying reason for a foul decision is a specific rule violation, which can be differentiated in several type of fouls (e.g. clinging or grab in throwing arm) and which are shown by the referee via hand signals during a match. In addition, in handball, the punishment (e.g. yellow card with progressive 2‑minutes, 2‑minutes) is reasoned by the type of foul (International Handball Federation,
2016). Hence, collectively, RDM in handball referees is to be differentiated in deciding about calling a foul or not likewise to whistle or not (decision) as well as into correctly determining the type of foul (reasoning 1), as the punishment (reasoning 2) is reasoned on the type of foul. The necessity to correctly determining both the decision and the reasonings fits with the suggestion that referees must be able to avoid categorisation errors to not ‘over-punish’ or ‘under-punish’ players (MacMahon et al.,
2015). Hence, to systematically rework evidence on the relationship between physical load and handball referees’ decision-making with naturalistic criteria, research needs to consider both the referees’ decisions and both reasonings.
Here, in two studies we aimed to examine the effects of physical load on RDM, i.e. referees’ decisions (calling a foul or not) and reasonings (type of foul and punishment), in top-class handball referees administering external valid tasks for physical load and RDM in a NDM criteria-oriented approach (Bloß et al.,
2020; Hancock et al.,
2021). Considering recent findings, we hypothesised that physical load affects RDM. Due to the ambiguous evidence about the effects of physical load on referees’ decision-making and the lack of research on referees’ reasonings, however, we could not formulate well-grounded separate hypotheses regarding the effects of physical load on referees’ decisions and reasonings, respectively, and we also refrained from making any directional predictions.
Discussion
Our studies aimed to examine the hypothesized effects of physical load on top-class handball referees’ decisions and reasonings. Although the results of study 1 indicate that physical load affects referees’ decisions, we did not find evidence for an effect in study 2. Furthermore, results of both studies point out an effect of physical load on referees’ reasonings.
In study 1, referees’ decision correctness improved from initial to medium physical load, which corroborates to previous studies indicating that the decision correctness improves under physical load (Emmonds et al.,
2015; Larkin et al.,
2014; Mascarenhas et al.,
2009). Moreover, the decision correctness decreased under maximal physical load. This complies with previous research indicating that the correctness decreases under maximal physical load, respectively at the end of a match (Ahmed et al.,
2017; Elsworthy et al.,
2014; Emmonds et al.,
2015; Gomez-Carmona & Pino-Ortega,
2016; Mallo et al.,
2012; Oudejans et al.,
2005; Samuel et al.,
2019). The decrease also fits with recent studies pointing out an impairment of cognitive processes under maximal physical load (Schmidt et al.,
2019). Furthermore, in study 1, the decision correctness was lowest at the beginning of the test, which corresponds to previous research showing that referees were less accurate at the beginning of a match and that they made more correct decisions during the match (Emmonds et al.,
2015; Larkin et al.,
2014; Mascarenhas et al.,
2009). The latter may be in line with the calibration effect: in early stages of a match, referees observe foul and no foul situations to develop an internal rating scale for a match before making strict decisions (e.g. awarding a yellow card to a foul; Unkelbach & Memmert,
2008). Hence, in the video test referees may have adopted a similar ‘cautious’ approach in the beginning. Even though decision correctness decreased under maximal physical load, it was still on a higher level than under initial physical load. In study 2, however, decision correctness remained constant across physical load levels. Besides methodological limitations which will be discussed in detail below, one potential explanation for the diverging results could be the improved endurance performance of the participants. A comparison of the mean running distances of participants who took part in both studies revealed that endurance performance improved from
M = 1.290 meters (
SD = 299 meters; running time:
M = 11:10 min,
SD = 2:23 min) in study 1 to
M = 1.633 meters (
SD = 340 meters; running time:
M = 14:00 min,
SD = 02:40 min) in study 2. The improvement in running distance reflects an enhanced endurance performance, which could, in turn, have led to a differently subjectively perceived fatigue (Enoka & Duchateau,
2016). As previous research indicates that a higher endurance capacity is associated with higher attention (de Sousa et al.,
2019) likewise that a higher physical fitness level is associated with better cognitive performance (Luque-Casado, Zabala, Morales, Mateo-March, & Sanabria,
2013), referees may have been more aware and concentrated in study 2 (Morillo et al.,
2017; Schmidt et al.,
2019), which, ultimately, might have resulted in more correct decisions under different physical load conditions in relation with a familiarisation with the YYTR. Furthermore, sensitivity analyses revealed a smallest effect size to which both studies are sensitive of η
p2 = 0.10 (calculations revealed similar values for both studies). Hence, from a statistical point of view, effects of the decision analyses are reliably detected (Perugini, Gallucci, & Costantini,
2018).
With regard to the effects of physical load on the referees’ reasonings, the correctness of reasonings significantly decreased in study 1 from initial to medium physical load. Results are in line with previous research indicating that specific processes could be negatively affected by physical load (e.g. information processing efficacy; Schmidt et al.,
2019; Tomporowski,
2003). In contrast, reasoning correctness increased particularly under medium physical load in study 2. At this point, while not conclusive, a possible explanation as well could be that because participants had a better endurance capacity in study 2, they might have been more effective in the specific processing under medium to maximal physical load conditions (Helsen et al.,
2019; Morillo et al.,
2017). However, results of our studies concerning the reasonings should be treated with caution due to the small number of participants and thus results may therefore only indicate initial tendencies that have to be further investigated.
Moreover, referees made decisions on a higher level compared to reasonings. On the one hand, participating referees were accustomed to make decisions and reasonings on video sequences shown from the television camera perspective through the official video-rule test of the DHB. On the other hand, potentially relevant information may not be visible via that perspective, but from an on-field position/perspective instead. The camera perspective could be a limiting factor especially for making correct reasonings, since relevant detailed information (e.g. body parts) may not have been visible, which would have reduced the representativeness of the video test (Kittel et al.,
2021). Future studies might therefore consider using video sequences recorded from an on-field perspective (even though referees were tested under rest, e.g. see Spitz, Put, Wagemans, Williams, & Helsen,
2018).
Furthermore, the cornerstone model from Mascarenhas et al. (
2005b) provides other factors, like psychological components, that are relevant for a referees’ performance. Thus, the cornerstone model can serve as a basis for examining the relationships between the constituent performance characteristics. In this context, the presented YYTR has the potential to integrate further loading factors, next to physical load, such as psychological factors (e.g. see Gillué, Laloux, Alvarez, & Feliu,
2018; Hill, Matthews, & Senior,
2016; Poolton et al.,
2011), to more closely approximate real match demands and to enhance external validity, which is in line with the NDM framework. Moreover, reflecting that RDM correctness in our studies corresponds with previous studies conducting video analyses (e.g. Mallo et al.,
2012) or controlled laboratory studies (e.g. Samuel et al.,
2019), the YYTR with the applied video test seems to be a valid tool to measure RDM under physical load and is thus a further step forward toward controlled analyses of RDM following naturalistic criteria. In addition, the video test used in the YYTR might at least currently also be a useful tool to investigate expertise differences in handball referees’ decision-making as well as turn out beneficial to train RDM under more natural conditions (e.g. see Kittel et al.,
2021; Mascarenhas et al.,
2009).
However, our research is not without limitations. Discussions with the involved referees and officials from the referee board of the DHB lead to the presumption that referees would make decisions and reasonings in real matches more differentiated than illustrated in the video test, i.e. the video test would need a better distinction between a referees’ decision and the two reasonings. Furthermore, as the YYT(R) is an exhaustion test, referees rapidly became physically exhausted (see heart rate in Fig.
4a). However, in real matches, handball referees officiate on a mean heart rate of about 80% (range 72–87%; da Silva et al.,
2010; Pabst et al.,
2012). Thus, the YYTR needs adaptations to generate more trials in the heart rate range of real handball matches. Both representing the differentiation between referees’ decisions and reasonings as well as the adaption of the YYTR protocol may help further improve the external validity and the representativeness of the YYTR.
Concerning the interpretation of our results, first, even though we used different video sequences in study 2, we presumably simplified the video test procedure due to splitting the decision and reasoning matrix (Fig.
2). As we tested a large amount of the same referees in both studies, the simplification of the decision and reasoning matrix might have helped the referees to make more correct decisions right from the beginning. Second, even though referees were familiar with the decision-making about video sequences from the television camera perspective through the official video rule test of the DHB, referees were unaccustomed to the experimental setting (YYTR). To let referees become more accustomed to the YYTR, subsequent studies might integrate trials at the beginning of the YYTR, e.g. by letting referees conduct video test trials under rest and in more walking phases. The improved RDM in study 2 compared to study 1 might partly depend on a familiarisation effect. Third, study 2 took place at the half-season training courses and thus 6 months after study 1, which took place at the season preparation courses. Hence, referees had a long-lasting recovery phase before study 1 in which referees presumably did not officiate and tried to physically recover as well. As a result, referees might have benefitted from the deliberate officiating during the season at the half-season training courses (i.e., study 2) compared to study 1.
Furthermore, by individually assigning the video test trials to blocks of physical load levels (i.e. initial, medium, maximal) we intended to analyse participants at a comparable physical load level and took their individual endurance capacity into account. However, by doing so we might have generated a potential artefact as different trials (i.e. videos of different fouls, types of fouls, punishments and potentially different item difficulty) were considered in the respective blocks. It is, therefore, necessary to not only treat the present results with caution, but to also consider the above potential limitations and artefacts in future research. Here, the importance of running replication studies is highlighted, as both the increase in correct decisions from initial to medium physical load and the decrease of correct decisions under maximal physical load found in study 1 were not confirmed (but different videos were used in study 2) and should thus not be overestimated.