Introduction
Traditionally, surgical skills have been taught through apprenticeship in the operating room, that is, through observing an experienced doctor performing a procedure and then performing the procedure on patients under supervision. However, during the last two decades, simulation training has gained ground [
1]. Simulation consists of teaching and training in a structured setting that reproduces features of the clinical setting [
1]. It allows the learner to repeat and practice specific tasks and makes it possible to use an objective tool for assessment of skills and structured constructive feedback to the learners [
2]. Hence, errors can be identified, analyzed, and corrected, in order to improve surgical efficacy and quality, and ameliorate ethics of surgical training, standard of care, and patient safety [
1,
2]. Indeed, in recent years, a greater emphasis has been put on that medical student should demonstrate pre-practice/pre-registration core procedural skills to ensure patient safety [
3]. Nonetheless, the formal teaching and training of basic suturing skills to medical students have received relatively little attention and there is no standard for what should be tested and how [
3].
Assessment tools for procedural skills have to be valid and reliable [
2]. Examples of existing validated assessment tools for suturing skills are the University of Western Ontario Microsurgery Skills Acquisition/Assessment instrument (UWOMSA) [
4‐
6] and the Objective Structured Assessment of Technical Skill global rating scale OSATS [
2,
7‐
9]. Both instruments are developed for surgeons in training. UWOMSA specifically evaluates microsurgical suturing skills and comprises three categories: quality of knot, efficiency, and handling. The learner is scored in each category on a 5-point Likert scale and a global score is calculated, with a maximum of 15 [
4‐
6]. OSATS evaluate macroscopic suturing skills in seven domains: respect for tissue, time and motion, instrument handling, knowledge of instruments, use of assistants, and knowledge of specific procedure. The learner is scored in each category on a 5-point Likert scale and a global score is calculated, with a maximum of 35 [
2,
7‐
9]. There are few assessment tools for suturing skills validated for medical students [
2,
3]. Moreover, there are no instruments evaluating both micro- and macrosurgical qualities.
The aim of this study was to develop and validate, using scientific methods, a tool for assessment of medical students’ suturing skills, measuring both micro- and macrosurgical qualities.
Discussion
There are few studies on assessment tools for medical students’ suturing skills [
2,
3]. This is a study that develops and validates an assessment tool for suturing skills, measuring both micro- and macrosurgical quality indicators, in medical students.
There is no consensus on how reliability and validity should be measured [
14] and the statistical methods chosen can affect the result. For instance, correlation coefficients are typically used to describe reliability, but their weakness is the lack of quantifying agreement and their insensitivity to systematic measurement errors [
15,
16]. Because correlation only measures how closely a set of paired observations follow a straight line and not the agreement, a correlation analysis could show a close to perfect correlation, but still be diverging from true values, and thus be misleading. On the other hand, intra class correlation (ICC), comparing more than two sets of measurement, has the strength of accounting for within subject variability and average variability, but is highly influenced by the homogeneity of data [
15]. This could explain that the lowest ICC score of inter-rater reliability are in the expert control group (Table
3), even though this group has the closest range of scores of all three study groups (Table
2 and Fig.
2). Methodological researchers have advocated the use of repeated measurements to compare agreement between methods and the agreement of a method to itself and thereby quantifying disagreement [
16]. This way, agreement is not only present or absent, but quantified. In the present study, there were no previous golden standard tools with identical scale of measurement that the new tool could be validated against. Accuracy measurements by repeatability were therefore only conducted for comparison of the assessors’ in-house scores of the same subjects at two different time points (inter-test reliability), as this was the only available option for repeated measures (Table
3). However, the assessors were blinded to the fact that they were producing repeated measurement of the exact same tests twice. This was possible due to the fact that tests were assessed as films, contrary to live assessment. Thus, the calculated CR might represent the most appropriate measurement of the accuracy of the test, and whether the test has sufficient accuracy for future purposes can be extrapolated.
The results might be affected by the composition of the sample. In this study, only students who had no previous experience with suturing, neither in vivo nor in vitro, were included and hence, they were all truly novices. Therefore, differences in previous experiences cannot be considered a factor. It can be questioned whether our sample is representative of medical students or may be comprised of students who are extra interested in acquiring suturing skills or are extra apt for surgery. However, the students were their own controls, or tested against experienced plastic surgeons, when the construct validity was tested and hence the subjects’ aptitude for or interest in surgery should not have affected the results.
Evaluation of films of the task performed by the subjects, using checklists, has been done in earlier studies similar to this one [
4,
17]. Several benefits have been found with evaluations of films rather than a live performance [
4]. For instance, it is possible to blind the assessor the identity of the subject and, in this case, if the performance is pre- or post-course, it makes it possible for several assessors to evaluate the performance simultaneously [
4], and for the assessors to rewind or fast forward as they need [
11]. A possible confounder is that the evaluations could be affected by assessor’s fatigue when a large number of tapes have to be watched and analyzed. In order to minimize this risk, the sample size was kept small and the assessors were not constrained to a certain deadline. The small variations seen between assessors are inherent as there always is a touch of subjectivity in any evaluation of a performance. Even though a checklist is used, assessors might find some quality indicators more or less important than others, and therefore be more or less harsh in their evaluation. For example, this might explain why one of the assessors seemed to be more accepting than the other two when making the overall evaluation of if the subject could suture or not.
Skill proficiency is difficult to define for suturing, especially at undergraduate level and an assessment tool needs to be able to capture different aspects, as well as giving an overall evaluation. Time alone is a bad measurement as it does not take quality into consideration [
18] and as novices might sometimes not be aware of all steps of a procedure, they might take shortcuts leading to fast procedure times, but poor results [
17]. Furthermore, the instrument was able to detect a difference between pre- and post-course performance (
p = 0.03) (Fig.
4), whereas there was no detectable difference in time used (
p = 0.55), indicating that time is not sensitive enough. On the other hand, time has to be part of the assessment, as proficiency not only is characterized by a good end result but also of efficiency. In previous studies, time has also been incorporated in the overall assessment in different ways, either as a measurement of time taken to complete a task [
12] or as number of tasks performed during a certain time [
19]. We calculated the cutoff times for the tasks so that 67% of the post-course subjects fell within in it (Fig.
2). In previous study, the authors have not stated how the cutoff time was assigned [
12]. As most of the subjects falling within the 33% have to be considered outliers (Fig.
2), we consider this an adequate cutoff time. The number of errors was weighted by a factor of 10, as previously described [
13], to emphasize the importance of a correct suture technique and good quality knot, in relation to time used. Nonetheless, how time and errors are weighted is arbitrary, but the model used in this study has been successfully utilized in other studies [
12,
13]. The strength of this new instrument is that it evaluates quality indicators important to both micro- and macrosurgical quality indicators and in addition to the total score, individual qualities can be analyzed specifically.
An assessment tool for suturing skills in medical students can be useful both to give formative feedback to the students [
20,
21] as well as to evaluate if the students meet the required standard [
11], and for curriculum development [
19,
22‐
25]. Further studies are needed to evaluate how our instruments can be used for these purposes. Moreover, studies are needed on the implementation of the instrument, that is, on the feasibility, acceptability, educational impact and effectiveness of the instrument [
2] and on the transferability to the clinical environment (face validity) [
1,
2].
In conclusion, our findings suggest that the developed in-house assessment tool shows promising reliability and validity when assessing novice medical students’ macroscopical suturing skills. Further validation is needed for microsurgical suturing skills.
Acknowledgements
We would like to thank the volunteering medical students at the University of Bergen and our colleagues at the Department of Plastic and Reconstructive Surgery for their participation as subjects and controls, and the plastic surgeons who were assessors: Dr. Carolin Freccero, MD, Phd, Dr. Eivind Strandenes, MD, and Professor Louis de Weerd, MD, PhD. We are also indebted to Dr. Karl Ove Hufthammer, biostatistician, MSc, PhD, for valuable guidance through the statistical hurdles.