Discussion
We stress that the intention of our study was not to analyze the quality of the applied scoring protocols for their reliability to predict sex. The intention was to analyze how well observers could repeat observations of the protocols on different modalities. The importance being that virtual osteological collections become more numerous alongside the existence of their analogous counterparts [
17,
21‐
24].
Our first research question concerned the interchangeability of dry skulls and CT images of the same bones, i.e. the type of error associated with the scoring of the same cranial traits on the two modalities. Results suggested that the two modalities were, for the majority of traits, interchangeable, although with some exceptions.
The highest agreement was for the glabella and the supraorbital margin, the poorest for the ramus flexure trait. Relating to the second research question, an interesting result was the high consistency in the scorings between the two virtual modalities (CT and 3D surface scans), especially when comparing the scorings performed on virtual versus dry bone modality. One possible explanation is the lack of a tactile sensation on both virtual modalities, as opposed to the dry bone modality. Comparing a tactile and a non-tactile modality could thus yield more divergent outcomes than comparing two non-tactile modalities with each other.
We carried out the analysis of the second research question as a pilot study on a subsample of 50 specimens as compared to 223 specimens used for the first research question. Hence, a more extensive analysis focusing on the comparison of virtual modalities with each other is desirable. While the comparison of the two virtual modalities resulted in low agreement for the ramus flexure trait, the agreement for the other traits was acceptable, especially for the nuchal crest, the supraorbital margin, the glabella, the mental tubercles and the zygomatic extension. Comparing the dry bone and the surface scan modalities with each other, we obtained agreements below the acceptable threshold, except for the nuchal crest, the supraorbital margin, the glabella and the zygomatic extension. Overall, we found a superior trait consistency and availability for the glabella and supraorbital margin, an intermediate performance for the other traits (mastoid process, mental eminence, mental protuberance, mental tubercles, nuchal crest and zygomatic extension) and a relative inferior performance of the ramus flexure.
As we did not intend to analyze the traits for their sex prediction quality, but how similar or different traits are perceived in visual-tactile versus visual-only environments, it is interesting to discuss possible reasons why some traits resulted in higher intermodality agreement than others. Before discussing this issue, however, the intra- and interobserver agreements in earlier publications about the sex estimation protocols is interesting to note as it may give an indication as to why they are consistent between modalities or why they are not. Walker's interobserver agreement of the five traits (mastoid process, mental eminence, nuchal crest, glabella and supraorbital margin) yielded overall agreement of 96%, with significant differences in the scoring process for the mastoid process [
4]. In the intraobserver agreement, Walker postulated a 99.5% agreement [
4]. Other studies found the highest intraobserver agreement for the glabella of 78% [
49] and κ values below 0.6 for the mental eminence [
57]. When Langley et al. added the zygomatic extension to the above mentioned five traits, it yielded interobserver agreement results second best after the glabella [
43].
The superiority of the glabella could be owing to its nature as a discernible contour viewed from a lateral perspective. The good results for the supraorbital margin might be due to the lighting and shadows on the virtual modalities, partially compensating the absence of the tactile sensation. The mastoid process performed with a score 3 in all three comparisons. While this trait was readily available, its
κ-values were below 0.6 in all tests. Petaros et al. (2015) reported a similarly unsuccessful analysis of the mastoid process [
58], while other studies agreed on its superior performance as a sex indicator [
57,
59]. With an amendment of the mastoid process involving (geo)metric measurements [
58,
60], repeatability and reproducibility as well as modality consistency could possibly benefit the overall performance of this trait.
The relatively poor performance of the ramus flexure traits might have originated from a general difficulty in discerning the feature. In fact, the trait has raised controversy in the literature; while the authors of the original publication insisted on the repeatability of the ramus flexure trait [
42,
61], they did not test its reproducibility. Other groups attempting to reproduce the observations did not succeed [
62‐
70]. Our results could indicate a similar difficulty with the trait per se and subsequently with its consistency between the modalities. Hence, we can assume that a sex estimation trait with a precise description tested for intra- and interobserver agreement has a chance of being consistent across modalities. If agreements are not tested and other groups are not able to repeat observations, the quality of the trait for consistency on different modalities is questionable. However, we included the ramus flexure protocol on purpose to investigate the performance of a trait that had not been tested for reproducibility. Overall, our findings indicate that the modality is not as influential on the outcome as the description of the trait [
30,
52,
71]. Thus, the question may be directed at finding suitable traits to score [
72,
73] that are both accurate in predicting sex as well as applicable to the analogous and the virtual environment. Our study supplies information on the latter question. Further research on the former question could now follow.
The skull is a rather robust skeletal structure, contrasting with ribs, which fracture rather easily. Hence, cranial features were generally observable in 80% to 100% of our specimens. In contrast, the often-fragmented mandibular ramus allowed observations of the ramus flexure trait in approximately a quarter of specimens only. Combined with the poor consistency of this trait between the modalities, the ramus flexure trait might not be worth investigating further.
The intermediate results for the third research question involving the mental eminence corroborated the finding of a previous study, which investigated the consistency of this trait on dry bone and micro-XCT reconstructions of 105 South African individuals from the Pretoria Bone Collection with four observers [
27]. Results suggested that the mental eminence was not scored consistently on the analogous (dry bone) and the virtual (micro-XCT) modalities [
27]. While a strong expression of the mental tubercles is closely linked to a square, male chin, less pronounced tubercles hint at a more rounded and female chin [
44,
74‐
76]. Hence, since it is generally acknowledged that the menton exhibits quantifiable sexual dimorphism [
76‐
78], this relative inconsistency between the modalities may be caused by an imprecise trait description of the mental eminence. Earlier descriptions of the mental eminence were unclear as to the exact location [
3,
4], and later it was stated that
"the mental eminence is also known as the mental protuberance" [
5]. The different features constituting the menton shape, e.g. protuberance, tubercles, fossa mentalis and incurvatio mandibularis [
79], may be expressed in different degrees, independent from each other. Given this intricate anatomy of the menton [
49], a precise description of the trait is indispensable in order to promote its consistent scoring across modalities. At the same time, an imprecise trait definition may also lead to an unreliable sex estimation accuracy [
49,
57]. This consideration, in conjunction with earlier results [
27] suggested the separation of the mental eminence into two components. This led to a higher agreement for the mental tubercles as compared to the mental eminence and the mental protuberance, encouraging an investigation of that trait concerning the accuracy in predicting sex.
The recent paper investigating the modality interchangeability of sex estimation traits on the human pelvis [
30] found the greatest consistency in one nonmetric and six metric traits. The iliac tuberosity [
80], together with the greater sciatic notch height (adapted definition), the ischium post-acetabular length, the spino-sciatic length, the spino-auricular length, the cotylo-sciatic breadth and the vertical acetabular diameter [
81] had resulted in superior consistency and availability [
30]. These traits, combined with the glabella and the supraorbital margin could be merged into a new set of sex estimation traits to be tested for its sex prediction accuracy. If the traits yield satisfactory accuracies, they could be combined into a new set of traits for which the modality interchangeability has already been tested. They could then be confidently used on both the dry bone and the CT modality. Likewise, the group of pelvic (postauricular surface, postauricular space, sciatic notch, composite arch, ischio-pubic proportion, subpubic concavity, acetabulo-symphyseal pubic length, cotylo-pubic width, innominate length and iliac breadth) and cranial (nuchal crest, mastoid process, mental eminence, mental tubercles, mental protuberance and zygomatic extension) traits resulting in intermediate performance could be combined and tested for accuracies in a future study. Moreover, the consistency of age-at-death estimation traits could be another field for investigation in a future study.
Limiting factors of our study were the number of virtual modalities included and the state of bone preservation. Both observers had previous experience with the virtual modalities. Comparisons with a study including an observer without any prior experience with virtual images of bones would be interesting as levels of confidence might vary [
82]. Moreover, we used different sample sizes for the analysis of our research questions. While our main focus lay on the dry bone-CT comparison, we consider the addition of 3D surface scans a pilot study, encouraging a more extensive analysis with a larger sample of 3D surface scans.
To the best of our knowledge, no research has been published so far comparing the performance of cranial traits on the analogous and the virtual modalities, encompassing a large sample.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.