Rule-based approaches
Studies that use rules range from simple rules that match section headings against target text to sophisticated hierarchical rules. In order to characterize these studies, we use the following dimensions (see Table
2):
Table 2
Rule based studies
| E | Flat | NS | R |
| E | Flat | NS | H |
| E | Flat | NS | R |
| E | Flat | NS | R |
| E | Flat | NS | R |
| R | Flat | F | H |
Chen et al. [ 5], Dai et al. [ 7] | R | Flat | F | H |
| R | Flat | F | R |
| R | Flat | NS | R |
| R | Flat | NS | R |
| R | Flat | F | R |
| R | Flat | F | R |
| R | Flat | NS | R |
| R | Flat | F | H |
Schadow and McDonald [ 49] | R | Flat | NS | R |
| R | Do not require | F | R |
| R | Flat | F | R |
| E,R | Flat | F | H |
| E,R | Dictionary | NS | H |
| P | Hierarchy | NS | R |
| P | Hierarchy | F,C | R |
| R,P | Flat | F,C | R |
| E,R,P | Flat | F,H | H |
| E,R,P | Hierarchy | F,H | R |
| E,R,P | Hierarchy | F,H | R |
| E,R,P | Hierarchy | F,H | R |
| E,R,P | Hierarchy | F,H | R |
| E,R,P | Flat | F,C | R |
| E,R,P | Hierarchy | F,H | R |
1
Type of Rules: A rule is an expression that describes the way things happen in a particular situation. In this sense, rules express general or specific characteristics of these situations. Section identification rules describe how sections of a certain type are usually written.
(a)
Exact matching: This kind of rule identifies the beginning of a section by the appearance of a heading available in the terminology and the ending before the next heading starts.
(b)
Regular expressions: They describe sets of rules using search patterns, which are then matched against the target text. A search pattern specifies the type of text you want to find in the target text using constants and operators. A regular expression may use terminologies as input. For instance, it may express that a heading text (from an existing terminology) followed by a colon marks the beginning of a section.
(c)
Probabilistic rules: Rules that take into account probabilistic data obtained from sample data sets, such as Bayesian word probabilities occurring in each section of a sample text.
Different types of rules may often be used in the same study. For instance, a method may first find standard headings using an exact matching rule and then identify variations using regular expressions.
2
Required previous information:This dimension explores the type of external information required as input by the method. Studies included in this review use:
(a)
Flat dictionaries of headings: These are plain lists that contain possible terms used as headings for each section.
(b)
Hierarchical heading dictionaries: They include a hierarchy of headings, which implies a parent-child relationship that can be exploited to infer knowledge when a heading coincides with a portion of the target text.
(c)
Dictionaries of synonyms and word variants, providing modifications or abbreviations of standard headings. Some of the studies include a dictionary with the most common variations or synonyms for a heading or words contained therein.
3
Features Generated from Available Sample Texts: Some studies rely on information extracted from an available sample of texts that can give hints on the existing patterns for different sections. This dimension describes what kind of features are extracted from these texts to be used later as an input for the rules. Common features can be classified in the following groups:
(a)
Formatting features: Represent the shape of a section including the characteristics of its headings (e.g. word count, capitalization), explicit line ending characters (e.g. question marks), the beginning of the lines matching an enumeration pattern, or the typical size of the section.
(b)
Concepts/ terms within each section: Frequent concepts or terms occurring in each section are identified.
(c)
Heading probabilities: Probabilities are assigned to common and uncommon section headings in a corpus; e.g., the probability of each possible section heading occurring in a document within a corpus.
These features can only be obtained automatically if the sample data set was already annotated with section begin and end tags. Note that feature values may vary according to the corpus, which means that the rules generated using them should be dynamically adapted to new feature values to be used in different contexts (i.e. the typical size of a section may differ between corpora).
The following paragraphs detail the characteristics of section identification studies that use a rule-based approach. Basic approaches rely on the
exact matching of well-known headings (and some heading variations) within clinical narratives; different studies follow this approach. Wang et al. [
60] detect sections in discharge summaries finding a set of predetermined section headings and their target forms. Their work, based on the MedLEEs section identification method [
15], demonstrates that selecting information considering the section where it is contained, may improve the detection of drug disease-manifestation related symptoms and drug-drug event relations. Similarly, the work proposed by Edinger et al. [
14] improves cohort selection looking for the name of the sections where a specific patient characteristic should be found. Likewise, the study by Singh et al. [
52] classifies radiology reports into high and low priority ones by analyzing the
Impression section, which was detected using hand-crafted rules.
Finally, the study of Phuong and Chau [
42] recognizes the beginning of a section looking for the well-known headings or words in order to apply de-identification processes taking into account the section where the possible identifiers are.
The studies above demonstrate that exact matching of heading texts for section identification is very useful to improve the precision of this task. However, if the level of standardization of the headings is low, recall drops due to sections that cannot be detected by this method.
As an evolution of the rules that find the exact words of headings in the text,
regular expressions, and in general pattern matching, are the most common methods used for section identification. Radiology reports were analyzed using regular expressions in different studies. Taira et al. [
55] and Rubin and Desser [
45] identified sections in this type of documents using classifiers that include a dictionary of common section labels (e.g.
Clinical History,
Findings) and common formatting layouts (e.g. section order, section length, use of colons). Hsu et al. [
18] proposed a quality assurance process for evaluating the diagnostic accuracy of radiology reports, which included section analysis. In a first phase, regular expressions identify relevant sections based on headings. In addition, they exploit formatting characteristics such as capitalization, colon use and paragraph breaks for identifying subsections.
Pathology reports sections have also been identified using regular expressions. Schadow and McDonald [
49] created regular expressions to find terminology for specimen headings. Kropf et al. [
25] improved information retrieval of pathology reports by means of section identification. Basically, the results of queries for certain tumors in pathology reports are contextualized using section-sensitive queries. Other works on pathology reports use systems previously created with this functionality; for example, Lin et al. [
30] used CTAKES sectionizer, based on regular expressions.
In order to analyze the use of standard terminology in the sections of operative notes, Melton et al. [
35] extracted headings using rules for capitalization, semi-colons, hyphens, and line-breaks. Potential section headings are then mapped to a controlled terminology. This study found that about 20% of the sections were not covered by these controlled terminologies. Several non-matching sections were, however, important elements for operative notes in certain sub-specialties, such as cardiac and transplant surgeries. This demonstrates that regular expressions based on formatting characteristics complement exact matching rules on the detection of headings that do not follow standard terminologies. Besides, the output of this kind of methods can be used to enrich standard terminologies with interface terms (i.e. terms that do not necessarily correspond to standard terminology labels, but which are characteristic for the clinical jargon) that can improve the recall of section identification in not seen texts.
Sections of clinic visit notes have been identified using regular expressions. Schuemie et al. [
50] proposed some section detection heuristics, e.g., lines consisting of a sequence of uppercase letters, optionally followed by a colon, mark the beginning of a section, whereas too many words in the heading, explicit line endings like question marks and periods, or enumeration-like characters at the beginning of a line, prevent flagging these lines as section beginning.
Finally, Meystre and Haug [
36] used regular expressions for extracting possible sections headings and then, considering the identified sections, they extracted clinical problem mentions to populate a problem list.
In summary, studies have used regular expressions to find headings that do not match with the standard or expected text, but with some variation of it. Besides, this type of rule is often used to recognize when a text corresponds to a heading, or conversely when it does not, analyzing the formatting characteristics (e.g. capitalization, spaces, length).
Complementing the exact matching and the regular expressions for section identification, a significant subset of studies include more sophisticated rules intended to detect not only labeled, but unlabeled sections. These studies typically use probabilistic information obtained from a corpus of clinical narratives.
Denny et al. [
9] created a section identification system called SecTag that relies on a hierarchical heading terminology. The approach first identifies sentence boundaries and list elements. Within each sentence it detects all candidate section headings by using rules, spelling correction, recognizing word variants and synonyms that map to known terms and, finally, the method detects section boundaries. In case of low certainty of whether a sentence belongs to a section, it uses a statistical model that contains the prior probability of each possible section heading occurring in a document, the probability of each section heading occurring in a specific position, and the probabilistic sequential order for headings appearing in documents, obtained from a sample data set. Using this model, the system applies different rules to maintain or discard the previously detected sections. For instance, it uses the distance of the candidate section heading to other nearby section headings to decide whether the section should be labeled. It includes a set of rules that take into account the Bayesian scores of each detected section, as well as other kinds of inputs like the typical section length, according to the corpus. This proposal demonstrated very good results for section identification in history and physical examination texts and has been re-used in other studies [
11,
12,
34].
The study of Shivade et al. [
51] includes two strategies; first, a rule-based strategy that relies on information as the length of a string when it is a section heading, the usage of camel case, and the set of commonly used words that constitute a section heading. The second one is focused on implicit sections. Here, the authors identified the sections by the relative frequency of concepts occurring therein. For instance, a medication section that cannot be explicitly identified using headings can be detected through the identification of concepts that correspond to medication names. Similarly, in Suominen et al. [
54], a semiautomatic heading identification was applied using, first, regular expressions, and then a content analysis method that allowed for monitoring shifts in content and style.
The study proposed by Tran et al. [
58] detected sections using rules for identifying concepts from an ontology. Each section is associated with a class of this ontology, and the properties of each class represent the terms frequently found in each section, their frequency being obtained from sample data. The key point here is that these properties are used to detect implicit sections in new texts.
Other types of works explore section identification from a temporal point of view. Lee and Choi [
27] developed a temporal segmentation method for clinical texts, particularly discharge summaries. Hierarchical rules are created manually based on cue phrases that can indicate topical or temporal boundaries in text structures. Temporal segments can describe information about symptoms, clinical tests, diagnoses, medications, treatments, and clinical department/visit information.
As can be seen, studies have used probabilistic rules mainly to deal with implicit sections, or as a mechanism to improve precision and recall of sections identified with other types of rules. Even though these rules may vary in complexity, all of them represent the knowledge acquired from a sample of texts where sections have been previously identified. The availability of this sample is a crucial constraint for using this type of rules.
Finally, some studies that follow a hybrid approach apply rules during their first phases. That is the case of Ni et al. [
40], who defined exact matching rules; others used regular expressions to locate headings and boundary markers [
1,
7,
20,
46], and [
21]; furthermore, one study used probabilistic rules with statistics computed for each kind of narrative (e.g. the section length, the number of sections, and section order) to identify undetected sections [
6]. These studies will be discussed in detail later.
Machine learning approaches
ML algorithms are increasingly relevant for section identification. Table
3 presents the studies that include a ML method. Each study is described using the following characteristics:
1
The ML method(s) used. Note that this review only illustrates the method (aka technique), since most of the studies do not specify the particular implementation or algorithm used.
Table 3
Machine learning studies
| AdaBoost | M | 60 | CV | ML |
| Bayesian Network | M | 3483 | CV | ML |
Chen et al. [ 5], Dai et al. [ 7] | Conditional Random Fields | M, RB, CO | 790 | 514 | H |
| Conditional Random Fields | M | 100 | 600 | ML |
| Conditional Random Fields and Maximum Entropy Classifier | M, AL | NS | NS | H |
| Conditional Random Fields and Viterbi | M, RB | 2340 | 1003 | H |
| Expectation Maximization Classifier | M, RB | NS | NS | H |
| Hidden Markov Model and Viterbi | M, RB | 7549 | 2130 | ML |
| Logistic Regression | M | 1106 | CV | ML |
| Logistic Regression and Viterbi | M, RB | 1800 | 12502 | H |
| Maximum Entropy Classifier | M, CO | 1365 | 374 | ML |
| Neural Network | M, RB | 25842 | 2000 | H |
| Support Vector Machine | M, RB | 3000 | 200 | H |
| Support Vector Machine | M | 50 | CV | ML |
| Support Vector Machine and KNN | M | 10694 | CV | ML |
2
Strategy used for training and test data set creation. “Manually created” data sets means that human experts annotate (or label) the sections or the specific characteristics from the text required for section identification. “Using a rule-based approach or an automated method” means that the study included an automatic phase that obtained the sections using rules. Finally, “Active learning and distant supervision” means the study facilitates the creation of training and test data sets by combining automatic with a manual method in order to reduce the human effort.
3
The reported size of the training data set. Although some studies reported their training sets in terms of sentences, sections, or documents, this review unifies the metric as the number of clinical narratives used.
4
The reported size of the test data set. It is described using the same unit (i.e. number of documents). “CV” means Cross Validation, i.e., an approach that uses the same data set for training and testing, changing iteratively the part of the data used for either purpose.
Our analysis shows that the most popular ML methods are Conditional Random Fields (CRF) and Support Vector Machine (SVM). The Viterbi algorithm is also very popular to find the optimal section label sequence where section identification is treated as a sequence labeling problem. All these works rely on manually created training and test sets, at least partially.
The size of the training data set differs considerably between studies, ranging from 50 to 25,842 texts. Size variation notably impacted the performance and adaptability of the models on new data sets.
All studies extracted features relevant for section identification. Table
4 assigns the reported features to the following groups:
1
Lexical features: word-level features that can be extracted directly from the text. E.g., the entire word, its prefix, its suffix, capitalization (e.g. uppercase, title case), its type (e.g. number, stop word), among others.
Table 4
Machine learning features
| U,N | POS | RT,AT,T | LP | ML |
| N | NS | NS | NS | ML |
Chen et al. [ 5], Dai et al. [ 7] | C,A | Pun | ST | WL | H |
| NS | NS | NS | NS | ML |
| NS | NS | NS | NS | H |
| N | POS,Pun | ST | LP | H |
| NS | NS | ST | SS,OS | H |
| N | NS | NS | SB | ML |
| U | NS | NS | NS | ML |
| U,N,C | NS | ST | LP,LL,LC,CC | H |
| U,C | Nu | NS | LP,WL,SS,SB | ML |
| NS | NS | NS | SB | H |
| N,C | Pun | ST | LP,WL,SB | H |
| U,N | POS,VT | ST,DI,MN | LP,LL,SB | ML |
| NS | NS | NS | SB | ML |
2
Syntactical features: represent the sentence structure, e.g. grammatical roles and part of speech.
3
Semantic features: represent the meaning of words and terms in a sentence. Most of them refer to a custom dictionary or a controlled terminology.
4
Contextual features: describe relative or absolute characteristics of a line or a section within the clinical narrative. For instance, the position of the section in the document and layout characteristics.
All studies that applied ML are methodologically similar, but each of them has certain restrictions according to the type of sections it seeks to identify. For instance, the approach proposed by Bramsen et al. [
3] explores sections from a temporal perspective. They identify a new section when there is an abrupt change in the temporal focus of the text. E.g., a patient’s admission has a different temporal focus compared to a previous hospital visit. Using the training data, the method extracts boundaries that delineate temporal transitions and then, these boundaries are represented through a set of features. Finally, they trained a classifier to obtain the segments and their temporal order, using the relations
before,
after and
incomparable.
Li et al. [
29] considered section mapping as a sequence-labeling problem to 15 possibly known section types. For training the model, they first determine text span boundaries in the training set with section headings and blank lines as cues. A text span may start with a section heading or with a blank line. The start position of the next text span becomes the marker of the end of the previous text span. Recognized section headings are mapped to section labels based on a custom dictionary. Thus, every text is represented as a sequence of text spans.
The study of Waranusast et al. [
62] contributes to section identification in systems that can interpret freehand anatomical sketches, and handwritten text using temporal and spatial information. On-line digital ink documents, composed of sequences of strokes, and separated by gaps are used as input. A stroke consists of a sequence of stroke points (x and y coordinates) recorded between a pen-down event and a pen-up event. This information was used to train a classifier that distinguishes text from non-text sections, based on spatial and temporal features of the ink document. Even though the sections identified in this work are not classical clinical sections, the feature extraction methods can be extended to other types of classification.
Tepper et al. [
57] used a model based on maximum entropy. They classify every text line using BIO tags with category labels X; i.e., B-X and I-X indicate that the line begin (B) or lay inside (I) a section with category X; O means the line is not in any section (e.g., a blank line at the beginning of a document).
The approach for section identification proposed by Mowery et al. [
39] classifies sentences according to the SOAP framework. To this end, an SVM classifier for each SOAP class was trained. The authors found out that most of the semantic features were useless for classification, probably because they were too broad. Features identified as important for section identification were: predictive unigrams and bigrams, word/POS pairs, and some tokens to which UMLSⓇ (Unified Medical Language System) [
2] identifiers had been assigned.
Lohr et al. [
31] developed a custom dictionary of useful categories for the annotation of CDA (Clinical Document Architecture) [
13] compliant sections and presented a guide for section annotation in German discharge summaries and related documents. They narrowed down the appropriate granularity of the annotation unit, the set of relevant and feasible categories, and their internal structure. They also highlighted text passages that do not belong to the category suggested by the subheading they were assigned to; such “out of context” text passages were abundant in their corpus and constituted a major source of false assignments.
In Deléger and Névéol [
8], section identification only considered two types of sections, viz. core medical content and others, e.g., headings and footers, in order to select what is relevant for physicians. They trained a CRF model using a training data set with EHR narratives and other medical texts, e.g., email content.
Haug et al. [
17] used Bayesian models to detect sections, based on training set with 98 different topics annotated in seven types of clinical narratives. The model includes different types of narratives because of the high variability in the headings produced during routine medical documentation for each kind of document. The annotations include (sub)section headings and section contents. Similar topics are labeled under the same section name.
Even though the above studies have trained their models on data sets from specific institutions, their underlying ML logic gives them the possibility to retrain the original model on new training data, in such a way that the section identification method can be applied in other institutions. Tepper et al. [
57], for instance, stated that their method could be easily retrained, because it “requires only a small annotated dataset”. Nevertheless, retraining ML models, although viable in theory, is often laborious and time-consuming in practice, because new annotated data sets are required to obtain acceptable performance. Similarly, the existing methods are constrained to identify sections on certain types of clinical narratives (see Table
1); however, it is also possible to retrain the models on a new narrative type or applied the original model to narratives with similar characteristics. Mowery et al. [
39] suggested their method, originally created for
Emergency Reports, could be applied on reports with similar structure and lexical distribution, such as
History and Physical Exams,
Progress Notes and
Discharge Summaries. Likewise, Bramsen et al. [
3] claimed their method, originally created for
Discharge Reports, can also be used during the pre-processing phase on other analysis.
Finally, seven hybrid approaches use rule-based methods during the creation of training and test data sets, and then apply ML methods. This is the case of Apostolova et al. [
1], Sadoughi et al. [
46], Ni et al. [
40], Chen et al. [
5], Dai et al. [
7], Jancsary et al. [
20], and Ganesan and Subotin [
16]. Other ones use rules for detecting the explicit sections and a ML algorithm for detecting implicit sections like dine in Cho et al. [
6]. These works are characterized in Tables
2,
3, and
4, and explained in detail in the following section.
Hybrid approaches
The study by Cho et al. [
6] extracts common heading labels from a training data set and used them to identify labeled sections on the target texts. Then, it extracts patterns from the training data set and used them as features of an expectation maximization model for identifying “hidden” (i.e. unlabeled) sections. Features represent the kind of employment of the colon character (e.g., if it is used to express time, ratio or a list), and the candidate phrase characteristics (e.g., whether it is all capitalized or in title format). The model is applied when it is suspected that a text contains “hidden” sections, according to the statistics computed for the kind of text; in particular the length of a section, the number of sections and the order of sections.
In a similar vein, Apostolova et al. [
1] used training data to identify common, local formatting patterns that help identify section headings and boundary markers in radiology reports. The training vectors are built using 3000 reports, computing the bi-grams weights (TF-IDF Term Frequency – Inverse Document Frequency) to the eight possible sections in radiology reports. Each sentence from the training data is assigned to one section. For new reports two classification strategies are applied: first, a rule-based one for detecting headings. When no section is detected, the similarity of the sentence to the training vectors belonging to each section is computed. This strategy assigns the sentence to the section of the closest sentence vector (cosine distance). The second strategy uses SVM for creating one classifier for each section type trained on the features of the sentence itself and of surrounding sentences. Features are sentence formatting (e.g. capitals, colon), previous sentence boundary (e.g. white space, special characters), following sentence boundary, a flag indicating exact heading matches in the sentence, and the cosine vector distance to each of the eight section vectors. Formatting and boundary features significantly improved the classification of semantically related sections such as
Findings and
Impression. Other sections, e.g.,
Recommendation proved to be hard to classify correctly.
Sadoughi et al. [
46] embedded section identification in the processing of medical dictations, as a binary classification problem where “1” corresponds to boundary token and “0” otherwise. They used a recurrent neural network with an embedding layer using pre-trained embeddings with word2vec. They consider a continuous bag-of-words with a size of 200, trained for 15 iterations over a window of 8 words with no minimum count. We classify this study as hybrid, because they also used regular expressions.
Speech recognition was also the context of Jancsary et al. [
20] approach to text segmentation. They created a data set using dictations whose completeness was validated by humans. Then, a section type was automatically assigned to each report fragment if its heading matched some known heading from a custom dictionary; when the heading was not known, they assigned it manually. Token features of each section such as dates, physical units and, dosages were labeled. Using the labeled data set, they labeled new dictations in raw format (dictations without human intervention) using a similarity function based on the semantic and phonetic differences. With both annotated data sets they built a model based on CRF for the classification of new dictations.
Chen et al. [
5] and Dai et al. [
7] formulated section identification as a token-based classification using CRF on lexical, syntactical, contextual, and semantic features, the latter resulting from string match against headings terms from controlled terminologies. The importance of each feature in the model was analyzed. The authors found out that adding layout features (e.g. line breaks) enabled the CRF model to recognize section headings that did not appear in the training set. A decrease in precision was observed when very specific terminology was associated with subsections in the lexicon. Finally, it was found that using non-standard abbreviations was one of the most important causes of false negatives (e.g. “All” for “allergy”).
Ganesan and Subotin [
16] proposed a segmentation model that identifies header, footer, and top-level sections in clinical texts, using a regularized multi-class logistic regression model to classify each line according to five roles: start of a heading, continuation of heading, start of a section, continuation of a section, and footer. The training set was built using a rule-based approach applied to different kinds of clinical narratives. Using this training set, the method generates features at text line level, including the relative position of the line, which turned out to be the most important indicator for section boundaries. Another useful feature is the “KnownHeading” feature that represents whether the line contains a heading included in a custom dictionary.
Ni et al. [
40] used active learning and distant supervision during the creation of the training data set for a section identification maximum entropy model. Active learning reduces annotation effort by selecting the most informative data, and distant supervision automatically generates “weakly” labeled training data using a knowledge base. The study starts with a manually annotated training set, which is then used to predict the section on a new data set. From the predicted data set, the examples with the lower prediction confidence are selected and given to a human annotator. The confidence of the examples (i.e. documents) is calculated by counting the sentences that have a confidence less than 0.9 and selecting the top m documents using this count. In the distant learning approach, the goal is to include all the known medical headings in a custom dictionary and use them for extracting, from the unlabeled data set, the sentences that contain them. The found heading is used to tag the sentence and all the subsequent sentences until another heading is found. This study demonstrated that both techniques achieved a good accuracy on the final training data set and the models generated from it.