Background
Escherichia coli is a normal inhabitant of the gastrointestinal microbiota of mammalians and birds, but at the same time it can cause a variety of diseases relevant for public and animal health such as diarrhoea, bacteraemia, septicaemia, urinary tract infections [
1]. From a clinical perspective,
E. coli is broadly classified into commensals, intestinal pathogenic
E. coli (InPEC) and extraintestinal pathogenic
E. coli (ExPEC), the latter group being further divided into uropathogenic
E. coli (UPEC), septicaemia-associated
E. coli (SEPEC), neonatal meningitis
E. coli (NMEC), and avian pathogenic
E. coli (APEC). ExPEC strains are normal colonizers of the gut of men and animals, but in contrast to intestinal pathogenic variants, they can cause infections to the urinary tract or the blood stream [
2] once they reach the corresponding body site. Although collectively termed ExPEC, simply to reflect their shared ability to express functionally similar virulence factors and to denote considerable overlaps concerning serotypes and phylogenetic background [
3],[
4], this group of strains exhibits large genome diversity. This has been mainly attributed to the frequent location of virulence associated genes (VAGs) on plasmids, pathogenicity islands, or phages, allowing the VAGs to be highly interchangeable among strains through horizontal gene transfer (HGT) [
5],[
6]. The population structure of
E. coli is characterised by the presence of distinct phylogenetic groups as observed by phylogenetic reconstruction [
7],[
8] or by the use of specific markers [
9]. Based on these approaches, four (A, B1, B2 and D) major phylogroups have been described while according to the method, two minor (E and F) or two hybrid (AxB1 and ABD) phylogroups have been defined in addition, which are not necessarily equivalent [
8]-[
10]. The distribution (presence/absence) of virulence factors thought to be involved in the ability of a strain to cause diverse diseases also varies among strains of these phylogenetic groups, indicating a role of the genetic background in the expression of virulence [
11]-[
13].
The high diversity of ExPEC and the difficulty in a clear demarcation of these facultative pathogenic strains from their commensal counterpart poses a huge challenge to infectious medicine in terms of diagnostic and risk assessment. As recently shown, the genetic variability of the
mutS-
rpoS chromosomal region may serve as indicator and thus as a chromosomal marker for the different virulence potential of
E. coli strains [
2],[
14]-[
16]. The crucial genes are
mutS, which encodes one of the four proteins required for methyl-directed mismatch repair (MMR) of DNA, and
rpoS, which encodes a sigma factor (sigma 38) that regulates many stationary-phase and environmental stress response genes [
17]. Although
mutS and
rpoS are generally conserved in Enterobacteriaceae, the
mutS-rpoS intergenic and its adjacent region revealed extensive genetic variability that was subjected to genetic exchange during the evolution of pathogenic lineages. Several studies revealed a pathotype-associated polymorphism in this genetic region [
15],[
18],[
19] suggesting it to be the region owing to HGT and evolutionary processes. In comparison to
E. coli K-12, previous studies revealed that enteropathogenic
E. coli (EPEC), enterohaemorrhagic
E. coli (EHEC) and
E. coli group B2 strains harbour specific DNA insertions within the
mutS-rpoS intergenic region [
15],[
16],[
19]. An insertion of 2.1 kb, in place of the initially identified 2.9 kb insert at the proximity of
E. coli O157:H7 [
19] has been found in strains of uropathogenic
E. coli[
15] and larger intergenic regions exist in strains of EPEC and EHEC [
16]. Moreover, phylogenetic analysis of EHEC and EPEC strains, as well as strains of the Ecor collection, revealed that the
mutS gene itself may be frequently subject to horizontal transfer and recombination during the evolution of these strains which is consistent with mechanism for stabilizing adaptive changes promoted by
mutS mutators with relaxed recombination barriers [
14],[
20].
In the present study we attempted to assess the importance of the
mutS-rpoS intergenic region as a surrogate marker for the rapid identification of highly virulent ExPEC strains and to delineate biological meaningful subgroups among this highly diverse group of strains. In particular, we aimed: (i) to characterize the genetic diversity of the
mutS gene and of the
o454-nlpD genomic region among 510
E. coli strains obtained from animal and human sources; (ii) to delineate associations between the polymorphism of this region and features such as phylogenetic background of
E. coli, bacterial class or pathotype, host species, clinical condition, serogroup and virulence associated genes (VAG)s; and (iii) to identify the most important ExPEC-related VAGs for classification of the
o454-nlpD genomic region by using the random forest (RF) algorithm [
21]. The results could be a valuable contribution to ongoing analyses on pathoadaptive alterations in ExPEC strains that affect disease severity and may have consequences for diagnostics of
E. coli infections (15).
Here we provide sound evidence that this polymorphic region indeed is correlated with virulence, and with the high number of newly generated whole genome data of E. coli, researchers working with this versatile bacterial pathogen can further proof validity of our data in a larger context.
Discussion
The
mutS-nlpD region represents a major operational region of genomic evolution in Enterobacteriaceae. Previous studies suggested that the polymorphic nature of this genetic region due to high mutation rate and to loss or acquisition of genes by horizontal gene transfer plays an important role in the constant adaptation of the bacteria to environmental changes and to new ecological niches [
2],[
14]-[
16]. So far, the mosaic structure of the
mutS-nlpD region was mainly investigated in intestinal pathogenic strains. By that, RFLP type analysis of the
fhlA-nlpD genomic region revealed four different clusters: i.e. K-12 and Ecor group A, EPEC 1 group, EPEC 2 & EHEC 2 groups, and O157:H7 & EHEC 1 group [
16]. Soon after, Le Clerc et al. [
19] detected a polymorphism at the proximity of the
rpoS gene. In particular, a 2.9-kb DNA insertion was identified in
E. coli O157:H7 and related enterohaemorrhagic
E. coli (EHEC) strains as well as in
Shigella dysenteriae[
19], while a 2.1-kb DNA insert was observed in
E. coli pyelonephritis strain CFT073 [
15]. These authors investigated diverse clinical isolates, including 23 from urinary tract infections (UTIs), 26 from infantile diarrhoea and haemorrhagic colitis, and seven from catheter-associated infections. Since the DNA insert previously identified in UPEC strain CFT073 was more common among urinary tract infection isolates (82.6%) than among other clinical
E. coli isolates they suggested it to be specifically linked with uropathogens [
15]. They also stated, that all B2 strains of the
E. coli reference collection harboured this specific DNA insert und concluded that it might also predict the virulence of a strain, as B2 strains are known to be highly virulent in terms of extraintestinal pathogenicity.
From these studies it seemed likely that there might be pathotype- or virulence-associated polymorphisms in the
mutS-rpoS region of the
E. coli chromosome, although this has not been verified by using a large set of field strains. In the present study, PCR analyses of the
o454-nlpD intergenic sequence from 510 predominantly extraintestinal pathogenic and commensal
E. coli strains revealed substantial size variation. With only few exceptions, the strains were grouped into four patterns which resembled previously identified groups determined for the O157:H7 & EHEC 1 group (termed as pattern I in our study), K-12 and Ecor A group (pattern II), CFT073 and other uropathogens (pattern III), and the EPEC group (pattern IV) [
15]. While patterns I, II, and IV were randomly distributed among all non-B2 group strains, pattern III was exclusively observed in group B2 strains, both of clinical and faecal origin. This is consistent with the genomic variation in selected Ecor strains described by others [
15],[
19],[
23],[
24]. With respect to the ongoing evolution and due to observations from previous MLST analyses [
8] it is generally recognized that the Ecor collection does not reflect the entire
E. coli population. Previous findings about the presence of the
rpoS-proximal 2.1-kb insertion in all Ecor group B2 isolates and in a limited number of clinical isolates could thus not be extrapolated to nowadays relevant ExPEC-sequence types, such as ST95, ST127, ST372, and the global emerging multiresistant ST131 clone [
25]-[
29]. In our strain collection, these sequence types as well as other clinically important B2-STs, including ST73 and ST80 [
28],[
30] were frequently represented suggesting the existence of a B2-associated
mutS-nlpD intergenic region.
Since pattern III was found to be highly associated to the strain category ExPEC and with a higher number of ExPEC-related VAGs we could speculate a potential role of these genes with the ability of the strain to invade extraintestinal tissues. However, replacement of
slyA with other genes such as
o347 and
o183 in pattern III and their roles in pathogenesis remain to be investigated.
SlyA, found exclusively in the EHEC- and EPEC-related patterns I and IV plays an important role in the invasion and survival of
Salmonella in macrophage cells [
31]. The relevance of this transcriptional regulator for the pathogenesis of non-B2 ExPEC strains has not been explored yet. Similarly, the putative role of the
o347-encoded factor, showing a limited level of similarity to enzymes implicated in antibiotic hydrolysis, remains to be determined [
15],[
32]. Girardeau et al. [
33] investigated common traits between animal and human ExPEC isolates positive for afimbrial adhesin gene
afa-8. Interestingly, they found a significant proportion of human pyelonephritis-associated isolates showing the
mutS-rpoS intergenic region found in EPEC and EHEC isolates. Furthermore, a putative function of
SlyA in the pathogenesis of certain extraintestinal
E. coli isolates which largely lack ExPEC-associated traits, such as S-fimbrial gene
sfa, alpha-hemolysin gene
hly, cytotoxic necrotizing factor gene
cnf, was suggested. In accordance with our data, Girardeau et al. [
33] and others observed
mutS-rpoS intergenic patterns among ExPEC strains, which were previously associated with intestinal pathotypes [
15],[
33]. Thus, with respect to the group of ExPEC and its various pathotypes this genetic region may not be regarded as pathotype-associated but more likely as a marker to reflect their diversity and probably to predict highly virulent members of this group, as different patterns are obviously linked with single ExPEC-related virulence genes.
Advance statistics like Random Forest (RF) algorithm was used to resolve difficult tasks such as the identification of the most important VAGs for the prediction of the
o454-
nlpD patterns. The RF classification of the
o454-
nlpD patterns estimated an OOB (out-of-bag) error rate of 18.4% when using 46 virulence-associated genes (VAGs). This relatively high error rate clearly indicated that not all VAGs used were strongly associated with a specific
o454-
nlpD pattern as shown in Table
3 where only pattern III is associated to specific genes. Other genes where associated to at least two different patterns. However, since the error rate in predicting the pattern III was zero (Table
5 and Additional file
4) we can conclude that RF performed very well for pattern III prediction. Furthermore, RF allowed us to identify for the first time the top-ranked VAGs for pattern III prediction. The most predictive indicators, the genes
csgA,
malX,
chuA,
vat and
sitDchromosomal were also statistically significant predictors, and thus worthy of further investigation. Since pattern III was nearly exclusively linked with group B2 strains in our study, these genes may also be regarded as predictive for highly virulent members of the ExPEC group and of commensal
E. coli harbouring virulence potential, respectively. Indeed, the heme binding protein encoding gene
chuA is one of the genetic regions targeted in the PCR-based approach for rapid phylogenetic typing of
E. coli strains and is said to occur regularly in group B2 and D strains [
9],[
34].
ChuA, together with
vat, which encodes an autotransporter serine protease toxin,
fyuA, which encodes the yersiniabactin receptor and finally a gene (
yfcV), which encodes the major subunit of a putative chaperone-usher fimbria, have previously been included in a diagnostic multiplex PCR to identify strains of the UPEC pathotype [
35]. In their study, Spurbeck et al. [
35] suggested that
E. coli isolates that encode these four genes are correlated with high numbers of other VAGs, are able to colonize the bladder in higher numbers than strains lacking these genes, and are nearly 10 times more likely to represent UPEC or NMEC strains than faecal commensal strains [
35]. Likewise positively linked with pattern III strains is
malX, which codes for a phosphotransferase system enzyme II that recognizes maltose and glucose [
36] and is frequently present in ExPEC strains [
37]-[
39]. Östblom et al. [
40] could demonstrate, that
malX was among those genes that were associated with fitness of
E. coli in the infant bowel microbiota. Here, carriage of various pathogenicity island markers and particularly
malX correlated positively with the time of persistence of individual strains in the colon, supporting their role to increase the fitness of
E. coli in its natural niche, the colon [
38].
The
mutS chromosomal region has long been identified as the location for the insertion of blocks of VAGs e.g. in the case of the two widely diverged pathogens,
Salmonella Typhimurium and
Haemophilus influenzae. In
S. Typhimurium, a 40 kb pathogenicity island (SPI-1) is inserted 5? to the
mutS gene [
40] and in
H. influenzae, a 3.1 kb tryptophanase gene cluster (
tna) is inserted on the 3? side of the
mutS gene in strains that cause spinal meningitis in infants [
41]. This insertion allows the utilization of tryptophan and, thus, provides a growth advantage for the pathogen, particularly in the tryptophan-rich environment of cerebrospinal fluid. In our study, there was no indication for an insertion of PAI-like structures or of larger blocks of VAGs in the intergenic
mutS-rpoS region in any of the 510
E. coli strains under investigation. Other studies raised the hypothesis that a certain
o454-nlpD pattern, reflecting distinct evolutionary
E. coli lineages, might be linked with the acquisition of VAGs at chromosomal sites outside this genetic region by HGT [
2],[
15],[
42]. Culham and Wood [
15] suggested that the 2.1-kb insertion upstream of
rpoS arrived earlier than certain virulence determinants linked with urinary tract infections, such as genes for P-fimbriae (
pap), S-fimbriae (
sfa), a polyketide synthetase (
pks), and ?-hemolysin (
hly) during the evolution of group B2. Basically, the polymorphism in this genetic region is considered to result from the close linkage of
mutS and
rpoS genes which are frequently mutated in
E. coli evolution due to ecological specialization upon repeated shuttles between different environments [
2]. Here, their inactivation as well as the re-acquisition of functional alleles might have been of selective advantage, e.g. in terms of stress resistance, higher mutation rates, genome plasticity, and stabilization of beneficial adaptive mutations [
5],[
43].
Indeed, in addition to the findings of genetic variability, phylogenetic analysis of EHEC and EPEC pathogens, as well as strains of the Ecor collection, revealed that an unexpected level of recombination between
mutS genes has occurred during the evolution of these strains [
14],[
20]. In a comparison of
mutS phylogeny against predicted
E. coli ‘whole-chromosome’ phylogenies, derived from multilocus enzyme electrophoresis (MLEE) and
mdh sequences, Brown et al. [
14] observed striking levels of phylogenetic discordance among
mutS alleles and their host strains, which basically represented the Ecor collection. To investigate whether this is also true for a greater collection of strains and using concatenated MLST gene sequences instead of single genes or MLEE as comparison, we extended this approach on 177 ExPEC and commensal strains. Here, many of the
mutS alleles clustered according to the population structure given from the housekeeping genes phylogeny, indicating a low frequency of recombination events across phylogenetic groups. However, for a number of strains we also found incongruence between these two phylogenies, which is likely due to recombination of
mutS between different phylogenetic groups. By investigating the molecular phylogeny of MMR (methyl-directed mismatch repair) genes from natural
E. coli isolates Denamur et al. [
20] could show that, compared to two housekeeping genes, individual functional MMR genes exhibit high sequence mosaicism derived from diverse phylogenetic lineages. They suggested that the MMR functions have frequently been lost and reacquired in the evolution of
E. coli. To which extent
mutS and other genes of the MMR system as well as the
mutS-nlpD intergenic region represent a hallmark of a mechanism of adaptive evolution in ExPEC and in other
E. coli pathotypes has been scarcely investigated. Certainly,
mutS has a unique role in the formation of mutators with relaxed recombination barriers, and bacteria with a defect in their MMR system, e.g. by a temporary loss of
mutS, are more prone to genetic variations and HGT and, consequently, have an increased capacity to adapt to the host environment or acquire new VAGs, respectively [
2],[
44]. This phenomenon was recently described for UPEC strains, where
mutS and other genes of the MMR system were found to be involved in the reciprocal control of motility and adherence of UPEC due to an increased expression of flagellin [
44]. The authors discussed a possible relationship between MutS and UPEC pathogenesis in general as urinary tract isolates exhibit a higher occurrence of mutator strains than commensal
E. coli or any other
E. coli pathotype [
42].
Methods
Bacterial strains
A total of 510
E. coli isolates, including 72 strains of the
E. coli reference collection (Ecor) were examined for their
o454-nlpD genomic region. This strain collection comprised of 367 extraintestinal pathogenic
E. coli (ExPEC) isolated from multiple anatomical sites and 143 strains isolated from the faeces of clinically healthy humans (n?=?83) and animals (n?=?60). Pathogenic strains were isolated from urinary tract infections in humans (n?=?65) and animals, including dogs, cats, horses, and pigs (n?=?93), avian systemic and local
E. coli infections (collectively termed colibacillosis) (n?=?135), meningitis in infants (n?=?24), septicaemia in humans (n?=?28) and animals (n?=?5), and from other diseases, such as peritonitis, mastitis, metritis, cervicitis, and vaginitis in various animal species (n?=?18), the latter three diseases broadly categorized as genital tract infection. Clinical strains were recovered during routine microbiological diagnostic of samples from patients in hospitals and veterinary clinics in Germany. In case of disease outbreaks, only one strain per event was included.
E. coli isolated from healthy individuals were avian faecal strains, which originated from cloacal swabs of clinically healthy poultry from Germany and human commensal strains isolated from the gut of healthy human carriers published earlier [
45]-[
47]. Commensal strains belonging to the Ecor collection originated from the gut of clinically healthy animals, such as cattle, dogs, sheep, pigs, and orang-utan [
48].
We further included enteropathogenic
E. coli E3248/69 [
49] and enterohaemorrhagic
E. coli EDL933 [
50] as well as K-12 laboratory strain MG1655 as references for PCR mapping of the
mutS-nlpD genomic region. Strains were stored at -70°C in brain heart infusion broth with 10% glycerol until further use.
DNA preparations
Bacterial DNA was extracted using the Master-PureTH Genomic DNA Purification Kit (Biozym Diagnostik GmbH, Hessisch Oldendorf, Germany) according to the manufacturer’ recommendations. DNA concentrations were determined by NanoDrop ND-1000 spectrophotometer. The DNA was diluted in MilliQ sterilized water to obtain ca. 50ng/μl, and 4μl were used for single and multiplex PCR.
Characterization of the o454-nlpD genomic region and sequence analysis of mutS
In order to investigate the
o454-nlpD region size a PCR approach was used. The amplification of the
o454-nlpD regions was carried out with oligonucleotide primers F5 and R2 (Table
1) using standard PCR conditions. The reference sequences selected for were: MG1655 (U00096), EDL933 (NC_002655), CFT073 (NC_004431) and E2348/69 (FM180568). The structure of these genomic regions and the relative size of the reference sequences are illustrated in Figure
1.
Long range PCRs for amplification of the whole fhlA-mutS region, using primers fhlA FP and nlpD RP, were performed by an Extensor Hi-Fidelity PCR Master Mix (ABgene’, Fisher Scientific - Germany GmbH) as recommended by the manufacturer. Briefly, for a 25μl reaction 10.0μl of Extensor Master Mix was mixed with 0.2μl of oligonucleotide primers in a 100 pmol concentration, 4 μl (400 ng) template DNA and 10.8μl deionized water. The amplification was performed in a thermal cycler (Perkin Elmer GeneAmp® PCR System 2400, Applied Biosystems, Darmstadt, Germany) using the following program: 92°, 2 min for initial denaturation; 10 cycles of 92°C, 10 sec denaturation, 59°C, 30 sec annealing and 68°C, 8 min elongation; the following 15°Cycles started with an extension time of 68°C, which was prolonged for 10 sec per cycle. A final extension cycle was applied at 68°C for 7 min.
E. coli strains were further investigated for various regions within the
fhlA-
mutS-
rpoS-nlpD genomic region by PCR assays according to standard protocols [
51]. Targeted genes, their descriptions as well as sequences of oligonucleotide primers and their positions within the target sequences are given in Table
1 and Figure
1.
Oligonucleotide primers used to amplify the
mutS genes in
E. coli were F1 and R3 and the sequences are given in Table
1. Amplification of
mutS sequences was performed under the following conditions: initial denaturation at 94°C for 4 min; 30 cycles of 94°C for 45 sec, 58°C for 1 min, and 72°C for 2.30 min; and final incubation at 72°C for 10 min. After double strand sequencing of the amplicon (LGC Genomics, Berlin, Germany) the 380 bp-segment of the
mutS gene, which corresponds to the conserved ATP-binding domain and to base pair coordinates 1808 to 2187 of the
mutS coding region in
E. coli CFT073 (GenBank accession number AE014075) was used for comparative phylogenetic analyses.
In addition to the 88 mutS alleles sequenced here, another 92 partial mutS sequences, originating from the Ecor strain collection (accession no. AF001987 - AF002010, AJ005826 - AJ005828, AF004287, AJ242620, and AF291185 - AF291258) and from fully sequenced strains CFT073 (UPEC, accession no. AE0140075), APEC_O1 (APEC, CP000468), UTI89 (UPEC, CP000243), HS (Commensal, CP000802), MG1655 (K-12, U00096), W3110 (K-12, AP009048), SMS-3-5 (Environmental, CP000970), 042 (EAEC, FN554766), 11128 (EHEC, AP010960), 11368 (EHEC, AP010953), 12009 (EHEC, AP010958), E2348/69 (EPEC, FM180568), EC4115 (EHEC, CP001164), EDL933 (EHEC, AE005174), H10407 (ETEC, FN649414), LF82 (AIEC, CU651637), O157:H7 Sakai (EHEC, BA000007), 536 (UPEC, CP000247), IAI39 (UPEC, CU928164), and UMN026 (UPEC, CU928163) were obtained from GenBank.
Phylogenetic trees of
mutS sequences and concatenated sequences of the seven housekeeping genes included in MLST analyses were calculated using RAxML 8 [
52]. For each phylogeny, 100 bootstrap replicates were calculated. The visualization of the tree was performed with Dendroscope 3 [
53].
Nucleotide sequence accession numbers
The mutS nucleotide sequence data reported in this paper has been deposited in the GenBank sequence database with accession numbers KM232523 through KM232607.
Serotyping
Serotyping was performed on 426 strains at the Robert-Koch Institute (Wernigerode, Germany) by tube agglutination with rabbit anti-E. coli immune sera produced against a panel of antigenic test strains containing E. coli O-groups 1 to 181. Similar analysis was carried out in order to investigate the bacterial flagella antigens H group.
Multilocus sequence typing (MLST), Ecor grouping and phylogenetic analyses
Multi locus sequence typing was performed using the scheme published by Wirth et al. [
8]. Allele sequences were allocated to the public database available at the MLST website (
http://mlst.warwick.ac.uk/mlst/dbs/Ecoli). Ancestral groups were determined by an analysis based on the concatenated sequences of the seven housekeeping genes used for MLST. The linkage model implemented in the software Structure (
http://pritch.bsd.uchicago.edu/software.html) was used to identify groups with distinct allele frequencies. Cut-off values for the assignment of individual isolates to one of the four groups (A, B1, B2, and D) as well as to hybrid groups AxB1 and ABD were determined according to Wirth et al. [
8]. Phylogenetic clustering was performed by calculating a minimum spanning tree by means of a graphical software tool implemented in BioNumerics 7.1 (Applied Maths, Belgium). MLST data were partially adopted from previous publications [
37],[
45]-[
47].
Virulence gene typing
Virulence associated genes (VAG)s of recognized importance in the pathogenesis of ExPEC strains were investigated by multiplex and single PCRs as described previously [
37],[
47]. VAGs investigated in the present study were 47 and encoded for factors within the categories of adhesins (
afa/draB,
bmaE, csgA,
fimC,
focG, gafD, hrlA,
iha,
mat,
papAH, papC,
papEF, papG, sfa/foc,
sfaS, tsh), iron acquisition (
chuA,
fyuA,
ireA,
iroN,
irp2,
iucD,
iutA, sitepisomal,
sitchromosomal), serum resistance/protectins (
iss,
kpsMTII,
neuC,
ompA,
ompT,
traT), toxins/hemolysins (
astA,
cnf,
sat,
vat,
hlyA,
hlyF), and invasion (
ibeA,
gimB,
tia). Miscellaneous genes such as
malX,
pic,
puvA,
pks, and ColV plasmid operon genes
cvi/cva were investigated as well.
Biostatistics
To analyse the significant relationship between two categorical variables with few nominal values, a Person’ chi-square test [
54] was used. In case the assumptions of the chi-square test did not hold, a Fisher’ exact test was applied [
55]. In particular, Person’ chi-square test was used to analyse the significant relationship between the
o454-nlpD regions and the strain category (fecal or ExPEC), while Fisher’ exact test was employed to investigate the relationship between
o454-nlpD patterns and single VAG. Cohen-Friendly association plot was used to visualize deviation from independence of rows and columns in a two-way frequency table [
56].
P-values?<?0.05 indicate that the relationship between the
o454-nlpD regions and the strain category or single VAG is statistically significant. The Person’ chi-square test and Fisher’ exact test were performed using the functions ‘chisq.test’ and ‘fisher.test’ in R software (
http://www.r-project.org).
To determine the strength of association between two categorical variables with multiple nominal values, a Cramer’ V test [
57] was used. In particular, this type of statistic was applied to measure the association between the
o545-nlpD patterns and one of the following variables: MLST type, Ecor group, class or pathotype, host species, clinical conditions and serotype. The two-way contingency tables fitted for Cramer’ V test (see Additional file
2, Additional file
5, Additional file
6, Additional file
7, Additional file
8 and Additional file
9) displayed as column names the
o454-nlpD patterns and as row names one of the variables described above.
Cramers.v values closer to 1 indicate a strong or high association between the two nominal variables, while closer to 0 indicate a weak or low association between them;
uc.RC values, representing the Theil’ symmetric Uncertainty Coefficient UC, indicate how much knowing the values of the Row variables decreases uncertainty about the Column variables;
p.uc.RC values indicate the probability of gaining UC(R|C) by chance;
uc.CR values indicate how much knowing the values of the Column variables decreases uncertainty about the Row variables;
p.uc.CR is the probability of gaining UC(C|R) by chance;
uc.sym values closer to 1 indicate a lower uncertainty in predicting one of the variables from another on average. This analysis was performed using the R-package ‘polytomous’ (
http://cran.r-project.org/web/packages/polytomous/index.html).
In order to interpret the relevance of VAGs variables for pattern prediction and to filter out non-informative VAGs, Random Forest (RF) algorithm [
21] was used. RF was performed using the R-packages ‘Random Forest’ (
http://cran.r-project.org/web/packages/randomForest/index.html) with the following settings: 1000 number of trees and 3 variables tried at each split. The result from RF and conditional variable importance was verified via multiple random forest runs starting with different seeds and sufficiently large number of tree values to ensure robustness and stability of results [
58]. The commonly used importance measure from RF is the mean decrease Gini values. Gini value is directly derived from the ‘Gini index’ on the resulting RF trees. The RF classifier uses a splitting function called ‘Gini index’ to determine which attribute to split on during the tree learning phase. The Gini index measure the level of impurity of the samples assigned to a node based on a split at its parent.
Supporting data
The data sets supporting the results of this article are included within the article (and its additional files).