Innuendo Whole Genome And Core Genome Mlst Schemas And Datasets For Escherichia Coli

  1. Rossi, Mirko 1
  2. Silva, Mickael Santos Da 2
  3. Ribeiro-Gonçalves, Bruno Filipe 2
  4. Silva, Diogo Nuno 2
  5. Machado, Miguel Paulo 2
  6. Oleastro, Mónica 3
  7. Borges, Vítor 3
  8. Isidro, Joana 3
  9. Viera, Luis 3
  10. Halkilahti, Jani 4
  11. Jaakkonen, Anniina 5
  12. Palma, Federica 6
  13. Salmenlinna, Saara 4
  14. Hakkinen, Marjaana 5
  15. Garaizar, Javier 7
  16. Bikandi, Joseba 7
  17. Hilbert, Friederike 8
  18. Carriço, João André 2
  1. 1 University of Helsinki
    info

    University of Helsinki

    Helsinki, Finlandia

    ROR https://ror.org/040af2s02

  2. 2 Universidade de Lisboa
    info

    Universidade de Lisboa

    Lisboa, Portugal

    ROR https://ror.org/01c27hj86

  3. 3 Instituto Nacional de Saúde Dr. Ricardo Jorge
    info

    Instituto Nacional de Saúde Dr. Ricardo Jorge

    Lisboa, Portugal

    ROR https://ror.org/03mx8d427

  4. 4 Terveyden ja hyvinvoinnin laitos
  5. 5 Elintarviketurvallisuusvirasto
  6. 6 ANSES
  7. 7 Universidad del País Vasco/Euskal Herriko Unibertsitatea
    info

    Universidad del País Vasco/Euskal Herriko Unibertsitatea

    Lejona, España

    ROR https://ror.org/000xsnr85

  8. 8 University of Veterinary Medicine Vienna
    info

    University of Veterinary Medicine Vienna

    Viena, Austria

    ROR https://ror.org/01w6qp003

Argitaratzaile: Zenodo

Argitalpen urtea: 2018

Mota: Dataset

Laburpena

<strong>Dataset</strong> As reference dataset, 2,218 public draft or complete genome assemblies and available metadata of <em>Escherichia coli</em> have been downloaded from EnteroBase in April 2017. Genomes have been selected on the basis of the ribosomal ST (rST) classification available in EnteroBase: from the same rST, genomes have been randomly selected and downloaded. The number of samples for each rST in the final dataset is proportional to those available in EnteroBase in April 2017. The dataset includes also 119<em> </em>Shiga toxin-producing <em>E.coli</em> genomes assembled with INNUca v3.1 belonging to the INNUENDO Sequence Dataset (PRJEB27020). File 'Metadata/Ecoli_metadata.txt' contains metadata information for each strain including source classification, taxa of the hosts, country and year of isolation, serotype, pathotype, classical pubMLST 7 genes ST classification, assembly source/method and Enterobase barcode. The directory 'Genomes' contains the 119 INNUca v3.1 assemblies of the strains listed in 'Metadata/Ecoli_metadata.txt'. Enterobase assemblies can be downloaded from http://enterobase.warwick.ac.uk/species/ecoli/search_strains using 'barcode'. <strong>Schema creation and validation</strong> The wgMLST schema from EnteroBase have been downloaded and curated using <em>chewBBACA AutoAlleleCDSCuration</em> for removing all alleles that are not coding sequences (CDS). The quality of the remain loci have been assessed using <em>chewBBACA Schema Evaluation</em> and loci with single alleles, those with high length variability (i.e. if more than 1 allele is outside the mode +/- 0.05 size) and those present in less than 0.5% of the <em>Escherichia</em> genomes in EnteroBase at the date of the analysis (April 2017) have been removed. The wgMLST schema have been further curated, excluding all those loci detected as “Repeated Loci” and loci annotated as “non-informative paralogous hit (NIPH/ NIPHEM)” or “Allele Larger/ Smaller than length mode (ALM/ ASM)” by the <em>chewBBACA Allele Calling</em> engine in more than 1% of a dataset composed by 2,337 <em>Escherichia coli</em> genomes. File 'Schema/Ecoli_wgMLST_7601_schema.tar.gz' contains the wgMLST schema formatted for chewBBACA and includes a total of 7,601 loci. File 'Schema/Ecoli_cgMLST_2360_listGenes.txt' contains the list of genes from the wgMLST schema which defines the cgMLST schema. The cgMLST schema consists of 2,360 loci and has been defined as the loci present in at least the 99% of the 2,337 <em>Escherichia coli</em> genomes. Genomes have no more than 2% of missing loci. File 'Allele_Profles/Ecoli_wgMLST_alleleProfiles.tsv' contains the wgMLST allelic profile of the 2,337 <em>Escherichia coli</em> genomes of the dataset. Please note that missing loci follow the annotation of chewBBACA Allele Calling software. File 'Allele_Profles/Ecoli_cgMLST_alleleProfiles.tsv' contains the cgMLST allelic profile of the 2,337 <em>Escherichia coli</em> genomes of the dataset. Please note that missing loci are indicated with a zero. <strong>Additional citations</strong> The schema are prepared to be used with <strong>chewBBACA</strong>. When using the schema in this repository please cite also: Silva M, Machado M, Silva D, Rossi M, Moran-Gilad J, Santos S, Ramirez M, Carriço J. chewBBACA: A complete suite for gene-by-gene schema creation and strain identification. 15/03/2018. M Gen 4(3): doi:10.1099/mgen.0.000166 http://mgen.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.000166 <em>Escherichia coli</em> schema is a derivation of EnteroBase <em>E. coli</em> EnteroBase wgMLST schema. When using the schema in this repository please cite also: Alikhan N-F, Zhou Z, Sergeant MJ, Achtman M (2018) A genomic overview of the population structure of <em>Salmonella</em>. PLoS Genet 14 (4):e1007261. https://doi.org/10.1371/journal.pgen.1007261