![]() Metadata inconsistencies were observed between the downloaded genomes and those listed as complete genomes on the genome reports from NCBI Genome. In this study, a local database was created that contained all publically available complete bacterial genomes from the NCBI ftp site. High-quality databases are essential for research areas, such as comparative genomics, phylogenetics, and metagenomics, especially as they apply to diagnostics, public health, biosafety and biosecurity, and microbial forensics. Inaccurate identifying information can confound downstream analyses and may cause misinterpretations and therefore curation of metadata is necessary. Quality control of supporting data within public databases is crucial to ensure accurate and the most up-to-date metadata however, quality control practices and methods are not readily known or clearly stated. However, since these downloads include thousands of complete bacterial genomes, there is a challenge to easily identify which genomes were included in the download, to determine if all files and metadata associated with particular genomes were included and whether supporting data were correct. The NCBI ftp site 7 provides links to download all bacterial genomes in a number of file types. Data can be readily downloaded from these databases through ftp sites or facilitated through download links. Additional databases with more specific microbial applications and bioinformatics programs include IMG ( Markowitz et al., 2012) and PATRIC (Pathosystems Resource Integration Center) ( Wattam et al., 2014). These whole bacterial genome sequence data are housed in publically available databases such as NCBI 4 ( Benson et al., 2015), European Molecular Biology Laboratory–European Bioinformatics Institute (EMBL–EBI) 5 ( Amid et al., 2012), and DNA Data Bank of Japan (DDBJ) 6 ( Kodama et al., 2012), which make up the International Nucleotide Sequence Database Collaboration (INSDC) ( Nakamura et al., 2013). As a result of advancements in sequencing technologies, with increased output and decreased costs, the number of completed genomes will continue to rise resulting in substantial amounts of data. Integrated Microbial Genomes (IMG) 2 ( Markowitz et al., 2012) reported the number of bacterial genomes at 26,033 at the finished (3,378), draft (1,683), and permanent draft (20,972) status, and there is a total of 39,969 bacterial genome sequencing projects listed in the Genomes OnLine Database (GOLD) 3 ( Reddy et al., 2015), an increase from only 1,986 in 2007. Ten years later, in 2005, there were almost 300 prokaryote genomes sequenced ( Fraser-Liggett, 2005) and as of May 2015 there were 34,066 bacterial genomes available at the complete (3,725), chromosome (773), scaffold (11,028), and contig (18,540) status as listed by the National Center for Biotechnology Information (NCBI) 1. The first complete bacterial genome was sequenced in 1995 ( Fleischmann et al., 1995) and 24 microbial organisms were completely sequenced within the next 5 years ( Nierman et al., 2000). AutoCurE is an easy-to-use tool for local database curation for large-scale genome data prior to downstream analyses.Īdvancements in sequencing technologies in the past several years have resulted in a substantial increase in the number of bacterial genomes that have been and continue to be sequenced. Flags are generated for nine metadata fields if there are inconsistencies between the downloaded genomes and genomes reports and if erroneous or missing data are evident. AutoCurE provides an automated approach to curate local genomic databases by flagging inconsistencies or errors by comparing the downloaded supporting data to the genome reports to verify genome name, RefSeq accession numbers, the presence of archaea, BioProject/UIDs, and sequence file descriptions. AutoCurE, an automated tool for bacterial genome database curation in Excel, was developed to facilitate local database curation of supporting data that accompany downloaded genomes from the National Center for Biotechnology Information. Publically available genomes can be readily downloaded however, there are challenges to verify the specific supporting data contained within the download and to identify errors and inconsistencies that may be present within the organizational data content and metadata. More than 3,000 bacterial genomes have been sequenced and are available at the finished status. Current sequencing technologies have made it feasible to sequence entire bacterial genomes with relative ease and time with a substantially reduced cost per nucleotide, hence cost per genome. Whole-genome data are invaluable for large-scale comparative genomic studies. 2Center of Excellence in Genomic Medicine Research, King Abdulaziz University, Jeddah, Saudi Arabia.1Institute of Applied Genetics, Department of Molecular and Medical Genetics, University of North Texas Health Science Center, Fort Worth, TX, USA.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |