From contiguity to accuracy: Validation-centered perspectives on bacterial genome assembly

Minkyung Kim; Yong-Joon Cho; Ok-Sun Kim

doi:10.71150/jm.2604004

Articles

Page Path: HOME > J. Microbiol > Ahead of print > Article

Review From contiguity to accuracy: Validation-centered perspectives on bacterial genome assembly: Minkyung Kim¹, Yong-Joon Cho^2,3,*, Ok-Sun Kim^1,*; DOI: https://doi.org/10.71150/jm.2604004
Published online: June 19, 2026

¹Division of Life Sciences, Korea Polar Research Institute, Incheon 21990, Republic of Korea

²Department of Molecular Bioscience, Kangwon National University, Chuncheon 24341, Republic of Korea

³Multidimensional Genomics Research Center, Kangwon National University, Chuncheon 24341, Republic of Korea

*Correspondence Yong-Joon Cho yongjoon@kangwon.ac.kr Ok-Sun Kim oskim@kopri.re.kr

• Received: April 3, 2026 • Revised: May 7, 2026 • Accepted: May 12, 2026

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

30 Views
2 Download

Article

Download PDF

ABSTRACT
Introduction
Structural Ambiguity Arising from Complex Genome Architecture
Limitations of Long-Read Sequencing Accuracy
Challenges and Limitations of Genome Assembly Algorithms
Polishing and Validation Approaches for Genome Assembly
A Practical Framework for High-Confidence Complete Genomes
Future Perspectives
Notes
Supplementary Information
References

ABSTRACT

Recent advances in sequencing technologies, particularly long-read platforms, have substantially improved contiguity of bacterial genome assemblies and enabled the routine generation of near-complete or circular genomes. However, achieving a contiguous assembly does not necessarily guarantee accuracy. Assembly errors, including structural misassemblies, collapsed repeats, incorrect circularization, plasmid reconstruction errors, and nucleotide-level inaccuracies, remain prevalent and may lead to misleading biological interpretations if not properly identified. In this review, we provide a comprehensive overview of bacterial genome assembly from a validation-centered perspective and examine the underlying causes of draft genome formation and assembly uncertainty, highlighting the roles of repetitive genomic structures, platform-specific error profiles, and algorithmic limitations. We further emphasize that the central challenge in contemporary bacterial genomics is no longer simply to maximize assembly contiguity, but to determine whether apparently complete genomes are truly correct and sufficiently reliable for their intended downstream applications. We propose a practical decision-making framework that links sequencing strategy, assembly workflow, polishing, and validation rigor, and introduce a tiered confidence classification to guide the interpretation of genome assembly reliability. As bacterial genome sequencing becomes increasingly routine and large-scale, future efforts should prioritize accuracy, reproducibility, transparent reporting, and evidence-supported validation over completeness alone.
Keywords: genome sequencing, genome assembly, assembly validation, assembly algorithms, genome accuracy, draft genome

Introduction

The initiation of large-scale genome sequencing projects in the early 1990s led to a rapid expansion of microbial genome sequencing efforts, thereby contributing substantially to the advancement of environmental microbiology (Bentley et al., 2003; Eiglmeier et al., 2001; Glaser et al., 1993). The introduction of next-generation sequencing (NGS) platforms, particularly Illumina, enabled the generation of large volumes of highly accurate short-read data (Metzker, 2010). However, the limited read length of these platforms has posed challenges in resolving repetitive and structurally complex genomic regions (Koren et al., 2013; Rhoads and Au, 2015). The development of third-generation sequencing technologies in the 2010s, such as Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT), has significantly improved genome assembly contiguity by producing long reads capable of spanning repetitive regions and structural variations (Koren et al., 2013; Reuter et al., 2015). Consequently, the number of complete bacterial genomes reported in public databases has increased substantially, with over 77,000 complete genomes currently available.

Despite these advances, an important question remains: can all reported complete circular genomes be considered accurate representations of the true genome structure? Genome assembly is a complex computational process that reconstructs chromosome-scale sequences from fragmented reads. In the absence of a suitable reference genome, de novo assembly is typically employed, and its accuracy is influenced by multiple factors, including sequencing technology, read length, sequencing depth, and assembly algorithms. With the rapid development of sequencing technologies, a wide range of genome assembly algorithms and software tools have been introduced; however, different assemblers can produce varying results even when applied to the same dataset (Johnson et al., 2023; Rojas-Miranda et al., 2025; Trisakul et al., 2024). These discrepancies may lead to differences in genome structure, contiguity, and the presence of assembly artifacts, raising concerns about the reliability of assembled genomes. Such uncertainties are particularly critical when genome assemblies are used for downstream analyses, including comparative genomics, functional annotation, and evolutionary studies.

To address these challenges, evaluation frameworks based on contiguity, completeness, and correctness (the 3C criteria) have been proposed to systematically assess assembly quality (Molina-Mora et al., 2020). Contiguity describes the degree of assembly fragmentation and is commonly assessed using metrics based on contig number and size, including maximum contig length, total assembly length, and N50. Completeness can be defined along two complementary dimensions: structural completeness, which refers to the reconstruction of an entire chromosome into a single circular contig, and gene content completeness, typically evaluated using marker gene-based approaches such as BUSCO or CheckM. Notably, these dimensions do not necessarily coincide. Accuracy, often discussed alongside correctness in assembly evaluation frameworks, encompasses both structural accuracy, reflecting the absence of misassemblies such as inversions or rearrangements, and base-level accuracy, defined by the absence of nucleotide substitutions and indels. These approaches aim not only to achieve structural completeness but also to ensure a “truly complete genome” satisfying all 3C criteria, whereby assembled genomes accurately reflect the underlying biological sequences.

Existing reviews have largely focused on sequencing platforms, general assembly algorithms, or broad benchmarking of assembly tools. In contrast, this review specifically addresses bacterial genome assembly from the perspective of validation, emphasizing the gap between apparent completeness and actual correctness. Given that long-read sequencing, which improves assembly continuity, has become a primary approach for genome assembly, this review places greater emphasis on long-read and hybrid assembly strategies. We discuss how repetitive genomic structures, sequencing biases, and assembler-specific behaviors contribute to draft genome formation and assembly errors, and we highlight why these issues remain important even in highly contiguous long-read assemblies. We further argue that, as bacterial genome sequencing rapidly expands and assembled genomes are increasingly used in comparative, functional, and surveillance studies, rigorous validation is becoming more urgent rather than less. By integrating current knowledge on assembly challenges, tool variability, and validation strategies, this review provides a framework for evaluating the reliability of bacterial genome assemblies beyond contiguity alone.

Structural Ambiguity Arising from Complex Genome Architecture

Advances in high-throughput sequencing technologies have led to a rapid expansion of genomic data, with approximately 3 million bacterial genomes currently available in public databases (NCBI, 2026) (Fig. 1A and 1B). These advancements have improved read length, accuracy, and throughput, thereby enabling the generation of increasingly contiguous genome assemblies (Loman and Pallen, 2015). However, as of 2025, the number of contig-level assemblies exceeds that of complete genomes by approximately 32-fold (Fig. 1B). Long-read sequencing generally reduces contig fragmentation; however, draft assemblies remain common even when long-read data are used (Fig. 1C and 1D). Recent studies have shown that assembly breakpoints frequently occur at repetitive regions, and that some of these repeats, particularly long and highly similar sequences, can lead to misassemblies even in genomes previously reported as complete (Acuña-Amador et al., 2018).

Repetitive sequences vary widely in length, ranging from short dinucleotide repeats to long segments spanning several kilobases (Treangen et al., 2009). Historically, genomes containing repeat regions longer than the typical size of rRNA operons (~5–7 kb) were difficult to assemble into complete genomes using earlier sequencing technologies (Koren et al., 2013). With the advancement of third-generation sequencing platforms, increasingly long repeat structures have been reported, including inverted repeats exceeding tens of kilobases, such as those identified in Lactobacillus species (Colombini et al., 2025; El Kafsi et al., 2017). These findings highlight that assembly difficulty is not uniform across species but is strongly influenced by genome architecture.

Although long reads can span repetitive regions, when multiple repeat copies share nearly identical sequences (e.g., > 99–99.9% identity), reads traversing these regions may lack sufficient sequence divergence to uniquely assign their genomic origin (Fig. 2, case 1). Such regions therefore collapse into a single node in the assembly graph, generating ambiguous connections between distinct genomic contexts and leading to incorrect path reconstruction. Tandem repeats comprising two or more nearly identical copies are frequently reduced to a single representation during assembly (Fig. 2, case 2). As a consequence, assemblies can appear circular and complete yet be shorter than the true genome, containing hidden structural errors that are not detectable using standard contiguity metrics alone. When repetitive regions exceed the effective read length or lack sufficient unique flanking sequences, assembly algorithms often fail to resolve their correct genomic placement, resulting in fragmentation or structural misassemblies (Fig. 2, case 3; Waters et al., 2025). Genomes containing large prophage insertions, genomic islands, or multiple highly similar rRNA operons therefore remain challenging to resolve with respect to 3C criteria, particularly in terms of structural correctness.

Plasmids represent inherently dynamic genetic elements that vary in copy number, size, and distribution across cells, often contributing to heteroresistance and population-level genomic heterogeneity (Shintani et al., 2015). These biological characteristics can lead to uneven representation of plasmid sequences and introduce ambiguity in assembly graphs. Insertion sequences (IS elements), transposons, and antimicrobial resistance cassettes are frequently shared between chromosomal and plasmid contexts, such that reads derived from these regions cannot be uniquely assigned to a specific replicon (Carattoli et al., 2014). Moreover, homologous IS elements and transposons shared among plasmids can give rise to chimeric plasmid contigs or misassemblies, resulting in underestimation of plasmid number or structural complexity. In addition to sequence homology, variation in plasmid copy number presents a further challenge. Low-copy or single-copy plasmids often display coverage similar to that of the chromosome, increasing the likelihood of being obscured by background signal or fragmented due to insufficient read support (Antipov et al., 2016a). This challenge is further exacerbated when multiple plasmids with different copy numbers coexist within the same cell, complicating accurate resolution and separation.

These limitations collectively give rise to the so-called “false completeness” problem, whereby assemblies appear structurally complete despite containing unresolved errors that are not captured by contiguity-based metrics. Such ambiguities can directly affect genome size estimation and lead to inconsistencies among assemblies generated using different computational approaches.

Limitations of Long-Read Sequencing Accuracy

PacBio high-fidelity (HiFi) and ONT long-read sequencing represent two dominant contemporary platforms, offering a fundamental trade-off between base-level accuracy and maximum read length. In 2019, PacBio introduced HiFi sequencing, which addressed several limitations of earlier long-read technologies and has since become a widely adopted platform for modern genome sequencing. PacBio HiFi reads are generated through circular consensus sequencing (CCS), in which multiple passes over the same DNA molecule are combined to produce highly accurate consensus sequences. As a result, read lengths are typically constrained to approximately 10–25 kb, but achieve high accuracy exceeding Q30 (> 99.5%) (Hon et al., 2020; Travers et al., 2010). In contrast, ONT sequencing analyzes native DNA molecules in a single pass and can therefore generate ultra-long reads that frequently exceed 100 kb, thereby improving structural resolution across complex genomic regions, albeit with higher residual error rates (Peng et al., 2025).

These platforms also differ in their characteristic error profiles. PacBio sequencing errors primarily arise from misinterpretation of fluorescence signals and temporal variation in polymerase kinetics. In HiFi sequencing, these errors are largely random and are substantially reduced during CCS consensus generation, resulting in error rates below 1% (Wenger et al., 2019). Nevertheless, homopolymeric regions remain a persistent challenge, as accurately determining the exact number of consecutive identical bases across multiple passes is still difficult, often leading to short insertion-deletion errors (Hu et al., 2024a). ONT sequencing, which infers nucleotide identity from ionic current signals, exhibits even greater difficulty in homopolymeric tracts than PacBio (Bouras et al., 2024b; Chiou et al., 2023). Accuracy declines markedly when homopolymers exceed five bases, and the profile of substitution errors can further vary with GC content and base modifications, including methylation (Delahaye and Nicolas, 2021).

These differences have important implications for genome assembly. The shorter but highly accurate HiFi reads are generally more effective for resolving small-scale sequence variation and reducing base-level errors, whereas the substantially longer ONT reads provide superior ability to span large repetitive regions and structural variants. Nevertheless, residual errors from both platforms can compromise repeat resolution and lead to misassemblies, particularly in regions with high sequence similarity (Phillippy et al., 2008; Treangen and Salzberg, 2011). Indel errors, even at low frequency, can introduce frameshifts or premature stop codons, thereby affecting downstream gene prediction and functional annotation. Notably, in a genome the size of Escherichia coli (~4.6 Mb), even a 0.5% error rate corresponds to more than 23,000 incorrect bases, highlighting the substantial absolute number of errors. Such errors can substantially compromise protein-coding gene prediction by introducing frameshifts and premature stop codons, thereby resulting in inaccurate estimates of both gene content and functional potential and highlighting the importance of systematic assembly validation and, when necessary, manual curation (Watson and Warr, 2019).

Challenges and Limitations of Genome Assembly Algorithms

Genome assembly algorithms are fundamental to genome reconstruction; however, their inherent limitations critically constrain the accuracy and completeness of assembled genomes. Assembly algorithms play a central role in determining the structure and quality of the reconstructed genome (Espinosa et al., 2023; Merda et al., 2024; Rojas-Miranda et al., 2025; Tizabi et al., 2022). These algorithms can be broadly classified into overlap-based approaches and graph-based approaches, each optimized for different sequencing technologies and error profiles. The overlap–layout–consensus (OLC) methods reconstruct genomes by identifying pairwise overlaps between reads and generating a consensus sequence, making them well suited for long-read data (Miller et al., 2010; Myers, 2005). However, high sequencing error rates can complicate accurate overlap detection, potentially leading to misassemblies, representing a fundamental challenge in accurately reconstructing genome structure. In contrast, de Bruijn graph (DBG)-based methods decompose sequencing reads into shorter k-mers and construct graphs based on exact sequence matches, enabling efficient assembly of short-read data (Bankevich et al., 2012; Simpson et al., 2009; Zerbino and Birney, 2008). However, the reliance on fixed k-mer sizes limits the resolution of repetitive regions and sensitivity to uneven sequencing coverage, often resulting in fragmented and ambiguous assemblies (Nagarajan and Pop, 2013). Although extensions such as string graphs, repeat graphs, and fuzzy Bruijn graphs have been developed to address these limitations, highly identical repeats and complex genomic structures remain difficult to resolve (Jain, 2023; Kolmogorov et al., 2019; Myers, 2005; Ruan and Li, 2020).

The diversity of assembly tools reflects fundamental differences in algorithmic design (Table 1). Early assemblers developed for Sanger or short-read sequencing data, including DBG-based tools such as Velvet, SOAPdenovo, and SPAdes, are optimized for handling large volumes of short, high-accuracy reads (Li et al., 2010; Peng et al., 2010; Zerbino and Birney, 2008). Long-read based assemblers adopt OLC-based strategies to leverage long reads spanning repetitive regions, thereby improving assembly contiguity (Chen et al., 2021a; Koren et al., 2017; Li, 2016; Vaser and Šikić, 2021). Long-read assembly has reduced the fragmentation typical of short-read assemblies, yet substantial inter-assembler variability persists even when identical long-read datasets are used. Small plasmids are frequently underrepresented or entirely missed in assemblies generated from long-read data. Although this limitation is partly attributable to biases introduced during library preparation (Wick et al., 2021b), the present review focuses specifically on differences among assemblers. Substantial variability has been reported for small plasmid recovery across long-read assemblers (Flye, 67–79%; Miniasm, 64%; Raven, 39%), whereas the hybrid assembler Unicycler, which integrates short-read data, achieved complete recovery (Johnson et al., 2023). Additionally, in a benchmark of eight long-read assemblers for prokaryotic genomes, Flye and Canu were generally reliable but differed in circularization behavior (Wick and Holt, 2021). Flye (and Raven) frequently produced assemblies with terminal sequence truncation, whereas Canu (and NextDenovo) often retained terminal overlaps, leading to artificial sequence duplication at circular boundaries. Also, Miniasm/Minipolish most consistently achieved clean circularisation; NextDenovo/NextPolish performed well for chromosome completion but poorly for plasmid recovery. These differences arise from fundamental distinctions in algorithmic design rather than incidental implementation details (Trisakul et al., 2024). Long-read assemblers differ in how they correct noisy reads, represent assembly graphs, resolve repeats, derive consensus sequences, and handle circular replicons (Amarasinghe et al., 2020; Wick and Holt, 2021). Because long-read data still retain substantial indel-heavy and homopolymer-associated errors, assembly quality remains strongly influenced by the interaction between read error profiles and tool-specific correction and polishing strategies. Thus, high contiguity does not necessarily reflect superior base-level accuracy, nor does chromosomal completion ensure correct circularization or complete recovery of all replicons.

To further mitigate the limitations of individual sequencing technologies, hybrid assembly approaches have been developed to integrate short and long reads within a single workflow. These approaches can be broadly categorized into algorithm-integrated assemblers and pipeline-based frameworks. Algorithm-integrated methods, such as hybridSPAdes, extend DBG-based assembly by incorporating long-read information directly into the graph structure (Antipov et al., 2016b). In contrast, pipeline-based approaches, including Unicycler, Hybracter, and WENGAN, combine multiple assemblers and post-processing steps such as scaffolding and polishing (Di Genova et al., 2021; Wick et al., 2017). Unicycler uses SPAdes as its core short-read assembler and integrates additional steps, including long-read bridging, graph simplification, and polishing, to generate complete assemblies (Wick et al., 2017). In contrast to Unicycler, Hybracter implements a long-read-first assembly framework in which long reads are initially assembled using Flye, followed by iterative polishing with long-read (Medaka) and short-read tools (Polypolish and Pypolca) (Bouras et al., 2024a). In benchmarking analyses, Hybracter hybrid produced near-zero error rates, with median counts of 0 single nucleotide variants (SNVs) and 0 small indels, compared to substantially higher error rates observed in Unicycler assemblies (median 34 SNVs and 11 indels). In addition to improvements in chromosome-level accuracy, Hybracter incorporates a dedicated plasmid assembly module, Plassembler, enabling more complete and accurate recovery of plasmid sequences. While hybrid assembly strategies substantially improve genome accuracy, the integration of multiple tools and polishing steps introduces dependencies on sequencing depth and data quality, which can influence the final assembly outcome. Furthermore, certain polishing tools may introduce errors under specific conditions, and therefore hybrid approaches do not guarantee completely error-free assemblies.

Additionally, tools implementing consensus-based approaches, such as Trycycler and Autocycler, have been developed to integrate multiple independent long-read assemblies and reduce tool-specific stochastic variation (Wick et al., 2021a, 2025). However, when assemblers share similar algorithmic assumptions or error profiles, their outputs can converge on the same incorrect structure, allowing systematic errors to persist in the final consensus. In practice, Trycycler also requires manual intervention to resolve irreconcilable graph structures, which limits scalability and introduces operator-dependent variability (Wick et al., 2021a). Therefore, consensus-based assembly should not be considered a substitute for validation, but rather as a complementary strategy that addresses inter-assembler variability without independently confirming biological accuracy.

Polishing and Validation Approaches for Genome Assembly

As the importance of genome sequence accuracy and assembly reliability has become increasingly recognized, a variety of tools have been developed to improve genome assemblies through polishing, assess assembly quality, and resolve assembly-related errors (Tables 2 and 3). The performance of polishing tools varies substantially depending on their underlying algorithms, data integration strategies, and computational requirements. In a comparative analysis of parameterized polishing tools applied to ONT assemblies from nine bacterial genomes, Polypolish-careful alone is recommended under conditions of extremely low sequencing depth (< 5×) or when minimizing false-positive corrections is a primary concern, whereas Pypolca-careful is recommended for single-nucleotide polishing in all other scenarios (Bouras et al., 2024b). In contrast, other tools (e.g., Medaka, NextPolish, and Pilon) were reported to introduce additional errors under suboptimal conditions, particularly at low sequencing depths. Complementary studies further indicate that long-read–based polishing tools, such as Racon, Medaka, and DeepPolisher, improve consensus accuracy by leveraging long-read data; however, their performance generally remains inferior to that of hybrid polishing approaches integrating both short- and long-read datasets (Lee et al., 2021). Consistent with these observations, tests of 132 combinations of assembly and polishing tools demonstrated that polishing performance is largely determined by tool combinations and pipeline design rather than by the choice of a single tool (Luan et al., 2024). Notably, the order of polishing steps is critical, with the best-performing pipeline applying long-read polishing using Medaka followed by short-read polishing with tools such as NextPolish. These findings collectively indicate that achieving high-confidence genome assemblies requires not only careful selection and combination of tools but also independent validation to ensure that residual errors are minimized.

Deep learning-based approaches, including Medaka and DeepPolisher, leverage patterns in read alignments to improve base-level accuracy, effectively resolving sequencing errors in many contexts. However, given that polishing tools are primarily designed for base-level correction and offer limited capacity for resolving structural errors, independent validation of assembly structure is essential. Among available strategies, read mapping is one of the most widely used approaches for assessing structural consistency. By aligning raw sequencing reads back to the assembled genome, it is possible to identify discrepancies such as mismatches, coverage gaps, and abnormal coverage patterns that may indicate assembly errors (Bzikadze et al., 2022; Firtina et al., 2020; Li et al., 2023). Regions with unusually high or low coverage can reflect repeats, duplications, or missing sequences (Gao et al., 2019). However, such signals may also arise from biological features or sequencing biases, and therefore require careful interpretation (Delahaye and Nicolas, 2021; Gunasekera et al., 2021). When a closely related reference genome is available, alignment-based approaches such as QUAST and Assemblytics provide effective means of assessing structural accuracy (Gurevich et al., 2013; Nattestad and Schatz, 2016). Comparative analysis of assembled genomes against reference sequences enables the identification of large-scale structural discrepancies, including inversions, translocations, insertions, and deletions, as well as inconsistencies that are not captured by contiguity metrics alone. However, reference-based validation is susceptible to bias when the reference genome contains errors or differs substantially from the target genome (Sousa et al., 2019). Additionally, evolutionary divergence is often misinterpreted as structural variation, necessitating careful interpretation. In this context, graph-based inspection has also emerged as a valuable strategy for identifying unresolved or ambiguous regions. Tools such as Bandage and gfatools enable visualization and interrogation of assembly graphs, allowing direct observation of branching structures, bubbles, and cycles that reflect uncertainty in the assembly (Marijon et al., 2019; Wick et al., 2015). As such, graph inspection can provide critical insights into regions requiring further validation or manual curation.

In addition to these approaches, comprehensive assessment of assembly quality requires evaluation of completeness, structural consistency, and biological plausibility. Assembly completeness and contamination represent fundamental aspects of genome quality assessment. Marker gene–based approaches, such as those implemented in BUSCO and CheckM, are widely used to estimate completeness by assessing the presence of conserved single-copy genes (Parks et al., 2015; Simão et al., 2015). However, this approach relies on the assumption that conserved single-copy genes are stably maintained across lineages; violations of this assumption in genomes shaped by extensive horizontal gene transfer or lineage-specific gene expansions can result in over- or underestimation of completeness. Additionally, the predefined marker gene sets used by these tools do not uniformly represent all bacterial clades, which can introduce lineage-dependent biases in the completeness estimates. Therefore, the completeness and contamination indicators derived from these approaches should be interpreted in the context of the evolutionary background of the organism and in combination with lines of complementary evidence.

Importantly, reliance on a single validation strategy can lead to false confidence in assembly correctness, particularly in repeat-rich or structurally complex genomes. Although polishing effectively corrects small-scale nucleotide errors, it does not resolve structural inaccuracies. Similarly, mapping-based approaches often fail to detect errors in repetitive regions, and the resulting signals do not always clearly distinguish between biological variation and assembly artifacts. In particular, some assembly errors persist even after multiple rounds of validation and polishing, especially in complex genomic regions. These limitations underscore the necessity of applying multiple complementary validation strategies to achieve high-confidence genome assemblies. A minimum validation checklist should therefore include: (i) read remapping to evaluate coverage uniformity and identify potential misassemblies, (ii) cross-assembler comparison or consensus-based approaches to assess structural consistency, (iii) inspection of assembly graphs to resolve ambiguous repeat junctions, and (iv) marker gene-based evaluation of completeness and contamination. While these approaches provide a robust foundation for validation, defining universally applicable thresholds remains challenging due to variability in sequencing technologies, genome complexity, and dataset-specific characteristics. Nevertheless, practical benchmarks can serve as general guidance. For example, base-level accuracy is often considered high when quality values (QV) approach or exceed ~50, whereas marker gene-based completeness above 95% and contamination below 2% are commonly used as indicative criteria for high-quality bacterial genomes. However, such thresholds should be interpreted cautiously and in context, rather than as absolute indicators of correctness.

Building on these considerations, we propose a tiered validation framework (bronze, silver, and gold) to facilitate practical assessment of genome assembly reliability based on validation rigor rather than sequencing technology alone. Assemblies classified as bronze represent those generated using a single assembler and subjected to polishing and basic statistical evaluation (e.g., contiguity metrics and marker gene completeness), but lacking thorough validation of structural and base-level accuracy. Assemblies classified as silver include additional validation steps, such as cross-assembler comparison, read mapping-based coverage assessment, and graph-based inspection of structurally ambiguous regions, thereby providing improved confidence in structural integrity. Finally, gold-level assemblies represent the highest-confidence genomes, for which all validation steps have been performed, including independent reference comparison and/or manual curation to verify both structural correctness and base-level accuracy. Importantly, this classification framework is intended to reflect the depth and confidence of validation rather than the sequencing strategy itself.

A Practical Framework for High-Confidence Complete Genomes

Obtaining complete genome assemblies requires not only appropriate sequencing technologies but also well-designed experimental procedures and careful selection of computational tools. We propose an integrated framework that encompasses multiple stages, from sample preparation to iterative validation, to achieve reliable genome reconstruction (Fig. 3). However, because multiple variables such as genome complexity and research objectives can influence sequencing and assembly strategies, this should be regarded not as a fixed rule but as a conditional decision-making framework.

A critical first step is the extraction of high-quality, high-molecular-weight DNA. In bacterial genome studies, the use of cultures derived from a single colony is generally recommended to minimize genomic heterogeneity (Wick et al., 2023). Fragmentation of DNA during extraction can limit the effective read length, particularly for long-read sequencing technologies, thereby reducing the ability to resolve repetitive genomic regions. To minimize DNA shearing, harsh mechanical handling such as excessive vortexing, pipetting, and repeated freeze–thaw cycles should be avoided (Branton and Deamer, 2019). In addition, when performing hybrid assembly using multiple sequencing platforms, it is preferable to use a single DNA extract to maintain consistency across datasets (Wick et al., 2023). Therefore, the use of intact, high-quality DNA is essential for achieving highly contiguous assemblies.

Sequencing design and assembly strategies also play a crucial role, and the appropriate choice depends on the research objective. For applications focused primarily on specific gene detection or comparative genomic analyses, Illumina-only sequencing is generally sufficient. Such studies do not aim to achieve contiguity-level completeness, and the issue of false completeness discussed in this review is therefore not applicable to these workflows. To achieve a complete genome assembly with high contiguity and accuracy, two complementary long-read–based strategies are commonly employed. The first is a hybrid approach combining ONT long reads with Illumina short reads, in which ONT provides the structural backbone for resolving long repetitive regions and Illumina reads are used to correct residual base-level errors. The second is a PacBio HiFi-based approach, in which the inherently high per-read accuracy of HiFi reads can substantially reduce the requirement for short-read polishing. ONT can produce ultra-long reads that resolve exceptionally long repetitive regions, while HiFi offers higher per-read accuracy that simplifies polishing requirements. Both strategies may encounter difficulties in recovering small plasmids, and supplementation with short-read data is generally beneficial in such cases. Consistent with the validation-centered perspective of this review, we recommend that any long-read–based assembly, including HiFi-only assemblies, undergo systematic validation and, where feasible, orthogonal short-read data may provide additional support for residual base-level correction and small plasmid recovery. The choice between these two workflows depends on platform accessibility, cost, and the length and complexity of repetitive regions to be resolved. The optimal sequencing depth depends on genome characteristics and the sequencing platform (Lerminiaux et al., 2024); however, excessively high coverage does not improve assembly quality and can reduce performance due to increased complexity and error accumulation (Rojas-Miranda et al., 2025; Wick et al., 2023). In Illumina-only sequencing, increasing coverage beyond high depths (e.g., > 100×) does not lead to further improvements in assembly outcomes. For short-read polishing, most of the benefit is achieved at approximately 25× coverage, with minimal additional gains at higher depths (Bouras et al., 2024b). Long-read sequencing has been shown to produce high-quality assemblies even at approximately 30–40× coverage using assemblers such as Flye; however, when aiming for near-complete or highly accurate genome reconstruction, substantially higher coverage, typically in the range of 100× to 200×, should be considered (Kolmogorov et al., 2019; Wick et al., 2023). In a hybrid sequencing approach, genome assembly typically involves comparative evaluation of multiple assemblers and is commonly performed by constructing a structural backbone from ONT data using assemblers such as Flye, followed by error correction with Illumina data using tools such as NextPolish.

Following assembly, a quantitative overview of assembly contiguity is obtained by comparison with a reference genome, using basic assembly statistics such as total genome size, number of contigs, and N50 values. Although N50 is widely used to summarize contig length distribution and is often interpreted as an indicator of assembly quality (Mäkinen et al., 2012), it does not directly reflect assembly accuracy. Assemblies with high N50 values can still contain misassemblies, including incorrectly merged contigs or collapsed repeat regions (Thrash et al., 2020). Therefore, assembly statistics should be regarded as technical indicators rather than definitive measures of accuracy. Subsequently, raw reads are mapped back to the assembly to assess coverage uniformity across the genome. This step enables the identification of sequence discrepancies and coverage gaps through manual inspection. Contigs exhibiting coverage substantially lower than the genome-wide average (e.g., less than half of the mean coverage) are indicative of potential contamination, whereas plasmid sequences typically display elevated coverage. In addition, graph-based visualization is employed to evaluate structural consistency and identify ambiguous regions within the assembly. Inspection of assembly graphs (e.g., GFA files) is essential to confirm that no structurally inconsistent or unsupported connections have been introduced during assembly. Finally, tools such as QUAST, CheckM, and Circlator can be employed for assembly error detection and for assessing completeness and contamination, thereby supporting the generation of a high-confidence genome (Table S1).

If structural inconsistencies are detected at any stage, reassembly is required. No universally applicable tool-based workflow currently ensures the generation of complete genomes. In cases where circularity cannot be achieved despite the use of multiple assemblers, reporting a validated draft genome is preferable to artificially increasing contiguity through parameter adjustment. Where appropriate, a complete genome may be reconstructed by ordering and orienting contigs based on alignment to a completed reference genome, although this process necessitates extensive manual curation. Genome assembly remains an inherently iterative process that requires repeated rounds of validation and refinement. As discussed above, validation approaches—including read mapping, graph inspection, reference-based comparison, and completeness assessment—should be applied in combination to evaluate both structural integrity and nucleotide-level accuracy. Although these strategies do not guarantee the generation of complete genomes in all cases, they substantially increase the likelihood of achieving high-quality assemblies. Therefore, careful evaluation and iterative improvement are critical for obtaining reliable genome sequences suitable for downstream analyses.

Future Perspectives

Future progress will depend not only on improved sequencing and assembly algorithms, but also on the development of scalable, standardized, and transparent validation frameworks that enable researchers to assess both structural integrity and base-level accuracy. In this context, emerging computational approaches are beginning to extend the scope of assembly validation. Recent advances in machine learning and deep learning have also begun to influence genome assembly workflows, particularly in basecalling, polishing, and variant detection. These approaches offer the potential to improve assembly validation by integrating multiple sources of evidence, such as read mapping patterns, coverage profiles, and graph structures, to identify inconsistencies that may not be captured by conventional metrics. Additionally, AI-driven approaches have the potential to extend beyond base-level correction to support structural validation. For example, models trained on large-scale genome databases could, in principle, learn characteristic patterns of genome organization, including repeat structure, gene synteny, and breakpoint consistency, and identify deviations that may indicate assembly artifacts. However, distinguishing true biological structural variation from assembly artifacts remains inherently challenging, as both can produce similar signals in sequencing and assembly data. Such approaches remain largely unexplored and are likely to be limited by the availability of representative reference data and the diversity of genome architectures. Therefore, distinguishing true biological structural variation from assembly artifacts is expected to continue to require orthogonal validation strategies. As these limitations are addressed, machine learning-based validation frameworks are likely to become an important complement to existing validation strategies.

Acknowledgments

This work was supported by the Korea Polar Research Institute (KOPRI) [PE26100] and by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (RS-2026-25478655 and RS-2026-25491179). This work was also supported by Global-Learning & Academic research institution for Master, PhD students, and Postdocs (G-LAMP) Program (RS-2023-00301850).

Conflict of Interest

The authors declare no competing interests.

Supplementary Information

The online version contains supplementary material available at https://doi.org/10.71150/jm.2604004

Table S1.

Comparative assessment of validation capabilities across genome assembly evaluation tools

jm-2604004-Supplementary-Table-S1.pdf

Fig. 1.

Overview of bacterial genome assemblies and sequencing platforms. (A) Number of bacterial genomes by assembly level deposited in NCBI from 1995 to 2025, with the y-axis shown on a logarithmic scale. (B) Distribution of bacterial genomes by assembly level in 2025. (C) Proportion of sequencing platforms (Illumina, PacBio, and ONT) used for bacterial genome assemblies across different assembly levels. (D) Distribution of contig counts according to sequencing platform for genomes at the contig assembly level.

Fig. 2.

Mechanism of repeat-induced assembly ambiguity and misassembly in long-read sequencing. Case 1. Highly identical repeats. Two repeat copies (R1 and R2) with near-identical sequences (~99.9% identity) are flanked by distinct genomic regions. Although reads may partially span these regions, the lack of sufficient sequence divergence prevents unique assignment of reads to specific repeat copies, resulting in ambiguous assembly paths and potential misassembly. Case 2. Tandem repeat structures. Multiple consecutive repeat units arranged in tandem can lead to uncertainty in repeat copy number during assembly. Depending on the algorithm and supporting evidence, repeat arrays may be collapsed into fewer copies, producing assemblies that underestimate the true genomic length. Case 3. Read length distribution effects. Even when ultra-long reads are present, only a fraction of reads may exceed the length of long repeat regions (e.g., 20 kb). When the majority of reads are shorter than the repeat, insufficient spanning evidence can lead to repeat collapse or fragmentation at repeat boundaries.

Fig. 3.

Workflow for generating a high-confidence complete bacterial genome. Schematic overview of genome assembly and validation. High-molecular-weight DNA is subjected to hybrid sequencing using short- and long-read platforms. Assembly is performed using either an ONT-based hybrid approach, in which ONT structural reconstruction is followed by short-read polishing, or a PacBio HiFi-based approach. Validation includes evaluation of assembly statistics (genome size, plasmids, N50), read mapping for coverage assessment, assembly graph inspection for structural ambiguity, completeness and contamination assessment (e.g., CheckM, BUSCO), and comparative analysis with reference genomes (e.g., dnaA alignment) to detect structural discrepancies such as inversions and rearrangements.

Table 1.

De novo assembly tools used in bacterial genome assembly

Tool	Input	Core engine	Note	Reference
ARACHNE	Short (Sanger)	OLC	Efficient scaling; primarily suited for large eukaryotic genomes	Batzoglou et al. (2002)
PCAP	Short (Sanger)	OLC (Sanger-era)	Parallel Sanger-era assembler; large-genome oriented	Huang et al. (2003)
Newbler	Short (454)	OLC	Widely used in early bacterial genome assembly	Margulies et al. (2005)
Mira	Short	OLC	Supports mapping-based assembly and polishing	Chevreux (2005)
PHRAP	Short (Sanger)	OLC (Sanger-era)	Sanger-era shotgun assembler	de la Bastide and McCombie (2007)
EULER-SR	Short (454)	DBG	A-Bruijn–inspired DBG	Chaisson and Pevzner (2008)
Velvet	Short	DBG	Early short-read DBG assembler	Zerbino and Birney (2008)
ALLPATHS-LG	Short	DBG	Optional long-read integration for gap filling	MacCallum et al. (2009)
IDBA	Short	Iterative DBG	Handles uneven coverage	Peng et al. (2010)
SOAPdenovo	Short	DBG	Scaffold-oriented; widely used for large genomes	Li et al. (2010)
SOAPdenovo2	Short	DBG	Improved memory efficiency and accuracy	Luo et al. (2012)
Minia version 3	Short	Compacted DBG	Unitig-based; evolved from Bloom filter–based approach	Salikhov et al. (2013)
MEGAHIT^*	Short	Succinct DBG	Optimized for metagenome assembly	Li et al. (2015)
SPAdes^*	Short	DBG	Multisized and paired-end integration	Prjibelski et al. (2020)
HybridSPAdes^*	Hybrid	DBG	Multisized, paired-end, and long-read integration	Antipov et al. (2016b)
FALCON	Long	String graph	Diploid-aware; optimized for complex eukaryotic genomes	Chin et al. (2016)
Miniasm	Long	OLC	Consensus-free; requires polishing	Li (2016)
HINGE	Long	OLC (repeat-aware)	Improves repeat resolution using hinge-based graph construction	Kamath et al. (2017)
Canu^*	Long	OLC	Designed for noisy long reads	Koren et al. (2017)
Flye^*	Long	Repeat graph	Robust to complex repeats	Kolmogorov et al. (2019)
HiCanu^*	Long (HiFi)	OLC	Improved accuracy and repeat resolution	Nurk et al. (2020)
wtdbg2^*	Long	Fuzzy Bruijn graph	Fast and memory-efficient; designed for noisy long reads	Ruan and Li (2020)
Shasta^*	Long (ONT)	Marker graph	Fast and memory-efficient	Shafin et al. (2020)
Raven^*	Long	OLC	Optimized for long uncorrected reads; fast and memory-efficient	Vaser and Šikić (2021)
Hifiasm^*	Long (HiFi)	String graph	High accuracy and repeat resolution	Cheng et al. (2021)
NECAT^*	Long (ONT)	OLC	Efficient assembly of noisy long reads	Chen et al. (2021a)
SmartDenovo	Long	OLC	No error correction	Liu et al. (2021)
NextDenovo^*	Long	OLC	Improved accuracy	Hu et al. (2024b)

^*Tools marked with an asterisk are actively maintained and commonly used in contemporary bacterial genome assembly pipelines.

Table 2.

Genome polishing and error correction tools

Tool	Category	Input	Core method	Key function	Reference
Pilon	Short-read polishing	Illumina reads	Read mapping	Error correction of SNPs and small indels	Walker et al. (2014)
Polypolish	Short-read polishing	Illumina reads	Multi-mapping	Repeat-aware error correction	Wick and Holt (2022)
Pypolca	Short-read polishing	Illumina reads	Read mapping	Error correction with threshold-based variant filtering	Bouras et al. (2024b)
Racon	Long-read polishing	Long reads	Read mapping	Consensus-based error correction	Vaser et al. (2017)
NeuralPolish	Long-read polishing	Long reads	Deep learning	Improved base accuracy using neural networks	Huang et al. (2021)
DeepPolisher	Long-read polishing	Long reads	Deep learning	Deep learning–based error correction	Mastoras et al. (2025)
Medaka	Long-read polishing	ONT reads	Deep learning	Signal-aware error correction	Medaka (2018)
NextPolish	Hybrid polishing	Short + long reads	Iterative polishing	Multi-platform error correction	Hu et al. (2020)

Table 3.

Tools for genome assembly validation and quality assessment

Tool	Category	Input	Core method	Key function	Reference
REAPR	Read-based	Short reads + assembly	Read mapping	Error detection	Hunt et al. (2013)
Inspector	Read-based	Long reads	Read mapping	Structural and local error detection with correction	Chen et al. (2021b)
QUAST	Reference-based	Assembly + reference	Whole-genome alignment	Assembly quality assessment and misassembly detection	Gurevich et al. (2013)
Assemblytics	Reference-based	Assembly + reference	Whole-genome alignment	Structural variation detection	Nattestad and Schatz (2016)
CheckM	Completeness	Assembly	Lineage-specific marker genes	Completeness + contamination	Parks et al. (2015)
BUSCO	Completeness	Assembly	Marker genes (Single-copy orthologs)	Completeness assessment	Seppey et al. (2019)
KAT	k-mer based	Reads + assembly	k-mer comparison	Coverage bias, duplication detection	Mapleson et al. (2017)
Merqury	k-mer based	Reads + assembly	k-mer spectrum	Accuracy (QV) + completeness	Rhie et al. (2020)
Bandage	Graph-based	Assembly graph	Visualization	Graph inspection	Wick et al. (2015)
gfatools	Graph-based	GFA	Graph parsing	Structural analysis	Pani et al. (2024)
Circlator	Structural	Assembly	Overlap detection	Circularization validation	Hunt et al. (2015)
MOB-suite	Plasmid	Assembly	Database + typing	Plasmid reconstruction	Robertson and Nash (2018)

References

Acuña-Amador L, Primot A, Cadieu E, Roulet A, Barloy-Hubler F. 2018. Genomic repeats, misassembly and reannotation: A case study with long-read resequencing of Porphyromonas gingivalis reference strains. BMC Genomics. 19: 54.Article PubMed PMC
Amarasinghe SL, Su S, Dong X, Zappia L, Ritchie ME, et al. 2020. Opportunities and challenges in long-read sequencing data analysis. Genome Biol. 21: 30.Article PubMed PMC PDF
Antipov D, Hartwick N, Shen M, Raiko M, Ladipus A, et al. 2016a. plasmidSPAdes: Assembling plasmids from whole genome sequencing data. Bioinformatics. 32: 3380–3387. Article PDF
Antipov D, Korobeynikov A, McLean JS, Pevzner PA. 2016b. hybridSPAdes: An algorithm for hybrid assembly of short and long reads. Bioinformatics. 32: 1009–1015. Article
Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, et al. 2012. SPAdes: A new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol. 19: 455–477. Article PubMed PMC Link
Batzoglou S, Jaffe DB, Stanley K, Butler J, Gnerre S, et al. 2002. ARACHNE: A whole-genome shotgun assembler. Genome Res. 12: 177–189. Article PubMed PMC
Bentley SD, Maiwald M, Murphy LD, Pallen MJ, Yeats CA, et al. 2003. Sequencing and analysis of the genome of the Whipple's disease bacterium Tropheryma whipplei. Lancet. 361: 637–644. Article PubMed
Bouras G, Houtak G, Wick RR, Mallawaarachchi V, Roach MJ, et al. 2024a. Hybracter: Enabling scalable, automated, complete and accurate bacterial genome assemblies. Microb Genom. 10: 001244.Article
Bouras G, Judd LM, Edwards RA, Vreugde S, Stinear TP, et al. 2024b. How low can you go? Short-read polishing of Oxford Nanopore bacterial genome assemblies. Microb Genom. 10: 001254.Article
Branton D, Deamer DW. 2019. Nanopore sequencing: An introduction. World Scientific Publishing. Link
Bzikadze AV, Mikheenko A, Pevzner PA. 2022. Fast and accurate mapping of long reads to complete genome assemblies with VerityMap. Genome Res. 32: 2107–2118. Article PubMed PMC
Carattoli A, Zankari E, García-Fernández A, Voldby Larsen M, Lund O, et al. 2014. In silico detection and typing of plasmids using PlasmidFinder and plasmid multilocus sequence typing. Antimicrob Agents Chemother. 58: 3895–3903. Article PubMed PMC Link
Chaisson MJ, Pevzner PA. 2008. Short read fragment assembly of bacterial genomes. Genome Res. 18: 324–330. Article PubMed
Chen Y, Nie F, Xie SQ, Zheng YF, Dai Q, et al. 2021a. Efficient assembly of nanopore reads via highly accurate and intact error correction. Nat Commun. 12: 60.Article PDF
Chen Y, Zhang Y, Wang AY, Gao M, Chong Z. 2021b. Accurate long-read de novo assembly evaluation with Inspector. Genome Biol. 22: 312.Article PDF
Cheng H, Concepcion GT, Feng X, Zhang H, Li H. 2021. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat Methods. 18: 170–175. Article PubMed PMC PDF
Chevreux B. 2005. Ph.D. thesis. MIRA: An automated genome and EST assembler. The Ruprecht-Karls-University, Heidelberg, Germany. PDF
Chin CS, Peluso P, Sedlazeck FJ, Nattestad M, Concepcion GT, et al. 2016. Phased diploid genome assembly with single-molecule real-time sequencing. Nat Methods. 13: 1050–1054. Article PubMed PMC PDF
Chiou CS, Chen BH, Wang YW, Kuo NT, Chang CH, et al. 2023. Correcting modification-mediated errors in nanopore sequencing by nucleotide demodification and reference-based correction. Commun Biol. 6: 1215.Article PubMed PMC PDF
Colombini L, Santoro F, Tirziu M, Cuppone AM, Pozzi G, et al. 2025. A 69.9-kb long inverted repeat increases genome instability in a strain of Lactobacillus crispatus. NAR Genom Bioinform. 7: lqaf085.Article PubMed PMC PDF
de la Bastide M, McCombie WR. 2007. Assembling genomic DNA sequences with PHRAP. Curr Protoc Bioinformatics. 17: 11.4.1–11.4.15. Article
Delahaye C, Nicolas J. 2021. Sequencing DNA with nanopores: Troubles and biases. PLoS One. 16: e0257521. Article PubMed PMC
Di Genova A, Buena-Atienza E, Ossowski S, Sagot MF. 2021. Efficient hybrid de novo assembly of human genomes with WENGAN. Nat Biotechnol. 39: 422–430. Article PubMed PDF
Eiglmeier K, Parkhill J, Honoré N, Garnier T, Tekaia F, et al. 2001. The decaying genome of Mycobacterium leprae. Lepr Rev. 72: 387–398. Article PubMed
El Kafsi H, Loux V, Mariadassou M, Blin C, Chiapello H, et al. 2017. Unprecedented large inverted repeats at the replication terminus of circular bacterial chromosomes suggest a novel mode of chromosome rescue. Sci Rep. 7: 44331.Article PubMed PMC
Espinosa E, Bautista R, Fernandez I, Larrosa R, Zapata EL, et al. 2023. Comparing assembly strategies for third-generation sequencing technologies across different genomes. Genomics. 115: 110700.Article PubMed
Firtina C, Kim JS, Alser M, Senol Cali D, Cicek AE, et al. 2020. Apollo: A sequencing-technology-independent, scalable and accurate assembly polishing algorithm. Bioinformatics. 36: 3669–3679. Article PubMed PDF
Gao S, Tran Q, Phan V. 2019. Understand effective coverage by mapped reads using genome repeat complexity. Proceedings of 11th International Conference on Bioinformatics and Computational Biology, BiCOB. 65–73 Available from https://digitalcommons.memphis.edu/facpubs/3302. Article
Glaser P, Kunst F, Arnaud M, Coudart MP, Gonzales W, et al. 1993. Bacillus subtilis genome project: Cloning and sequencing of the 97 kb region from 325 degrees to 333 degrees. Mol Microbiol. 10: 371–384. Article
Gunasekera S, Abraham S, Stegger M, Pang S, Wang P, et al. 2021. Evaluating coverage bias in next-generation sequencing of Escherichia coli. PLoS One. 16: e0253440. Article PubMed PMC
Gurevich A, Saveliev V, Vyahhi N, Tesler G. 2013. QUAST: Quality assessment tool for genome assemblies. Bioinformatics. 29: 1072–1075. Article PubMed PMC PDF
Hon T, Mars K, Young G, Tsai YC, Karalius JW, et al. 2020. Highly accurate long-read HiFi sequencing data for five complex genomes. Sci Data. 7: 399.Article PubMed PMC PDF
Hu J, Fan J, Sun Z, Liu S. 2020. NextPolish: A fast and efficient genome polishing tool for long-read assembly. Bioinformatics. 36: 2253–2255. Article PubMed PDF
Hu J, Wang Z, Liang F, Liu SL, Ye K, et al. 2024a. NextPolish2: A repeat-aware polishing tool for genomes assembled using HiFi long reads. Genomics Proteomics Bioinformatics. 22: qzad009.Article PDF
Hu J, Wang Z, Sun Z, Hu B, Ayoola AO, et al. 2024b. NextDenovo: An efficient error correction and accurate assembly tool for noisy long reads. Genome Biol. 25: 107.Article PDF
Huang N, Nie F, Ni P, Luo F, Gao X, et al. 2021. NeuralPolish: A novel nanopore polishing method based on alignment matrix construction and orthogonal Bi-GRU networks. Bioinformatics. 37: 3120–3127. Article PubMed PDF
Huang X, Wang J, Aluru S, Yang SP, Hillier L. 2003. PCAP: A whole-genome assembly program. Genome Res. 13: 2164–2170. Article PubMed PMC
Hunt M, Kikuchi T, Sanders M, Newbold C, Berriman M, et al. 2013. REAPR: A universal tool for genome assembly evaluation. Genome Biol. 14: R47.Article PubMed PMC PDF
Hunt M, Silva ND, Otto TD, Parkhill J, Keane JA, et al. 2015. Circlator: Automated circularization of genome assemblies using long sequencing reads. Genome Biol. 16: 294.Article PubMed PMC PDF
Jain C. 2023. Coverage-preserving sparsification of overlap graphs for long-read assembly. Bioinformatics. 39: btad124.Article PubMed PMC
Johnson J, Soehnlen M, Blankenship HM. 2023. Long read genome assemblers struggle with small plasmids. Microb Genom. 9: mgen001024.Article PubMed PMC
Kamath GM, Shomorony I, Xia F, Courtade TA, Tse DN. 2017. HINGE: Long-read assembly achieves optimal repeat resolution. Genome Res. 27: 747–756. Article PubMed PMC
Kolmogorov M, Yuan J, Lin Y, Pevzner PA. 2019. Assembly of long, error-prone reads using repeat graphs. Nat Biotechnol. 37: 540–546. Article PubMed PDF
Koren S, Harhay GP, Smith TP, Bono JL, Harhay DM, et al. 2013. Reducing assembly complexity of microbial genomes with single-molecule sequencing. Genome Biol. 14: R101.Article PubMed PMC PDF
Koren S, Walenz BP, Berlin K, Miller JR, Bergman NH, et al. 2017. Canu: Scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 27: 722–736. Article PubMed PMC
Lee JY, Kong M, Oh J, Lim J, Chung SH, et al. 2021. Comparative evaluation of nanopore polishing tools for microbial genome assembly and polishing strategies for downstream analysis. Sci Rep. 11: 20740.Article PubMed PMC PDF
Lerminiaux N, Fakharuddin K, Mulvey MR, Mataseje L. 2024. Do we still need Illumina sequencing data? Evaluating Oxford Nanopore Technologies R10.4.1 flow cells and the Rapid v14 library prep kit for Gram negative bacteria whole genome assemblies. Can J Microbiol. 70: 178–189. Article PubMed
Li H. 2016. Minimap and miniasm: Fast mapping and de novo assembly for noisy long sequences. Bioinformatics. 32: 2103–2110. Article PubMed PMC PDF
Li D, Liu CM, Luo R, Sadakane K, Lam TW. 2015. MEGAHIT: An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics. 31: 1674–1676. Article PubMed PDF
Li K, Xu P, Wang J, Yi X, Jiao Y. 2023. Identification of errors in draft genome assemblies at single-nucleotide resolution for quality assessment and improvement. Nat Commun. 14: 6556.Article PubMed PMC PDF
Li R, Zhu H, Ruan J, Qian W, Fang X, et al. 2010. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 20: 265–272. Article PubMed
Liu H, Wu S, Li A, Ruan J. 2021. SMARTdenovo: A de novo assembler using long noisy reads. GigaByte. 2021: gigabyte15.Article PubMed PMC
Loman NJ, Pallen MJ. 2015. Twenty years of bacterial genome sequencing. Nat Rev Microbiol. 13: 787–794. Article PubMed PDF
Luan T, Commichaux S, Hoffmann M, Jayeola V, Jang JH, et al. 2024. Benchmarking short and long read polishing tools for nanopore assemblies: Achieving near-perfect genomes for outbreak isolates. BMC Genomics. 25: 679.Article PubMed PMC PDF
Luo R, Liu B, Xie Y, Li Z, Huang W, et al. 2012. SOAPdenovo2: An empirically improved memory-efficient short-read de novo assembler. Gigascience. 1: 18.Article PubMed PMC
MacCallum I, Przybylski D, Gnerre S, Burton J, Shlyakhter I, et al. 2009. ALLPATHS 2: Small genomes assembled accurately and with high continuity from short paired reads. Genome Biol. 10: R103.Article PubMed PMC PDF
Mäkinen V, Salmela L, Ylinen J. 2012. Normalized N50 assembly metric using gap-restricted co-linear chaining. BMC Bioinformatics. 13: 255.Article PubMed PMC
Mapleson D, Garcia Accinelli G, Kettleborough G, Wright J, Clavijo BJ. 2017. KAT: A K-mer analysis toolkit to quality control NGS datasets and genome assemblies. Bioinformatics. 33: 574–576. Article PubMed PMC PDF
Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, et al. 2005. Genome sequencing in microfabricated high-density picolitre reactors. Nature. 437: 376–380. Article PubMed PMC
Marijon P, Chikhi R, Varré JS. 2019. Graph analysis of fragmented long-read bacterial genome assemblies. Bioinformatics. 35: 4239–4246. Article PubMed PDF
Mastoras M, Asri M, Brambrink L, Hebbar P, Kolesnikov A, et al. 2025. Highly accurate assembly polishing with DeepPolisher. Genome Res. 35: 1595–1608. Article PubMed PMC
Medaka. 2018. Sequence correction provided by ONT Research. Available from https://github.com/nanoporetech/medaka (accessed April 2026). Link
Merda D, Vila-Nova M, Bonis M, Boutigny AL, Brauge T, et al. 2024. Unraveling the impact of genome assembly on bacterial typing: A one health perspective. BMC Genomics. 25: 1059.Article PubMed PMC PDF
Metzker M. 2010. Sequencing technologies — the next generation. Nat Rev Genet. 11: 31–46. Article PubMed PDF
Miller JR, Koren S, Sutton G. 2010. Assembly algorithms for next-generation sequencing data. Genomics. 95: 315–327. Article PubMed PMC
Molina-Mora JA, Campos-Sánchez R, Rodríguez C, Shi L, García F. 2020. High quality 3C de novo assembly and annotation of a multidrug resistant ST-111 Pseudomonas aeruginosa genome: Benchmark of hybrid and non-hybrid assemblers. Sci Rep. 10: 1392.Article PubMed PMC PDF
Myers EW. 2005. The fragment assembly string graph. Bioinformatics. 21: ii79–ii85. Article PubMed PDF
Nagarajan N, Pop M. 2013. Sequence assembly demystified. Nat Rev Genet. 14: 157–167. Article PubMed PDF
Nattestad M, Schatz MC. 2016. Assemblytics: a web analytics tool for the detection of variants from an assembly. Bioinformatics. 32: 3021–3023. Article PubMed PMC PDF
NCBI. Genome datasets. 2026. Available from https://www.ncbi.nlm.nih.gov/datasets/genome/ (accessed April 2026). Link
Nurk S, Walenz BP, Rhie A, Vollger MR, Logsdon GA, et al. 2020. HiCanu: Accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Res. 30: 1291–1305. Article PubMed PMC
Pani S, Dabbaghie F, Marschall T, Söylev A. 2024. A toolkit for analyzing and manipulating pangenome alignments. bioRxiv. doi: https://doi.org/10.1101/2024.12.10.627813. Article
Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. 2015. CheckM: Assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 25: 1043–1055. Article PubMed PMC
Peng Y, Leung HCM, Yiu SM, Chin FYL. 2010. IDBA—A practical iterative de Bruijn graph de novo assembler. In Berger B. (ed.), Research in computational molecular biology, vol. 6044, pp. 426–440. Springer. Article
Peng K, Li C, Wang Q, Xin X, Wang Z, et al. 2025. The applications and advantages of nanopore sequencing in bacterial antimicrobial resistance surveillance and research. NPJ Antimicrob Resist. 3: 87.Article PubMed PMC PDF
Phillippy AM, Schatz MC, Pop M. 2008. Genome assembly forensics: Finding the elusive mis-assembly. Genome Biol. 9: R55.Article PubMed PMC PDF
Prjibelski A, Antipov D, Meleshko D, Lapidus A, Korobeynikov A. 2020. Using SPAdes de novo assembler. Curr Protoc Bioinformatics. 70: e102. Article PubMed Link
Reuter JA, Spacek DV, Snyder MP. 2015. High-throughput sequencing technologies. Mol Cell. 58: 586–597. Article PubMed PMC
Rhie A, Walenz BP, Koren S, Phillippy AM. 2020. Merqury: Reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol. 21: 245.Article PubMed PMC PDF
Rhoads A, Au KF. 2015. PacBio sequencing and its applications. Genom Proteom Bioinform. 13: 278–289. Article PubMed PMC PDF
Robertson J, Nash JHE. 2018. MOB-suite: Software tools for clustering, reconstruction and typing of plasmids from draft assemblies. Microb Genom. 4: e000206. Article PubMed PMC
Rojas-Miranda H, Madrigal-Ly V, Molina-Mora JA. 2025. Benchmarking genome assemblers for four bacterial models based on contiguity, correctness, and completeness. Sci Rep. 15: 42858.Article PubMed PMC PDF
Ruan J, Li H. 2020. Fast and accurate long-read assembly with wtdbg2. Nat Methods. 17: 155–158. Article PubMed PDF
Salikhov K, Sacomoto G, Kucherov G. 2013. Using cascading Bloom filters to improve the memory usage for de Bruijn graphs. Algorithms Mol Biol. 9: 2.Article
Seppey M, Manni M, Zdobnov EM. 2019. BUSCO: Assessing genome assembly and annotation completeness. Methods Mol Biol. 1962: 227–245. Article PubMed
Shafin K, Pesout T, Lorig-Roach R, Haukness M, Olsen HE, et al. 2020. Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes. Nat Biotechnol. 38: 1044–1053. Article PubMed PDF
Shintani M, Sanchez ZK, Kimbara K. 2015. Genomics of microbial plasmids: Classification and identification based on replication and transfer systems and host taxonomy. Front Microbiol. 6: 242.Article PubMed PMC
Simão FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM. 2015. BUSCO: Assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics. 31: 3210–3212. Article PubMed PDF
Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJ, et al. 2009. ABySS: A parallel assembler for short read sequence data. Genome Res. 19: 1117–1123. Article PubMed PMC
Sousa TJ, Parise D, Profeta R, Parise MTD, Gomide ACP, et al. 2019. Re-sequencing and optical mapping reveals misassemblies and real inversions on Corynebacterium pseudotuberculosis genomes. Sci Rep. 9: 16387.Article PubMed PMC PDF
Thrash A, Hoffmann F, Perkins A. 2020. Toward a more holistic method of genome assembly assessment. BMC Bioinformatics. 21: 249.Article PubMed PMC PDF
Tizabi D, Bachvaroff T, Hill RT. 2022. Comparative analysis of assembly algorithms to optimize biosynthetic gene cluster identification in novel marine actinomycete genomes. Front Mar Sci. 9: 914197.Article
Travers KJ, Chin CS, Rank DR, Eid JS, Turner SW. 2010. A flexible and efficient template format for circular consensus sequencing and SNP detection. Nucleic Acids Res. 38: e159. Article PubMed PMC
Treangen TJ, Abraham AL, Touchon M, Rocha EP. 2009. Genesis, effects and fates of repeats in prokaryotic genomes. FEMS Microbiol Rev. 33: 539–571. Article PubMed
Treangen TJ, Salzberg SL. 2011. Repetitive DNA and next-generation sequencing: Computational challenges and solutions. Nat Rev Genet. 13: 36–46. Article PubMed PMC PDF
Trisakul K, Hinwan Y, Eisiri J, Salao K, Chaiprasert A, et al. 2024. Comparisons of genome assembly tools for characterization of Mycobacterium tuberculosis genomes using hybrid sequencing technologies. PeerJ. 12: e17964. Article PubMed PMC PDF
Vaser R, Šikić M. 2021. Time- and memory-efficient genome assembly with Raven. Nat Comput Sci. 1: 332–336. Article PubMed PDF
Vaser R, Sović I, Nagarajan N, Šikić M. 2017. Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res. 27: 737–746. Article PubMed PMC
Walker BJ, Abeel T, Shea T, Priest M, Abouelliel A, et al. 2014. Pilon: An integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS One. 9: e112963. Article PubMed PMC
Waters EV, Cameron SK, Langridge GC, Preston A. 2025. Bacterial genome structural variation: Prevalence, mechanisms, and consequences. Trends Microbiol. 33: 875–886. Article PubMed
Watson M, Warr A. 2019. Errors in long-read assemblies can critically affect protein prediction. Nat Biotechnol. 37: 124–126. Article PubMed PDF
Wenger AM, Peluso P, Rowell WJ, Chang PC, Hall RJ, et al. 2019. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat Biotechnol. 37: 1155–1162. Article PubMed PMC PDF
Wick RR, Holt KE. 2021. Benchmarking of long-read assemblers for prokaryote whole genome sequencing. F1000Res. 8: 2138.Article Link
Wick RR, Holt KE. 2022. Polypolish: Short-read polishing of long-read bacterial genome assemblies. PLoS Comput Biol. 18: e1009802. Article PubMed PMC
Wick RR, Howden BP, Stinear TP. 2025. Autocycler: Long-read consensus assembly for bacterial genomes. Bioinformatics. 41: btaf474.Article PubMed PMC
Wick RR, Judd LM, Cerdeira LT, Hawkey J, Méric G, et al. 2021a. Trycycler: Consensus long-read assemblies for bacterial genomes. Genome Biol. 22: 266.Article PDF
Wick RR, Judd LM, Gorrie CL, Holt KE. 2017. Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads. PLoS Comput Biol. 13: e1005595. Article PubMed PMC
Wick RR, Judd LM, Holt KE. 2023. Assembling the perfect bacterial genome using Oxford Nanopore and Illumina sequencing. PLoS Comput Biol. 19: e1010905. Article PubMed PMC
Wick RR, Judd LM, Wyres KL, Holt KE. 2021b. Recovery of small plasmid sequences via Oxford Nanopore sequencing. Microb Genom. 7: 000631.Article
Wick RR, Schultz MB, Zobel J, Holt KE. 2015. Bandage: Interactive visualization of de novo genome assemblies. Bioinformatics. 31: 3350–3352. Article PubMed PMC PDF
Zerbino DR, Birney E. 2008. Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18: 821–829. Article PubMed PMC

Supplementary Information

References

Citations

Citations to this article as recorded by

ePub Link

Cite this Article

Cite this Article: export Copy Download Format; Close

Download Citation

Download a citation file in RIS format that can be imported by all major citation management software, including EndNote, ProCite, RefWorks, and Reference Manager.

Format:

RIS — For EndNote, ProCite, RefWorks, and most other reference management software
BibTeX — For JabRef, BibDesk, and other BibTeX-specific software

Include:

Citation for the content below
Citation and abstract for the content below

From contiguity to accuracy: Validation-centered perspectives on bacterial genome assembly

DOI: https://doi.org/10.71150/jm.2604004

XML Download

Figure

From contiguity to accuracy: Validation-centered perspectives on bacterial genome assembly

Fig. 1. Overview of bacterial genome assemblies and sequencing platforms. (A) Number of bacterial genomes by assembly level deposited in NCBI from 1995 to 2025, with the y-axis shown on a logarithmic scale. (B) Distribution of bacterial genomes by assembly level in 2025. (C) Proportion of sequencing platforms (Illumina, PacBio, and ONT) used for bacterial genome assemblies across different assembly levels. (D) Distribution of contig counts according to sequencing platform for genomes at the contig assembly level.

Fig. 2. Mechanism of repeat-induced assembly ambiguity and misassembly in long-read sequencing. Case 1. Highly identical repeats. Two repeat copies (R1 and R2) with near-identical sequences (~99.9% identity) are flanked by distinct genomic regions. Although reads may partially span these regions, the lack of sufficient sequence divergence prevents unique assignment of reads to specific repeat copies, resulting in ambiguous assembly paths and potential misassembly. Case 2. Tandem repeat structures. Multiple consecutive repeat units arranged in tandem can lead to uncertainty in repeat copy number during assembly. Depending on the algorithm and supporting evidence, repeat arrays may be collapsed into fewer copies, producing assemblies that underestimate the true genomic length. Case 3. Read length distribution effects. Even when ultra-long reads are present, only a fraction of reads may exceed the length of long repeat regions (e.g., 20 kb). When the majority of reads are shorter than the repeat, insufficient spanning evidence can lead to repeat collapse or fragmentation at repeat boundaries.

Fig. 3. Workflow for generating a high-confidence complete bacterial genome. Schematic overview of genome assembly and validation. High-molecular-weight DNA is subjected to hybrid sequencing using short- and long-read platforms. Assembly is performed using either an ONT-based hybrid approach, in which ONT structural reconstruction is followed by short-read polishing, or a PacBio HiFi-based approach. Validation includes evaluation of assembly statistics (genome size, plasmids, N50), read mapping for coverage assessment, assembly graph inspection for structural ambiguity, completeness and contamination assessment (e.g., CheckM, BUSCO), and comparative analysis with reference genomes (e.g., dnaA alignment) to detect structural discrepancies such as inversions and rearrangements.

Fig. 1.

Fig. 2.

Fig. 3.

From contiguity to accuracy: Validation-centered perspectives on bacterial genome assembly

Tool	Input	Core engine	Note	Reference
ARACHNE	Short (Sanger)	OLC	Efficient scaling; primarily suited for large eukaryotic genomes	Batzoglou et al. (2002)
PCAP	Short (Sanger)	OLC (Sanger-era)	Parallel Sanger-era assembler; large-genome oriented	Huang et al. (2003)
Newbler	Short (454)	OLC	Widely used in early bacterial genome assembly	Margulies et al. (2005)
Mira	Short	OLC	Supports mapping-based assembly and polishing	Chevreux (2005)
PHRAP	Short (Sanger)	OLC (Sanger-era)	Sanger-era shotgun assembler	de la Bastide and McCombie (2007)
EULER-SR	Short (454)	DBG	A-Bruijn–inspired DBG	Chaisson and Pevzner (2008)
Velvet	Short	DBG	Early short-read DBG assembler	Zerbino and Birney (2008)
ALLPATHS-LG	Short	DBG	Optional long-read integration for gap filling	MacCallum et al. (2009)
IDBA	Short	Iterative DBG	Handles uneven coverage	Peng et al. (2010)
SOAPdenovo	Short	DBG	Scaffold-oriented; widely used for large genomes	Li et al. (2010)
SOAPdenovo2	Short	DBG	Improved memory efficiency and accuracy	Luo et al. (2012)
Minia version 3	Short	Compacted DBG	Unitig-based; evolved from Bloom filter–based approach	Salikhov et al. (2013)
MEGAHIT^*	Short	Succinct DBG	Optimized for metagenome assembly	Li et al. (2015)
SPAdes^*	Short	DBG	Multisized and paired-end integration	Prjibelski et al. (2020)
HybridSPAdes^*	Hybrid	DBG	Multisized, paired-end, and long-read integration	Antipov et al. (2016b)
FALCON	Long	String graph	Diploid-aware; optimized for complex eukaryotic genomes	Chin et al. (2016)
Miniasm	Long	OLC	Consensus-free; requires polishing	Li (2016)
HINGE	Long	OLC (repeat-aware)	Improves repeat resolution using hinge-based graph construction	Kamath et al. (2017)
Canu^*	Long	OLC	Designed for noisy long reads	Koren et al. (2017)
Flye^*	Long	Repeat graph	Robust to complex repeats	Kolmogorov et al. (2019)
HiCanu^*	Long (HiFi)	OLC	Improved accuracy and repeat resolution	Nurk et al. (2020)
wtdbg2^*	Long	Fuzzy Bruijn graph	Fast and memory-efficient; designed for noisy long reads	Ruan and Li (2020)
Shasta^*	Long (ONT)	Marker graph	Fast and memory-efficient	Shafin et al. (2020)
Raven^*	Long	OLC	Optimized for long uncorrected reads; fast and memory-efficient	Vaser and Šikić (2021)
Hifiasm^*	Long (HiFi)	String graph	High accuracy and repeat resolution	Cheng et al. (2021)
NECAT^*	Long (ONT)	OLC	Efficient assembly of noisy long reads	Chen et al. (2021a)
SmartDenovo	Long	OLC	No error correction	Liu et al. (2021)
NextDenovo^*	Long	OLC	Improved accuracy	Hu et al. (2024b)

Tool	Category	Input	Core method	Key function	Reference
Pilon	Short-read polishing	Illumina reads	Read mapping	Error correction of SNPs and small indels	Walker et al. (2014)
Polypolish	Short-read polishing	Illumina reads	Multi-mapping	Repeat-aware error correction	Wick and Holt (2022)
Pypolca	Short-read polishing	Illumina reads	Read mapping	Error correction with threshold-based variant filtering	Bouras et al. (2024b)
Racon	Long-read polishing	Long reads	Read mapping	Consensus-based error correction	Vaser et al. (2017)
NeuralPolish	Long-read polishing	Long reads	Deep learning	Improved base accuracy using neural networks	Huang et al. (2021)
DeepPolisher	Long-read polishing	Long reads	Deep learning	Deep learning–based error correction	Mastoras et al. (2025)
Medaka	Long-read polishing	ONT reads	Deep learning	Signal-aware error correction	Medaka (2018)
NextPolish	Hybrid polishing	Short + long reads	Iterative polishing	Multi-platform error correction	Hu et al. (2020)

Tool	Category	Input	Core method	Key function	Reference
REAPR	Read-based	Short reads + assembly	Read mapping	Error detection	Hunt et al. (2013)
Inspector	Read-based	Long reads	Read mapping	Structural and local error detection with correction	Chen et al. (2021b)
QUAST	Reference-based	Assembly + reference	Whole-genome alignment	Assembly quality assessment and misassembly detection	Gurevich et al. (2013)
Assemblytics	Reference-based	Assembly + reference	Whole-genome alignment	Structural variation detection	Nattestad and Schatz (2016)
CheckM	Completeness	Assembly	Lineage-specific marker genes	Completeness + contamination	Parks et al. (2015)
BUSCO	Completeness	Assembly	Marker genes (Single-copy orthologs)	Completeness assessment	Seppey et al. (2019)
KAT	k-mer based	Reads + assembly	k-mer comparison	Coverage bias, duplication detection	Mapleson et al. (2017)
Merqury	k-mer based	Reads + assembly	k-mer spectrum	Accuracy (QV) + completeness	Rhie et al. (2020)
Bandage	Graph-based	Assembly graph	Visualization	Graph inspection	Wick et al. (2015)
gfatools	Graph-based	GFA	Graph parsing	Structural analysis	Pani et al. (2024)
Circlator	Structural	Assembly	Overlap detection	Circularization validation	Hunt et al. (2015)
MOB-suite	Plasmid	Assembly	Database + typing	Plasmid reconstruction	Robertson and Nash (2018)

Table 1. De novo assembly tools used in bacterial genome assembly

Tools marked with an asterisk are actively maintained and commonly used in contemporary bacterial genome assembly pipelines.