GenHap

A novel computational method based on genetic algorithms for highly efficient haplotype assembly.

GenHap is a high-performance computational tool that leverages Genetic Algorithms (GAs) to solve the complex problem of haplotype assembly from high-throughput sequencing data.

Fully characterizing an individual's genome requires reconstructing the two distinct copies of each chromosome (haplotypes). This process is computationally daunting and is formally known as the weighted Minimum Error Correction (wMEC) problem, which is NP-hard. GenHap tackles this by partitioning sequencing reads into two disjoint subsets with the least number of corrections to Single Nucleotide Polymorphism (SNP) values. Designed to handle the long, high-coverage reads of third-generation sequencing technologies (like PacBio RS II and Oxford Nanopore), GenHap yields optimal solutions through a globally distributed search process.

GitHub stars License
C++ / MPI Genomics

Technical Architecture & Distributed Computing

GenHap avoids the computational bottlenecks of traditional haplotyping tools by relying on a highly optimized C++ backend and a Master-Slave distributed programming paradigm utilizing the Message Passing Interface (MPI).

Instead of attempting to solve the entire matrix of sequencing reads at once, GenHap employs a divide-et-impera (divide and conquer) strategy. The Master process detects haplotype blocks and splits the fragment matrix into smaller, manageable sub-matrices. These sub-problems are then distributed across multiple CPU cores (Slaves), where independent Genetic Algorithm instances optimize the partitions in parallel before the Master recombines them into a complete, highly accurate haplotype structure.


Research Highlights

Evolutionary Optimization

Utilizes tournament selection, crossover, and mutation operators to iteratively evolve the optimal partition of sequence reads without falling into local optima traps.

Unmatched Speed

Proven to be up to 4× faster than state-of-the-art tools (like HapCol) on Roche/454 datasets, and up to 20× faster on PacBio RS II datasets.

Future-Gen Ready

Specifically engineered to handle the high coverage and noisy read distributions characteristic of next-generation clinical sequencing technologies.

Core Publication

2019

  1. GenHap: a novel computational method based on genetic algorithms for haplotype assembly
    Andrea Tangherloni, Simone Spolaor, Leonardo Rundo, and 6 more authors
    BMC Bioinformatics, 2019