The human genome contains numerous highly polymorphic loci, rich in tandem repeats and structural variants. There, read alignments are often ambiguous and unreliable, resulting in hundreds of disease-associated genes being inaccessible for accurate variant calling. In such regions, structural variant callers show limited sensitivity, k-mer based tools cannot exploit full linkage information of a sequencing read, and gene-specific methods cannot be easily extended to process more loci. Improved ability to genotype highly polymorphic genes can increase diagnostic power and uncover novel disease associations.
We present a targeted tool Locityper, capable of genotyping complex polymorphic loci using both short- and long-read whole genome sequencing, including error-prone ONT data. For each target, Locityper recruits WGS reads and aligns them to possible locus haplotypes (e.g. extracted from a pangenome). By optimizing read alignment, insert size, and read depth profiles across haplotypes, Locityper efficiently estimates the likelihood of each haplotype pair. This is achieved by solving integer linear programming problems or by employing stochastic optimization.
Across 256 challenging medically relevant loci and 40 HPRC Illumina datasets, 95% Locityper haplotypes were accurate (QV, Phred-scaled divergence, ≥33), compared to 27% accurate haplotypes, reconstructed from the phased NYGC call set. In leave-one-out (LOO) evaluation, Locityper produced 60% accurate haplotypes, a fraction that will increase with larger reference panels as >91% haplotypes were very close (ΔQV≤5) to best available haplotypes. Overall, 82% 1KGP trio haplotypes were concordant. Finally, across 36 HLA genes LOO Locityper correctly predicted protein product in 94% cases, outperforming the specialized HLA-genotyper T1K at 78%.