A pangenome reference of 36 Chinese populations

Title： MAIB-Talk-020: A pangenome reference of 36 Chinese populations, Nature (2023)
Date：10:00pm US East time, 07/01/2023
Date：10:00am Beijing time, 07/02/2023
Zoom ID：933 1613 9423
Zoom PWD：416262
Zoom: https://uwmadison.zoom.us/meeting/register/tJcudu-prTIuGNda1MsF8PKyRQlnGn06TP2E

Presentation Record(Previous Presentation will be showed here if the video is not released for this talk)

MAIB: Manifold learning, Artificial Intelligence, Biology Forum (MAIB)

Dr. Yang Gao, Fudan University, Shanghai, China

State Key Laboratory of Genetic Engineering, Human Phenome Institute, Zhangjiang Fudan International Innovation Center, Center for Evolutionary Biology, School of Life Sciences, Fudan University, Shanghai, China Ministry of Education Key Laboratory of Contemporary Anthropology, Collaborative Innovation Center for Genetics and Development, Fudan University, Shanghai, China Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China School of Life Science and Technology, ShanghaiTech University, Shanghai, China

Referencen

Gao, Y., Yang, X., Chen, H. et al. A pangenome reference of 36 Chinese populations. Nature (2023). https://doi.org/10.1038/s41586-023-06173-7

Abstract

Human genomics is witnessing an ongoing paradigm shift from a single reference sequence to a pangenome form, but populations of Asian ancestry are underrepresented. Here we present data from the first phase of the Chinese Pangenome Consortium, including a collection of 116 high-quality and haplotype-phased de novo assemblies based on 58 core samples representing 36 minority Chinese ethnic groups. With an average 30.65× high-fidelity long-read sequence coverage, an average contiguity N50 of more than 35.63 megabases and an average total size of 3.01 gigabases, the CPC core assemblies add 189 million base pairs of euchromatic polymorphic sequences and 1,367 protein-coding gene duplications to GRCh38. We identified 15.9 million small variants and 78,072 structural variants, of which 5.9 million small variants and 34,223 structural variants were not reported in a recently released pangenome reference1. The Chinese Pangenome Consortium data demonstrate a remarkable increase in the discovery of novel and missing sequences when individuals are included from underrepresented minority ethnic groups. The missing reference sequences were enriched with archaic-derived alleles and genes that confer essential functions related to keratinization, response to ultraviolet radiation, DNA repair, immunological responses and lifespan, implying great potential for shedding new light on human evolution and recovering missing heritability in complex disease mapping.

Background

Over the past two decades, the reference human genome sequence has served as the foundation for genetic and biomedical research and applications; however, there is a broad consensus that no single reference sequence can represent the genomic diversity of global populations. On one hand, high-quality population-specific and haplotype-resolved genome references are necessary for genetic and medical analysis2. On the other hand, there is a clear need to shift from a single reference to a pangenome form that better represents genomic diversity, or allelic variation, within and across human populations3. With the advancement of long-read sequencing technologies as well as computational methods, it is now feasible to enable pan-genomic construction to capture the missed variations from a large collection of diverse genomes4. The Human Pangenome Reference Consortium (HPRC) recently constructed a draft human pangenome reference based on 47 samples of worldwide populations, but with East Asian population samples underrepresented (n = 4)1. In particular, only three Southern Han Chinese (CHS) samples were included in the HPRC reference, too few to represent the genomic diversity of ethnic groups in a region such as China, which is populated by 1.44 billion people. We showed previously that the genetic diversity in Asia was not well covered by large-scale international collaborative projects such as the 1000 Genomes Project5,6. Although the need to improve the representation of diverse ancestral backgrounds in genomic research is well known7,8, substantially fewer genomic studies have been conducted in populations of Asian ancestry compared with populations of European ancestry. China harbours a great genetic diversity, with 55 officially recognized minority ethnic groups in addition to the Han Chinese majority and a considerable number of unrecognized ethnic groups. Despite advances in sequencing technologies leading to the achievement of the telomere-to-telomere haploid assembly T2T-CHM13 (ref. 9), only a limited number of Chinese genomes have been de novo assembled to high-quality haplotype sequences using long-read DNA sequencing technologies2,10,11,12,13,14. The only two published studies on the Chinese pangenome were limited to the short-read sequencing data of Han Chinese samples15,16. There is an urgent need to establish a high-quality pangenome reference that better represents the great genomic diversity of Chinese populations. We anticipate such an effort to broaden the reference to represent genomic diversity, resolve allelic and locus heterogeneity, support unbiased and comprehensive detection of structural variation within and across populations, and improve genotyping accuracy in genomic regions enriched with complex sequence variations, such as human leukocyte antigen genes, and ultimately facilitate genomic analysis for both evolutionary and medical research.

What we can learn from this study

The study conducted by the Chinese Pangenome Consortium (CPC) provides several valuable insights and contributions to the field of human genomics. Here are some key takeaways from the study:

Pangenome Representation: The study highlights the shift from a single reference genome to a pangenome representation, which better captures the genomic diversity and allelic variation within and across populations. By including individuals from underrepresented minority ethnic groups, the CPC data significantly increased the discovery of novel and missing sequences.

Genomic Diversity of Chinese Populations: The CPC project specifically focuses on constructing a high-quality pangenome reference for Chinese populations. The study emphasizes the need to improve the representation of diverse ancestral backgrounds in genomic research, particularly in populations of Asian ancestry. It highlights the genetic diversity present in China, including 36 minority Chinese ethnic groups, and the importance of incorporating their genomic data for comprehensive analysis.

High-Quality Haplotype-Resolved Assemblies: The CPC generated 116 high-quality and haplotype-phased de novo genome assemblies using long-read sequencing technology. These assemblies provide a detailed understanding of the genetic variation within Chinese populations and enable accurate haplotype phasing, which is crucial for genetic and medical analysis.

Novel Variants and Missing Sequences: By analyzing the CPC data, the study identified millions of small variants and thousands of structural variants that were not reported in previous pangenome references. These novel variants contribute to expanding our knowledge of genetic diversity and offer insights into human evolution and complex disease mapping.

Functional Significance of Missing Sequences: The missing reference sequences identified in the CPC data were found to be enriched with archaic-derived alleles and genes associated with essential functions related to various biological processes such as keratinization, DNA repair, immune responses, and lifespan. This suggests that the CPC data has the potential to provide new insights into human evolution and uncover missing heritability in complex diseases.

Improving Genotyping Accuracy and Structural Variant Detection: The incorporation of diverse Chinese populations in the pangenome reference improves genotyping accuracy, particularly in genomic regions with complex sequence variations. Additionally, comprehensive detection of structural variants is facilitated, which is crucial for understanding genomic rearrangements and their impact on health and disease.

Overall, this study contributes to a more comprehensive understanding of the genomic diversity of Chinese populations, highlights the importance of including underrepresented populations in genomics research, and provides valuable resources for future genetic and medical studies.

The concept of a pangenome is important because it challenges the traditional understanding of genomes as fixed and static entities. A pangenome refers to the complete set of genes and genetic variations within a species, including the core genome shared by all individuals and the dispensable genome, which consists of genes that are present in some individuals but absent in others.

Here are some reasons why the pangenome is important:

Genetic Diversity: The pangenome captures the genetic diversity within a species. It helps us understand the genetic variations and differences between individuals, populations, and even different species. This diversity is crucial for adaptation, evolution, and resilience to environmental changes and diseases.

Functional Gene Annotation: The pangenome provides a comprehensive catalog of genes and genetic elements that can be associated with specific functions. By comparing the pangenome of a species, researchers can identify genes responsible for various traits, such as disease susceptibility, drug response, or agricultural productivity. This knowledge has implications for personalized medicine, crop improvement, and biotechnological applications.

Evolutionary Insights: The pangenome allows researchers to study the evolutionary dynamics of a species. By analyzing the variations in the pangenome, scientists can track the emergence, transfer, and loss of genes over time. This information helps in understanding the mechanisms of evolution, speciation, and adaptation.

Disease Research: In the context of human health, the pangenome contributes to our understanding of genetic diseases. It reveals genetic variations that may be associated with disease susceptibility, drug response, or treatment outcomes. By studying the pangenome, researchers can identify rare or population-specific variants that are missed in traditional reference genomes, improving diagnostic accuracy and personalized medicine.

Conservation and Biodiversity: The pangenome aids in conservation efforts by providing insights into the genetic diversity of endangered species. Understanding the pangenome can help conservationists identify genetically distinct populations, monitor genetic health, and develop strategies for preserving biodiversity.

In summary, the pangenome is important because it expands our understanding of genetic diversity, functional genomics, evolution, disease research, and conservation. By considering the full spectrum of genetic variations, rather than relying on a single reference genome, we gain a more comprehensive and accurate view of the genomic landscape within a species.