Evolution & Comparative Genomics
We are interested in different topics of phylogenetics, molecular evolution and comparative genomics, including:
1.Phylogenetic tree construction
2. Detection of selection signal from comparative or population data
3. Comparative genomics and population genetics
4. Development of relative bioinformatics tools and website
5. algorithms design and improvement
Projects:
1) Artificial selection during rice domestication project.
Domestication involves a series of profound genetic changes resulting from selection that make a wild species more amenable for cultivation and consumption by humans. There are lots of traits such as grain size, shape, color, fragrance and amylase content which distinguish modern rice varieties from their wild ancestor. Our rice domestication project will make effort to identify and catalog genes and functional regions which may result in these traits among different rice populations. To achieve these, we need to identify the regions with low SNP incidence (which we call SNP desert) and regions with high SNP incidence (which we can call SNP forest), and then map these regions to the traits data such as gene linkage-disequilibrium. In addition to some cultivation traits association loci which have lots of benefits for breeding, we can also get the clear answer for the history of the rice domestication from the project.
2) Searching functional element from recent selection signal
Natural selection has strongly influenced recent human evolution; by scanning the entire human genome in search of genetic variations we may detect the genome regions that have been targets of natural selection. The functions of these regions may help us to understand the recent evolution of human.
3) Adaptive evolution on phylogenetic branches
The project has three primary goals: (1) determine the best method for detecting adaptive evolution by testing against previously characterized biochemical pathways, (2) design improved methods of adaptive evolution detection, and (3) create a comprehensive database of these results.
4) De novo sequencing of D. albomicans.
We wish to find the evidence chromosomes fusion and illustrated the pattern of stepwise chromosome evolution.
5) Detecting genome structure variation
There are many forms of variation, including insertion, deletion, tandem duplication, reversion and segmental duplication, forming and altering the structure of chromosomes. We are developing a tool to detect these variations without any prior knowledge.
Re-sequencing
Our group takes full effort to develop standard bioinformatics protocol and pipelines for all kinds of projects related to resequencing. The members work on a wide range of items from alignment program, genetic variant identification as well as implementation of algorithms in population genomics, association studies, comparative genomics and other researches guided by reference DNA sequences.
Ongoing Projects:
1) Yanhuang: the first diploid Asian genome
The project sequenced a Han Chinese genome using the Now-Gen sequencing technology. Despite its impact on science, during the project we have developed solutions to sequence a human genome using short reads and identify single nucleotide polymorphisms, insertions/deletions and structural variations involved segmental changes.
2) International 1000 genomes project (participating)
The International 1000 genome project was announced by Sanger Institute in UK, BGI at Shenzhen in China and NHGRI in US at Jan 22, 2008. We participate in developing utilities and general data analysis of the great project. The project will provide a deep catalog of human genomic variants, facilitating medical studies and evolution researches.
3) Resequencing utilities
SOAP: short oligonucleotide alignment program, was developed by Dr. Ruiqiang Li et al. We are working to develop a folds faster new program
SOAP2 based on new data structure and implementations.
SoapSNP, SoapIndel, SoapSV: utilities to build consensus sequence and call variants from alignment result of SOAP.
SoapSOLiD: color-space short oligonucleotide alignment program for SOLiD sequencing
SoapSNP(Population version): utilities to deal with population resequencing data to call consensus and identify genetic variants.
Variation Detection
Our group mainly focuses on the detection of variations based on resequencing. These variations include three types: single nucleotide polymorphism (SNP), short insertion & deletion (short indel) and structural variation (SV).
Solexa provides a high throughput at low cost sequencing technology, speeding up the human genetic variation studies. BGI had developed a pipeline for variations detection based on Solexa sequencing which had been valuated in the Yanhuang project.
SNP is the most common genetic variation in the genome. The SNP detection program developed by BGI well considered the characteristics of Solexa sequencing, such as sequencing quality, alignment errors and experimental error dependency. The program finally calculated a quality score for each SNP candidate.
Indel refers to the two types of genetic mutation that are often considered together when comparing two sequences, means that, one is insertion, the other one is deletion. Since the Solexa sequencing reads are so short that has low confident in gap-alignment. So we only focus on detecting short indels (1~3 bp) based on gap-alignment.
SV, including duplication, inversion, translocation, insertion, deletion and complex (which means several rearrangement events), occurred at the same region and is hard to distinguish what exactly happened. Paired-end sequencing is useful for structural variations detection. When the paired end reads mapped to the reference abnormally (improper insert size or orientation), it may be caused by structural variation in the sequenced individual. So we can detect them based the abnormal paired-end alignments.
There are only 0.1% or a bit more differences between any two individuals at the genome level. It is important to understand these differences as they provide a way to explain individual differences in susceptibility to disease, response to drugs and environment. It helps peoples to understand ourselves better.
Project:
Yanhuang project:
Sequence at least 100 Chinese people’s genome to study the genetic characteristics of Chinese and construct a Chinese genetic polymorphism map. Then apply the findings to medical science hoping to solve the problems related to Chinese-specific genetic diseases.
1000 Genome Project:
It is an international collaboration project aiming to sequence at least 1000 people from all over the world to create the detailed and medically useful map of human genetic variation. The scientific goals of the 1000 Genomes Project are to produce a catalog of variants that are present at 1% or greater frequency in the human population across most of the genome, and down to 0.5% or lower within genes. The project will provide a deep understanding of human genetic variation and may contribute greatly to human health.
Association Studies
The major research objective of this group is to understand the relationship between DNA variations and interested phenotypic traits. The susceptible alleles of complex diseases are much harder to locate than those of the monogenic diseases because multiple loci contribute to the predisposition of complex diseases and each locus only accounts for a small proportion of the phenotype. With the data release of HapMap phase II, high-density SNP arrays have been used to scan the whole genome to capture alleles with lower frequency and smaller effects. Furthermore, as the new sequencing technology develops, it is now feasible and affordable to sequence the whole set of exons (exome) of each individual in cases and controls to obtain the whole exome allele patterns. We focus on designing and utilizing bioinformatics tools and statistical methods involved in association studies. Furthermore, we work on establishing bioinformatics pipeline for SNP/SV detection to deal with both genotyping data and re-sequencing data effectively.
Ongoing Projects:
International collaboration: Genome-wide studies of complex disorder related alleles detection
We participate in the LUCAMP project, which is a Sino-Danish collaboration aiming to detect the high-risk rare alleles (MAF < 0.5%) with different frequencies between diabetic cases and healthy controls, to validate those alleles experimentally and use them to aid disease prediction and prevention. In the first phase of this study, the goal is to sequence the whole exome of 2,000 cases and 2,000 controls and to detect the top SNPs with the most significantly differences in allele frequencies between these two groups of subjects.
The Ministry of Science and Technology, 863 funding: Identify susceptible sites of type II diabetes (T2D) in Chinese population
We collected 1800 DNA samples from Chinese population, and classified them into three groups: 1) patients with T2D; 2) glucose tolerance; 3) controls which were randomly selected and matched well with cases in age and sex. Employed both genotyping and Solexa sequencing, we are identifying susceptible alleles related to the occurrence of T2D in Chinese population.
The Silkworm Genomic Variation Map
Collected from several geographic regions all around the world, a total of 34 silkworms are being resequenced at ~2X each and genomic variation patterns, mostly SNPs, will be detected. We also aim to screen for evidence of artificial selection and/or domestication between the genomes of domesticated and wild silkworms. This project is the phase II of the Silkworm Genomics Project, and Southwest University in Chongqing, China provides all the silkworm samples. In phase I of the Project, BGI and Southwest University achieved the sequencing and analysis of the genome of domesticated silkworm Bombyx mori and the work has been published in Science.
Cancer Genomics
Facilitating with the whole-genome resequencing and target region resequencing in low cost, it’s feasible to identify the genetic features of tissues/blood samples from cancer patients. The cancer genomics group in the bioinformatics center of BGI-Shenzhen is aiming to construct the pipeline to analyze large-scale sequencing data focusing on explaining the relationship of the genetic context, expression profiling and epigenetic diverse and their corresponding effects on the phenotypes. |