User Guide

PGG.Population Database

PGG.Population database documents 7122 genomes representing 356 global populations from countries, and provides necessary information for researchers and medical doctors to understand genomic diversity and genetic ancestry of human populations. We included miscellaneous functions and a friendly graphical user interface to visualize genomic diversity, population relationship (genetic affinity), ancestral makeup, footprints of natural selection, and population history etc. Moreover, PGG.Population provides a dynamic feature for users to analyze and compare their data to population samples in the database which are updated timely when new data are available. The long-term aim of the PGG.Population, together with the joint efforts from other researchers who publish or re-publish their data and visualize results in a dynamic style via online illustration in our database, is to bridge evolutionary genetic studies to future precision medicine.

Data dealing process

Data Collection
We manually searched for information of each enrolled population online or from literatures (Figure right). Genome-wide genotyping data or NGS data of whole-genomes were collected for each human population. These genomic data covered not only general populations studied by international projects such as the HapMap [1], HGDP [2], the 1000 Genomes Project [3], the HUGO Pan-Asia SNP Project [4], the Human Origin data set [5], and Simons Genomic Diversity Project [6], but also the indigenous/isolated populations contributed by regional sequencing efforts such as Tibetans [7], Sherpas [8], Xingjiang’s Uyghurs [9] and ethnic groups with genomes deposited in Estonian Biocentre ( The list of populations and genomes will be updated once new data are published. In the current version of PGG.Population, all the genome information were based on the Human Genome Build 37 positions.

Data integration
Different genotyping data sets from diverse platforms were assembled for further analysis (Figure right). For individual data genotyped on the same platform, such as Illumina arrays, these were directly pooled and then that of respective individuals were extracted with PLINK using filters that will be described in the next section

(DC1 in Figure right). In principle, we do not merge data obtained from distinct platforms (Illumina and Affymetrix arrays) owing to the small intersection of SNPs among them. However, combining data was applied when both the data of different platforms are valuable for understanding the demographic events of ethnic groups (DC2 in Figure right). For example, we pooled individual data genotyped on Affymetrix Genome-wide Human SNP Array 6.0 and Illumina HO-Q Illumina HumanOmni1-Quad beadchip when exploring and reconstructing the population structure of Sherpas and Tibetans, of which the method was described in detail in Zhang et al [8]. For the purpose of minimizing strand ambiguity issues, all A/T and G/C markers were removed to reduce the risk of any ambiguity before data combination.

The NGS data were combined flexibly with (DC1 in Figure 1) or without genotyping data (DC3 in Figure right), depending on the requirements of downstream analyses. When NGS data were combined with genotyping data, strand information was determined from the whole-genome sequence data based on the Human Genome Build 37 positions and strand was flipped to match that of the sequenced data. Only intersections of SNPs among NGS and genotyping data were retained for further analysis.

NGS data of high coverage (≥ 20 ×) were integrated from bam files for further analysis (DC3 in Figure right). However, NGS data of low coverage (< 20 ×) were not combined considering the VCF files were generated from data of different read depth, coverage and variant calling process. NGS reads were mapped to the human reference genome (Build 37) using Burrows-Wheeler Algorithm [10]. SNP calling and raw variants filtering were carried out using the HaplotypeCaller module and the variant quality score recalibration (VQSR) module in GATK [11, 12], respectively.

These steps thus generate different pooled data sets (‘Illumina’ data sets, ‘Affymetrix’ data sets, ‘Illumina-Affymetrix’ data sets, and ‘NGS’ data sets), which were flexible for reconstructing histories of diverse populations. We selected the latest and most representative data set for one group when the given population is covered by different datasets. A distinguished example is the Xinjiang’s Uyghurs, where the data published by Feng et al [9] were included as the best representative data set, as it consisted of around 1,000 samples from diverse geographical regions.

Quality control
We filtered each combined data set that was assembled at both the single nucleotide polymorphism (SNP) and individual levels (QC1 in Figure right). At the SNP level, we removed SNPs with call rate of < 90% (across all individuals). At the individual level, we required at least 90% genotyping completeness for each individual (across all SNPs). We also removed recently related individuals by filtering one individual from all pairs when identity by descent (IBD) was > 35%. All of the analyses were performed with PLINK v1.07 [13]. Only biallelic variants were used for downstream analysis. Outliers were removed based on principal components analysis (PCA) for each data set. For each “NGS” data set, only nucleotide sites passed universal filters were retained, as variant calling can be challenging in complex genomic regions [6] (QC2 in Figure right).

To test batch effect for each merged data set, PCA and FST-based analysis were performed (QC1 in Figure right). Population samples from the same group and genotyped on different platforms were particularly used for examining any potential batch effects. These population samples are expected to show close genetic affinity (FST <0.004) and cluster together in PCA plot [14] given there is no considerable batch effect. Data sets with significant batch effect were excluded from further analysis.

Analysis of genomic diversity and inference of genetic ancestry
Y chromosomal haplogroups were determined on the basis of key mutations commonly used for nomenclature of human paternal lineages. We developed an algorithm to search all possible combinations of the key mutations used for nomenclature from our sequence data to determine the fine-scale paternal haplogroup that was affiliated with each sample. mtDNA haplogroups were defined as described by Weissensteiner et al [15]. To estimate genetic affinities, FST between each pair of populations was calculated following Weir and Cockerham [16]. To investigate fine-scale population structures, we performed a series of PCA using EIGENSOFT [17]. We applied ADMIXTURE v1.30 [18] for unsupervised clustering analysis. Because the model in ADMIXTURE does not take linkage disequilibrium (LD) into consideration, we pruned each dataset using an r2 cutoff of 0.1 in each continuous window of 50 SNPs, and advanced by 10 SNPs (--indep-pairwise 50 10 0.1) with PLINK v1.07. We ran ADMIXTURE with random seeds for the datasets from K = 2 to K = 20 with default parameters (--cv = 5) in 10 replicates for each K for each data set. We used runs of homozygosity (ROH) to measure genetic diversity for each population. Natural selection analysis was performed only for NGS data sets using SelScan [19].

1. Mathieson I, Lazaridis I, Rohland N, Mallick S, Patterson N, Roodenberg SA, Harney E, Stewardson K, Fernandes D, Novak M, et al: Genome-wide patterns of selection in 230 ancient Eurasians. Nature 2015, 528:499-503.

2. Li JZ, Absher DM, Tang H, Southwick AM, Casto AM, Ramachandran S, Cann HM, Barsh GS, Feldman M, Cavalli-Sforza LL, Myers RM: Worldwide human relationships inferred from genome-wide patterns of variation. Science 2008, 319:1100-1104.

3. Genomes Project C, Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, Korbel JO, Marchini JL, McCarthy S, McVean GA, Abecasis GR: A global reference for human genetic variation. Nature 2015, 526:68-74.

4. Consortium HP-AS, Abdulla MA, Ahmed I, Assawamakin A, Bhak J, Brahmachari SK, Calacal GC, Chaurasia A, Chen CH, Chen J, et al: Mapping human genetic diversity in Asia. Science 2009, 326:1541-1545.

5. Patterson N, Moorjani P, Luo Y, Mallick S, Rohland N, Zhan Y, Genschoreck T, Webster T, Reich D: Ancient admixture in human history. Genetics 2012, 192:1065-1093.

6. Mallick S, Li H, Lipson M, Mathieson I, Gymrek M, Racimo F, Zhao M, Chennagiri N, Nordenfelt S, Tandon A, et al: The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature 2016, advance online publication.

7. Lu D, Lou H, Yuan K, Wang X, Wang Y, Zhang C, Lu Y, Yang X, Deng L, Zhou Y, et al: Ancestral Origins and Genetic History of Tibetan Highlanders. Am J Hum Genet 2016, 99:580-594.

8. Zhang C, Lu Y, Feng Q, Wang X, Lou H, Liu J, Ning Z, Yuan K, Wang Y, Zhou Y, et al: Differentiated demographic histories and local adaptations between Sherpas and Tibetans. Genome Biol 2017, 18:115.

9. Feng Q, Lu Y, Ni X, Yuan K, Yang Y, Yang X, Liu C, Lou H, Ning Z, Wang Y, et al: Genetic history of Xinjiang's Uyghurs suggests Bronze Age multiple-way contacts in Eurasia. Mol Biol Evol 2017.

10. Li H, Durbin R: Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 2010, 26:589-595.

11. DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M, et al: A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 2011, 43:491-498.

12. McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA: The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 2010, 20:1297-1303.

13. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, Sham PC: PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 2007, 81:559-575.

14. Nakatsuka N, Moorjani P, Rai N, Sarkar B, Tandon A, Patterson N, Bhavani GS, Girisha KM, Mustak MS, Srinivasan S, et al: The promise of discovering population-specific disease-associated genes in South Asia. Nat Genet 2017, 49:1403-1407.

15. Weissensteiner H, Pacher D, Kloss-Brandstatter A, Forer L, Specht G, Bandelt HJ, Kronenberg F, Salas A, Schonherr S: HaploGrep 2: mitochondrial haplogroup classification in the era of high-throughput sequencing. Nucleic Acids Res 2016, 44:W58-63.

16. Weir BS: Estimating F-statistics: A historical view. Philos Sci 2012, 79:637-643.

17. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D: Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 2006, 38:904-909.

18. Alexander DH, Novembre J, Lange K: Fast model-based estimation of ancestry in unrelated individuals. Genome Res 2009, 19:1655-1664.

19. Szpiech ZA, Hernandez RD: selscan: an efficient multithreaded program to perform EHH-based scans for positive selection. Mol Biol Evol 2014, 31:2824-2827.

How to operate figures

We provided several interactive website elements for each figure to enhance user experience. Here we use the ADMIXTURE part (Figure attached below) on our website as an example to provide an introduction of the interactive events, and how they can be controlled to obtain the best solution scheme.

(1). Mouse click event: Clicking on the color bars (or figure legend) will hide components represented by the corresponding color in the plot, and re-clicking on the inactive color (shown in grey) will enable it show again on the plot.

(2). Mouse hover event: Hovering on an element in a plot will trigger an information box containing detailed data about this element. For instance, by hovering a bar in the ADMIXTURE plot, users can find which

(3). Mouse wheel scroll: In ADMIXTURE and ROH plots, scrolling will zoom in/out the resolution of a specific plot, ranging from minimal 1 individual to maximal all samples.

(4). Data view and figure download toolbox: For each plot we prepared a toolbox for users to check numerical data and to download the adjusted figures. The buttons for these functions can be found upright each plot, of which the page-shaped button will open a new box containing Tab-seperated data, and the download button will convert the current plot into JPEG format and provided the download option.

(5). Option menus: In ADMIXTURE and ROH parts, we provided option menus by which users could change the ancestry (K) numbers in Admixture and the ROH length in descent. The options will change the level of analysis results and presentation.

(6). Data download button. We provided download button to each plot so that users can obtain the corresponding data underlying the figure.

Data Resources

The following table lists the data resources and their platforms, so that users interested in the population may have a chance to access the original data and know the background of the original study.

Author Title Year Journal Platform PMID RawDataLink