Additional file 1: Figure S1. Benchmarking of libraries generated with low-pass kits sequenced at intermediate coverage levels. a) Mean coverage across the library types. b) Per-sample duplicate rate over (deduplicated) sequencing coverage. c) Genotype quality (GQ) as a function of mean GQ (averaged over 2 �� 12 samples). Fraction of variant calls that overlap between replicates (d), and their genotype concordance (e) at either all variants (GQ > 0) or high-confidence variants (GQ > 20). f) Recall, g) Precision and h) Non-reference concordance rates computed per sample against the 1000 Genomes high coverage call set [32] as ���truth���. The single HP4 sample with coverage>10x was excluded from this comparison. Figure S2. a) Overview of data types available for participants and how they overlap. b) Distribution of de-duplicated sequencing coverage per sample for low-pass samples, c) TruSeq PCR-free high-coverage samples, d) TruSeq Nano high-coverage samples. e) Distribution of sequencing duplicate rates per sample for low-pass samples, f) for TruSeq PCR-free samples, and g) for TruSeq Nano samples. h) Breakdown of number of individuals by self-reported ethnicity and sequencing type. Figure S3. Effect of GQ filtering on indel calling performance. a) Recall, b) Precision, and c) NCR for indels over varying GQ thresholds. Figure S4. Accuracy of flagged sites by flag type. a) Overview of the different flag types that characterize variants by comparing (filtered) sequencing-based genotype with genotype after imputation. A call is flagged with IM = 0 if sequencing-based genotype and imputed genotype agree fully. Given low coverage, we consider the lack of sequencing data evidence for an imputed allele as ���not inconsistent��� while the disappearance of an allele after imputation is categorized as ���inconsistent���. IM = 1 therefore flags imputed calls that are not inconsistent with the sequencing-based call (either because it was missing or we may have only observed one of two alleles in sequencing). IM = 2 and IM = 3 flag sites that are inconsistent between sequencing-based and imputed calls, where IM = 2 calls were heterozygous in sequencing (potentially due to sequencing or mapping artifacts or contamination, or an error in imputation) and IM = 3 calls were homozygous for the opposite allele. b) Fraction of SNV calls in each IM flag category. c) Fraction of indel calls in each IM flag category. d) Recall (normalized to each individual���s overall SNV recall), e) Precision, and f) NCR of SNVs. g) Normalized recall, h) Precision, and i) NCR of indels. Figure S5. Detailed performance (recall, precision, and NCR) of SNV and indel calling both genomewide (including repetitive regions) as well as in high-confidence regions only, shown over coverage. a) SNVs genomewide, b) SNVs in high-confidence regions, c) Indels genomewide, d) Indels in high-confidence regions. Figure S6. Performance comparison across different pipeline stages/runs. a) Overview of tested call sets. ���Single��� refers to individually called mid-pass data (GQ > 17). ���MP��� and ���MP-HP��� refer to the joint-called (���joint���) and imputed (���imp���) call sets using mid-pass data from 1510 individuals (MP) and mid-pass data from 1410 individuals plus high-pass data from 100 high-pass individuals (MP-HP) For more details see methods. b) Recall, c) Precision, and d) NCR for SNVs. e) Recall, f) Precision, and g) NCR for indels. Figure S7. Analysis of European admixture in the study cohort. ADMIXTURE was run assuming two populations on the cohort with 91 British individuals from 1000 Genomes (GBR) included to capture European ancestry. Shown are the proportions of ancestry estimated (population 1 = red, population 2 = orange). Individuals are ordered by cohort (GBR/Polynesian). Analysis of PC1 from PC analysis versus proportion of population 1 ancestry from ADMIXTURE analysis found that PC1 is highly correlated with the degree of estimated European ancestry (Spearman���s �� = ��� 0.89, p 5%) variants in the study cohort that are either absent from (a) or rare in (b) 1000 Genomes. Indels located in high-confidence regions of the genome and all SNVs were included in the analysis.