Summary

This report documents the steps to prepare genotype data for imputation using the TOPMed Imputation Server:

  • SNP-level quality control using PLINK
  • Conversion to VCF format
  • (Optional) Liftover to GRCh38
  • Sorting, compressing, and indexing VCF files
  • Preparing files for upload

Required Tools and Files

Software Tools

Tool Purpose Install Command
PLINK Quality control, format conversion conda install -c bioconda plink
vcf-sort Sort VCF files (via htslib or vcftools) Included in vcftools or use bcftools sort
bgzip Compress VCF files to .vcf.gz conda install -c bioconda htslib
tabix Index compressed VCFs conda install -c bioconda htslib
Perl Required to run HRC-1000G-check-bim.pl Pre-installed on most systems

Reference Files

File Description Command
HRC-1000G-check-bim.pl Script to harmonize SNP positions/alleles
HRC.r1-1.GRCh37.wgs.mac5.sites.tab Reference SNP list for QC

Preparing pre-Imputation data

Additional SNP-Level Quality Control

plinkFile <- "ADNI_QC_FINAL"
dataDir <- getwd()
setwd(dataDir)

# Read BIM file
bim <- read.table(paste0(plinkFile, ".bim"), header = FALSE, stringsAsFactors = FALSE)
colnames(bim) <- c("CHR", "SNP", "CM", "BP", "A1", "A2")

# Filter chromosomes 1–22 and X
bim <- bim[bim$CHR %in% c(as.character(1:22), "X"), ]

# Filter alleles to A/C/G/T only
valid_alleles <- c("A", "C", "T", "G")
bim <- bim[bim$A1 %in% valid_alleles & bim$A2 %in% valid_alleles, ]

# Remove duplicated positions
dup_pos <- bim$BP[duplicated(bim$BP)]
bim <- bim[!bim$BP %in% dup_pos, ]

# Save valid SNPs
write.table(bim$SNP, "ValidSNPs.txt", quote = FALSE, row.names = FALSE, col.names = FALSE)

Convert to Binary Format

system(paste("plink --file", plinkFile, "--output-chr M --make-bed --out", plinkFile))

Allele Frequency Calculation

system(paste("plink --bfile", plinkFile, "--freq --out", plinkFile))

BIM Check

hrc_script <- "/path/to/HRC-1000G-check-bim.pl"
hrc_ref <- "/path/to/HRC.r1-1.GRCh37.wgs.mac5.sites.tab"

system(paste("perl", hrc_script,
             "-h -r", hrc_ref,
             "-b", paste0(plinkFile, ".bim"),
             "-f", paste0(plinkFile, ".frq"),
             "-c -p EUR -o"))

Run SNP Update Script

system("chmod 755 Run-plink.sh")
system("./Run-plink.sh")

Save Reference Alleles

for (i in 1:22) {
  bim_chr <- read.table(paste0(plinkFile, "-updated-chr", i, ".bim"), header = FALSE)
  write.table(bim_chr[, c(2, 6)], paste0("snps_", i, ".txt"),
              quote = FALSE, row.names = FALSE, col.names = FALSE, sep = "\t")
}

Convert to Sorted VCF and Index

for i in {1..22}; do
  vcf-sort ADNI1_QC_FINAL-updated-chr$i.vcf | bgzip -c > ADNI1-updated-chr$i.vcf.gz
  tabix -p vcf ADNI1-updated-chr$i.vcf.gz
  echo "Processed chr$i"
done

Output pre-Imputation

Each chromosome will produce:

  • ADNI1-updated-chr<i>.vcf.gz
  • ADNI1-updated-chr<i>.vcf.gz.tbi

These files can now be uploaded to the TOPMed or Michigan Imputation Server.


Running Imputation


Preparing imputed data

Decompress results

for i in `ls *.zip`
do 
  unzip -P XXXXXXX $i #Password Imputation Server
done

Merge chromosome datasets in one file

for i in {1..22}
do 
  plink --vcf chr$i.dose.vcf.gz --make-bed --double-id --out chr$i.final 
  echo chr$i.final >> merge.list
done

plink --merge-list merge.list --make-bed --out ADNI1.merged

Annotate to rsID format (optional)

plink --bfile ADNI1.merged --recode vcf bgz --out ADNI1.impQC
tabix -p vcf ADNI1.impQC.vcf.gz

bcftools annotate --annotations /nfs/users2/rg/nvilortejedor/ALFA-GWAS/HRC_Imputation/annotation_GRCh37p13/All_20180423.vcf.gz --columns ID --threads 20 -O z -o ADNI1.impQC.rs.vcf.gz ADNI1.impQC.vcf.gz 

Additional post-Imputation QC

[Up to the user] We normally check: Imputation quality, MAF, HWE, …

Remove intermediate files

rm ADNI1.impQC*.vcf.gz*

Output After Imputation

Each chromosome will produce:

  • ADNI1.impQC.rs.bed
  • ADNI1.impQC.rs.bim
  • ADNI1.impQC.rs.fam

References

  • Li Y, Willer C, Sanna S, Abecasis G. Genotype imputation. Annu Rev Genomics Hum Genet. 2009;10:387-406. doi: 10.1146/annurev.genom.9.081307.164242. PMID: 19715440; PMCID: PMC2925172.

  • Das S, Forer L, Schönherr S, Sidore C, Locke AE, Kwong A, Vrieze S, Chew EY, Levy S, McGue M, Schlessinger D, Stambolian D, Loh PR, Iacono WG, Swaroop A, Scott LJ, Cucca F, Kronenberg F, Boehnke M, Abecasis GR, Fuchsberger C. Next-generation genotype imputation service and methods. Nature Genetics 48, 1284–1287 (2016).


Organized by Alzheimer’s Association, ISTAART Neuroimaging PIA. Working group Brain Imaging Genetics.

Special thanks to ADNI for providing the datasets.

© 2025 AAIC Workshop Basics of Genetics • Maintained by @GeneticNeuroStats