Genome-wide Association Studies

Author

Yuetiva Robles

Published

July 20, 2025

Summary

This document includes instructions for running a simple genome-wide association study (GWAS) using the command line program PLINK (version 1.9) and visualizing data using R and other tools. Additional information and the PLINK software can be found on the PLINK website.

Genome-Wide Association Studies using PLINK

Two primary genome-wide association tests performed in PLINK are linear regression for continuous outcomes like biomarkers and logistic regression for binary outcomes like AD case/control status. This tutorial will give an example of each analysis type. These analyses and additional association analyses are described in more detail on the PLINK Association analysis webpage.

Deconstructing PLINK commands

The commands for running linear and logistic regression analyses in PLINK are very similar, so we'll deconstruct the PLINK commands here before moving on to running the analyses.

--bfile fileprefix
Here fileprefix is the filename before the .bed/.bim/.fam in the set of files containing your genetic data.

--allow-no-sex
This is necessary if your PLINK .fam file does not have SEX included, since PLINK will automatically ignore missing-sex samples.

--adjust
Optional: this command creates an additional file of adjusted significance values that correct for multiple testing (Bonferroni, Benjamini & Hochberg, etc). The real benefit to using this command is that this file is sorted by unadjusted p-value, smallest to largest and provides a quick and easy way to view top SNPs.

--linear hide-covar (or --logistic hide-covar)
Use the --linear command for linear regression when the outcome/phenotype is continuous or --logistic command for logistic regression for binary (case/control). Optional: adding the hide-covar modifier removes the regression results for the covariates from the output file and is highly recommended (see the end of this document for discussion).

--covar filename
Here filename is the name of the file with covariate data. This file should be text format and tab-delimited. It does not need to include everyone in the .fam file and there can be data for individuals not listed in the .fam file; PLINK will automatically ignore sample data with non-matching IDs.

--covar-name covar1,covar2
List the covariates to be added to the analysis, where covar1, covar2, etc are the column names used in the covariate file. When listing more than one covariate, separate each with a comma (no space). The order doesn't matter and PLINK will ignore any columns not listed. All covariates used must be numeric, use dummy variables for categories.

--pheno filename
Here filename is the name of the file with phenotype/outcome data. Since specific columns can be named, best practices are to use the same file for phenotype/outcome data and covariates.

--pheno-name phenotype
List the phenotype/outcome to be tested, where phenotype is the column name used in the phenotype file. Phenotypes must be numeric, continuous for linear regression and binary for logistic (control = 1, case = 2).

--pfilter 1
Optional: this command is used to specify a maximum p-value cutoff. Some SNPs will have p = NA, --pfilter 1 will remove these from the final results. Using other cutoff values (for example --pfilter 0.05) can provide even smaller results files if you don't care about SNPs with larger p-values.

--maf 0.02
Optional: this command is used to specify a minimum minor allele frequency (MAF) for SNPs to test. Typically smaller datasets may have some false positives with lower frequency SNPs and usually an MAF cutoff will help - typically values between 0.01 and 0.05 will be used for smaller datasets and larger datasets will usually use smaller values around 0.001.

--out outfileprefix
Here outfileprefix is the filename prefix to be used for output files created using this command. PLINK will automatically add the filename extension.

Linear regression analysis

To test SNP associations with hippocampal volume (HV), we'll repeat the linear regression analysis from the previous tutorial, this time in a GWAS using PLINK.

At the shell prompt, type:

PLINK="./plink/plink"
BFILE="data/files-imputation/postImputation/ADNI1.impQC.rs"
PHENO="my-analysis/ADNI1_HV_covars.txt"
OUTDIR="my-analysis"
cd ~
$PLINK --bfile $BFILE --allow-no-sex --adjust --linear hide-covar --covar $PHENO --covar-name AGE,SEX,ICV --pheno $PHENO --pheno-name HV --pfilter 1 --maf 0.02 --out $OUTDIR/ADNI1_HV_GWAS_p1_maf02

PLINK is very efficient and can run simple GWAS with relatively small datasets like this on personal computers. Note that this exact analysis has been run on a 2020 MacBook Pro laptop (16 GB RAM reserved) and took less than 20 minutes for >6.8 million SNPs and 562 samples.

Output files created with this command include:

ADNI1_HV_GWAS_p1_maf02.log
The log file created using the specified outfileprefix.

ADNI1_HV_GWAS_p1_maf02.assoc.linear
A text file created using the specified outfileprefix: columns are CHR, SNP, BP, A1, TEST, NMISS, BETA, STAT, P. SNPs are listed in the same order as in the .bim file (by chromosome and base pair position).

ADNI1_HV_GWAS_p1_maf02.assoc.linear.adjusted
Optional text file created using --adjusted in the PLINK command. Columns are CHR, SNP, UNADJ, GC, BONF, HOLM, SIDAK_SS, SIDAK_SD, FDR_BH, FDR_BY (where UNADJ is the unadjusted p-value and the other columns are p-values adjusted for the various multiple test correction methods). SNPs are listed in order by smallest unadjusted p-value to largest.

Logistic regression analysis

To test SNP associations with AD case/control status, we'll repeat the logistic regression analysis from the previous tutorial, this time in a GWAS using PLINK.

At the shell prompt, type:

PLINK="./plink/plink"
BFILE="data/files-imputation/postImputation/ADNI1.impQC.rs"
PHENO="my-analysis/ADNI1_HV_covars.txt"
OUTDIR="my-analysis"
cd ~
$PLINK --bfile $BFILE --allow-no-sex --adjust --logistic hide-covar --covar $PHENO --covar-name AGE,SEX --pheno $PHENO --pheno-name ADstatus --pfilter 1 --maf 0.02 --out $OUTDIR/ADNI1_ADstatus_GWAS_p1_maf02

Output files created with this command include:

ADNI1_ADstatus_GWAS_p1_maf02.log
The log file created using the specified outfileprefix.

ADNI1_ADstatus_GWAS_p1_maf02.assoc.logistic
A text file created using the specified outfileprefix: columns are CHR, SNP, BP, A1, TEST, NMISS, OR, STAT, P. SNPs are listed in the same order as in the .bim file (by chromosome and base pair position).

ADNI1_ADstatus_GWAS_p1_maf02.assoc.logistic.adjusted
Optional text file created using --adjusted in the PLINK command. Columns are CHR, SNP, UNADJ, GC, BONF, HOLM, SIDAK_SS, SIDAK_SD, FDR_BH, FDR_BY (where UNADJ is the unadjusted p-value and the other columns are p-values adjusted for the various multiple test correction methods). SNPs are listed in order by smallest unadjusted p-value to largest.

Visualize results

With the .adjusted file, a very quick and easy preview of the results can be done using the command-line. Previewing the first few rows of the .adjusted file will help identify the SNPs with the smallest p-values and let you determine if there were any genome-wide significant associations (p<5×10^-8). We will also check to see if our results from the previous tutorial are similar for rs429358 and rs769449. At the shell prompt, type:

HVGWAS="ADNI1_HV_GWAS_p1_maf02.assoc.linear"
ADGWAS="ADNI1_ADstatus_GWAS_p1_maf02.assoc.logistic"
cd ~/my-analysis
head $HVGWAS.adjusted
head $ADGWAS.adjusted

egrep -w 'rs429358|rs769449' $HVGWAS
egrep -w 'rs429358|rs769449' $ADGWAS

Manhattan and QQ plots

There are a number of R packages available to create Manhattan plots and QQ-plots such as ggplot2 and qqman. Another option is to use online tools like FUMA GWAS (Functional Mapping and Annotation of Genome-Wide Association Studies) which can provide a "one-stop shop" for visualizing, annotating, and prioritizing SNPs for downstream post-GWAS analyses covered in the later tutorial.

One of the easiest R packages to use and customize is qqman, which is what we'll demonstrate here. First, import the GWAS results and prepare the data. Warning, these are very large files and can take a while to load and manipulate. Note that in the Manhattan plot keeping millions of SNPs with P>0.05 is not going to be very different than the plot after removing these SNPs and shifting the y-axis to start at 1.3 instead of 0. The reduction in processing load and file size is noticeable, which is why we are filtering our GWAS results by p-value in the preparation below.

## load libraries
library(tidyverse)
library(here)
library(qqman)

## create variables for files/directories
gwasfile <- here("~","my-analysis","ADNI1_HV_GWAS_p1_maf02.assoc.linear")
outdir <- here("~","my-analysis")

## import GWAS results file
gwas <- read_table(gwasfile)
glimpse(gwas)
# 6813933 SNPs

## create and save QQ plot before filtering SNPs
## use png for a smaller file; pdf for publication quality
## IMPORTANT: The QQ plot takes several minutes to be generated.
png(file = paste0(outdir, "/ADNI1_HV_GWAS-qqplot.png"))
qq(gwas$P)
dev.off()

## filter by p-value to reduce number of observations
gwas <- gwas %>%
  filter(P <= 0.05)
glimpse(gwas)
# 327182

## create and save Manhattan plot
## use png for a smaller file; pdf for publication quality
png(file = paste0(outdir, "/ADNI1_HV_GWAS-manhattan.png"))
manhattan(gwas)
dev.off()

Organized by Alzheimer’s Association, ISTAART Neuroimaging PIA. Working group Brain Imaging Genetics.

Special thanks to ADNI for providing the datasets.