Single SNP Association Analysis

Author

Yuetiva Robles

Published

July 20, 2025

Summary

This document includes instructions for running simple single-SNP analysis using R. Genome-wide association studies (GWAS) are essentially large-scale single-SNP analyses repeated up to millions of times. Here we demonstrate a simple single-SNP regression analysis, the traditional GWAS analysis. There are several other statistical analysis methods available for neuroimaging genetic studies, but these are outside the scope of this tutorial.

Extract SNP genotypes

Our data is stored in PLINK format (.bed/.bim/.fam) and for single SNP analyses we need to extract SNP genotypes from these data to test. Here is the PLINK command to extract genotype data for two individual SNPs of interest: rs429358 and rs769449.

Fun fact: These are the SNP for APOE ε4 (rs429358) and a SNP in linkage disquilibrium (LD) with rs429358 that often appears as the top hit in GWAS (rs769449). We refer to rs769449 as a "tag SNP" for APOE ε4.

At the shell prompt, type:

PLINK="./plink/plink"
BFILE="data/files-imputation/postImputation/ADNI1.impQC.rs"
OUTDIR="my-analysis"
cd ~
mkdir $OUTDIR
$PLINK --bfile $BFILE --recode A --snps rs429358,rs769449 --out $OUTDIR/ADNI1_APOE_snps

Deconstructing the PLINK command:

--bfile fileprefix
Here fileprefix is the filename before the .bed/.bim/.fam in the set of files containing your genetic data.

--recode A
This creates a new text file, the A modifier provides additive genotype data (0, 1, or 2) based on the A1 allele by default (PLINK default A1 = minor allele).

--snps rsID,snpID
This specifies a subset of one or more SNPs where rsID or snpID are SNPs listed in the .bim file, multiple SNPs are separated by commas (no space). Note that if any SNPs are not present in your data you'll get an error and no file will be generated, best practice is to verify the SNP names in .bim file or use --extract with a text file list of SNPs instead.

--out outfileprefix
Here outfileprefix is the filename prefix to be used for output files created using this command. PLINK will automatically add the filename extension.

Output files created with this command include:

outfileprefix.log
The log file created using the specified outfileprefix.

outfileprefix.raw
A text file created using the specified outfileprefix. In this example, columns are FID, IID, PAT, MAT, SEX, PHENOTYPE, snp1_A1, ... Here snp1_A1 is the additive genotype data for extracted SNPs named by the SNP ID followed by the A1 allele/counted allele).

Run single-SNP regression in R

Now that we have a text-format file with additive genotype data, we can combine that with our phenotype data and perform regression analyses in R. This is a simplified model for the tutorial, remember to use robust statistical modeling when running your own analyses.

## load libraries
library(tidyverse)
library(here)
library(broom)

## create variables for files/directories
genofile <- here("~","my-analysis","ADNI1_APOE_snps.raw")
phenofile <- here("~","data","ADNI1_demographics_immersive.txt")
outdir <- here("~","my-analysis")

## Import genotype data
geno <- read_table(genofile)
## preview imported data
glimpse(geno)

## Import phenotype data
pheno <- read_table(phenofile)
## preview imported data
glimpse(pheno)

Phenotype variables of interest:

ID : Individual identifier that matches PLINK IID in .fam file.

DISEASE_STATUS : Summary diagnosis provided by ADNI as AD, CN, LMCI.

DISEASE_STATUS_DICO : Derived variable for dichotomous AD status (0 = does not have AD summary diagnosis, 1 = has AD summary diagnosis).

AGE : Age provided by ADNI.

REPORTED_SEX : Reported sex provided by ADNI as Female, Male.

Whole_hippocampus : Hippocampal volume provided by ADNI.

ICV : Intracranial volume provided by ADNI.

C1, C2, C3, C4 : First four principal components for genetic population structure calculated from the genetic data preparation in the previous tutorial.

## Extract IDs and SNP data; 

geno_sub <- geno %>%
  select(FID, IID, rs769449_A, rs429358_C)

## combine genotype data, age, sex, AD status, HV, ICV, and PCs for analyses

## note that the variable renaming below is not all necessary, 
## simply preference/convenience in later scripts

hv <- pheno %>%
  select("IID"  = ID, "Diagnosis" = DISEASE_STATUS, 
         "ADstatus" = DISEASE_STATUS_DICO, 
         AGE, "SEX" = REPORTED_SEX, "HV" = Whole_hippocampus, 
         ICV, C1, C2, C3, C4) %>%
  right_join(geno_sub, by = "IID")
glimpse(hv)
#dfSummary(hv)

#################################
## Extra table: Cross-tabulation of rs429358 and rs769449

## This is just an extra demonstration showing that rs769449 
## genotype is very similar to rs429358 but not the same

table(hv$rs429358_C, hv$rs769449_A, useNA = 'always')

Here we'll run a simple linear regression testing the relationship between hippocampal volume and rs429358 (or rs769449), adjusting for age, sex, and intracranial volume.

rs429358hv.mod <- lm(
  HV ~ rs429358_C + AGE + SEX + ICV, 
  data = hv)
summary(rs429358hv.mod)
tidy(rs429358hv.mod)

rs769449hv.mod <- lm(
  HV ~ rs769449_A + AGE + SEX + ICV, 
  data = hv)
summary(rs769449hv.mod)
tidy(rs769449hv.mod)

Here we'll run a simple logistic regression testing the relationship between AD case/control status and rs429358 (or rs769449), adjusting for age and sex.

rs429358ad.mod <- glm(
  ADstatus ~ rs429358_C + AGE + SEX, 
  data = hv, 
  family = binomial)
summary(rs429358ad.mod)
tidy(rs429358ad.mod)

rs769449hv.mod <- glm(
  HV ~ rs769449_A + AGE + SEX, 
  data = hv, 
  family = binomial)
summary(rs769449ad.mod)
tidy(rs769449ad.mod)