Note
All auxiliary scripts used in the examples are documented in Scripts.
Simulated datasets
QTL-MAS 2012
- The dataset is available for download here.
- This article describes how the dataset was simulated.
- 3k related animals, 10k SNPs, 5 chromosomes, and 3 traits
Sequence genotypes
- The dataset is available for download here.
- Sequence genotypes were simulated by genosim.
- 10k unrelated individuals, 5M sequence variants, and 30 chromosomes
- Functional annotations were quickly simulated by assigning LDSC baseline annotations to the simulated sequence variants in order.
- Phenotypes were simulated using the S-LDSC baseline model enrichment estimates for human traits.
- The enrichment estimates were used to compute variance component (VC) estimates for intercept and 24 main functional annotations.
- The VC estimates were set as true values in phenotype simulation.
- A small value was added to the intercept's VC to enhance the all-in-one GRM's positive definiteness.
Partitioning heritability
By chromosomes
Partitioning heritability by chromosomes for the QTL-MAS 2012 dataset
- Create a SNP info file: chr.snp_info.csv.
- Make a GRM for each chromosome.
- Create a GRM list: chr.grms.txt.
- Run REML/MINQUE.
# Making GRMs
# Input: geno and chr.snp_info.csv
mkdir chromosomes
for chr in {1..5}
do
mph --make_grm --binary_genotype geno --min_maf 0 --min_hwe_pval 1e-8 --snp_info chr.snp_info.csv --snp_weight $chr --num_threads 10 --out ./chromosomes/$chr
done
# Running REML
# Input: chr.grms.txt, phen.csv, and covar.csv
mph --reml --grm_list chr.grms.txt --phenotype phen.csv --trait milk --error_weight milk_wt --covariate_file covar.csv --covariate_names all --num_threads 10 --out ./chromosomes/milk
By functional annotations
Partitioning heritability by functional annotations for the sequence genotypes dataset
# 1. Create a new SNP info file including only variants that pass quality control.
# SNPs with an MAF < 0.01 are removed from funct.snp_info.csv, producing qc.funct.snp_info.csv.
snpinfo="qc.funct.snp_info.csv"
# Read the first line of the SNP info file and split it by a comma to get the annotation list.
IFS=',' read -ra elements < $snpinfo
# 2. Make a GRM for each functional annotation category.
# It may take about one hour to contruct all the GRMs.
# Note that precomputed GRMs are included in the dataset. Go to the step 3 to save time.
mkdir grms
# Iterate over the list of annotation categories.
# Use PLINK to extract SNPs in each category for faster I/O.
for ((i = 1; i < ${#elements[@]}; i++)); do
awk -F ',' -v colname="${elements[i]}" 'NR == 1 { for (j = 1; j <= NF; j++) if ($j == colname) col = j } NR > 1 && $col == 1 { print $1 }' $snpinfo > "${elements[i]}.extract.txt"
plink --bfile geno --chr-set 30 --extract "${elements[i]}.extract.txt" --threads 14 --make-bed --out ${elements[i]}
mph --make_grm --binary_geno ${elements[i]} --snp_info $snpinfo --snp_weight ${elements[i]} --num_threads 14 --out ./grms/${elements[i]}
rm ${elements[i]}.*
done
# 3. Create a GRM list and run REML.
# Create a GRM list.
grmlist="funct.grms.txt"
if [ -e $grmlist ]; then
rm $grmlist
fi
for ((i = 1; i < ${#elements[@]}; i++)); do
echo "grms/${elements[i]} 1" >> $grmlist
done
# Run REML.
mkdir reml
mph --reml --save_memory --grm_list $grmlist --phenotype pheno/hsq0.9.sim.csv --trait 1 --num_threads 14 --out reml/1
Below is an R script for recomputing the proportions of genetic variance explained and enrichments.
# mph_functs.R is documented at https://jiang18.github.io/mph/scripts/.
source("mph_functs.R")
library(data.table)
# Read the SNP info file to get the SNP weighting matrix.
sw = as.matrix(fread("qc.funct.snp_info.csv"),rownames=1)
sw[is.na(sw)] = 0
# Get the SNP incidence matrix.
# The SNP weighting matrix is often the incidence matrix, but not always.
si = (sw != 0) * 1
# More annotation categories can be included in the incidence matrix.
# For example, a new annotation of gene is derived here.
gene = as.numeric( rowSums(si[, c("Coding_UCSC", "Intron_UCSC", "Promoter_UCSC", "UTR_3_UCSC", "UTR_5_UCSC")]) > 0 )
si = cbind(si, gene)
# Calculate the crossproduct of 'si' and 'sw'.
# The columns of 'sw' should match the rows of 'vcfile'; otherwise, they should be reordered.
index = 1:25; sw = sw[, index]
cp = crossprod(si, sw)
# Number of SNPs in each annotation category of interest
annot_size = colSums(si)
# Total number of SNPs
nsnps = nrow(si)
vcfile = "reml/1.mq.vc.csv"
result = recompute_enrichments(vcfile, cp, nsnps=nsnps, annot.size=annot_size)
result
# Visualization
library(ggplot2)
result = as.data.frame(result[-1,])
pct = result$prop * 100
pct = round(pct,2)
rownames(result) = paste0(rownames(result)," (",pct,"%)")
df = data.frame(FunctionalAnnotation=rownames(result), Enrichment=result$enrichment, SE=result$enrichment.se)
p = ggplot(df, aes(x=FunctionalAnnotation, y=Enrichment)) +
geom_bar(stat="identity", color="black", position=position_dodge()) +
geom_errorbar(aes(ymin=Enrichment-SE, ymax=Enrichment+SE), width=.5, position=position_dodge(.9))
p = p + theme(text = element_text(size=22), axis.text.x = element_text(angle = 65, vjust = 1, hjust=1))
p = p + geom_hline(yintercept=1, linetype="dashed", color = "red") + xlab("") + ylab("Per-SNP heritability enrichment estimate") + ggtitle("Trait 1")
p
The following figure will be produced.
Dominance and epistasis
Decomposing genetic variance into additive, dominance, and epistatic components for the QTL-MAS 2012 dataset
- Make GRMs from SNPs: one for additive and one for dominance.
- Create a GRM list for
--make_fore
: AD.grms.txt. - Make first-order interaction GRMs.
- Create a GRM list for
--reml
, listing A, D, AxA, AxD, and DxD: ADE.grms.txt. - Run REML/MINQUE.
# Making one additive GRM and one dominance GRM
# Input: geno and chr.snp_info.csv
mkdir nonadditive
mph --make_grm --binary_genotype geno --min_maf 0 --min_hwe_pval 1e-8 --snp_info chr.snp_info.csv --num_threads 10 --out ./nonadditive/genome
mph --make_grm --binary_genotype geno --min_maf 0 --min_hwe_pval 1e-8 --snp_info chr.snp_info.csv --num_threads 10 --out ./nonadditive/genome --dom
# Making three first-order interaction GRMs
# Input: AD.grms.txt listing the GRMs that have been made
mph --make_fore --grm_list AD.grms.txt --num_threads 10 --out ./nonadditive/genome
# Running REML
# Input: ADE.grms.txt, phen.csv, and covar.csv
mph --reml --grm_list ADE.grms.txt --phenotype phen.csv --trait milk --covariate_file covar.csv --covariate_names all --num_threads 10 --out ./nonadditive/milk
Genetic correlation
Estimating genetic and environmental correlations for the QTL-MAS 2012 dataset
Genome-wide
mkdir multi-trait
mph --make_grm --binary_genotype geno --min_maf 0 --min_hwe_pval 1e-8 --snp_info chr.snp_info.csv --num_threads 10 --out ./multi-trait/genome
mph --reml --save_mem --grm_list A.grm.txt --phenotype phen.csv --trait milk,fat,fat_percent --covariate_file covar.csv --covariate_names all --num_threads 10 --out ./multi-trait/genome
Chromosome-wise
# This step may have been run.
mkdir chromosomes
for chr in {1..5}
do
mph --make_grm --binary_genotype geno --min_maf 0 --min_hwe_pval 1e-8 --snp_info chr.snp_info.csv --snp_weight $chr --num_threads 10 --out ./chromosomes/$chr
done
# The directory may have been created.
mkdir multi-trait
mph --reml --save_mem --grm_list chr.grms.txt --phenotype phen.csv --trait milk,fat,fat_percent --num_threads 10 --out ./multi-trait/chromosomes
Genotype–covariate interaction
MPH can effectively do GCI-GREML.
Estimating the proportion of phenotypic variance contributed by genotype–covariate interaction effects for the QTL-MAS 2012 dataset
- Make a genotype–covariate interaction GRM for a categorical covariate.
- Run REML using a GRM list file like this.
# The R scripts are documented at https://jiang18.github.io/mph/scripts/.
mkdir GCI
mph --make_grm --binary_genotype geno --min_maf 0 --min_hwe_pval 1e-8 --snp_info chr.snp_info.csv --num_threads 10 --out ./GCI/genome
Rscript --no-save make_gci_grm.R ./GCI/genome covar.csv ./GCI/genome
mph --reml --save_mem --grm_list GCI.grms.txt --phenotype phen.csv --trait milk --covariate_file covar.csv --covariate_names all --num_threads 10 --out ./GCI/milk