Examples - MPH

Note

All auxiliary scripts used in the examples are documented in Scripts.

Simulated datasets

QTL-MAS 2012

The dataset is available for download here.
This article describes how the dataset was simulated.
3k related animals, 10k SNPs, 5 chromosomes, and 3 traits

Sequence genotypes

The dataset is available for download here.
Sequence genotypes were simulated by genosim.
10k unrelated individuals, 5M sequence variants, and 30 chromosomes
Functional annotations were quickly simulated by assigning LDSC baseline annotations to the simulated sequence variants in order.
Phenotypes were simulated using the S-LDSC baseline model enrichment estimates for human traits.
- The enrichment estimates were used to compute variance component (VC) estimates for intercept and 24 main functional annotations.
- The VC estimates were set as true values in phenotype simulation.
- A small value was added to the intercept's VC to enhance the all-in-one GRM's positive definiteness.

Partitioning heritability

By chromosomes

Partitioning heritability by chromosomes for the QTL-MAS 2012 dataset

Create a SNP info file: chr.snp_info.csv.
Make a GRM for each chromosome.
Create a GRM list: chr.grms.txt.
Run REML/MINQUE.

# Making GRMs
# Input: geno and chr.snp_info.csv
mkdir chromosomes
for chr in {1..5}
do
    mph --make_grm --binary_genotype geno --min_maf 0 --min_hwe_pval 1e-8 --snp_info chr.snp_info.csv --snp_weight $chr --num_threads 10 --out ./chromosomes/$chr
done

# Running REML
# Input: chr.grms.txt, phen.csv, and covar.csv
mph --reml --grm_list chr.grms.txt --phenotype phen.csv --trait milk --error_weight milk_wt --covariate_file covar.csv --covariate_names all --num_threads 10 --out ./chromosomes/milk

By functional annotations

Partitioning heritability by functional annotations for the sequence genotypes dataset

# 1. Create a new SNP info file including only variants that pass quality control.
# SNPs with an MAF < 0.01 are removed from funct.snp_info.csv, producing qc.funct.snp_info.csv.
snpinfo="qc.funct.snp_info.csv"

# Read the first line of the SNP info file and split it by a comma to get the annotation list.
IFS=',' read -ra elements < $snpinfo

# 2. Make a GRM for each functional annotation category.
# It may take about one hour to contruct all the GRMs. 
# Note that precomputed GRMs are included in the dataset. Go to the step 3 to save time. 
mkdir grms

# Iterate over the list of annotation categories.
# Use PLINK to extract SNPs in each category for faster I/O.
for ((i = 1; i < ${#elements[@]}; i++)); do
    awk -F ',' -v colname="${elements[i]}" 'NR == 1 { for (j = 1; j <= NF; j++) if ($j == colname) col = j } NR > 1 && $col == 1 { print $1 }' $snpinfo >  "${elements[i]}.extract.txt"
    plink --bfile geno --chr-set 30 --extract "${elements[i]}.extract.txt" --threads 14 --make-bed --out ${elements[i]}
    mph --make_grm --binary_geno ${elements[i]} --snp_info $snpinfo --snp_weight ${elements[i]} --num_threads 14 --out ./grms/${elements[i]}
    rm ${elements[i]}.*
done

# 3. Create a GRM list and run REML.
# Create a GRM list.
grmlist="funct.grms.txt"
if [ -e $grmlist ]; then
    rm $grmlist
fi
for ((i = 1; i < ${#elements[@]}; i++)); do
    echo "grms/${elements[i]} 1" >> $grmlist
done

# Run REML.
mkdir reml
mph --reml --save_memory --grm_list $grmlist --phenotype pheno/hsq0.9.sim.csv --trait 1 --num_threads 14 --out reml/1

Below is an R script for recomputing the proportions of genetic variance explained and enrichments.

# mph_functs.R is documented at https://jiang18.github.io/mph/scripts/.
source("mph_functs.R")
library(data.table)

# Read the SNP info file to get the SNP weighting matrix.
sw = as.matrix(fread("qc.funct.snp_info.csv"),rownames=1)
sw[is.na(sw)] = 0

# Get the SNP incidence matrix.
# The SNP weighting matrix is often the incidence matrix, but not always.
si = (sw != 0) * 1

# More annotation categories can be included in the incidence matrix.
# For example, a new annotation of gene is derived here.
gene = as.numeric( rowSums(si[, c("Coding_UCSC", "Intron_UCSC", "Promoter_UCSC", "UTR_3_UCSC", "UTR_5_UCSC")]) > 0 )
si = cbind(si, gene)

# Calculate the crossproduct of 'si' and 'sw'.
# The columns of 'sw' should match the rows of 'vcfile'; otherwise, they should be reordered.
index = 1:25; sw = sw[, index]
cp = crossprod(si, sw)

# Number of SNPs in each annotation category of interest
annot_size = colSums(si)
# Total number of SNPs
nsnps = nrow(si)

vcfile = "reml/1.mq.vc.csv"
result = recompute_enrichments(vcfile, cp, nsnps=nsnps, annot.size=annot_size) 
result

# Visualization
library(ggplot2)

result = as.data.frame(result[-1,])
pct = result$prop * 100
pct = round(pct,2)
rownames(result) = paste0(rownames(result)," (",pct,"%)")

df = data.frame(FunctionalAnnotation=rownames(result), Enrichment=result$enrichment, SE=result$enrichment.se)
p = ggplot(df, aes(x=FunctionalAnnotation, y=Enrichment)) + 
  geom_bar(stat="identity", color="black", position=position_dodge()) +
  geom_errorbar(aes(ymin=Enrichment-SE, ymax=Enrichment+SE), width=.5, position=position_dodge(.9)) 
p = p + theme(text = element_text(size=22), axis.text.x = element_text(angle = 65, vjust = 1, hjust=1))
p = p + geom_hline(yintercept=1, linetype="dashed", color = "red") + xlab("") + ylab("Per-SNP heritability enrichment estimate") + ggtitle("Trait 1")
p

The following figure will be produced. FunctAnnotTrait1

Dominance and epistasis

Decomposing genetic variance into additive, dominance, and epistatic components for the QTL-MAS 2012 dataset

Make GRMs from SNPs: one for additive and one for dominance.
Create a GRM list for --make_fore: AD.grms.txt.
Make first-order interaction GRMs.
Create a GRM list for --reml, listing A, D, AxA, AxD, and DxD: ADE.grms.txt.
Run REML/MINQUE.

# Making one additive GRM and one dominance GRM
# Input: geno and chr.snp_info.csv
mkdir nonadditive

mph --make_grm --binary_genotype geno --min_maf 0 --min_hwe_pval 1e-8 --snp_info chr.snp_info.csv --num_threads 10 --out ./nonadditive/genome

mph --make_grm --binary_genotype geno --min_maf 0 --min_hwe_pval 1e-8 --snp_info chr.snp_info.csv --num_threads 10 --out ./nonadditive/genome --dom

# Making three first-order interaction GRMs
# Input: AD.grms.txt listing the GRMs that have been made
mph --make_fore --grm_list AD.grms.txt --num_threads 10 --out ./nonadditive/genome

# Running REML
# Input: ADE.grms.txt, phen.csv, and covar.csv
mph --reml --grm_list ADE.grms.txt --phenotype phen.csv --trait milk --covariate_file covar.csv --covariate_names all --num_threads 10 --out ./nonadditive/milk

Genetic correlation

Estimating genetic and environmental correlations for the QTL-MAS 2012 dataset

Genome-wide

mkdir multi-trait

mph --make_grm --binary_genotype geno --min_maf 0 --min_hwe_pval 1e-8 --snp_info chr.snp_info.csv --num_threads 10 --out ./multi-trait/genome

mph --reml --save_mem --grm_list A.grm.txt --phenotype phen.csv --trait milk,fat,fat_percent --covariate_file covar.csv --covariate_names all --num_threads 10 --out ./multi-trait/genome

Chromosome-wise

# This step may have been run. 
mkdir chromosomes
for chr in {1..5}
do
    mph --make_grm --binary_genotype geno --min_maf 0 --min_hwe_pval 1e-8 --snp_info chr.snp_info.csv --snp_weight $chr --num_threads 10 --out ./chromosomes/$chr
done

# The directory may have been created.
mkdir multi-trait
mph --reml --save_mem --grm_list chr.grms.txt --phenotype phen.csv --trait milk,fat,fat_percent --num_threads 10 --out ./multi-trait/chromosomes

Genotype–covariate interaction

MPH can effectively do GCI-GREML.

Estimating the proportion of phenotypic variance contributed by genotype–covariate interaction effects for the QTL-MAS 2012 dataset

Make a genotype–covariate interaction GRM for a categorical covariate.
Run REML using a GRM list file like this.

# The R scripts are documented at https://jiang18.github.io/mph/scripts/.
mkdir GCI

mph --make_grm --binary_genotype geno --min_maf 0 --min_hwe_pval 1e-8 --snp_info chr.snp_info.csv --num_threads 10 --out ./GCI/genome

Rscript --no-save make_gci_grm.R ./GCI/genome covar.csv ./GCI/genome

mph --reml --save_mem --grm_list GCI.grms.txt --phenotype phen.csv --trait milk --covariate_file covar.csv --covariate_names all --num_threads 10 --out ./GCI/milk

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search