CEPH samples were used as technical control in QIMR projects too. Scott helped to extract all the CEPH samples with around 200 samples, and imputed them with two imputation servers, Michigan and TopMed. The data is both useful to compare across other studies, and compare two different imputation panels.
QIMR have many batches of data and they were processed with standard QC with MAF >=1%, and flip to hg19 ‘+’ strand (which I know, based on Will Rayner’s BLAST search results).
For each chip family (GSA, Omni, HapMap-based), Scott processed the data in steps of:
Identify all of the NA/GM control samples which passed QC, and the sample IDs which they were read as after extracting from GenomeStudio.
Make a tiny PLINK binary-format file which has their raw genotypes under those IDs (still distinct), but restricted to just the markers which passed QC in their batch. In the Omni case I added 50 ordinary individuals who passed QC in the same batch, otherwise there were too few for the Imputation Server to accept the job. and added 200 individuals added to each batch of genotyping for a decent sample size.(Scott just chose the first 200/batch which were covered by the UQ data transfer agreement)
Merge the batches from one chip family for the set of markers used in the original imputation run (which is the lowest common denominator of QC-passing markers across a number of batches), and flip to hg19 ‘+’ strand (which I know, based on Will Rayner’s BLAST search results).
Upload to Michigan Imputation Server (3 uploads per chip family : (a) autosomes; (b) chr X PAR regions; (c) chr X non-PAR region).
Uploaded to TOPMED imputation server for TOPMed (r3) reference
Download and extract the files and transfer to Tian through RDM.
plink --vcf $vcffile --const-fid --keep-allele-order --make-bed --out $plinkfile
mkdir -p ${folder}_PLINK/${folder}_PLINK_original_bim
cp ${plinkfile}.bim ${folder}_PLINK/${folder}_PLINK_original_bim/
Rscript update_TOPMED_SNPs.R ${plinkfile}.bim SNP_lists/dbsnp_146_hg38_${chr}.txt /QRISdata/Q6913/Pipeline/ukb20k_7M_4cM/snp.info
cp ${plinkfile}.bim_updated ${plinkfile}.bim
The purpose of this step is to rename the duplicated SNP IDs, and fill in an ID for the SNPs without an ID.
Duplicate SNP IDs are distinguished by adding dup and a number, while keeping the SNP that have the same pair of alleles vs. the predictors intact.
This R script was used for TopMed imputed data
library(dplyr)
library(tidyr)
library(data.table)
args=commandArgs(trailingOnly = TRUE)
bimfile=args[1]
pred.file=args[2]
# Read the files
predictor <- read.table(pred.file, header = TRUE, stringsAsFactors = FALSE,
col.names = c("chr", "SNP", "cm", "pos", "A1", "A2", "A1Freq", "N", "blk" ))
bim <- read.table(bimfile, header = FALSE, stringsAsFactors = FALSE,
col.names = c("chr", "SNP", "cm", "pos", "A1", "A2"))
# add an index column
bim$index = c(1:nrow(bim))
# Create chrbpAB for bim
bim <- bim %>%
mutate(A = pmin(toupper(A1), toupper(A2)),
B = pmax(toupper(A1), toupper(A2)),
chrbpAB = paste0("chr", chr, ":", pos, ":", A, ":", B))
# If 'SNP' is ".", replace it with the value in 'chrbpAB'
bim <- bim %>% mutate(SNP = ifelse(SNP == ".", chrbpAB, SNP))
# Handle duplicates in bim
predictor <- predictor %>%
mutate(A = pmin(toupper(A1), toupper(A2)),
B = pmax(toupper(A1), toupper(A2)))
# Match with predictor and handle duplicates
bim <- bim %>% left_join(predictor, by = c("SNP", "A", "B"), suffix = c("", ".pred"))
bim <- bim %>% mutate(exact_match = !is.na(A1.pred) & !is.na(A2.pred))
dup.bim = bim[bim$SNP %in% bim[duplicated(bim$SNP) , "SNP"],]
dup.bim <- dup.bim %>%
group_by(SNP) %>%
arrange(desc(exact_match), .by_group = T) %>%
ungroup()
dup.bim = data.frame(dup.bim)
# If there are duplicated SNPs for any reason then label them as _2, _3 etc
temp <- rle(dup.bim$SNP)
temp2 <- paste0(dup.bim$SNP, "_", unlist(lapply(temp$lengths, seq_len)))
temp2 <- gsub("_1$", "", temp2)
dup.bim$SNP <- temp2
## get the key columns
uniq.bim = bim[!bim$SNP %in% bim[duplicated(bim$SNP) , "SNP"],]
fixed.bim = rbind(uniq.bim %>% select(chr, SNP, cm, pos, A1, A2, index),
dup.bim %>% select(chr, SNP, cm, pos, A1, A2, index))
fixed.bim = fixed.bim[order(fixed.bim$index),]
# Save the modified bim file
write.table(x = fixed.bim[,1:6], file = paste0(bimfile,"_updated" ), sep = "\t", quote = FALSE, row.names = FALSE, col.names = FALSE)
HRC imputed data were processed in the same way as HSU data.
HRC imputed data were merged into one file from all the chromosomes.
i=$SLURM_ARRAY_TASK_ID
cd /scratch/project/genetic_data_analysis/uqtlin5/CEPH_samples/QIMR
dir=Cleaned_Plink/
## define data
studies=("GSA" "Hap_HRCr1.1" "Omni_HRCr1.1")
panel="HRCr1.1"
for cohort in "${studies[@]}"; do
cohort=${studies[i-1]}
## generate the merge list
prefix=${dir}/${cohort}_${panel}_PLINK/${cohort}_${panel}"_chr"
for chr in $(seq 2 22); do
echo "${prefix}${chr}.bed ${prefix}${chr}.bim ${prefix}${chr}.fam" >> ${cohort}_${panel}_bmerge_files.txt
done
## merge the data
mkdir -p Merged_plink
plink --bfile ${prefix}1 --merge-list ${cohort}_${panel}_bmerge_files.txt --allow-no-sex --make-bed --out Merged_plink/${cohort}_${panel}_autosomes
done
TopMed imputed data were filtered with the 7.3M SNP in the SBayesRC predictors, and then merged onto one file.
i=$SLURM_ARRAY_TASK_ID
dir=Cleaned_Plink/
## define data
studies=("GSA_TOPMedr3" "Hap_TOPMedr3" "Omni_TOPMedr3")
mkdir -p Cleaned_Plink/${cohort}_PLINK/
bfile=/QRISdata/Q5740/QIMR_Berghofer/${cohort}_PLINK/${cohort}_chr${i}
plink --bfile $bfile --extract SNP_lists/SNPs_in_7.3M.txt --make-bed --out Cleaned_Plink/${cohort}_PLINK/${cohort}_subset_chr${i}
for cohort in "${studies[@]}"; do
cohort=${studies[i-1]}
## generate the merge list
prefix=${dir}/${cohort}_PLINK/${cohort}"_subset_chr"
for chr in $(seq 2 22); do
echo "${prefix}${chr}.bed ${prefix}${chr}.bim ${prefix}${chr}.fam" >> ${cohort}_bmerge_files.txt
done
## merge the data
mkdir -p Merged_plink
plink --bfile ${prefix}1 --merge-list ${cohort}_bmerge_files.txt --allow-no-sex --make-bed --out Merged_plink/${cohort}_autosomes
done
The same plink command line was used to profile PGS from the 6 QIMR data sets.
i=$SLURM_ARRAY_TASK_ID
traitfile="/QRISdata/Q6913/GCTB_predictor_list_for_batch_profiling.txt"
trait=$(sed "${i}q;d" $traitfile | awk '{print $1}' )
predictor=$(sed "${i}q;d" $traitfile | awk '{print $3}' )
outdir=PRS_all_GCTB
plink \
--bfile $bfile \
--score ${predictor} 2 5 8 header sum \
--out ${outdir}/${cohort}_${trait}_SBayesRC
The parameters after your predictor file means