CEPH samples were used as technical control in QIMR projects too. Scott helped to extract all the CEPH samples with around 200 samples, and imputed them with two imputation servers, Michigan and TopMed. The data is both useful to compare across other studies, and compare two different imputation panels.


1 Data QC and Imputation

QIMR have many batches of data and they were processed with standard QC with MAF >=1%, and flip to hg19 ‘+’ strand (which I know, based on Will Rayner’s BLAST search results).

For each chip family (GSA, Omni, HapMap-based), Scott processed the data in steps of:

  1. Identify all of the NA/GM control samples which passed QC, and the sample IDs which they were read as after extracting from GenomeStudio.

  2. Make a tiny PLINK binary-format file which has their raw genotypes under those IDs (still distinct), but restricted to just the markers which passed QC in their batch. In the Omni case I added 50 ordinary individuals who passed QC in the same batch, otherwise there were too few for the Imputation Server to accept the job. and added 200 individuals added to each batch of genotyping for a decent sample size.(Scott just chose the first 200/batch which were covered by the UQ data transfer agreement)

  3. Merge the batches from one chip family for the set of markers used in the original imputation run (which is the lowest common denominator of QC-passing markers across a number of batches), and flip to hg19 ‘+’ strand (which I know, based on Will Rayner’s BLAST search results).

  4. Upload to Michigan Imputation Server (3 uploads per chip family : (a) autosomes; (b) chr X PAR regions; (c) chr X non-PAR region).

  5. Uploaded to TOPMED imputation server for TOPMed (r3) reference

  6. Download and extract the files and transfer to Tian through RDM.

2 Post-imputation processing

2.2 Update SNP ID

The purpose of this step is to rename the duplicated SNP IDs, and fill in an ID for the SNPs without an ID.

Duplicate SNP IDs are distinguished by adding dup and a number, while keeping the SNP that have the same pair of alleles vs. the predictors intact.

This R script was used for TopMed imputed data

library(dplyr)
library(tidyr)
library(data.table)

args=commandArgs(trailingOnly = TRUE)

bimfile=args[1]
pred.file=args[2]

# Read the files
predictor <- read.table(pred.file, header = TRUE, stringsAsFactors = FALSE,
                        col.names = c("chr", "SNP", "cm", "pos", "A1", "A2", "A1Freq",  "N", "blk" ))
bim <- read.table(bimfile, header = FALSE, stringsAsFactors = FALSE,
                  col.names = c("chr", "SNP", "cm", "pos", "A1", "A2"))


# add an index column

bim$index = c(1:nrow(bim))

# Create chrbpAB for bim
bim <- bim %>%
  mutate(A = pmin(toupper(A1), toupper(A2)),
         B = pmax(toupper(A1), toupper(A2)),
         chrbpAB = paste0("chr", chr, ":", pos, ":", A, ":", B))

# If 'SNP' is ".", replace it with the value in 'chrbpAB'
bim <- bim %>%  mutate(SNP = ifelse(SNP == ".", chrbpAB, SNP))


# Handle duplicates in bim
predictor <- predictor %>%
  mutate(A = pmin(toupper(A1), toupper(A2)),
         B = pmax(toupper(A1), toupper(A2)))

# Match with predictor and handle duplicates
bim <- bim %>%  left_join(predictor, by = c("SNP", "A", "B"), suffix = c("", ".pred")) 
bim <- bim %>%  mutate(exact_match = !is.na(A1.pred) & !is.na(A2.pred)) 

dup.bim = bim[bim$SNP %in% bim[duplicated(bim$SNP) , "SNP"],]
dup.bim <- dup.bim %>% 
  group_by(SNP) %>% 
  arrange(desc(exact_match), .by_group = T) %>%
  ungroup()

dup.bim = data.frame(dup.bim)

# If there are duplicated SNPs for any reason then label them as _2, _3 etc
temp <- rle(dup.bim$SNP)
temp2 <- paste0(dup.bim$SNP, "_", unlist(lapply(temp$lengths, seq_len)))
temp2 <- gsub("_1$", "", temp2)
dup.bim$SNP <- temp2

## get the key columns
uniq.bim =  bim[!bim$SNP %in% bim[duplicated(bim$SNP) , "SNP"],]

fixed.bim = rbind(uniq.bim %>%  select(chr, SNP, cm, pos, A1, A2, index),
                  dup.bim %>%  select(chr, SNP, cm, pos, A1, A2, index))

fixed.bim = fixed.bim[order(fixed.bim$index),]

# Save the modified bim file
write.table(x = fixed.bim[,1:6], file = paste0(bimfile,"_updated" ), sep = "\t", quote = FALSE, row.names = FALSE, col.names = FALSE)

HRC imputed data were processed in the same way as HSU data.

2.3 merge the chromosomes

HRC imputed data were merged into one file from all the chromosomes.

i=$SLURM_ARRAY_TASK_ID
cd /scratch/project/genetic_data_analysis/uqtlin5/CEPH_samples/QIMR
dir=Cleaned_Plink/

## define data
studies=("GSA" "Hap_HRCr1.1" "Omni_HRCr1.1")
panel="HRCr1.1"

for cohort in "${studies[@]}"; do
    cohort=${studies[i-1]}
    ## generate the merge list
    prefix=${dir}/${cohort}_${panel}_PLINK/${cohort}_${panel}"_chr"    
    for chr in $(seq 2 22); do
        echo "${prefix}${chr}.bed ${prefix}${chr}.bim ${prefix}${chr}.fam"  >>  ${cohort}_${panel}_bmerge_files.txt
    done    
    ## merge the data
    mkdir -p Merged_plink
    plink --bfile ${prefix}1  --merge-list ${cohort}_${panel}_bmerge_files.txt  --allow-no-sex  --make-bed  --out  Merged_plink/${cohort}_${panel}_autosomes
done

TopMed imputed data were filtered with the 7.3M SNP in the SBayesRC predictors, and then merged onto one file.

i=$SLURM_ARRAY_TASK_ID
dir=Cleaned_Plink/

## define data
studies=("GSA_TOPMedr3" "Hap_TOPMedr3" "Omni_TOPMedr3")

mkdir -p Cleaned_Plink/${cohort}_PLINK/
bfile=/QRISdata/Q5740/QIMR_Berghofer/${cohort}_PLINK/${cohort}_chr${i}
plink --bfile $bfile --extract  SNP_lists/SNPs_in_7.3M.txt   --make-bed --out Cleaned_Plink/${cohort}_PLINK/${cohort}_subset_chr${i}

for cohort in "${studies[@]}"; do
cohort=${studies[i-1]}
## generate the merge list
prefix=${dir}/${cohort}_PLINK/${cohort}"_subset_chr"    
for chr in $(seq 2 22); do
    echo "${prefix}${chr}.bed ${prefix}${chr}.bim ${prefix}${chr}.fam"  >>  ${cohort}_bmerge_files.txt
done    
## merge the data
mkdir -p Merged_plink
plink --bfile ${prefix}1  --merge-list ${cohort}_bmerge_files.txt  --allow-no-sex  --make-bed  --out  Merged_plink/${cohort}_autosomes
done

3 PGS profiling

The same plink command line was used to profile PGS from the 6 QIMR data sets.

i=$SLURM_ARRAY_TASK_ID
traitfile="/QRISdata/Q6913/GCTB_predictor_list_for_batch_profiling.txt"
trait=$(sed "${i}q;d" $traitfile | awk '{print $1}' )
predictor=$(sed  "${i}q;d"  $traitfile  | awk '{print $3}' )
outdir=PRS_all_GCTB

plink  \
--bfile  $bfile \
--score  ${predictor}  2 5 8  header sum    \
--out ${outdir}/${cohort}_${trait}_SBayesRC

The parameters after your predictor file means

  • 2 5 8: the three useful columns in the predictor file.
  • header: The predictor file has a header row.
  • sum: Plink prefers to divide the score by the number of SNPs in predictor. Using “sum” will prevent the division step.