Table of contents
You should now be at the stage where all of the phenotypes have been generated and are stored as R data objects. The following script converts these objects into tables, converting binary phenotypes into cases, controls and exclusions and (optionally) transforming quantitative phenotypes using inverse normal transformation. It is at this stage where you can add the optional sex stratified phenotypes and age of onset phenotypes (only in those phenotypes that are not already stratified by sex and have reliable date of onset respectively).
To run all scripts after this point first run.
cd $package_folder/extdata/scripts/association_testing/
01_phenotype_preparation.R
Usage
01_phenotype_preparation.R (--phenotype_folder=<FOLDER> | --phenotype_files=<FILES>)
(--phenotype_filtered_save_name=<FILE>) [(--relate_remove --kinship_file=<FILE>)]
[(--sex_split_phenotypes=<FILE> --sex_info=<FILE> | --sex_split_all --sex_info=<FILE>)]
[(--age_of_onset_phenotypes=<FILE> --DOB_file=<FILE> | --age_of_onset_all --DOB_file=<FILE>)]
[--groupings=<FILE> --quantitative_Case_N=<number> --binary_Case_N=<number> --male=<number>
--female=<number> --PheWAS_manifest_overide=<FILE> --IVNT --save_RDS=<FILE> --stats_save=<FILE>]
There are appreciable more inputs for this script than those used for generating phenotypes, this reflects the increased number of options available for the user. We will go through several of these within the examples, but for full details of all of the arguments please use the help argument -h
.
Example 1 - standard input
Below is the standard input for preparing phenotypes for association, here the phenotypes are first filtered by case number (default values 50 and 100 used if not added as arguments --quantitative_Case_N=<number> --binary_Case_N=<number>
respectively) and then filtered by relatedness and then filtered again by case number (in case removing individuals who are related has dropped the case number below the threshold for inclusion). The inverse normal transformation is applied to all quantitative traits using the --IVNT
argument. The optional argument --groupings
has not been used in the example but would be commonly used in real PheWAS to subset the population, usually by ancestry. If --grouping
is used then the script performs all filtering steps per grouping and produces and table of phenotypes per group. The scripts also produces two stats files (when using --stats_save
argument), which will record the case and control numbers per phenotype for all groupings after initial filtering and after related individuals have been removed.
./01_phenotype_preparation.R \
--phenotype_filtered_save_name $phewas_folder/phenotype_table/example_table \
--phenotype_folder $phewas_folder/data/phenotypes/ \
--relate_remove \
--kinship_file $phewas_folder/data/related_callrate \
--IVNT \
--stats_save $phewas_folder/phenotype_table/example_stats
As no --groupings
file is provided this saves a single phenotype table all_pop_example_table.gz
that uses the default group name βallβ as a suffix, and two stats files example_stats_N_filtered.csv
and example_stats_relate_remove.csv
Example 2 - sex stratified phenotypes
Below we have added the --sex_split_phenotypes
and the --sex_info
arguments (both are required). The --sex_info
is taken from the combined_sex
file that was created by the 02_data_preparation.R script
. The --sex_split_phenotypes
requires a file with the header PheWAS_ID with each phenotype that requires a sex stratified version specified. Another option is to use the --sex_split_all
argument which will create sex stratified phenotypes for all of the existing non-sex specific phenotypes available. Try replacing the -sex_split_phenotypes
with --sex_split_all
, remember to change the save name of the --phenotype_filtered_save_name
argument and --stats_save
if you want to have both versions available.
./01_phenotype_preparation.R \
--phenotype_filtered_save_name $phewas_folder/phenotype_table/example_table_sex \
--phenotype_folder $phewas_folder/data/phenotypes/ \
--relate_remove \
--kinship_file $phewas_folder/data/related_callrate \
--IVNT \
--stats_save $phewas_folder/phenotype_table/example_stats_sex \
--sex_split_phenotypes $package_folder/extdata/worked_example/sex_filter.txt.gz \
--sex_info $phewas_folder/data/combined_sex
Example 3 - age of onset phenotypes
Below we have added the --age_of_onset_phenotypes
and the --DOB_file
arguments (both are required). The --DOB_file
is taken from the DOB
file that was created by the 02_data_preparation.R script
. The --age_of_onset_phenotypes
requires a file containing four columns PheWAS_ID,lower_limit,upper_limit and transformation.
PheWAS_ID are the phenotypes to create the age_of_onset phenotypes for, lower_limit is the lower age boundary to filter read as <=, upper_limit is the upper age boundary acceptable read as >=. Transformation is the type of transformation (if any) that should be applied to the phenotype. Current the only accepted transformation is βIVNTβ for inverse normal transformation, if no transformation is required input NA. All age of onset phenotypes to be labelled (PheWAS_ID)_age_of_onset. Unlike the equivalent sex stratified --sex_split_all
argument the --age_of_onset_all
argument is not quite stable for use and should not be inputted.
Not all phenotypes can produce a useful age_of_onset variant, quantitative phenotypes for example have no meaningful age of onset, so some thought should be taken when selecting phenotypes when generating their age of onset variant.
./01_phenotype_preparation.R \
--phenotype_filtered_save_name $phewas_folder/phenotype_table/example_table_sex_aoo \
--phenotype_folder $phewas_folder/data/phenotypes/ \
--relate_remove \
--kinship_file $phewas_folder/data/related_callrate \
--IVNT \
--stats_save $phewas_folder/phenotype_table/example_stats_sex_aoo \
--sex_split_phenotypes $package_folder/extdata/worked_example/sex_filter.txt.gz \
--sex_info $phewas_folder/data/combined_sex \
--age_of_onset_phenotypes $package_folder/extdata/worked_example/aoo_filter.txt.gz \
--DOB_file $phewas_folder/data/DOB