vignettes/input_datasets.rmd
input_datasets.rmd
Note!!! It is highly recommended to check the number of datapoints in your dataset before you plot, since a very large dataset will take a long time to plot.
I usually reduce the size of the GWAS datataset prior to plotting by filtering out variants with P>1e-03
.
topr comes with three example GWAS datasets, one on Ulcerative Colitis retrieved from the UKBB (UC_UKBB
), and the other two on Crohn’s disease (CD_UKBB
and CD_FINNGEN
) obtained from the FinnGen and UK biobanks respectively. topr utilizes gene and exon datasets from Ensembl (GRCh38.pxx) (ENSGENES
and ENSEXONS
).
See topr reference for more details on the in-built datasets.
Input datasets must include least three columns (CHROM, POS
and P
), where naming of the columns is flexible (i.e the chr label can be either chr or chrom and is case insensitive).
topr has 3 in-built datasets (GWASes), take a look at Crohn’s GWAS (CD_UKBB
) by issuing the following command:
head(CD_UKBB)
CHROM POS ID P OR
1 chr1 1006415 rs145588482 0.000468758 0.583384
2 chr1 1006415 rs145588482 0.000468758 0.583384
3 chr1 1007256 rs76233940 0.000401567 0.579783
4 chr1 1007256 rs76233940 0.000401567 0.579783
5 chr1 1007256 rs76233940 0.000401567 0.579783
6 chr1 1341559 rs376494450 0.000151216 1.320130
The chromosome in the CHROM
column can be represented with or without the chr suffix, e.g (chr1 or 1)
Input data column names and alternatives understood by topr:
Required columns and alternative namings: CHROM: CHR,chr,Chrom,chrom,chromosome,CHROMOSOME POS: pos,BP,bp,base_pair_location P: p,PVAL,pval,PVALUE,pvalue,p_value Optional columns and alternative namings: ID: rsid,rsId,RSID,SNP,snp,rsName,rsname,RSNAME Gene_Symbol: Gene_symbol,GENENAME,geneName,genename,GENE,gene Max_Impact: max_impact,Impact,impact REF: ref ALT: alt BETA: beta,b,B OR: or,odds_ratio