{ "cells": [ { "cell_type": "markdown", "id": "e7c16e8b-2abd-4e7b-a68f-14a84ff7ce01", "metadata": {}, "source": [ "# Python API for the pycmplot package\n", "\n", "-----" ] }, { "cell_type": "markdown", "id": "f593ebb1-2155-4b53-9159-64631ff683dc", "metadata": {}, "source": [ "## Installation\n", "\n", "Follow instructions in https://pypi.org/project/pycmplot/\n", "\n", "-----" ] }, { "cell_type": "markdown", "id": "8f49dab7-5795-49b4-88e9-dba30208fa49", "metadata": {}, "source": [ "## Import the package functions\n", "\n", "Eight main functions are involved in generating plots\n", "\n", " - `prep_pycmplot_input_info`: \n", " formats input streams correctly for data processing function\n", " - --> auto-detects chrom, pos, snp, pval, and build columns\n", " - --> auto-detects delimiter\n", " - --> returns a dictionary with these information that the function below needs to appropriately handle sumstats.\n", " \n", " - `get_sumstats_and_merged_sector_list`: \n", " processes sumstat for plotting functions (work horse of the package)\n", " - --> trims variants based on `trim_pval` threshold to increase speed and memory efficiency of plotting functions\n", " - --> fixes chrom labels\n", " - --> adds **logP** column if specified\n", " - --> adds a **LABEL** column which is simply the labels provided by user.\n", " - --> converts hg19 to hg38 genome build\n", " - --> identifies lead SNPs based on significance threshold and generates hits table\n", " - --> generates sector size dictionary for circular plots\n", " - --> generates dictionary of genome-wide significance and suggestive thresholds per summary stat\n", " - --> returns a **pycmplot_dict** dictionary containing input information for plotting functions. These include (these are the dict keys):\n", " - **sectors**: a dictionary of sector sizes for circular plotting.\n", " - **dfs**: the loaded summary stats with `trim_pval` applied if set.\n", " - **annot**: a hits summary table generated for potential annotation.\n", " - **lines**: a dictionary of values to use to plot genome-wide significance and suggestive lines.\n", " - **pvals**: a dictionary of raw p-values extracted for qq-plotting before applying `trim_pval` for Manhattan plotting.\n", "\n", "\n", " - `plot_linear`: \n", " generates linear Manhattan plots\n", " - --> returns plt.Axes\n", "\n", " - `plot_circular`:\n", " generates circular Manhattan plots\n", " \n", " - `plot_qq_single`: \n", " generates a qq-plot in user-supplied axis.\n", " - Useful in creating customized multi-panel qq-plots in one figure.\n", " - --> returns plt.Axes\n", " \n", " - `plot_qq_separate`:\n", " generates separate qq-plots per sumstat in different figures.\n", " \n", " - `plot_qq_overlay`:\n", " generates qq-plots in one figures on a single shared axes with different colors per sumstat,\n", " and lambda values included in legend.\n", "\n", " - `plot_qq_combined`:\n", " generates multi-panel qq-plots using `plot_qq_single` arranged \n", " in a configurable column grid in a single figure.\n", " \n", " \n", "***NB:***\n", "> _Auto-detection of column names and delimiter is very important when dealing with multiple sumstats from different projects or tools with different\n", "> column naming styles._" ] }, { "cell_type": "code", "execution_count": 1, "id": "f778281e-94cc-441a-8ec6-143d3fcb6ad1", "metadata": { "tags": [] }, "outputs": [], "source": [ "from pycmplot import (\n", " prep_pycmplot_input_info, \n", " get_sumstats_and_merged_sector_list, \n", " plot_linear, plot_circular, \n", " plot_qq_single, \n", " plot_qq_separate, \n", " plot_qq_overlay, \n", " plot_qq_combined,\n", ")" ] }, { "cell_type": "markdown", "id": "7d0afb20-2412-4161-a7f3-54caf87ee43b", "metadata": {}, "source": [ "## Getting help\n", "\n", "To see the options of a function, simply run `help(function_name)`\n", "\n", "Example:\n", "\n", " - `help(prep_pycmplot_input_info)`" ] }, { "cell_type": "markdown", "id": "8381f317-72a7-4c0b-92c1-4b402ba82229", "metadata": {}, "source": [ "# Example plotting\n", "\n", "- Prepare lists of your summary stats and their corresponding labels" ] }, { "cell_type": "code", "execution_count": 6, "id": "6d216762-004e-4b94-9f09-13c4fa7c58d7", "metadata": {}, "outputs": [], "source": [ "sumstats_list = [\n", " '/data/awonkam1/kesoh/gwas/sitt/sumher/pycmplot/sitt_all_panels_nHbF_agecut-5_kvik1.step2.assoc.gz', \n", " '/data/awonkam1/kesoh/gwas/sitt/sumher/pycmplot/sitt_all_panels_nMCV_agecut-5_kvik1.step2.assoc.gz'\n", "]\n", "labels_list = ['HbF','MCV']" ] }, { "cell_type": "markdown", "id": "d6862585-29a4-4fb2-8d3e-59e56383351e", "metadata": {}, "source": [ "- Pass these lists to the `prep_pycmplot_input_info` function\n", "- These files include summary stats from multiple imputation panels in different genomic \n", " coordinates concatenated together and a column 'BUILD' added to indicate the build of each position" ] }, { "cell_type": "code", "execution_count": 3, "id": "c27038b7-04c9-4029-80ec-017ab390f94f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'HbF': [['CHR', 'POS', 'SNP', 'P', 'BUILD'],\n", " {'CHR': 'category',\n", " 'POS': object,\n", " 'SNP': str,\n", " 'P': float,\n", " 'BUILD': 'category'},\n", " {'CHR': 'CHR', 'POS': 'POS', 'SNP': 'SNP', 'P': 'P', 'BUILD': 'BUILD'},\n", " '\\t'],\n", " 'MCV': [['CHR', 'POS', 'SNP', 'P', 'BUILD'],\n", " {'CHR': 'category',\n", " 'POS': object,\n", " 'SNP': str,\n", " 'P': float,\n", " 'BUILD': 'category'},\n", " {'CHR': 'CHR', 'POS': 'POS', 'SNP': 'SNP', 'P': 'P', 'BUILD': 'BUILD'},\n", " '\\t']}" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sumstat_info_dict = prep_pycmplot_input_info(\n", " sum_stats=sumstats_list,\n", " labels=labels_list\n", ")\n", "\n", "# take a look at the dictionary: the function has auto-detected column names and delimiter\n", "sumstat_info_dict" ] }, { "cell_type": "markdown", "id": "5c726956-fedb-47e2-aa3a-2c2018e1ec92", "metadata": {}, "source": [ "## Important notes\n", "- Hits table will be generated here containing variants and their gene annotations at the set significance threshold using the `signific_threshold` option.\n", "- If say you use a different p-value threshold to highlight positions in the plotting functions, those positions will be highlighted but will not be annotated because annotations in the `hits_table` were generated using a different p-value threshold.\n", "- If you wish to annotate all highlighted positions, then use the same `signif_threshold` and `highlight_threshold` everywhere.\n", "\n", "- For instance, below we use the default genome-wide significance threshold by not setting any value" ] }, { "cell_type": "code", "execution_count": 4, "id": "ab4a538f-1ea6-43f2-81db-e7d469945ab4", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[INFO] Loading MCV from /data/awonkam1/kesoh/gwas/sitt/sumher/pycmplot/sitt_all_panels_nMCV_agecut-5_kvik1.step2.assoc.gz …\n", "[INFO] Loaded 82710042 variants from summary stat file, using 6.84 GB of memory\n", "[INFO] Extracting raw p-values for QQ-plotting ...\n", "[INFO] Excluding variants with p-value less than 0.01 to speed up Manhattan plotting ...\n", "[INFO] 854187 variants remain after trimming, using 7.74 MB of memory\n", "[INFO] Adding a 'logP' column ...\n", "[INFO] Normalizing chromosome names {\"23\": \"X\", \"24\": \"Y\", \"M\": \"MT\", \"MTDNA\": \"MT\"} ...\n", "[INFO] Converting hg19 coordinates to hg38 ...\n", "[INFO] Loading LiftOver chain file: /home/kesohku1/projects/gwas/scripts/pycmplot/pycmplot/data/hg19ToHg38.over.chain.gz\n", "[INFO] Extracting variants to highlight ...\n", "[INFO] Loading HbF from /data/awonkam1/kesoh/gwas/sitt/sumher/pycmplot/sitt_all_panels_nHbF_agecut-5_kvik1.step2.assoc.gz …\n", "[INFO] Loaded 82710042 variants from summary stat file, using 6.84 GB of memory\n", "[INFO] Extracting raw p-values for QQ-plotting ...\n", "[INFO] Excluding variants with p-value less than 0.01 to speed up Manhattan plotting ...\n", "[INFO] 890427 variants remain after trimming, using 8.07 MB of memory\n", "[INFO] Adding a 'logP' column ...\n", "[INFO] Normalizing chromosome names {\"23\": \"X\", \"24\": \"Y\", \"M\": \"MT\", \"MTDNA\": \"MT\"} ...\n", "[INFO] Converting hg19 coordinates to hg38 ...\n", "[INFO] Extracting variants to highlight ...\n", "[INFO] Loading gene info from: /home/kesohku1/projects/gwas/scripts/pycmplot/pycmplot/data/Homo_sapiens.GRCh38.geneinfo.tsv.gz\n", "[INFO] Annotating lead variants and generating hits summary table ...\n", "[INFO] Locus summary written to: test_pycmplot_python_api.tsv\n", "[INFO] Computing per-sumstat sector sizes (chrom → [min_pos, max_pos])\n", "CPU times: user 4min 26s, sys: 22.6 s, total: 4min 49s\n", "Wall time: 4min 50s\n" ] }, { "data": { "text/html": [ "
| \n", " | CHR | \n", "POS | \n", "SNP | \n", "P | \n", "BUILD | \n", "logP | \n", "LABEL | \n", "OLD_POS | \n", "OLD_BUILD | \n", "genic | \n", "... | \n", "promoter_upstream_flag | \n", "gene_density | \n", "top_gene | \n", "biotype | \n", "priority_score | \n", "distance | \n", "promoter_flag | \n", "distance_score | \n", "biotype_weight | \n", "promoter_bonus | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | \n", "2 | \n", "60493111 | \n", "2:60493111:SNV | \n", "3.039700e-18 | \n", "hg38 | \n", "17.517169 | \n", "HbF | \n", "60493111 | \n", "hg38 | \n", "True | \n", "... | \n", "False | \n", "46 | \n", "BCL11A | \n", "protein_coding | \n", "4.0 | \n", "0 | \n", "False | \n", "1.0 | \n", "1.0 | \n", "0.0 | \n", "
| 0 | \n", "16 | \n", "258003 | \n", "16:258003:SNV | \n", "1.094200e-13 | \n", "hg38 | \n", "12.960903 | \n", "MCV | \n", "258003 | \n", "hg38 | \n", "True | \n", "... | \n", "False | \n", "150 | \n", "FAM234A | \n", "protein_coding | \n", "4.0 | \n", "0 | \n", "False | \n", "1.0 | \n", "1.0 | \n", "0.0 | \n", "
2 rows × 24 columns
\n", "| \n", " | CHR | \n", "POS | \n", "SNP | \n", "P | \n", "BUILD | \n", "logP | \n", "LABEL | \n", "OLD_POS | \n", "OLD_BUILD | \n", "genic | \n", "... | \n", "promoter_upstream_flag | \n", "gene_density | \n", "top_gene | \n", "biotype | \n", "priority_score | \n", "distance | \n", "promoter_flag | \n", "distance_score | \n", "biotype_weight | \n", "promoter_bonus | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | \n", "2 | \n", "60493111 | \n", "2:60493111:SNV | \n", "3.039700e-18 | \n", "hg38 | \n", "17.517169 | \n", "HbF | \n", "60493111 | \n", "hg38 | \n", "True | \n", "... | \n", "False | \n", "46.0 | \n", "BCL11A | \n", "protein_coding | \n", "4.0 | \n", "0 | \n", "False | \n", "1.0 | \n", "1.0 | \n", "0.0 | \n", "
| 0 | \n", "16 | \n", "258003 | \n", "16:258003:SNV | \n", "1.094200e-13 | \n", "hg38 | \n", "12.960903 | \n", "MCV | \n", "258003 | \n", "hg38 | \n", "True | \n", "... | \n", "False | \n", "150.0 | \n", "FAM234A | \n", "protein_coding | \n", "4.0 | \n", "0 | \n", "False | \n", "1.0 | \n", "1.0 | \n", "0.0 | \n", "
| 2 | \n", "X | \n", "63094194 | \n", "23:62313664:SNV | \n", "5.633700e-08 | \n", "hg38 | \n", "7.249206 | \n", "HbF | \n", "62313664 | \n", "hg19 | \n", "False | \n", "... | \n", "False | \n", "NaN | \n", "SPIN4-ARHGEF9 | \n", "intergenic | \n", "NaN | \n", "253034-540773 | \n", "None | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
3 rows × 24 columns
\n", "| \n", " | CHR | \n", "SNP | \n", "POS | \n", "P | \n", "logP | \n", "LABEL | \n", "
|---|---|---|---|---|---|---|
| 5 | \n", "2 | \n", "2:60721347 | \n", "60721347 | \n", "1.567800e-16 | \n", "15.804709 | \n", "HbF | \n", "
| 3 | \n", "16 | \n", "16:258003:G:A | \n", "258003 | \n", "1.094200e-13 | \n", "12.960903 | \n", "MCV | \n", "
| \n", " | CHR | \n", "SNP | \n", "POS | \n", "P | \n", "BUILD | \n", "logP | \n", "LABEL | \n", "OLD_POS | \n", "OLD_BUILD | \n", "genic | \n", "... | \n", "promoter_upstream_flag | \n", "gene_density | \n", "top_gene | \n", "biotype | \n", "priority_score | \n", "distance | \n", "promoter_flag | \n", "distance_score | \n", "biotype_weight | \n", "promoter_bonus | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | \n", "2 | \n", "2:60721347 | \n", "60494212 | \n", "1.567800e-16 | \n", "hg38 | \n", "15.804709 | \n", "HbF | \n", "60721347.0 | \n", "hg19 | \n", "True | \n", "... | \n", "False | \n", "46 | \n", "BCL11A | \n", "protein_coding | \n", "4.0 | \n", "0 | \n", "False | \n", "1.0 | \n", "1.0 | \n", "0.0 | \n", "
| 0 | \n", "16 | \n", "16:258003:G:A | \n", "258003 | \n", "1.094200e-13 | \n", "hg38 | \n", "12.960903 | \n", "MCV | \n", "NaN | \n", "NaN | \n", "True | \n", "... | \n", "False | \n", "150 | \n", "FAM234A | \n", "protein_coding | \n", "4.0 | \n", "0 | \n", "False | \n", "1.0 | \n", "1.0 | \n", "0.0 | \n", "
2 rows × 24 columns
\n", "