Quickstart
This page walks through the most common workflows in a few lines each. For a detailed walkthrough with real data, see Python API Tutorial.
Input file format
pycmplot accepts any whitespace- or comma-delimited summary statistics file,
including gzip-compressed (.gz) files. Required columns (auto-detected from
common names) are:
Chromosome (e.g.
CHR,CHROM,#CHROM)Base-pair position (e.g.
BP,POS,pos)Variant identifier (e.g.
SNP,RSID,MarkerName)P-value or test statistic (e.g.
P,pvalue,Wald_P)
Optionally, a genome-build column (hg19 / hg38) enables automatic
liftover (see Plotting behaviour options). Alternatively, pass per-file builds on
the command line with --build hg19,hg38,....
Tip
For large summary statistics files, always pass --trim_pval 0.01 to
discard variants with p > 0.01 before plotting. This can reduce memory usage
by an order of magnitude.
Command line
Most users will use the CLI. The typical invocation for a multi-track plot is:
pycmplot \
--sum_stats HbF.tsv.gz,MCV.txt.gz,MCH.tsv.gz \
--labels HbF,MCV,MCH \
--logp \
--signif_line \
--highlight \
--annotate GENE \
--trim_pval 0.01 \
--output_dir ./results
For the full CLI reference, see Command-Line Interface.
Python API
The Python API exposes the same pipeline as the CLI, with three explicit
steps: (1) resolve per-file column names, (2) load and pre-process the
summary statistics, (3) render the plot. The rendering functions take the
sumstats_loaded dictionary produced in step 2.
Linear Manhattan plot (single trait)
from pycmplot import (
prep_pycmplot_input_info,
get_sumstats_and_merged_sector_list,
plot_linear,
)
files = ["HbF.tsv.gz"]
labels = ["HbF"]
# 1. Resolve per-file column names and delimiters
file_info = prep_pycmplot_input_info(
sum_stats=files,
labels=labels,
)
# 2. Load data, run liftover (if needed), extract lead SNPs,
# build the hits table, and compute merged sector sizes
result = get_sumstats_and_merged_sector_list(
sum_stats=files,
labels=labels,
logp=True,
trim_pval=0.01,
file_info=file_info,
signif_threshold=5e-8,
)
# 3. Render the plot
plot_linear(
sumstats_loaded=result["dfs"],
logp=True,
signif_lines=result["lines"],
hits_table=result["annot"],
annotate="GENE",
label_col="top_gene",
colors=["steelblue", "silver"],
output_dir="./results",
output_format="png",
dpi=300,
)
Multi-track linear Manhattan plot
Compare three RBC traits in a single stacked figure:
files = ["HbF.tsv.gz", "MCV.txt.gz", "MCH.tsv.gz"]
labels = ["HbF", "MCV", "MCH"]
file_info = prep_pycmplot_input_info(sum_stats=files, labels=labels)
result = get_sumstats_and_merged_sector_list(
sum_stats=files,
labels=labels,
logp=True,
trim_pval=0.01,
file_info=file_info,
signif_threshold=5e-8,
)
plot_linear(
sumstats_loaded=result["dfs"],
logp=True,
signif_lines=result["lines"],
hits_table=result["annot"],
annotate="GENE",
label_col="top_gene",
colors=["steelblue", "silver"],
output_dir="./results",
)
Circular (Circos-style) Manhattan plot
from pycmplot import plot_circular
plot_circular(
sumstats_loaded=result["dfs"],
sector_sizes=result["sectors"],
logp=True,
signif_lines=result["lines"],
hits_table=result["annot"],
annotate="GENE",
label_col="top_gene",
colors=["steelblue", "silver"],
plot_title="RBC Traits",
output_dir="./results",
)
QQ plots
Each of the QQ functions accepts the pvals dictionary returned by
get_sumstats_and_merged_sector_list():
from pycmplot import plot_qq_combined, plot_qq_overlay, plot_qq_separate
# Grid of per-track QQ panels
plot_qq_combined(
pval_dict=result["pvals"],
thin=True,
max_points=50_000,
ncols=3,
title="RBC Traits",
output_path="./results/rbc_qq",
fig_format="png",
)
# All traits overlaid on one axes, with lambda in the legend
plot_qq_overlay(
pval_dict=result["pvals"],
thin=True,
max_points=50_000,
title="RBC Traits",
output_path="./results/rbc_qq_overlay",
fig_format="png",
)
# One file per trait
plot_qq_separate(
pval_dict=result["pvals"],
base_name="RBC",
thin=True,
max_points=50_000,
output_path="./results/rbc_qq",
fig_format="png",
)
Mixed genome builds with liftover
If your summary statistics were generated on different reference panels
(hg18, hg19, or hg38), pycmplot can liftover hg18 and hg19 coordinates to
hg38 before plotting. Supply the builds either through a BUILD column
in the files or by passing --build (CLI) / build_list= (API):
pycmplot \
--sum_stats study_hg18.tsv.gz,study_hg19.tsv.gz,study_hg38.tsv.gz \
--labels Study_A,Study_B,Study_C \
--build hg18,hg19,hg38 \
--logp \
--annotate GENE \
--output_dir ./results
Next steps
Full CLI reference: Command-Line Interface
Complete Python API: API Reference
Interactive notebook tutorial: Python API Tutorial