Quickstart

This page walks through the most common workflows in a few lines each. For a detailed walkthrough with real data, see Python API Tutorial.

Input file format

pycmplot accepts any whitespace- or comma-delimited summary statistics file, including gzip-compressed (.gz) files. Required columns (auto-detected from common names) are:

  • Chromosome (e.g. CHR, CHROM, #CHROM)

  • Base-pair position (e.g. BP, POS, pos)

  • Variant identifier (e.g. SNP, RSID, MarkerName)

  • P-value or test statistic (e.g. P, pvalue, Wald_P)

Optionally, a genome-build column (hg19 / hg38) enables automatic liftover (see Plotting behaviour options). Alternatively, pass per-file builds on the command line with --build hg19,hg38,....

Tip

For large summary statistics files, always pass --trim_pval 0.01 to discard variants with p > 0.01 before plotting. This can reduce memory usage by an order of magnitude.

Command line

Most users will use the CLI. The typical invocation for a multi-track plot is:

pycmplot \
  --sum_stats HbF.tsv.gz,MCV.txt.gz,MCH.tsv.gz \
  --labels HbF,MCV,MCH \
  --logp \
  --signif_line \
  --highlight \
  --annotate GENE \
  --trim_pval 0.01 \
  --output_dir ./results

For the full CLI reference, see Command-Line Interface.

Python API

The Python API exposes the same pipeline as the CLI, with three explicit steps: (1) resolve per-file column names, (2) load and pre-process the summary statistics, (3) render the plot. The rendering functions take the sumstats_loaded dictionary produced in step 2.

Linear Manhattan plot (single trait)

from pycmplot import (
    prep_pycmplot_input_info,
    get_sumstats_and_merged_sector_list,
    plot_linear,
)

files  = ["HbF.tsv.gz"]
labels = ["HbF"]

# 1. Resolve per-file column names and delimiters
file_info = prep_pycmplot_input_info(
    sum_stats=files,
    labels=labels,
)

# 2. Load data, run liftover (if needed), extract lead SNPs,
#    build the hits table, and compute merged sector sizes
result = get_sumstats_and_merged_sector_list(
    sum_stats=files,
    labels=labels,
    logp=True,
    trim_pval=0.01,
    file_info=file_info,
    signif_threshold=5e-8,
)

# 3. Render the plot
plot_linear(
    sumstats_loaded=result["dfs"],
    logp=True,
    signif_lines=result["lines"],
    hits_table=result["annot"],
    annotate="GENE",
    label_col="top_gene",
    colors=["steelblue", "silver"],
    output_dir="./results",
    output_format="png",
    dpi=300,
)

Multi-track linear Manhattan plot

Compare three RBC traits in a single stacked figure:

files  = ["HbF.tsv.gz", "MCV.txt.gz", "MCH.tsv.gz"]
labels = ["HbF", "MCV", "MCH"]

file_info = prep_pycmplot_input_info(sum_stats=files, labels=labels)

result = get_sumstats_and_merged_sector_list(
    sum_stats=files,
    labels=labels,
    logp=True,
    trim_pval=0.01,
    file_info=file_info,
    signif_threshold=5e-8,
)

plot_linear(
    sumstats_loaded=result["dfs"],
    logp=True,
    signif_lines=result["lines"],
    hits_table=result["annot"],
    annotate="GENE",
    label_col="top_gene",
    colors=["steelblue", "silver"],
    output_dir="./results",
)

Circular (Circos-style) Manhattan plot

from pycmplot import plot_circular

plot_circular(
    sumstats_loaded=result["dfs"],
    sector_sizes=result["sectors"],
    logp=True,
    signif_lines=result["lines"],
    hits_table=result["annot"],
    annotate="GENE",
    label_col="top_gene",
    colors=["steelblue", "silver"],
    plot_title="RBC Traits",
    output_dir="./results",
)

QQ plots

Each of the QQ functions accepts the pvals dictionary returned by get_sumstats_and_merged_sector_list():

from pycmplot import plot_qq_combined, plot_qq_overlay, plot_qq_separate

# Grid of per-track QQ panels
plot_qq_combined(
    pval_dict=result["pvals"],
    thin=True,
    max_points=50_000,
    ncols=3,
    title="RBC Traits",
    output_path="./results/rbc_qq",
    fig_format="png",
)

# All traits overlaid on one axes, with lambda in the legend
plot_qq_overlay(
    pval_dict=result["pvals"],
    thin=True,
    max_points=50_000,
    title="RBC Traits",
    output_path="./results/rbc_qq_overlay",
    fig_format="png",
)

# One file per trait
plot_qq_separate(
    pval_dict=result["pvals"],
    base_name="RBC",
    thin=True,
    max_points=50_000,
    output_path="./results/rbc_qq",
    fig_format="png",
)

Mixed genome builds with liftover

If your summary statistics were generated on different reference panels (hg18, hg19, or hg38), pycmplot can liftover hg18 and hg19 coordinates to hg38 before plotting. Supply the builds either through a BUILD column in the files or by passing --build (CLI) / build_list= (API):

pycmplot \
  --sum_stats study_hg18.tsv.gz,study_hg19.tsv.gz,study_hg38.tsv.gz \
  --labels Study_A,Study_B,Study_C \
  --build hg18,hg19,hg38 \
  --logp \
  --annotate GENE \
  --output_dir ./results

Next steps