pycmplot.stats

Statistical utilities for identifying significant loci from GWAS summary statistics. Provides lead-SNP extraction using a LD-pruning-free distance clumping approach, and helpers for selecting SNPs to highlight in the plot.

pycmplot.stats

Statistical utilities for identifying independent lead SNPs and locus boundaries from GWAS summary statistics.

get_lead_snps() applies greedy distance-based clumping to return one representative SNP per independent locus. get_highlight_snps() extends that to mark all variants within a locus window, enabling per-locus colouring on Manhattan plots.

Notes

Both functions operate on a single-trait DataFrame. When comparing multiple traits, call them independently per track; lead SNP extraction across traits is handled by get_sumstats_and_merged_sector_list().

pycmplot.stats.get_highlight_snps(df: DataFrame, highlight: bool = False, highlight_thresh: float = 5e-08, logp: bool = False, window: int = 500000) → tuple[DataFrame, DataFrame][source]

Mark all variants within window bp of a lead SNP.

Calls get_lead_snps() to identify independent loci, then sets an in_locus boolean flag on every variant whose chromosomal position falls within ±*window* bp of any lead SNP on the same chromosome.

Parameters:

df (pandas.DataFrame) – Summary statistics with canonical columns CHR, POS, P (and logP when logp is True).
highlight_thresh (float, optional) – Significance threshold passed to get_lead_snps(). Default is 5e-8.
logp (bool, optional) – If True, use the logP column for thresholding and ranking. Default is False.
window (int, optional) – Half-width of the locus window in base-pairs. Defaults to 500_000 (500 kb).

Returns:

df_annotated (pandas.DataFrame) – A copy of df with an additional boolean column in_locus. Variants inside at least one locus window have in_locus = True.
leads_df (pandas.DataFrame) – The lead-SNP DataFrame returned by get_lead_snps().

See also

get_lead_snps: Used internally to identify independent loci.

Examples

>>> from pycmplot.stats import get_highlight_snps
>>> df_ann, leads = get_highlight_snps(df, highlight_thresh=5e-8)
>>> df_ann["in_locus"].sum()
1842

pycmplot.stats.get_lead_snps(df: DataFrame, signif_threshold: float = 5e-08, logp: bool = False, window: int = 500000) → DataFrame[source]

Identify independent lead SNPs by greedy distance-based clumping.

Starting from the most significant variant, each subsequent variant is retained as a new lead only if it lies more than window base-pairs from all previously accepted leads on the same chromosome.

Parameters:

df (pandas.DataFrame) – Summary statistics with canonical columns CHR, POS, P. When logp is True, a logP column (–log₁₀(P)) must also be present.
signif_threshold (float, optional) – Significance cutoff. When logp is False, variants with P > signif_threshold are excluded; when logp is True, variants with logP < -log10(signif_threshold) are excluded. Default is 5e-8.
logp (bool, optional) – If True, filter and rank by the logP column (descending) instead of P (ascending). Default is False.
window (int, optional) – Clumping window half-width in base-pairs. A candidate SNP is excluded if it falls within window bp of any already-accepted lead on the same chromosome. Default is 500_000 (500 kb).

Returns:

Subset of df containing only the lead SNPs, one row per independent locus, in the order they were selected (most significant first within each chromosome).

Return type:

pandas.DataFrame

Notes

This is a distance-only approach; it does not use linkage disequilibrium information. Users requiring LD-based clumping should post-process the returned table with PLINK or a dedicated LD-clumping tool.