pycmplot.stats
Statistical utilities for identifying significant loci from GWAS summary statistics. Provides lead-SNP extraction using a LD-pruning-free distance clumping approach, and helpers for selecting SNPs to highlight in the plot.
pycmplot.stats
Statistical utilities for identifying independent lead SNPs and locus boundaries from GWAS summary statistics.
get_lead_snps() applies greedy distance-based clumping to return one
representative SNP per independent locus. get_highlight_snps() extends
that to mark all variants within a locus window, enabling per-locus colouring
on Manhattan plots.
Notes
Both functions operate on a single-trait DataFrame. When comparing multiple
traits, call them independently per track; lead SNP extraction across traits
is handled by get_sumstats_and_merged_sector_list().
- pycmplot.stats.get_highlight_snps(df: DataFrame, highlight: bool = False, highlight_thresh: float = 5e-08, logp: bool = False, window: int = 500000) tuple[DataFrame, DataFrame][source]
Mark all variants within window bp of a lead SNP.
Calls
get_lead_snps()to identify independent loci, then sets anin_locusboolean flag on every variant whose chromosomal position falls within ±*window* bp of any lead SNP on the same chromosome.- Parameters:
df (pandas.DataFrame) – Summary statistics with canonical columns
CHR,POS,P(andlogPwhen logp isTrue).highlight_thresh (float, optional) – Significance threshold passed to
get_lead_snps(). Default is5e-8.logp (bool, optional) – If
True, use thelogPcolumn for thresholding and ranking. Default isFalse.window (int, optional) – Half-width of the locus window in base-pairs. Defaults to
500_000(500 kb).
- Returns:
df_annotated (pandas.DataFrame) – A copy of df with an additional boolean column
in_locus. Variants inside at least one locus window havein_locus = True.leads_df (pandas.DataFrame) – The lead-SNP DataFrame returned by
get_lead_snps().
See also
get_lead_snpsUsed internally to identify independent loci.
Examples
>>> from pycmplot.stats import get_highlight_snps >>> df_ann, leads = get_highlight_snps(df, highlight_thresh=5e-8) >>> df_ann["in_locus"].sum() 1842
- pycmplot.stats.get_lead_snps(df: DataFrame, signif_threshold: float = 5e-08, logp: bool = False, window: int = 500000) DataFrame[source]
Identify independent lead SNPs by greedy distance-based clumping.
Starting from the most significant variant, each subsequent variant is retained as a new lead only if it lies more than window base-pairs from all previously accepted leads on the same chromosome.
- Parameters:
df (pandas.DataFrame) – Summary statistics with canonical columns
CHR,POS,P. When logp isTrue, alogPcolumn (–log₁₀(P)) must also be present.signif_threshold (float, optional) – Significance cutoff. When logp is
False, variants withP > signif_thresholdare excluded; when logp isTrue, variants withlogP < -log10(signif_threshold)are excluded. Default is5e-8.logp (bool, optional) – If
True, filter and rank by thelogPcolumn (descending) instead ofP(ascending). Default isFalse.window (int, optional) – Clumping window half-width in base-pairs. A candidate SNP is excluded if it falls within window bp of any already-accepted lead on the same chromosome. Default is
500_000(500 kb).
- Returns:
Subset of df containing only the lead SNPs, one row per independent locus, in the order they were selected (most significant first within each chromosome).
- Return type:
Notes
This is a distance-only approach; it does not use linkage disequilibrium information. Users requiring LD-based clumping should post-process the returned table with PLINK or a dedicated LD-clumping tool.
See also
get_highlight_snpsReturns all variants within locus windows and adds an
in_locusflag column.
Examples
>>> from pycmplot.stats import get_lead_snps >>> leads = get_lead_snps(df, signif_threshold=5e-8, logp=True, window=500_000) >>> leads[["SNP", "CHR", "POS", "P"]].head() SNP CHR POS P 0 rs123456 2 60718043 1.20e-120 1 rs789012 11 5246696 3.40e-85