Changelog
All notable changes to pycmplot are documented here.
The format is based on Keep a Changelog and this project adheres to Semantic Versioning.
[0.3.0] �~@~T 20266001
Fixed
Bug fix: - linear plotting
t_heightslocal variable access failure.
Changed
Suggestive line color from
lightbluetonavyin circular plottingSignificance line color from
redtoorangeredin linear plotting to match circular plottingSuggestive line color from
bluetonavyin linear plotting to match circular plotting
Added
ylabel: optional ylabel text in circular plotting to match linear plotting
[0.2.8] — 2026-05-30
Added
Dual annotation renderer architecture
Two complementary annotation functions now handle sparse and dense annotation scenarios independently:
_draw_annotation_arrows— sparse annotation renderer with tiered label placement, chromosome-boundary spreading, cumulative-distance stacking, and straightarc3arrows (curvature fixed at zero for visual clarity in low-density contexts)._draw_annotation_arrows_multirail— dense annotation renderer implementing a three-step layout pipeline (see below) with curvedarcarrows and adaptiveylim.Three-step dense annotation layout pipeline (
_draw_annotation_arrows_multirail)
Relaxation pass — bidirectional
min_sepenforcement starting fromx_signalpositions. Labels in dense regions drift further from their signals than labels in sparse regions, producing a natural density signal with no explicit cluster detection.Drift-based rail assignment — each label’s relaxation drift is binned into a rail index using
rail_stride = rail_width / max_rails. Denser regions automatically receive higher rail indices proportionally across the full rail range. No per-rail queue processing ormax_driftthreshold is required.linspace rank-reassignment — labels are sorted by
x_signaland assigned evenly-spacedx_textslots vianp.linspace(rail_start, rail_end, n). This guaranteesx_textrank equalsx_signalrank (no arrow crossings by construction) and full rail coverage regardless ofrail_fracor signal distribution.
Auto char_width from axes geometry
For vertical text (rotation=90), the horizontal label footprint is
one character wide regardless of string length. char_width is now
derived from the axes pixel extent and figure DPI at draw time:
px_per_bp = ax_bbox.width / (xmax - xmin)
char_width = 0.6 * fsize * (fig.dpi / 72.0) / px_per_bp
The char_width_factor parameter has been removed from
_draw_annotation_arrows_multirail; char_width is computed
automatically and scales correctly with figure size, DPI, and font
size.
Proportional space budgeting and rail_frac awareness
Rail width is derived from rail_frac as
rail_width = genome_width * rail_frac, centred on the genome
midpoint. rail_stride and slot spacing scale proportionally with
rail_frac, ensuring even label distribution at any rail fraction
without choking at rail boundaries.
Layout table (
pd.DataFrame)
Placement, relaxation, and rendering are now cleanly separated via a
layout table with columns label, x_signal, x_text,
rail_id. rail_id is written during placement and not read
again until the rendering pass, enforcing strict separation of layout
and rendering concerns.
Chromosome-boundary detection (
_draw_annotation_arrows)
For each adjacent chromosome pair, the inter-chromosome gap is
computed. If the gap is narrower than spread_width, both boundary
annotations receive an x_bound value encoding direction and
magnitude, used downstream to push boundary labels apart before
general spreading.
Cumulative x-position porting from tracks
Annotation cumulative x positions are now ported directly from track
DataFrames via a three-column merge on (chr_col, pos_col, LABEL)
rather than being recomputed independently, guaranteeing exact
consistency between annotation and track coordinates.
track_heights sanity check and y-label positioning
track_heights is validated against the expected count
(n_tracks + 1 when annotating, n_tracks otherwise) with
explicit ValueError and TypeError messages. The y-label
position (-log10(P)) is computed from actual height ratios
accounting for top-to-bottom track orientation:
y_lab_pos = data_total / (2 * total_height)
Changed
_draw_annotation_arrows:max_radparameter removed; curvature is intentionally fixed at zero (straight arrows) for sparse annotation contexts. Dense annotation curvature is handled exclusively by_draw_annotation_arrows_multirail.Annotation deduplication now occurs at the top of both renderers via
drop_duplicates(subset=[chr_col, "x", label_col])to prevent replicated arrows whenannot_dfis a merged multi-track table.Chromosome order in boundary detection now uses
natsortedinstead ofsetto guarantee correct genomic ordering.x_boundis only set when the inter-chromosome gap is<= spread_width(previously unconditional), preventing spurious boundary constraints between well-separated chromosomes.
Fixed
Arrow crossings eliminated unconditionally by the linspace rank-reassignment step:
x_textrank is guaranteed equal tox_signalrank for all labels across all rails.Annotation spill past genome right boundary resolved:
rail_endacts as a hard clamp during relaxation; labels cannot exceed it regardless of local density.Higher-rail priority inversion fixed: the drift-based rail assignment correctly places the densest labels (largest drift) on higher rails, not the labels nearest the rail boundary.
x_textssort-order mismatch resolved: cumulative-scaled positions are now mapped back to original signal order vianp.argsortbefore use, preventing label-to-wrong-position assignment.char_widthunderestimation fixed: replacing the hardcoded8e6fallback with axes-geometry derivation corrects the ~2× underestimate that caused stacking to never fire for typical figure sizes atfontsize=6.natsortedapplied to chromosome order throughout to prevent incorrect pairing of chromosomes (e.g. chr3 with chr17) caused bysetiteration order.
0.2.7 — 2026-04-27
Added
Default-on density-aware auto-thinning for Manhattan / circular
rendering, inspired by gwaslab and applied on top of (i.e. in
addition to) the existing --trim_pval. A new helper
auto_thin_for_manhattan() keeps every variant
whose “interestingness” signal is at or above --auto_thin_threshold
and uniformly sub-samples the dense bulk to at most
--auto_thin_max_below rows per track (default 200 000). Lead
SNPs are still extracted from the full unthinned data, so peak
annotations are unaffected.
Two modes, switched by --logp:
P-value mode (
--logpset, the GWAS default). Signal is-log10(P);--auto_thin_thresholdis in-log10(P)units (default2.0=>P <= 0.01). Every suggestive / genome-wide-significant variant survives untouched.Raw-statistic mode (
--logpoff). ThePcolumn is interpreted as a raw test statistic and the signal becomes|value|, so the same machinery works for selection scans like iHS, XP-EHH, F_ST, Fay & Wu’s H, Tajima’s D, etc. The default threshold of2.0works for the standardised |iHS| / |XP-EHH| scans; override (e.g.--auto_thin_threshold 0.05) for F_ST.
Negative extremes are preserved as well as positive ones, so for signed statistics (iHS, XP-EHH) both tails of the distribution survive intact.
New CLI flags:
Flag |
Description |
|---|---|
|
Disable auto-thinning entirely. |
|
|
|
Cap on background variants per track (default 200 000). |
|
Counterpart for QQ log-uniform thinning, which is now ON by default. |
Combined with the rendering and data-prep optimisations from earlier in this release, this brings pycmplot’s untrimmed timings to:
Size |
manhattan (s) |
qq (s) |
circular (s) |
|---|---|---|---|
500K |
4.4 (was 32.6) |
4.1 (19.0) |
18.5 (119) |
1M |
5.1 (was 63.7) |
4.9 (37) |
19.6 (235) |
2M |
6.6 (was 127) |
6.4 (75) |
21.3 (469) |
5M |
12.7 (was 317) |
11.7 (191) |
28.7 (1169) |
i.e. circular plotting at 5 M variants is now 41x faster than the pre-0.2.7 untrimmed path, and projects to ~38 s at 10 M variants (down from ~38 minutes — and faster than CMplot’s circular path).
Performance
Linear Manhattan rendering switched from
ax.scatter(onePathCollectioncarrying a path-per-point with per-pointshould_simplifychecks) to oneax.plot(..., marker='.', linestyle='none')per chromosome (a singleLine2Dwhose marker-draw loop is dramatically cheaper). Visually identical rasterised output; on a 1 M-variant single-track plot this alone shrinksplot_linearmfrom ~6 s to ~0.5 s.QQ plots (
plot_qq_singleandplot_qq_combined) make the same scatter → plot switch for the observed points.Chromosome-name normalisation in
get_sumstats_and_merged_sector_list()is now applied to the categories of the CHRCategorical(≤25 distinct values) rather than to the underlying N-row code array. The result is stored as aCategoricalordered byCHROM_ORDERso downstream code can derivechr_idxfromcat.codesdirectly.Linear-plot
_preprecognises the canonical Categorical CHR column produced by the loader and skips the redundantstr.replace + str.upper + replacepass that was running on every plot call.Optional CSV reader switched to
engine='pyarrow'with safe fallback to the default C engine when pyarrow is unavailable.New
compute_pvalsparameter onget_sumstats_and_merged_sector_list()(defaultTrue);_core.pynow sets it toFalsewhen no QQ plot is requested, skipping an ~80 MB-at-10 M-variants p-value-array copy that was unused on Manhattan- or circular-only runs.
Combined effect (measured, single-track untrimmed, fresh subprocess):
plot_type |
500K before |
500K after |
speed-up |
|---|---|---|---|
manhattan |
32.6 s |
4.6 s |
7.1x |
19.0 s |
6.7 s |
2.8x |
|
circular |
119.0 s |
39.9 s |
3.0x |
plot_type |
1M before |
1M after |
speed-up |
|---|---|---|---|
manhattan |
63.7 s |
6.0 s |
10.6x |
37.1 s |
10.2 s |
3.6x |
|
circular |
235.3 s |
73.3 s |
3.2x |
Fixed
POSis now stored as plainint64after ato_numeric + dropnapass, rather than the nullableInt64that leakedpd.NAinto reductions likegroupby(...).max()and causedTypeError: boolean value of NA is ambiguousfurther down the pipeline.plot_linearm’sdf.groupby(CHR)[POS].max()now passesobserved=Trueso categorical chromosomes with no rows in a particular track produce no entry (s.get(c, 0)handles the missing case), avoiding theNA-propagation crash described above.Stripped 5 288 stray
NULbytes that had been appended to the end ofpycmplot/plotting/linear.py(filesystem-level corruption from a partial overwrite — the file imported only after the trailing zeros were removed).
0.2.5 — 2026-04-20
Fixed
Chromosome-22 positions falling outside hg38 chr22 limits after liftover no longer crash circular plotting. The liftover post-filter now guards against unknown chromosomes with an informative warning.
prep_pycmplot_input_infonow resolves and stores column mappings per file rather than collapsing everything onto the last file. This fixes incorrect column resolution when the input summary statistics files use different header names.io.get_file_headernow correctly honours thedelimargument when reading the header line.stats.get_highlight_snpsnow forwardslogpthrough toget_lead_snpsinstead of hard-coding it toFalse— highlighting works correctly when plotting on the −log₁₀(p) axis._core.pyannotation resolution now uses the value of--annotate(not the column name) when checking whether the requested annotation column exists in the hits table, and falls back toSNPwith a warning when it does not.Chromosome-length sort (
--sort_track chrom_len) now actually sorts by the number of chromosomes (most chromosomes first) rather than by track label.resources.ResourceConfig.requirenow importsas_filefromimportlib.resourcesso the bundled-resource fallback no longer raisesNameError. The fallback also now verifies that the resolved file actually exists before returning, rather than silently returning a phantom path.prep_pycmplot_input_infono longer emits a spurious “no build column detected” warning when the input files contain aBUILDcolumn. The check previously inspected the length of the top-level info list, which only distinguishes the--buildpath from the no-build path; the fix also checks whether a build column was appended toold_cols.Linear Manhattan plot: per-track labels and the shared
-log₁₀(p-value)y-axis label no longer overlap in the left margin. Track labels are now rendered as a right-aligned sub-title above each axes (ax.set_title(..., loc='right')), which keeps them out of the data region entirely — so labels remain legible for dense null tracks, iHS/F_ST/XP-EHH panels, or any other plot where data can reach the upper-right corner. The figure also reserves an explicit left strip for the shared y-label viafig.subplots_adjustinstead of relying ontight_layout(which was incompatible with the shared-x gridspec and silently emitted a matplotlib warning).Linear Manhattan plot: the
df = df[df[p_col] >= 0]sanity filter is now only applied when plotting-log₁₀(p). For non-p-value statistics (iHS, XP-EHH, Fay & Wu’s H) negative values are legitimate and are preserved. The filter was also previously applied aftercolor_cyclewas constructed, which caused a latentValueError: 'c' argument has N elements, which is inconsistent with 'x' and 'y'whenever the filter actually dropped rows.Annotation in circular plotting when GENE selected but SNP annotated.
Added
--ylabel/-ylflag (andylabel=kwarg onplot_linear()andplot_linearm()) for overriding the shared y-axis label on linear Manhattan plots. Intended for non-p-value statistics, e.g.--ylabel 'iHS'or--ylabel 'F_ST'.All QQ-plotting functions (
plot_qq_single(),plot_qq_combined(),plot_qq_separate(),plot_qq_overlay()) are now re-exported at the top level (from pycmplot import plot_qq_combined) and through thepycmplot.plottingsubpackage.hg18 → hg38 liftover.
BUILDcolumn values ofhg18(or--build hg18) now trigger direct hg18 → hg38 coordinate conversion via a bundled UCSC chain file (pycmplot/data/hg18ToHg38.over.chain.gz). A newliftover_hg18_to_hg38()helper andResourceConfig.chain_hg18_hg38attribute (overridable viaPYCMPLOT_CHAIN_HG18_HG38) are exposed alongside the existing hg19 → hg38 path. Together these cover virtually all publicly available GWAS summary statistics.python -m pycmplotentry point (via a new__main__.py).New Jupyter notebook demonstrating QQ-plotting workflows.
All module-, class-, and function-level docstrings now use the bare
"""..."""form so that Sphinx autodoc / numpydoc andhelp()render them correctly.Information about the sumstats printed to screen now includes number of variants pre and post trimming, memory usage, and progress bar.
Changed
Enhanced memory efficiency by changing CHR and BUILD columns dtypes from
strtocategoryinio.pyLicence changed to MIT Licence.
0.2.2 — 2026-04-18
Added
QQ plots (pycmplot.plotting.qq):
plot_qq_single()— single QQ panel on a provided axes, with 95% CI band, null diagonal, optional genome-wide line, and λ annotation.plot_qq_combined()— all sumstats as per-panel grid with configurable column count.plot_qq_separate()— one file per sumstat.plot_qq_overlay()— all sumstats on one shared axes, with λ in legend entries.thin_pvals()— log-uniform p-value thinning helper that preserves tail density while sparsifying the bulk, with no hard threshold seam.
CLI flags for QQ plotting:
Flag |
Description |
|---|---|
|
Generate QQ plot(s) alongside the Manhattan plot. |
|
Save one file per sumstat instead of a combined figure. |
|
Overlay all sumstats on a single QQ axes. |
|
Number of columns in the combined grid (default 3). |
|
Maximum points per track after thinning (default 50 000). |
|
Enable log-uniform p-value thinning (off by default). |
|
P-value floor below which all points are kept (default 0.01). |
Performance
Log-uniform thinning reduces a 10 M-SNP dataset to ≤ 50 000 plotted points with no perceptible visual difference.
Scatter points are rasterised inside PDF/SVG output (
rasterized=True), reducing file sizes from hundreds of MB to a few MB for large datasets.
Fixed
_qq_arrays: removed an erroneous reverse on theobservedarray that paired the largest expected quantile with the smallest observed p-value, breaking the diagonal.thin_pvals: replaced the two-region split that could produce a zero bulk budget (silently dropping the diagonal below −log₁₀(p) = 2) with seamless log-uniform thinning._plot_circularm: increased padding between the first and last tracks to improve visibility of track labels and y-axis ticks.--build_columndetection no longer fails when the flag is omitted.
0.2.1 — 2026-04-16
Added
--buildoption for supplying per-file genome builds when the summary statistics files do not carry aBUILDcolumn.--buildand--build_columnare both optional; plotting proceeds without genome-build information when neither is supplied.
Changed
Expanded
--annotatechoices fromsnp/geneto any column in the hits table (and any column in a user-supplied annotation table in the Python API).
Caveat
When multiple summary statistics files use different coordinate systems and
--annotateis set, annotation defaults to hg38 coordinates, which may mis-annotate hg19 variants. Supplying correct builds avoids this.
0.1.9 — 2026-04-14
Fixed
Column name auto-detection now covers both lower- and upper-case variants of every built-in candidate.
buildparameter ofprep_pycmplot_input_info()is now consistent with the CLI equivalent (required instead of optional).
0.1.8 — 2026-04-14
Added
--highlight_colorand--highlight_line_coloroptions.Short form for
--colors.Long forms for
-r_min,-r_max,-t_space,-pad.
Fixed
from __future__ import annotationsimport bug.Short form for
--highlight_line.
0.1.0 — 2026-04-18
Initial release.
Added
Package structure:
Installable Python package with
src/layout,pyproject.toml,setup.cfg, and asetup.pycompatibility shim.Console script
pycmplot(also runnable aspython -m pycmplot).
Modules:
pycmplot.constants— hg38 chromosome lengths, biotype priority weights, standard chromosome order.pycmplot.resources—ResourceConfigdataclass for reference-file paths, configurable via environment variables (PYCMPLOT_CHAIN_HG19_HG38,PYCMPLOT_GENEINFO_HG38,PYCMPLOT_GENEINFO_HG19).pycmplot.liftover— lazy-initialised hg19 → hg38 coordinate conversion.pycmplot.io— summary statistics loader with auto-detection of delimiters and column names.pycmplot.annotation— strand-aware nearest-gene annotation with biotype-weighted prioritisation, promoter flagging, andget_hits_summary_table().pycmplot.plotting.linear— multi-track stacked linear Manhattan plotter.pycmplot.plotting.circular— multi-track Circos-style circular Manhattan plotter.pycmplot.cli—argparseCLI.pycmplot._core— themain()orchestration function.
Fixed (relative to the original monolithic script):
Module-level
LiftOver(hardcoded_path)call replaced by a lazy singleton;import pycmplotno longer raisesFileNotFoundError.Hardcoded
/vast/awonkam1/...resourc