Changelog
=========
All notable changes to **pycmplot** are documented here.
The format is based on `Keep a Changelog `_
and this project adheres to `Semantic Versioning `_.
----
[0.3.0] �~@~T 20266001
**Fixed**
- Bug fix:
- linear plotting ``t_heights`` local variable access failure.
**Changed**
- Suggestive line color from ``lightblue`` to ``navy`` in circular plotting
- Significance line color from ``red`` to ``orangered`` in linear plotting to match
circular plotting
- Suggestive line color from ``blue`` to ``navy`` in linear plotting to match
circular plotting
**Added**
- ``ylabel``: optional ylabel text in circular plotting to match linear plotting
----
[0.2.8] — 2026-05-30
------------------------------------------------------------------------------
**Added**
- **Dual annotation renderer architecture**
Two complementary annotation functions now handle sparse and dense
annotation scenarios independently:
- ``_draw_annotation_arrows`` — sparse annotation renderer with tiered
label placement, chromosome-boundary spreading, cumulative-distance
stacking, and straight ``arc3`` arrows (curvature fixed at zero for
visual clarity in low-density contexts).
- ``_draw_annotation_arrows_multirail`` — dense annotation renderer
implementing a three-step layout pipeline (see below) with curved
``arc`` arrows and adaptive ``ylim``.
- **Three-step dense annotation layout pipeline** (``_draw_annotation_arrows_multirail``)
1. *Relaxation pass* — bidirectional ``min_sep`` enforcement starting
from ``x_signal`` positions. Labels in dense regions drift further
from their signals than labels in sparse regions, producing a
natural density signal with no explicit cluster detection.
2. *Drift-based rail assignment* — each label's relaxation drift is
binned into a rail index using
``rail_stride = rail_width / max_rails``. Denser regions
automatically receive higher rail indices proportionally across the
full rail range. No per-rail queue processing or ``max_drift``
threshold is required.
3. *linspace rank-reassignment* — labels are sorted by ``x_signal``
and assigned evenly-spaced ``x_text`` slots via
``np.linspace(rail_start, rail_end, n)``. This guarantees
``x_text`` rank equals ``x_signal`` rank (no arrow crossings by
construction) and full rail coverage regardless of ``rail_frac`` or
signal distribution.
- **Auto char_width from axes geometry**
For vertical text (``rotation=90``), the horizontal label footprint is
one character wide regardless of string length. ``char_width`` is now
derived from the axes pixel extent and figure DPI at draw time::
px_per_bp = ax_bbox.width / (xmax - xmin)
char_width = 0.6 * fsize * (fig.dpi / 72.0) / px_per_bp
The ``char_width_factor`` parameter has been removed from
``_draw_annotation_arrows_multirail``; ``char_width`` is computed
automatically and scales correctly with figure size, DPI, and font
size.
- **Proportional space budgeting and rail_frac awareness**
Rail width is derived from ``rail_frac`` as
``rail_width = genome_width * rail_frac``, centred on the genome
midpoint. ``rail_stride`` and slot spacing scale proportionally with
``rail_frac``, ensuring even label distribution at any rail fraction
without choking at rail boundaries.
- **Layout table** (``pd.DataFrame``)
Placement, relaxation, and rendering are now cleanly separated via a
layout table with columns ``label``, ``x_signal``, ``x_text``,
``rail_id``. ``rail_id`` is written during placement and not read
again until the rendering pass, enforcing strict separation of layout
and rendering concerns.
- **Chromosome-boundary detection** (``_draw_annotation_arrows``)
For each adjacent chromosome pair, the inter-chromosome gap is
computed. If the gap is narrower than ``spread_width``, both boundary
annotations receive an ``x_bound`` value encoding direction and
magnitude, used downstream to push boundary labels apart before
general spreading.
- **Cumulative x-position porting from tracks**
Annotation cumulative x positions are now ported directly from track
DataFrames via a three-column merge on ``(chr_col, pos_col, LABEL)``
rather than being recomputed independently, guaranteeing exact
consistency between annotation and track coordinates.
- **track_heights sanity check and y-label positioning**
``track_heights`` is validated against the expected count
(``n_tracks + 1`` when annotating, ``n_tracks`` otherwise) with
explicit ``ValueError`` and ``TypeError`` messages. The y-label
position (``-log10(P)``) is computed from actual height ratios
accounting for top-to-bottom track orientation::
y_lab_pos = data_total / (2 * total_height)
**Changed**
- ``_draw_annotation_arrows``: ``max_rad`` parameter removed; curvature
is intentionally fixed at zero (straight arrows) for sparse
annotation contexts. Dense annotation curvature is handled
exclusively by ``_draw_annotation_arrows_multirail``.
- Annotation deduplication now occurs at the top of both renderers
via ``drop_duplicates(subset=[chr_col, "x", label_col])`` to prevent
replicated arrows when ``annot_df`` is a merged multi-track table.
- Chromosome order in boundary detection now uses ``natsorted`` instead
of ``set`` to guarantee correct genomic ordering.
- ``x_bound`` is only set when the inter-chromosome gap is
``<= spread_width`` (previously unconditional), preventing spurious
boundary constraints between well-separated chromosomes.
**Fixed**
- Arrow crossings eliminated unconditionally by the linspace
rank-reassignment step: ``x_text`` rank is guaranteed equal to
``x_signal`` rank for all labels across all rails.
- Annotation spill past genome right boundary resolved: ``rail_end``
acts as a hard clamp during relaxation; labels cannot exceed it
regardless of local density.
- Higher-rail priority inversion fixed: the drift-based rail assignment
correctly places the densest labels (largest drift) on higher rails,
not the labels nearest the rail boundary.
- ``x_texts`` sort-order mismatch resolved: cumulative-scaled positions
are now mapped back to original signal order via ``np.argsort``
before use, preventing label-to-wrong-position assignment.
- ``char_width`` underestimation fixed: replacing the hardcoded
``8e6`` fallback with axes-geometry derivation corrects the ~2×
underestimate that caused stacking to never fire for typical figure
sizes at ``fontsize=6``.
- ``natsorted`` applied to chromosome order throughout to prevent
incorrect pairing of chromosomes (e.g. chr3 with chr17) caused by
``set`` iteration order.
0.2.7 — 2026-04-27
------------------
**Added**
- **Default-on density-aware auto-thinning** for Manhattan / circular
rendering, inspired by ``gwaslab`` and applied on top of (i.e. in
addition to) the existing ``--trim_pval``. A new helper
:func:`~pycmplot.io.auto_thin_for_manhattan` keeps **every** variant
whose "interestingness" signal is at or above ``--auto_thin_threshold``
and uniformly sub-samples the dense bulk to at most
``--auto_thin_max_below`` rows per track (default ``200 000``). Lead
SNPs are still extracted from the *full* unthinned data, so peak
annotations are unaffected.
Two modes, switched by ``--logp``:
* **P-value mode** (``--logp`` set, the GWAS default). Signal is
``-log10(P)``; ``--auto_thin_threshold`` is in ``-log10(P)`` units
(default ``2.0`` => ``P <= 0.01``). Every suggestive /
genome-wide-significant variant survives untouched.
* **Raw-statistic mode** (``--logp`` off). The ``P`` column is
interpreted as a raw test statistic and the signal becomes
``|value|``, so the same machinery works for selection scans like
**iHS, XP-EHH, F_ST, Fay & Wu's H, Tajima's D**, etc. The default
threshold of ``2.0`` works for the standardised \|iHS\| / \|XP-EHH\|
scans; override (e.g. ``--auto_thin_threshold 0.05``) for F_ST.
Negative extremes are preserved as well as positive ones, so for
signed statistics (iHS, XP-EHH) both tails of the distribution
survive intact.
New CLI flags:
============================== ================================================
Flag Description
============================== ================================================
``--no_auto_thin`` Disable auto-thinning entirely.
``--auto_thin_threshold`` ``-log10(P)`` floor above which every variant
is kept (default 2.0).
``--auto_thin_max_below`` Cap on background variants per track
(default 200 000).
``--no_qq_thin`` Counterpart for QQ log-uniform thinning,
which is now ON by default.
============================== ================================================
Combined with the rendering and data-prep optimisations from earlier
in this release, this brings pycmplot's untrimmed timings to:
+-------+-------------------+--------------+----------------+
| Size | manhattan (s) | qq (s) | circular (s) |
+=======+===================+==============+================+
| 500K | 4.4 (was 32.6) | 4.1 (19.0) | 18.5 (119) |
+-------+-------------------+--------------+----------------+
| 1M | 5.1 (was 63.7) | 4.9 (37) | 19.6 (235) |
+-------+-------------------+--------------+----------------+
| 2M | 6.6 (was 127) | 6.4 (75) | 21.3 (469) |
+-------+-------------------+--------------+----------------+
| 5M | 12.7 (was 317) | 11.7 (191) | 28.7 (1169) |
+-------+-------------------+--------------+----------------+
i.e. circular plotting at 5 M variants is now **41x faster** than the
pre-0.2.7 untrimmed path, and projects to ~38 s at 10 M variants
(down from ~38 minutes — and faster than CMplot's circular path).
**Performance**
- Linear Manhattan rendering switched from ``ax.scatter`` (one ``PathCollection``
carrying a path-per-point with per-point ``should_simplify`` checks) to
one ``ax.plot(..., marker='.', linestyle='none')`` per chromosome
(a single ``Line2D`` whose marker-draw loop is dramatically cheaper).
Visually identical rasterised output; on a 1 M-variant single-track plot
this alone shrinks ``plot_linearm`` from ~6 s to ~0.5 s.
- QQ plots (``plot_qq_single`` and ``plot_qq_combined``) make the same
scatter → plot switch for the observed points.
- Chromosome-name normalisation in
:func:`~pycmplot.io.get_sumstats_and_merged_sector_list` is now applied
to the **categories** of the CHR ``Categorical`` (≤25 distinct values)
rather than to the underlying N-row code array. The result is stored
as a ``Categorical`` ordered by ``CHROM_ORDER`` so downstream code can
derive ``chr_idx`` from ``cat.codes`` directly.
- Linear-plot ``_prep`` recognises the canonical Categorical CHR column
produced by the loader and skips the redundant ``str.replace +
str.upper + replace`` pass that was running on every plot call.
- Optional CSV reader switched to ``engine='pyarrow'`` with safe fallback
to the default C engine when pyarrow is unavailable.
- New ``compute_pvals`` parameter on
:func:`~pycmplot.io.get_sumstats_and_merged_sector_list` (default
``True``); ``_core.py`` now sets it to ``False`` when no QQ plot is
requested, skipping an ~80 MB-at-10 M-variants p-value-array copy that
was unused on Manhattan- or circular-only runs.
Combined effect (measured, single-track untrimmed, fresh subprocess):
========== =========== ========== ========
plot_type 500K before 500K after speed-up
========== =========== ========== ========
manhattan 32.6 s 4.6 s 7.1x
qq 19.0 s 6.7 s 2.8x
circular 119.0 s 39.9 s 3.0x
========== =========== ========== ========
========== ========== ========== ========
plot_type 1M before 1M after speed-up
========== ========== ========== ========
manhattan 63.7 s 6.0 s 10.6x
qq 37.1 s 10.2 s 3.6x
circular 235.3 s 73.3 s 3.2x
========== ========== ========== ========
**Fixed**
- ``POS`` is now stored as plain ``int64`` after a ``to_numeric +
dropna`` pass, rather than the nullable ``Int64`` that leaked ``pd.NA``
into reductions like ``groupby(...).max()`` and caused
``TypeError: boolean value of NA is ambiguous`` further down the
pipeline.
- ``plot_linearm``'s ``df.groupby(CHR)[POS].max()`` now passes
``observed=True`` so categorical chromosomes with no rows in a
particular track produce no entry (``s.get(c, 0)`` handles the missing
case), avoiding the ``NA``-propagation crash described above.
- Stripped 5 288 stray ``NUL`` bytes that had been appended to the end
of ``pycmplot/plotting/linear.py`` (filesystem-level corruption from a
partial overwrite — the file imported only after the trailing zeros
were removed).
----
0.2.5 — 2026-04-20
------------------
**Fixed**
- Chromosome-22 positions falling outside hg38 chr22 limits after liftover
no longer crash circular plotting. The liftover post-filter now guards
against unknown chromosomes with an informative warning.
- ``prep_pycmplot_input_info`` now resolves and stores column mappings
**per file** rather than collapsing everything onto the last file. This
fixes incorrect column resolution when the input summary statistics files
use different header names.
- ``io.get_file_header`` now correctly honours the ``delim`` argument when
reading the header line.
- ``stats.get_highlight_snps`` now forwards ``logp`` through to
``get_lead_snps`` instead of hard-coding it to ``False`` — highlighting
works correctly when plotting on the −log₁₀(p) axis.
- ``_core.py`` annotation resolution now uses the value of ``--annotate``
(not the column name) when checking whether the requested annotation
column exists in the hits table, and falls back to ``SNP`` with a warning
when it does not.
- Chromosome-length sort (``--sort_track chrom_len``) now actually sorts by
the number of chromosomes (most chromosomes first) rather than by track
label.
- ``resources.ResourceConfig.require`` now imports ``as_file`` from
``importlib.resources`` so the bundled-resource fallback no longer raises
``NameError``. The fallback also now verifies that the resolved file
actually exists before returning, rather than silently returning a
phantom path.
- ``prep_pycmplot_input_info`` no longer emits a spurious "no build column
detected" warning when the input files contain a ``BUILD`` column. The
check previously inspected the length of the top-level info list, which
only distinguishes the ``--build`` path from the no-build path; the fix
also checks whether a build column was appended to ``old_cols``.
- Linear Manhattan plot: per-track labels and the shared
``-log₁₀(p-value)`` y-axis label no longer overlap in the left margin.
Track labels are now rendered as a right-aligned sub-title above each
axes (``ax.set_title(..., loc='right')``), which keeps them out of the
data region entirely — so labels remain legible for dense null tracks,
iHS/F_ST/XP-EHH panels, or any other plot where data can reach the
upper-right corner. The figure also reserves an explicit left strip
for the shared y-label via ``fig.subplots_adjust`` instead of relying
on ``tight_layout`` (which was incompatible with the shared-x gridspec
and silently emitted a matplotlib warning).
- Linear Manhattan plot: the ``df = df[df[p_col] >= 0]`` sanity filter is
now only applied when plotting ``-log₁₀(p)``. For non-p-value
statistics (iHS, XP-EHH, Fay & Wu's H) negative values are legitimate
and are preserved. The filter was also previously applied *after*
``color_cycle`` was constructed, which caused a latent
``ValueError: 'c' argument has N elements, which is inconsistent with
'x' and 'y'`` whenever the filter actually dropped rows.
- Annotation in circular plotting when **GENE** selected but **SNP**
annotated.
**Added**
- ``--ylabel`` / ``-yl`` flag (and ``ylabel=`` kwarg on
:func:`~pycmplot.plotting.linear.plot_linear` and
:func:`~pycmplot.plotting.linear.plot_linearm`) for overriding the
shared y-axis label on linear Manhattan plots. Intended for non-p-value
statistics, e.g. ``--ylabel 'iHS'`` or ``--ylabel 'F_ST'``.
- All QQ-plotting functions (:func:`~pycmplot.plotting.qq.plot_qq_single`,
:func:`~pycmplot.plotting.qq.plot_qq_combined`,
:func:`~pycmplot.plotting.qq.plot_qq_separate`,
:func:`~pycmplot.plotting.qq.plot_qq_overlay`) are now re-exported at the
top level (``from pycmplot import plot_qq_combined``) and through the
:mod:`pycmplot.plotting` subpackage.
- **hg18 → hg38 liftover.** ``BUILD`` column values of ``hg18`` (or
``--build hg18``) now trigger direct hg18 → hg38 coordinate conversion
via a bundled UCSC chain file
(``pycmplot/data/hg18ToHg38.over.chain.gz``). A new
:func:`~pycmplot.liftover.liftover_hg18_to_hg38` helper and
``ResourceConfig.chain_hg18_hg38`` attribute (overridable via
``PYCMPLOT_CHAIN_HG18_HG38``) are exposed alongside the existing
hg19 → hg38 path. Together these cover virtually all publicly available
GWAS summary statistics.
- ``python -m pycmplot`` entry point (via a new ``__main__.py``).
- New Jupyter notebook demonstrating QQ-plotting workflows.
- All module-, class-, and function-level docstrings now use the bare
``"""..."""`` form so that Sphinx autodoc / numpydoc and
:func:`help` render them correctly.
- Information about the sumstats printed to screen now includes number
of variants pre and post trimming, memory usage, and progress bar.
**Changed**
- Enhanced memory efficiency by changing **CHR** and **BUILD** columns
dtypes from ``str`` to ``category`` in ``io.py``
- Licence changed to MIT Licence.
----
0.2.2 — 2026-04-18
------------------
**Added**
QQ plots (:mod:`pycmplot.plotting.qq`):
- :func:`~pycmplot.plotting.qq.plot_qq_single` — single QQ panel on a
provided axes, with 95% CI band, null diagonal, optional genome-wide
line, and λ annotation.
- :func:`~pycmplot.plotting.qq.plot_qq_combined` — all sumstats as
per-panel grid with configurable column count.
- :func:`~pycmplot.plotting.qq.plot_qq_separate` — one file per sumstat.
- :func:`~pycmplot.plotting.qq.plot_qq_overlay` — all sumstats on one
shared axes, with λ in legend entries.
- :func:`~pycmplot.plotting.qq.thin_pvals` — log-uniform p-value thinning
helper that preserves tail density while sparsifying the bulk, with no
hard threshold seam.
CLI flags for QQ plotting:
===================================== =======================================
Flag Description
===================================== =======================================
``-qq`` / ``--qq_plot`` Generate QQ plot(s) alongside the Manhattan plot.
``-qq_sep`` / ``--qq_separate`` Save one file per sumstat instead of a combined figure.
``-qq_ov`` / ``--qq_overlay`` Overlay all sumstats on a single QQ axes.
``-qq_cols`` / ``--qq_ncols`` Number of columns in the combined grid (default 3).
``-qq_max_pts`` / ``--qq_max_points`` Maximum points per track after thinning (default 50 000).
``-qq_thin`` / ``--qq_thin`` Enable log-uniform p-value thinning (off by default).
``-thin_below`` / ``--thin_below`` P-value floor below which all points are kept (default 0.01).
===================================== =======================================
**Performance**
- Log-uniform thinning reduces a 10 M-SNP dataset to ≤ 50 000 plotted
points with no perceptible visual difference.
- Scatter points are rasterised inside PDF/SVG output
(``rasterized=True``), reducing file sizes from hundreds of MB to a
few MB for large datasets.
**Fixed**
- ``_qq_arrays``: removed an erroneous reverse on the ``observed`` array
that paired the largest expected quantile with the smallest observed
p-value, breaking the diagonal.
- ``thin_pvals``: replaced the two-region split that could produce a
zero bulk budget (silently dropping the diagonal below
−log₁₀(p) = 2) with seamless log-uniform thinning.
- ``_plot_circularm``: increased padding between the first and last
tracks to improve visibility of track labels and y-axis ticks.
- ``--build_column`` detection no longer fails when the flag is omitted.
----
0.2.1 — 2026-04-16
------------------
**Added**
- ``--build`` option for supplying per-file genome builds when the
summary statistics files do not carry a ``BUILD`` column.
- ``--build`` and ``--build_column`` are both optional; plotting
proceeds without genome-build information when neither is supplied.
**Changed**
- Expanded ``--annotate`` choices from ``snp``/``gene`` to any column in
the hits table (and any column in a user-supplied annotation table in
the Python API).
**Caveat**
- When multiple summary statistics files use different coordinate
systems and ``--annotate`` is set, annotation defaults to hg38
coordinates, which may mis-annotate hg19 variants. Supplying correct
builds avoids this.
----
0.1.9 — 2026-04-14
------------------
**Fixed**
- Column name auto-detection now covers both lower- and upper-case
variants of every built-in candidate.
- ``build`` parameter of
:func:`~pycmplot.io.prep_pycmplot_input_info` is now consistent with
the CLI equivalent (required instead of optional).
----
0.1.8 — 2026-04-14
------------------
**Added**
- ``--highlight_color`` and ``--highlight_line_color`` options.
- Short form for ``--colors``.
- Long forms for ``-r_min``, ``-r_max``, ``-t_space``, ``-pad``.
**Fixed**
- ``from __future__ import annotations`` import bug.
- Short form for ``--highlight_line``.
----
0.1.0 — 2026-04-18
------------------
Initial release.
**Added**
Package structure:
- Installable Python package with ``src/`` layout, ``pyproject.toml``,
``setup.cfg``, and a ``setup.py`` compatibility shim.
- Console script ``pycmplot`` (also runnable as ``python -m pycmplot``).
Modules:
- :mod:`pycmplot.constants` — hg38 chromosome lengths, biotype priority
weights, standard chromosome order.
- :mod:`pycmplot.resources` — :class:`~pycmplot.resources.ResourceConfig`
dataclass for reference-file paths, configurable via environment
variables (``PYCMPLOT_CHAIN_HG19_HG38``, ``PYCMPLOT_GENEINFO_HG38``,
``PYCMPLOT_GENEINFO_HG19``).
- :mod:`pycmplot.liftover` — lazy-initialised hg19 → hg38 coordinate
conversion.
- :mod:`pycmplot.stats` — :func:`~pycmplot.stats.get_lead_snps` and
:func:`~pycmplot.stats.get_highlight_snps`.
- :mod:`pycmplot.io` — summary statistics loader with auto-detection of
delimiters and column names.
- :mod:`pycmplot.annotation` — strand-aware nearest-gene annotation
with biotype-weighted prioritisation, promoter flagging, and
:func:`~pycmplot.annotation.get_hits_summary_table`.
- :mod:`pycmplot.plotting.linear` — multi-track stacked linear
Manhattan plotter.
- :mod:`pycmplot.plotting.circular` — multi-track Circos-style circular
Manhattan plotter.
- :mod:`pycmplot.cli` — ``argparse`` CLI.
- :mod:`pycmplot._core` — the ``main()`` orchestration function.
**Fixed** (relative to the original monolithic script):
- Module-level ``LiftOver(hardcoded_path)`` call replaced by a lazy
singleton; ``import pycmplot`` no longer raises ``FileNotFoundError``.
- Hardcoded ``/vast/awonkam1/...`` resourc