pycmplot.io

Handles loading and pre-processing of summary statistics files. Auto-detects delimiters and resolves non-standard column names to the canonical names used internally by pycmplot.

pycmplot.io

Functions for loading, validating, and pre-processing GWAS summary statistics files. Handles delimiter auto-detection (whitespace, tab, comma), gzip decompression, and resolution of column-name variants to the canonical set used throughout the package.

The primary entry point for the plotting pipeline is get_sumstats_and_merged_sector_list(), which loads all tracks, runs coordinate liftover when needed, extracts lead SNPs, generates the hits summary table, and computes the merged Circos sector-size dictionary — all in a single call.

Notes

This module is called automatically by the command-line entry point and by pycmplot._core.main(); most users will not need to import it directly. It is documented here for users who wish to load and pre-process summary statistics programmatically before passing them to the plotting functions.

pycmplot.io.auto_thin_for_manhattan(df: DataFrame, keep_threshold: float = 2.0, max_below: int = 200000, logp: bool = True, logp_col: str = 'logP', p_col: str = 'P', seed: int = 42) → DataFrame[source]

Density-aware sub-sampling for Manhattan-style scatter plots.

Inspired by gwaslab’s default behaviour, this helper preserves every variant whose “interestingness” signal is at or above keep_threshold (so peaks, suggestive hits, genome-wide-significant hits, and extreme selection-scan values are kept verbatim) and uniformly sub-samples the dense bulk below the threshold down to at most max_below rows in total. For a 10 M-variant scan with the defaults below, this typically cuts the plotted point count from 10 M to ~200 K + a few hundred peaks — visually indistinguishable above the suggestive band, but two orders of magnitude faster to render.

Two modes, switched by logp:

P-value mode (logp=True, the default). signal = -log10(P). keep_threshold is in -log10(P) units (default 2.0, i.e. P <= 0.01). Variants with -log10(P) >= keep_threshold are all retained.
Raw-statistic mode (logp=False). signal = |value| of p_col — the column carrying the test statistic. This is the right mode for non-p-value scans such as iHS, XP-EHH, F_ST, Fay & Wu’s H, Tajima’s D, etc., where “interesting” means large magnitude (positive or negative). keep_threshold is then in the units of the underlying statistic (default still 2.0, which is a sensible cutoff for standardised selection scans; override with e.g. 0.05 for F_ST).

Parameters:

df (pandas.DataFrame) – Input DataFrame. In p-value mode, must contain either logp_col (preferred) or p_col. In raw-statistic mode, must contain p_col. When the relevant column is absent, df is returned unchanged.
keep_threshold (float, optional) – Threshold above which all variants are retained. Interpreted in -log10(P) units when logp=True (default 2.0), or in the natural units of the underlying statistic when logp=False.
max_below (int, optional) – Maximum number of below-threshold rows to retain, sampled uniformly at random. Default 200_000.
logp (bool, optional) – When True (default), interpret the data as p-values and use -log10(P) as the signal. When False, treat p_col as a raw statistic and use |value| as the signal.
logp_col (str, optional) – Name of the precomputed -log10(P) column for p-value mode. Default 'logP'.
p_col (str, optional) – Name of the raw p-value column (p-value mode) or test-statistic column (raw-statistic mode). Default 'P'.
seed (int, optional) – Seed for the RNG used to sub-sample the bulk. Default 42.

Returns:

Sub-sampled view of df preserving the original index ordering. When the below-threshold count is already <= max_below, the input is returned unchanged.

Return type:

pandas.DataFrame

Examples

GWAS p-values (default):

>>> thinned = auto_thin_for_manhattan(df, keep_threshold=2.0)

iHS / XP-EHH (signed selection statistics, |value| >= 2):

>>> thinned = auto_thin_for_manhattan(
...     df, logp=False, keep_threshold=2.0, p_col="iHS",
... )

F_ST (unsigned, 0–1, outlier cutoff e.g. 0.05):

>>> thinned = auto_thin_for_manhattan(
...     df, logp=False, keep_threshold=0.05, p_col="FST",
... )

pycmplot.io.detect_delimiter(file_path: str, sample_size: int = 5000)[source]

Infer the field delimiter of a summary statistics file automatically.

Reads the first sample_size bytes of file_path and passes the content to csv.Sniffer. Falls back to a character-frequency heuristic (testing ',', '\t', ' ', ';', '|') if csv.Sniffer raises csv.Error.

Parameters:

file_path (str or pathlib.Path) – Path to the summary statistics file. Gzip-compressed files (.gz) are supported transparently via smart_open().
sample_size (int, optional) – Number of bytes to read for delimiter detection. Default is 5000.

Returns:

delimiter (str) – The inferred single-character field separator (e.g. '\t', ',', ' ').
dialect (csv.Dialect or None) – The csv.Dialect object returned by csv.Sniffer, or None when the fallback heuristic was used.

Examples

>>> from pycmplot.io import detect_delimiter
>>> delim, dialect = detect_delimiter("HbF.tsv.gz")
>>> delim
'\t'

pycmplot.io.generate_random_string(length)[source]

Generate a random alphanumeric string.

Used internally to create a unique output file-name component when no --plot_title is provided.

Parameters:: length (int) – Number of characters in the output string.
Returns:: Random string drawn from ASCII letters (upper- and lower-case) and digits ([A-Za-z0-9]).
Return type:: str

Examples

>>> from pycmplot.io import generate_random_string
>>> s = generate_random_string(10)
>>> len(s)
10

pycmplot.io.get_file_header(file_path: str, delim: str | None = None, dialect=None) → list[str][source]

Read and return the column names from the header line of a file.

Opens file_path, reads the first row using csv.DictReader configured with the supplied delimiter or dialect, and returns the field names as an ordered list of strings.

Parameters:

file_path (str or pathlib.Path) – Path to the summary statistics file (plain text or .gz).
delim (str, optional) – Field separator character (e.g. '\t'). Takes priority over dialect when both are provided.
dialect (csv.Dialect, optional) – A csv.Dialect object (e.g. as returned by detect_delimiter()). Used only when delim is None.

Returns:

Ordered list of column names exactly as they appear in the file header. Returns an empty list and logs a warning if the header cannot be determined.

Return type:

list of str

Examples

>>> from pycmplot.io import detect_delimiter, get_file_header
>>> delim, dialect = detect_delimiter("HbF.tsv.gz")
>>> header = get_file_header("HbF.tsv.gz", delim=delim)
>>> header[:4]
['CHR', 'POS', 'SNP', 'P']

pycmplot.io.get_output_paths(labels, mode: str | None = 'lm', logp: bool = False, output_dir: str | None = None, plot_title: str | None = None, output_format: str | None = 'png')[source]

Construct output file paths for the plot image and locus summary table.

Creates output_dir (including any missing parent directories) and derives deterministic, human-readable file names from the plot title, track labels, plot mode, and y-axis scale.

Parameters:

labels (list of str) – Track labels joined with underscores in the output file name.
mode ({'lm', 'cm'}, optional) – Plot mode: 'lm' for linear Manhattan, 'cm' for circular. Default is 'lm'.
logp (bool, optional) – When True the string '_logp' is appended to the base name; otherwise '_pval' is appended. Default is False.
output_dir (str or pathlib.Path, optional) – Directory in which output files will be written. Created with mkdir(parents=True, exist_ok=True) if it does not already exist. Default is '.'.
plot_title (str, optional) – Human-readable plot title. Non-alphanumeric characters are stripped and spaces replaced with underscores for safe use in file names. When None a 10-character random alphanumeric string is used instead.
output_format (str, optional) – Image file extension without the leading dot (e.g. 'png', 'pdf', 'svg'). Default is 'png'.

Returns:

plt_name (str) – Absolute path to the output plot image file.
table_out (str) – Absolute path to the output locus summary table TSV file.
plt_base (str) – Absolute path base (no extension) used to derive the QQ-plot output stems.

Examples

>>> from pycmplot.io import get_output_paths
>>> plt_name, table_out, plt_base = get_output_paths(
...     labels=["HbF", "MCV"],
...     mode="lm",
...     logp=True,
...     output_dir="./results",
...     plot_title="RBC Traits",
... )
>>> plt_name
'.../results/RBC_Traits_HbF_MCV_lm_logp.png'

pycmplot.io.get_sumstats_and_merged_sector_list(sum_stats: list[str], labels: list[str], logp: bool = False, trim_pval: float | None = None, file_info: dict | None = None, sort_tracks: str | None = None, table_out: str | None = None, signif_threshold: float | None = None, signif_line: float | None = None, suggest_threshold: float | None = None, highlight: bool | None = False, highlight_thresh: float | None = 5e-08, resources: ResourceConfig | None = None, compute_pvals: bool = True, auto_thin: bool = True, auto_thin_threshold: float = 2.0, auto_thin_max_below: int = 200000)[source]

Load summary statistics, run liftover, extract lead SNPs, and compute merged Circos sector sizes.

This is the primary data-loading function for the plotting pipeline. For each track it reads the file using the column mapping from file_info, optionally filters by trim_pval, normalises chromosome names (chr prefix stripped; 23 to X, 24 to Y, M / MTDNA to MT), lifts over hg19 coordinates when a build column is present, and extracts lead SNPs. After all tracks are loaded it builds the hits summary table, derives significance thresholds, optionally sorts tracks, and computes the merged sector-size dict consumed by both plotters.

Parameters:

sum_stats (list of str) – Paths to summary statistics files (gzip supported).
labels (list of str) – Track labels in the same order as sum_stats.
logp (bool, optional) – If True, a logP column (–log₁₀(P)) is added to every loaded DataFrame and used for lead-SNP ranking and threshold-line computation. Default is False.
trim_pval (float, optional) – Drop variants with P > trim_pval before any further processing. Strongly recommended for large files (e.g. 0.01). Default is None (no trimming; variants with P > 1 are still removed).
file_info (dict, optional) – Column-resolution mapping as returned by prep_pycmplot_input_info(). Must be supplied for data to be loaded.
sort_tracks ({'label', 'chrom_len', None}, optional) – Track ordering after loading. 'label' sorts alphabetically; 'chrom_len' sorts by the number of distinct chromosomes (most chromosomes first). None preserves input order. Default is 'chrom_len'.
table_out (str, optional) – File path at which to write the locus summary table TSV. Passed through to get_hits_summary_table().
signif_threshold (float, optional) – Genome-wide significance threshold for lead-SNP extraction and the significance line. When None, computed as max(0.05 / N, 5e-8) where N is the variant count in the last loaded track; falls back to 5e-8 when trim_pval is set.
signif_line (float, optional) – Explicit significance-line value drawn on the plot. When None, signif_threshold is used. If logp is True and the value is < 1, it is converted to –log₁₀ scale automatically.
suggest_threshold (float, optional) – Suggestive significance threshold for a second dashed line. Defaults to 1e-5.
resources (ResourceConfig, optional) – ResourceConfig instance supplying paths to the liftover chain file and gene-info reference files. Falls back to default_resources.

Returns:

A dictionary with the following keys:

'sectors' — dict mapping chromosome → [min_pos, max_pos] across all tracks, in natural chromosome order ('1', '2', …, 'X', 'Y'), with a 'Spacer1' entry appended for y-axis labelling.
'dfs' — dict mapping label → [DataFrame, n_chroms]. Each DataFrame contains canonical columns CHR, POS, SNP, P, LABEL and optionally logP, BUILD, OLD_POS, OLD_BUILD (when a build column and liftover were applied).
'annot' — pandas.DataFrame containing the clumped locus summary with nearest-gene annotations. Empty when no variants pass the significance threshold.
'lines' — list of {'genome': float, 'suggestive': float} dicts, one per track, in the final sorted order.
'pvals' — dict mapping label → numpy.ndarray of raw (un-trimmed) p-values for QQ plotting.

Return type:

dict