pycmplot.io

Handles loading and pre-processing of summary statistics files. Auto-detects delimiters and resolves non-standard column names to the canonical names used internally by pycmplot.

pycmplot.io

Functions for loading, validating, and pre-processing GWAS summary statistics files. Handles delimiter auto-detection (whitespace, tab, comma), gzip decompression, and resolution of column-name variants to the canonical set used throughout the package.

The primary entry point for the plotting pipeline is get_sumstats_and_merged_sector_list(), which loads all tracks, runs coordinate liftover when needed, extracts lead SNPs, generates the hits summary table, and computes the merged Circos sector-size dictionary — all in a single call.

Notes

This module is called automatically by the command-line entry point and by pycmplot._core.main(); most users will not need to import it directly. It is documented here for users who wish to load and pre-process summary statistics programmatically before passing them to the plotting functions.

pycmplot.io.auto_thin_for_manhattan(df: DataFrame, keep_threshold: float = 2.0, max_below: int = 200000, logp: bool = True, logp_col: str = 'logP', p_col: str = 'P', seed: int = 42) DataFrame[source]

Density-aware sub-sampling for Manhattan-style scatter plots.

Inspired by gwaslab’s default behaviour, this helper preserves every variant whose “interestingness” signal is at or above keep_threshold (so peaks, suggestive hits, genome-wide-significant hits, and extreme selection-scan values are kept verbatim) and uniformly sub-samples the dense bulk below the threshold down to at most max_below rows in total. For a 10 M-variant scan with the defaults below, this typically cuts the plotted point count from 10 M to ~200 K + a few hundred peaks — visually indistinguishable above the suggestive band, but two orders of magnitude faster to render.

Two modes, switched by logp:

  • P-value mode (logp=True, the default). signal = -log10(P). keep_threshold is in -log10(P) units (default 2.0, i.e. P <= 0.01). Variants with -log10(P) >= keep_threshold are all retained.

  • Raw-statistic mode (logp=False). signal = |value| of p_col — the column carrying the test statistic. This is the right mode for non-p-value scans such as iHS, XP-EHH, F_ST, Fay & Wu’s H, Tajima’s D, etc., where “interesting” means large magnitude (positive or negative). keep_threshold is then in the units of the underlying statistic (default still 2.0, which is a sensible cutoff for standardised selection scans; override with e.g. 0.05 for F_ST).

Parameters:
  • df (pandas.DataFrame) – Input DataFrame. In p-value mode, must contain either logp_col (preferred) or p_col. In raw-statistic mode, must contain p_col. When the relevant column is absent, df is returned unchanged.

  • keep_threshold (float, optional) – Threshold above which all variants are retained. Interpreted in -log10(P) units when logp=True (default 2.0), or in the natural units of the underlying statistic when logp=False.

  • max_below (int, optional) – Maximum number of below-threshold rows to retain, sampled uniformly at random. Default 200_000.

  • logp (bool, optional) – When True (default), interpret the data as p-values and use -log10(P) as the signal. When False, treat p_col as a raw statistic and use |value| as the signal.

  • logp_col (str, optional) – Name of the precomputed -log10(P) column for p-value mode. Default 'logP'.

  • p_col (str, optional) – Name of the raw p-value column (p-value mode) or test-statistic column (raw-statistic mode). Default 'P'.

  • seed (int, optional) – Seed for the RNG used to sub-sample the bulk. Default 42.

Returns:

Sub-sampled view of df preserving the original index ordering. When the below-threshold count is already <= max_below, the input is returned unchanged.

Return type:

pandas.DataFrame

Examples

GWAS p-values (default):

>>> thinned = auto_thin_for_manhattan(df, keep_threshold=2.0)

iHS / XP-EHH (signed selection statistics, |value| >= 2):

>>> thinned = auto_thin_for_manhattan(
...     df, logp=False, keep_threshold=2.0, p_col="iHS",
... )

F_ST (unsigned, 0–1, outlier cutoff e.g. 0.05):

>>> thinned = auto_thin_for_manhattan(
...     df, logp=False, keep_threshold=0.05, p_col="FST",
... )
pycmplot.io.detect_delimiter(file_path: str, sample_size: int = 5000)[source]

Infer the field delimiter of a summary statistics file automatically.

Reads the first sample_size bytes of file_path and passes the content to csv.Sniffer. Falls back to a character-frequency heuristic (testing ',', '\t', ' ', ';', '|') if csv.Sniffer raises csv.Error.

Parameters:
  • file_path (str or pathlib.Path) – Path to the summary statistics file. Gzip-compressed files (.gz) are supported transparently via smart_open().

  • sample_size (int, optional) – Number of bytes to read for delimiter detection. Default is 5000.

Returns:

  • delimiter (str) – The inferred single-character field separator (e.g. '\t', ',', ' ').

  • dialect (csv.Dialect or None) – The csv.Dialect object returned by csv.Sniffer, or None when the fallback heuristic was used.

Examples

>>> from pycmplot.io import detect_delimiter
>>> delim, dialect = detect_delimiter("HbF.tsv.gz")
>>> delim
'\t'
pycmplot.io.generate_random_string(length)[source]

Generate a random alphanumeric string.

Used internally to create a unique output file-name component when no --plot_title is provided.

Parameters:

length (int) – Number of characters in the output string.

Returns:

Random string drawn from ASCII letters (upper- and lower-case) and digits ([A-Za-z0-9]).

Return type:

str

Examples

>>> from pycmplot.io import generate_random_string
>>> s = generate_random_string(10)
>>> len(s)
10
pycmplot.io.get_file_header(file_path: str, delim: str | None = None, dialect=None) list[str][source]

Read and return the column names from the header line of a file.

Opens file_path, reads the first row using csv.DictReader configured with the supplied delimiter or dialect, and returns the field names as an ordered list of strings.

Parameters:
  • file_path (str or pathlib.Path) – Path to the summary statistics file (plain text or .gz).

  • delim (str, optional) – Field separator character (e.g. '\t'). Takes priority over dialect when both are provided.

  • dialect (csv.Dialect, optional) – A csv.Dialect object (e.g. as returned by detect_delimiter()). Used only when delim is None.

Returns:

Ordered list of column names exactly as they appear in the file header. Returns an empty list and logs a warning if the header cannot be determined.

Return type:

list of str

Examples

>>> from pycmplot.io import detect_delimiter, get_file_header
>>> delim, dialect = detect_delimiter("HbF.tsv.gz")
>>> header = get_file_header("HbF.tsv.gz", delim=delim)
>>> header[:4]
['CHR', 'POS', 'SNP', 'P']
pycmplot.io.get_output_paths(labels, mode: str | None = 'lm', logp: bool = False, output_dir: str | None = None, plot_title: str | None = None, output_format: str | None = 'png')[source]

Construct output file paths for the plot image and locus summary table.

Creates output_dir (including any missing parent directories) and derives deterministic, human-readable file names from the plot title, track labels, plot mode, and y-axis scale.

Parameters:
  • labels (list of str) – Track labels joined with underscores in the output file name.

  • mode ({'lm', 'cm'}, optional) – Plot mode: 'lm' for linear Manhattan, 'cm' for circular. Default is 'lm'.

  • logp (bool, optional) – When True the string '_logp' is appended to the base name; otherwise '_pval' is appended. Default is False.

  • output_dir (str or pathlib.Path, optional) – Directory in which output files will be written. Created with mkdir(parents=True, exist_ok=True) if it does not already exist. Default is '.'.

  • plot_title (str, optional) – Human-readable plot title. Non-alphanumeric characters are stripped and spaces replaced with underscores for safe use in file names. When None a 10-character random alphanumeric string is used instead.

  • output_format (str, optional) – Image file extension without the leading dot (e.g. 'png', 'pdf', 'svg'). Default is 'png'.

Returns:

  • plt_name (str) – Absolute path to the output plot image file.

  • table_out (str) – Absolute path to the output locus summary table TSV file.

  • plt_base (str) – Absolute path base (no extension) used to derive the QQ-plot output stems.

Examples

>>> from pycmplot.io import get_output_paths
>>> plt_name, table_out, plt_base = get_output_paths(
...     labels=["HbF", "MCV"],
...     mode="lm",
...     logp=True,
...     output_dir="./results",
...     plot_title="RBC Traits",
... )
>>> plt_name
'.../results/RBC_Traits_HbF_MCV_lm_logp.png'
pycmplot.io.get_sumstats_and_merged_sector_list(sum_stats: list[str], labels: list[str], logp: bool = False, trim_pval: float | None = None, file_info: dict | None = None, sort_tracks: str | None = None, table_out: str | None = None, signif_threshold: float | None = None, signif_line: float | None = None, suggest_threshold: float | None = None, highlight: bool | None = False, highlight_thresh: float | None = 5e-08, resources: ResourceConfig | None = None, compute_pvals: bool = True, auto_thin: bool = True, auto_thin_threshold: float = 2.0, auto_thin_max_below: int = 200000)[source]

Load summary statistics, run liftover, extract lead SNPs, and compute merged Circos sector sizes.

This is the primary data-loading function for the plotting pipeline. For each track it reads the file using the column mapping from file_info, optionally filters by trim_pval, normalises chromosome names (chr prefix stripped; 23 to X, 24 to Y, M / MTDNA to MT), lifts over hg19 coordinates when a build column is present, and extracts lead SNPs. After all tracks are loaded it builds the hits summary table, derives significance thresholds, optionally sorts tracks, and computes the merged sector-size dict consumed by both plotters.

Parameters:
  • sum_stats (list of str) – Paths to summary statistics files (gzip supported).

  • labels (list of str) – Track labels in the same order as sum_stats.

  • logp (bool, optional) – If True, a logP column (–log₁₀(P)) is added to every loaded DataFrame and used for lead-SNP ranking and threshold-line computation. Default is False.

  • trim_pval (float, optional) – Drop variants with P > trim_pval before any further processing. Strongly recommended for large files (e.g. 0.01). Default is None (no trimming; variants with P > 1 are still removed).

  • file_info (dict, optional) – Column-resolution mapping as returned by prep_pycmplot_input_info(). Must be supplied for data to be loaded.

  • sort_tracks ({'label', 'chrom_len', None}, optional) – Track ordering after loading. 'label' sorts alphabetically; 'chrom_len' sorts by the number of distinct chromosomes (most chromosomes first). None preserves input order. Default is 'chrom_len'.

  • table_out (str, optional) – File path at which to write the locus summary table TSV. Passed through to get_hits_summary_table().

  • signif_threshold (float, optional) – Genome-wide significance threshold for lead-SNP extraction and the significance line. When None, computed as max(0.05 / N, 5e-8) where N is the variant count in the last loaded track; falls back to 5e-8 when trim_pval is set.

  • signif_line (float, optional) – Explicit significance-line value drawn on the plot. When None, signif_threshold is used. If logp is True and the value is < 1, it is converted to –log₁₀ scale automatically.

  • suggest_threshold (float, optional) – Suggestive significance threshold for a second dashed line. Defaults to 1e-5.

  • resources (ResourceConfig, optional) – ResourceConfig instance supplying paths to the liftover chain file and gene-info reference files. Falls back to default_resources.

Returns:

A dictionary with the following keys:

  • 'sectors'dict mapping chromosome [min_pos, max_pos] across all tracks, in natural chromosome order ('1', '2', …, 'X', 'Y'), with a 'Spacer1' entry appended for y-axis labelling.

  • 'dfs'dict mapping label [DataFrame, n_chroms]. Each DataFrame contains canonical columns CHR, POS, SNP, P, LABEL and optionally logP, BUILD, OLD_POS, OLD_BUILD (when a build column and liftover were applied).

  • 'annot'pandas.DataFrame containing the clumped locus summary with nearest-gene annotations. Empty when no variants pass the significance threshold.

  • 'lines'list of {'genome': float, 'suggestive': float} dicts, one per track, in the final sorted order.

  • 'pvals'dict mapping label numpy.ndarray of raw (un-trimmed) p-values for QQ plotting.

Return type:

dict

See also

prep_pycmplot_input_info

Resolves column names and delimiters; its output is passed as file_info.

pycmplot.annotation.get_hits_summary_table

Gene annotation and distance-based clumping of the locus table.

pycmplot.liftover.liftover_position

hg19 → hg38 coordinate conversion applied row-wise.

Examples

>>> from pycmplot.io import prep_pycmplot_input_info
>>> from pycmplot.io import get_sumstats_and_merged_sector_list
>>> files  = ["HbF.tsv.gz", "MCV.txt.gz"]
>>> labels = ["HbF", "MCV"]
>>> file_info = prep_pycmplot_input_info(files, labels)
>>> result = get_sumstats_and_merged_sector_list(
...     sum_stats=files,
...     labels=labels,
...     logp=True,
...     trim_pval=0.01,
...     file_info=file_info,
...     signif_threshold=5e-8,
... )
>>> sorted(result.keys())
['annot', 'dfs', 'lines', 'pvals', 'sectors']
>>> list(result["sectors"].keys())[:4]
['1', '2', '3', '4']
pycmplot.io.prep_pycmplot_input_info(sum_stats: list[str], labels: list[str], build_column: str | None = None, build_list: list[str] = None, delim: str | None = None, chrom: str | None = None, pos: str | None = None, snp: str | None = None, pcol: str | None = None)[source]

Resolve column names and delimiters for each summary statistics file.

Iterates over every file in sum_stats, auto-detects (or uses the supplied) delimiter, reads the file header, and maps each required column (chromosome, position, SNP ID, p-value, genome build) to the first matching entry in an ordered candidate-name list. Returns a per-label mapping that tells get_sumstats_and_merged_sector_list() exactly which columns to read and how to rename them.

Parameters:
  • sum_stats (list of str) – Paths to one or more summary statistics files (gzip supported).

  • labels (list of str) – Track labels in the same order as sum_stats.

  • build_column (str, optional) – Genome-build column name (candidates: 'BUILD', 'Genome', 'Genome_Build', 'Genome-build', …). Or list of genome builds supplied via --build.

  • build_list (list, optional) – List of genome builds in same order as sumstats and labels

  • delim (str, optional) – Field delimiter shared by all files. Accepts human-readable names ('tab', 'space', 'comma') or single characters. When None the delimiter is auto-detected independently for each file using detect_delimiter().

  • chrom (str, optional) – Chromosome column name. When None, the first header field that matches any built-in candidate ('CHR', 'CHROM', '#CHROM', 'chrom', 'chr', …) is used.

  • pos (str, optional) – Base-pair position column name (candidates: 'BP', 'POS', 'bp', 'pos', 'Basepair').

  • snp (str, optional) – Variant / marker ID column name (candidates: 'SNP', 'RSID', 'rsID', 'MarkerName', 'MarkerID', 'SNPID', 'ID', …).

  • pcol (str, optional) – P-value column name (candidates: 'P', 'P-value', 'pvalue', 'p_val', 'pval', 'Wald_P').

Returns:

Mapping of label -> [old_columns, col_dtypes, rename_map, sep]:

  • old_columns – list of the five original column names as found in the file header.

  • col_dtypes{column_name: dtype} passed to pandas.read_csv().

  • rename_map{old_name: canonical_name} for CHR, POS, SNP, P, BUILD.

  • sep – the resolved delimiter character for this file.

Return type:

dict

Raises:

SystemExit – If any required column (chromosome, position, SNP ID, p-value, or build) cannot be resolved from the file header.

See also

get_sumstats_and_merged_sector_list

The main loading function that consumes the mapping returned here.

detect_delimiter

Auto-detects the file delimiter when delim is None.

pycmplot.io.resolve_delimiter(delim: str) str[source]

Map a human-readable delimiter name to its single-character representation.

Parameters:

delim (str) – A delimiter name — one of 'space', 'tab', 'comma', 'colon', 'semi-colon', 'semicolon' — or a single bare character (e.g. '|'). Matching is case-insensitive.

Returns:

The corresponding single-character separator string.

Return type:

str

Raises:
  • TypeError – If delim is not a string.

  • ValueError – If delim is neither a recognised name nor a single character.

Examples

>>> from pycmplot.io import resolve_delimiter
>>> resolve_delimiter("tab")
'\t'
>>> resolve_delimiter(",")
','
pycmplot.io.smart_open(file_path: str)[source]

Open a plain-text or gzip-compressed file transparently.

Detects gzip compression from the .gz file suffix; all other paths are opened as plain text.

Parameters:

file_path (str or pathlib.Path) – Path to the file to open.

Returns:

An open, readable text-mode file object. Must be used as a context manager (with smart_open(...) as f: ...).

Return type:

io.TextIOWrapper or gzip.GzipFile

Examples

>>> from pycmplot.io import smart_open
>>> with smart_open("HbF.tsv.gz") as f:
...     header = f.readline()
pycmplot.io.strip_comma_separated_input_streams(sum_stats, labels, colors_raw='steelblue,grey', track_heights=None, builds=None)[source]

Parse comma-separated CLI strings into Python lists.

Converts the raw string arguments produced by argparse (e.g. "HbF.tsv.gz,MCV.txt.gz,MCH.tsv.gz") into the lists expected by the rest of the API. Validates that sum_stats, labels and builds (when supplied) have the same number of elements.

Parameters:
  • sum_stats (str) – Comma-separated list of summary statistics file paths.

  • labels (str) – Comma-separated list of track labels. Must contain the same number of elements as sum_stats.

  • colors_raw (str, optional) – Comma-separated list of matplotlib colour strings. Default is 'steelblue,grey'.

  • track_heights (str, optional) – Comma-separated list of relative track heights (floats), one per track.

  • builds (str, optional) – Comma-separated list of genome builds (e.g. 'hg19,hg38,hg38,hg19'), one per summary statistics file.

Returns:

  • sum_stats (list of str) – Parsed file paths, whitespace-stripped.

  • labels (list of str) – Parsed track labels, whitespace-stripped.

  • colors (list of str) – Parsed colour strings, whitespace-stripped.

  • t_heights (list of float or None) – Parsed track heights converted to float. None when track_heights was not supplied.

  • builds (list of str or None) – Parsed build strings, whitespace-stripped. None when builds was not supplied.

Raises:

SystemExit – If sum_stats, labels and builds have mismatched lengths.