pycmplot.io
Handles loading and pre-processing of summary statistics files. Auto-detects delimiters and resolves non-standard column names to the canonical names used internally by pycmplot.
pycmplot.io
Functions for loading, validating, and pre-processing GWAS summary statistics files. Handles delimiter auto-detection (whitespace, tab, comma), gzip decompression, and resolution of column-name variants to the canonical set used throughout the package.
The primary entry point for the plotting pipeline is
get_sumstats_and_merged_sector_list(), which loads all tracks, runs
coordinate liftover when needed, extracts lead SNPs, generates the hits
summary table, and computes the merged Circos sector-size dictionary — all
in a single call.
Notes
This module is called automatically by the command-line entry point and by
pycmplot._core.main(); most users will not need to import it directly.
It is documented here for users who wish to load and pre-process summary
statistics programmatically before passing them to the plotting functions.
- pycmplot.io.auto_thin_for_manhattan(df: DataFrame, keep_threshold: float = 2.0, max_below: int = 200000, logp: bool = True, logp_col: str = 'logP', p_col: str = 'P', seed: int = 42) DataFrame[source]
Density-aware sub-sampling for Manhattan-style scatter plots.
Inspired by
gwaslab’s default behaviour, this helper preserves every variant whose “interestingness” signal is at or abovekeep_threshold(so peaks, suggestive hits, genome-wide-significant hits, and extreme selection-scan values are kept verbatim) and uniformly sub-samples the dense bulk below the threshold down to at mostmax_belowrows in total. For a 10 M-variant scan with the defaults below, this typically cuts the plotted point count from 10 M to ~200 K + a few hundred peaks — visually indistinguishable above the suggestive band, but two orders of magnitude faster to render.Two modes, switched by logp:
P-value mode (logp=True, the default).
signal = -log10(P).keep_thresholdis in-log10(P)units (default2.0, i.e.P <= 0.01). Variants with-log10(P) >= keep_thresholdare all retained.Raw-statistic mode (logp=False).
signal = |value|of p_col — the column carrying the test statistic. This is the right mode for non-p-value scans such as iHS, XP-EHH, F_ST, Fay & Wu’s H, Tajima’s D, etc., where “interesting” means large magnitude (positive or negative).keep_thresholdis then in the units of the underlying statistic (default still2.0, which is a sensible cutoff for standardised selection scans; override with e.g.0.05for F_ST).
- Parameters:
df (pandas.DataFrame) – Input DataFrame. In p-value mode, must contain either logp_col (preferred) or p_col. In raw-statistic mode, must contain p_col. When the relevant column is absent, df is returned unchanged.
keep_threshold (float, optional) – Threshold above which all variants are retained. Interpreted in
-log10(P)units when logp=True (default2.0), or in the natural units of the underlying statistic when logp=False.max_below (int, optional) – Maximum number of below-threshold rows to retain, sampled uniformly at random. Default
200_000.logp (bool, optional) – When
True(default), interpret the data as p-values and use-log10(P)as the signal. WhenFalse, treat p_col as a raw statistic and use|value|as the signal.logp_col (str, optional) – Name of the precomputed
-log10(P)column for p-value mode. Default'logP'.p_col (str, optional) – Name of the raw p-value column (p-value mode) or test-statistic column (raw-statistic mode). Default
'P'.seed (int, optional) – Seed for the RNG used to sub-sample the bulk. Default
42.
- Returns:
Sub-sampled view of df preserving the original index ordering. When the below-threshold count is already <= max_below, the input is returned unchanged.
- Return type:
Examples
GWAS p-values (default):
>>> thinned = auto_thin_for_manhattan(df, keep_threshold=2.0)
iHS / XP-EHH (signed selection statistics,
|value|>= 2):>>> thinned = auto_thin_for_manhattan( ... df, logp=False, keep_threshold=2.0, p_col="iHS", ... )
F_ST (unsigned, 0–1, outlier cutoff e.g. 0.05):
>>> thinned = auto_thin_for_manhattan( ... df, logp=False, keep_threshold=0.05, p_col="FST", ... )
- pycmplot.io.detect_delimiter(file_path: str, sample_size: int = 5000)[source]
Infer the field delimiter of a summary statistics file automatically.
Reads the first sample_size bytes of file_path and passes the content to
csv.Sniffer. Falls back to a character-frequency heuristic (testing',','\t',' ',';','|') ifcsv.Snifferraisescsv.Error.- Parameters:
file_path (str or pathlib.Path) – Path to the summary statistics file. Gzip-compressed files (
.gz) are supported transparently viasmart_open().sample_size (int, optional) – Number of bytes to read for delimiter detection. Default is
5000.
- Returns:
delimiter (str) – The inferred single-character field separator (e.g.
'\t',',',' ').dialect (csv.Dialect or None) – The
csv.Dialectobject returned bycsv.Sniffer, orNonewhen the fallback heuristic was used.
Examples
>>> from pycmplot.io import detect_delimiter >>> delim, dialect = detect_delimiter("HbF.tsv.gz") >>> delim '\t'
- pycmplot.io.generate_random_string(length)[source]
Generate a random alphanumeric string.
Used internally to create a unique output file-name component when no
--plot_titleis provided.- Parameters:
length (int) – Number of characters in the output string.
- Returns:
Random string drawn from ASCII letters (upper- and lower-case) and digits (
[A-Za-z0-9]).- Return type:
Examples
>>> from pycmplot.io import generate_random_string >>> s = generate_random_string(10) >>> len(s) 10
- pycmplot.io.get_file_header(file_path: str, delim: str | None = None, dialect=None) list[str][source]
Read and return the column names from the header line of a file.
Opens file_path, reads the first row using
csv.DictReaderconfigured with the supplied delimiter or dialect, and returns the field names as an ordered list of strings.- Parameters:
file_path (str or pathlib.Path) – Path to the summary statistics file (plain text or
.gz).delim (str, optional) – Field separator character (e.g.
'\t'). Takes priority over dialect when both are provided.dialect (csv.Dialect, optional) – A
csv.Dialectobject (e.g. as returned bydetect_delimiter()). Used only when delim isNone.
- Returns:
Ordered list of column names exactly as they appear in the file header. Returns an empty list and logs a warning if the header cannot be determined.
- Return type:
Examples
>>> from pycmplot.io import detect_delimiter, get_file_header >>> delim, dialect = detect_delimiter("HbF.tsv.gz") >>> header = get_file_header("HbF.tsv.gz", delim=delim) >>> header[:4] ['CHR', 'POS', 'SNP', 'P']
- pycmplot.io.get_output_paths(labels, mode: str | None = 'lm', logp: bool = False, output_dir: str | None = None, plot_title: str | None = None, output_format: str | None = 'png')[source]
Construct output file paths for the plot image and locus summary table.
Creates output_dir (including any missing parent directories) and derives deterministic, human-readable file names from the plot title, track labels, plot mode, and y-axis scale.
- Parameters:
labels (list of str) – Track labels joined with underscores in the output file name.
mode ({'lm', 'cm'}, optional) – Plot mode:
'lm'for linear Manhattan,'cm'for circular. Default is'lm'.logp (bool, optional) – When
Truethe string'_logp'is appended to the base name; otherwise'_pval'is appended. Default isFalse.output_dir (str or pathlib.Path, optional) – Directory in which output files will be written. Created with
mkdir(parents=True, exist_ok=True)if it does not already exist. Default is'.'.plot_title (str, optional) – Human-readable plot title. Non-alphanumeric characters are stripped and spaces replaced with underscores for safe use in file names. When
Nonea 10-character random alphanumeric string is used instead.output_format (str, optional) – Image file extension without the leading dot (e.g.
'png','pdf','svg'). Default is'png'.
- Returns:
plt_name (str) – Absolute path to the output plot image file.
table_out (str) – Absolute path to the output locus summary table TSV file.
plt_base (str) – Absolute path base (no extension) used to derive the QQ-plot output stems.
Examples
>>> from pycmplot.io import get_output_paths >>> plt_name, table_out, plt_base = get_output_paths( ... labels=["HbF", "MCV"], ... mode="lm", ... logp=True, ... output_dir="./results", ... plot_title="RBC Traits", ... ) >>> plt_name '.../results/RBC_Traits_HbF_MCV_lm_logp.png'
- pycmplot.io.get_sumstats_and_merged_sector_list(sum_stats: list[str], labels: list[str], logp: bool = False, trim_pval: float | None = None, file_info: dict | None = None, sort_tracks: str | None = None, table_out: str | None = None, signif_threshold: float | None = None, signif_line: float | None = None, suggest_threshold: float | None = None, highlight: bool | None = False, highlight_thresh: float | None = 5e-08, resources: ResourceConfig | None = None, compute_pvals: bool = True, auto_thin: bool = True, auto_thin_threshold: float = 2.0, auto_thin_max_below: int = 200000)[source]
Load summary statistics, run liftover, extract lead SNPs, and compute merged Circos sector sizes.
This is the primary data-loading function for the plotting pipeline. For each track it reads the file using the column mapping from
file_info, optionally filters bytrim_pval, normalises chromosome names (chrprefix stripped;23toX,24toY,M/MTDNAtoMT), lifts over hg19 coordinates when a build column is present, and extracts lead SNPs. After all tracks are loaded it builds the hits summary table, derives significance thresholds, optionally sorts tracks, and computes the merged sector-size dict consumed by both plotters.- Parameters:
sum_stats (list of str) – Paths to summary statistics files (gzip supported).
labels (list of str) – Track labels in the same order as sum_stats.
logp (bool, optional) – If
True, alogPcolumn (–log₁₀(P)) is added to every loaded DataFrame and used for lead-SNP ranking and threshold-line computation. Default isFalse.trim_pval (float, optional) – Drop variants with
P > trim_pvalbefore any further processing. Strongly recommended for large files (e.g.0.01). Default isNone(no trimming; variants withP > 1are still removed).file_info (dict, optional) – Column-resolution mapping as returned by
prep_pycmplot_input_info(). Must be supplied for data to be loaded.sort_tracks ({'label', 'chrom_len', None}, optional) – Track ordering after loading.
'label'sorts alphabetically;'chrom_len'sorts by the number of distinct chromosomes (most chromosomes first).Nonepreserves input order. Default is'chrom_len'.table_out (str, optional) – File path at which to write the locus summary table TSV. Passed through to
get_hits_summary_table().signif_threshold (float, optional) – Genome-wide significance threshold for lead-SNP extraction and the significance line. When
None, computed asmax(0.05 / N, 5e-8)where N is the variant count in the last loaded track; falls back to5e-8when trim_pval is set.signif_line (float, optional) – Explicit significance-line value drawn on the plot. When
None, signif_threshold is used. If logp isTrueand the value is < 1, it is converted to –log₁₀ scale automatically.suggest_threshold (float, optional) – Suggestive significance threshold for a second dashed line. Defaults to
1e-5.resources (ResourceConfig, optional) –
ResourceConfiginstance supplying paths to the liftover chain file and gene-info reference files. Falls back todefault_resources.
- Returns:
A dictionary with the following keys:
'sectors'—dictmappingchromosome → [min_pos, max_pos]across all tracks, in natural chromosome order ('1','2', …,'X','Y'), with a'Spacer1'entry appended for y-axis labelling.'dfs'—dictmappinglabel → [DataFrame, n_chroms]. Each DataFrame contains canonical columnsCHR,POS,SNP,P,LABELand optionallylogP,BUILD,OLD_POS,OLD_BUILD(when a build column and liftover were applied).'annot'—pandas.DataFramecontaining the clumped locus summary with nearest-gene annotations. Empty when no variants pass the significance threshold.'lines'—listof{'genome': float, 'suggestive': float}dicts, one per track, in the final sorted order.'pvals'—dictmappinglabel → numpy.ndarrayof raw (un-trimmed) p-values for QQ plotting.
- Return type:
See also
prep_pycmplot_input_infoResolves column names and delimiters; its output is passed as file_info.
pycmplot.annotation.get_hits_summary_tableGene annotation and distance-based clumping of the locus table.
pycmplot.liftover.liftover_positionhg19 → hg38 coordinate conversion applied row-wise.
Examples
>>> from pycmplot.io import prep_pycmplot_input_info >>> from pycmplot.io import get_sumstats_and_merged_sector_list >>> files = ["HbF.tsv.gz", "MCV.txt.gz"] >>> labels = ["HbF", "MCV"] >>> file_info = prep_pycmplot_input_info(files, labels) >>> result = get_sumstats_and_merged_sector_list( ... sum_stats=files, ... labels=labels, ... logp=True, ... trim_pval=0.01, ... file_info=file_info, ... signif_threshold=5e-8, ... ) >>> sorted(result.keys()) ['annot', 'dfs', 'lines', 'pvals', 'sectors'] >>> list(result["sectors"].keys())[:4] ['1', '2', '3', '4']
- pycmplot.io.prep_pycmplot_input_info(sum_stats: list[str], labels: list[str], build_column: str | None = None, build_list: list[str] = None, delim: str | None = None, chrom: str | None = None, pos: str | None = None, snp: str | None = None, pcol: str | None = None)[source]
Resolve column names and delimiters for each summary statistics file.
Iterates over every file in sum_stats, auto-detects (or uses the supplied) delimiter, reads the file header, and maps each required column (chromosome, position, SNP ID, p-value, genome build) to the first matching entry in an ordered candidate-name list. Returns a per-label mapping that tells
get_sumstats_and_merged_sector_list()exactly which columns to read and how to rename them.- Parameters:
sum_stats (list of str) – Paths to one or more summary statistics files (gzip supported).
labels (list of str) – Track labels in the same order as sum_stats.
build_column (str, optional) – Genome-build column name (candidates:
'BUILD','Genome','Genome_Build','Genome-build', …). Or list of genome builds supplied via--build.build_list (list, optional) – List of genome builds in same order as sumstats and labels
delim (str, optional) – Field delimiter shared by all files. Accepts human-readable names (
'tab','space','comma') or single characters. WhenNonethe delimiter is auto-detected independently for each file usingdetect_delimiter().chrom (str, optional) – Chromosome column name. When
None, the first header field that matches any built-in candidate ('CHR','CHROM','#CHROM','chrom','chr', …) is used.pos (str, optional) – Base-pair position column name (candidates:
'BP','POS','bp','pos','Basepair').snp (str, optional) – Variant / marker ID column name (candidates:
'SNP','RSID','rsID','MarkerName','MarkerID','SNPID','ID', …).pcol (str, optional) – P-value column name (candidates:
'P','P-value','pvalue','p_val','pval','Wald_P').
- Returns:
Mapping of
label -> [old_columns, col_dtypes, rename_map, sep]:old_columns – list of the five original column names as found in the file header.
col_dtypes –
{column_name: dtype}passed topandas.read_csv().rename_map –
{old_name: canonical_name}forCHR,POS,SNP,P,BUILD.sep – the resolved delimiter character for this file.
- Return type:
- Raises:
SystemExit – If any required column (chromosome, position, SNP ID, p-value, or build) cannot be resolved from the file header.
See also
get_sumstats_and_merged_sector_listThe main loading function that consumes the mapping returned here.
detect_delimiterAuto-detects the file delimiter when delim is
None.
- pycmplot.io.resolve_delimiter(delim: str) str[source]
Map a human-readable delimiter name to its single-character representation.
- Parameters:
delim (str) – A delimiter name — one of
'space','tab','comma','colon','semi-colon','semicolon'— or a single bare character (e.g.'|'). Matching is case-insensitive.- Returns:
The corresponding single-character separator string.
- Return type:
- Raises:
TypeError – If delim is not a string.
ValueError – If delim is neither a recognised name nor a single character.
Examples
>>> from pycmplot.io import resolve_delimiter >>> resolve_delimiter("tab") '\t' >>> resolve_delimiter(",") ','
- pycmplot.io.smart_open(file_path: str)[source]
Open a plain-text or gzip-compressed file transparently.
Detects gzip compression from the
.gzfile suffix; all other paths are opened as plain text.- Parameters:
file_path (str or pathlib.Path) – Path to the file to open.
- Returns:
An open, readable text-mode file object. Must be used as a context manager (
with smart_open(...) as f: ...).- Return type:
Examples
>>> from pycmplot.io import smart_open >>> with smart_open("HbF.tsv.gz") as f: ... header = f.readline()
- pycmplot.io.strip_comma_separated_input_streams(sum_stats, labels, colors_raw='steelblue,grey', track_heights=None, builds=None)[source]
Parse comma-separated CLI strings into Python lists.
Converts the raw string arguments produced by
argparse(e.g."HbF.tsv.gz,MCV.txt.gz,MCH.tsv.gz") into the lists expected by the rest of the API. Validates that sum_stats, labels and builds (when supplied) have the same number of elements.- Parameters:
sum_stats (str) – Comma-separated list of summary statistics file paths.
labels (str) – Comma-separated list of track labels. Must contain the same number of elements as sum_stats.
colors_raw (str, optional) – Comma-separated list of matplotlib colour strings. Default is
'steelblue,grey'.track_heights (str, optional) – Comma-separated list of relative track heights (floats), one per track.
builds (str, optional) – Comma-separated list of genome builds (e.g.
'hg19,hg38,hg38,hg19'), one per summary statistics file.
- Returns:
sum_stats (list of str) – Parsed file paths, whitespace-stripped.
labels (list of str) – Parsed track labels, whitespace-stripped.
colors (list of str) – Parsed colour strings, whitespace-stripped.
t_heights (list of float or None) – Parsed track heights converted to
float.Nonewhen track_heights was not supplied.builds (list of str or None) – Parsed build strings, whitespace-stripped.
Nonewhen builds was not supplied.
- Raises:
SystemExit – If sum_stats, labels and builds have mismatched lengths.