.. _quickstart: Quickstart ========== This page walks through the most common workflows in a few lines each. For a detailed walkthrough with real data, see :ref:`python_api_notebook`. Input file format ----------------- pycmplot accepts any whitespace- or comma-delimited summary statistics file, including gzip-compressed (``.gz``) files. Required columns (auto-detected from common names) are: - Chromosome (e.g. ``CHR``, ``CHROM``, ``#CHROM``) - Base-pair position (e.g. ``BP``, ``POS``, ``pos``) - Variant identifier (e.g. ``SNP``, ``RSID``, ``MarkerName``) - P-value or test statistic (e.g. ``P``, ``pvalue``, ``Wald_P``) Optionally, a genome-build column (``hg19`` / ``hg38``) enables automatic liftover (see :ref:`cli_liftover`). Alternatively, pass per-file builds on the command line with ``--build hg19,hg38,...``. .. tip:: For large summary statistics files, always pass ``--trim_pval 0.01`` to discard variants with p > 0.01 before plotting. This can reduce memory usage by an order of magnitude. Command line ------------ Most users will use the CLI. The typical invocation for a multi-track plot is: .. code-block:: bash pycmplot \ --sum_stats HbF.tsv.gz,MCV.txt.gz,MCH.tsv.gz \ --labels HbF,MCV,MCH \ --logp \ --signif_line \ --highlight \ --annotate GENE \ --trim_pval 0.01 \ --output_dir ./results For the full CLI reference, see :ref:`cli`. Python API ---------- The Python API exposes the same pipeline as the CLI, with three explicit steps: (1) resolve per-file column names, (2) load and pre-process the summary statistics, (3) render the plot. The rendering functions take the ``sumstats_loaded`` dictionary produced in step 2. Linear Manhattan plot (single trait) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python from pycmplot import ( prep_pycmplot_input_info, get_sumstats_and_merged_sector_list, plot_linear, ) files = ["HbF.tsv.gz"] labels = ["HbF"] # 1. Resolve per-file column names and delimiters file_info = prep_pycmplot_input_info( sum_stats=files, labels=labels, ) # 2. Load data, run liftover (if needed), extract lead SNPs, # build the hits table, and compute merged sector sizes result = get_sumstats_and_merged_sector_list( sum_stats=files, labels=labels, logp=True, trim_pval=0.01, file_info=file_info, signif_threshold=5e-8, ) # 3. Render the plot plot_linear( sumstats_loaded=result["dfs"], logp=True, signif_lines=result["lines"], hits_table=result["annot"], annotate="GENE", label_col="top_gene", colors=["steelblue", "silver"], output_dir="./results", output_format="png", dpi=300, ) Multi-track linear Manhattan plot ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Compare three RBC traits in a single stacked figure: .. code-block:: python files = ["HbF.tsv.gz", "MCV.txt.gz", "MCH.tsv.gz"] labels = ["HbF", "MCV", "MCH"] file_info = prep_pycmplot_input_info(sum_stats=files, labels=labels) result = get_sumstats_and_merged_sector_list( sum_stats=files, labels=labels, logp=True, trim_pval=0.01, file_info=file_info, signif_threshold=5e-8, ) plot_linear( sumstats_loaded=result["dfs"], logp=True, signif_lines=result["lines"], hits_table=result["annot"], annotate="GENE", label_col="top_gene", colors=["steelblue", "silver"], output_dir="./results", ) Circular (Circos-style) Manhattan plot ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python from pycmplot import plot_circular plot_circular( sumstats_loaded=result["dfs"], sector_sizes=result["sectors"], logp=True, signif_lines=result["lines"], hits_table=result["annot"], annotate="GENE", label_col="top_gene", colors=["steelblue", "silver"], plot_title="RBC Traits", output_dir="./results", ) QQ plots ~~~~~~~~ Each of the QQ functions accepts the ``pvals`` dictionary returned by :func:`~pycmplot.io.get_sumstats_and_merged_sector_list`: .. code-block:: python from pycmplot import plot_qq_combined, plot_qq_overlay, plot_qq_separate # Grid of per-track QQ panels plot_qq_combined( pval_dict=result["pvals"], thin=True, max_points=50_000, ncols=3, title="RBC Traits", output_path="./results/rbc_qq", fig_format="png", ) # All traits overlaid on one axes, with lambda in the legend plot_qq_overlay( pval_dict=result["pvals"], thin=True, max_points=50_000, title="RBC Traits", output_path="./results/rbc_qq_overlay", fig_format="png", ) # One file per trait plot_qq_separate( pval_dict=result["pvals"], base_name="RBC", thin=True, max_points=50_000, output_path="./results/rbc_qq", fig_format="png", ) Mixed genome builds with liftover --------------------------------- If your summary statistics were generated on different reference panels (hg18, hg19, or hg38), pycmplot can liftover hg18 and hg19 coordinates to hg38 before plotting. Supply the builds either through a ``BUILD`` column in the files or by passing ``--build`` (CLI) / ``build_list=`` (API): .. code-block:: bash pycmplot \ --sum_stats study_hg18.tsv.gz,study_hg19.tsv.gz,study_hg38.tsv.gz \ --labels Study_A,Study_B,Study_C \ --build hg18,hg19,hg38 \ --logp \ --annotate GENE \ --output_dir ./results Next steps ---------- - Full CLI reference: :ref:`cli` - Complete Python API: :ref:`api` - Interactive notebook tutorial: :ref:`python_api_notebook`