pycmplot.annotation

Nearest-gene annotation for lead SNPs and generation of the hits summary table. Annotations are based on physical proximity to gene boundaries using a bundled gene reference, with optional biotype weighting.

pycmplot.annotation

Nearest-gene annotation for GWAS lead SNPs and generation of the structured locus summary table.

The main public function, get_hits_summary_table(), accepts the lead SNP DataFrame produced by get_lead_snps(), annotates each lead with the nearest (and most biologically plausible) gene using a two-pass strategy — strand-aware boundary distance followed by a composite priority score — and writes a tab-delimited locus summary file alongside the plot.

Gene reference files

Annotation relies on a bundled Ensembl gene-info TSV (hg38 or hg19). The file is resolved through ResourceConfig; custom paths can be supplied via the PYCMPLOT_GENEINFO_HG38 / PYCMPLOT_GENEINFO_HG19 environment variables.

pycmplot.annotation.get_annotation_column(annotate: str = None, hits_table: DataFrame = None, label_col: str = None)[source]
pycmplot.annotation.get_hits_summary_table(leads_df: DataFrame, window_kb: int = 500, table_out: str | None = None, resources: ResourceConfig | None = None) DataFrame[source]

Annotate lead SNPs with nearest genes and write the locus summary table.

For each lead SNP in leads_df, runs two complementary annotation passes:

1. Strand-aware boundary search (_annotate_variant()) — identifies the nearest upstream and downstream genes and detects genic / promoter overlap. 2. Priority scoring (_annotate_and_prioritize_variant()) — ranks all candidate genes within window_kb by a composite score that weights biotype, promoter proximity, and distance, then selects the single top-ranked gene (or the two flanking genes for intergenic hits).

After annotation, the table is deduplicated with distance-based clumping (_clump_by_distance()) and optionally written to table_out.

Parameters:
  • leads_df (pandas.DataFrame) – DataFrame of lead SNPs as returned by get_lead_snps(). Must contain columns CHR, POS, P, BUILD.

  • window_kb (int, optional) – Search radius in kilobases around each lead SNP. Default is 500.

  • table_out (str or None, optional) – File path at which to write the annotated locus summary table as a tab-delimited TSV. Set to None to suppress file output.

  • resources (ResourceConfig, optional) – ResourceConfig instance providing paths to the Ensembl gene-info TSV (hg38 or hg19). Defaults to default_resources.

Returns:

Clumped locus summary table. Contains all columns from leads_df plus annotation fields from both passes, including:

  • genicTrue when the lead SNP overlaps a gene body.

  • nearest_upstream_gene — nearest upstream gene symbol (strand-aware).

  • upstream_distance — distance to nearest_upstream_gene in bp.

  • nearest_downstream_gene — nearest downstream gene symbol (strand-aware).

  • downstream_distance — distance to nearest_downstream_gene in bp.

  • promoter_upstream_flagTrue when the SNP is within 2 kb upstream of a TSS.

  • gene_density — number of genes within the search window.

  • top_gene — top-priority gene from the scoring pass.

  • biotype — Ensembl biotype of top_gene ('intergenic' when no genic overlap).

  • priority_score — composite priority score (genic hits only).

Return type:

pandas.DataFrame

Notes

The gene reference (hg38 or hg19) is selected automatically based on the BUILD column in leads_df. hg19 builds are matched to the GRCh37 gene-info file; all others use the GRCh38 file.

See also

pycmplot.stats.get_lead_snps

Provides the leads_df input to this function.

pycmplot.resources.ResourceConfig

Controls the paths to the gene-info reference files.

Examples

>>> from pycmplot.annotation import get_hits_summary_table
>>> hits = get_hits_summary_table(
...     leads_df=leads,
...     window_kb=500,
...     table_out="./results/HbF_locus_summary.tsv",
... )
>>> hits[["SNP", "CHR", "POS", "top_gene", "biotype"]].head()
        SNP CHR       POS  top_gene           biotype
0  rs123456   2  60718043    BCL11A    protein_coding
1  rs789012  11   5246696       HBB    protein_coding