pycmplot.annotation

Nearest-gene annotation for lead SNPs and generation of the hits summary table. Annotations are based on physical proximity to gene boundaries using a bundled gene reference, with optional biotype weighting.

pycmplot.annotation

Nearest-gene annotation for GWAS lead SNPs and generation of the structured locus summary table.

The main public function, get_hits_summary_table(), accepts the lead SNP DataFrame produced by get_lead_snps(), annotates each lead with the nearest (and most biologically plausible) gene using a two-pass strategy — strand-aware boundary distance followed by a composite priority score — and writes a tab-delimited locus summary file alongside the plot.

Gene reference files

Annotation relies on a bundled Ensembl gene-info TSV (hg38 or hg19). The file is resolved through ResourceConfig; custom paths can be supplied via the PYCMPLOT_GENEINFO_HG38 / PYCMPLOT_GENEINFO_HG19 environment variables.

pycmplot.annotation.get_annotation_column(annotate: str = None, hits_table: DataFrame = None, label_col: str = None)[source]

pycmplot.annotation.get_hits_summary_table(leads_df: DataFrame, window_kb: int = 500, table_out: str | None = None, resources: ResourceConfig | None = None) → DataFrame[source]

Annotate lead SNPs with nearest genes and write the locus summary table.

For each lead SNP in leads_df, runs two complementary annotation passes:

1. Strand-aware boundary search (_annotate_variant()) — identifies the nearest upstream and downstream genes and detects genic / promoter overlap. 2. Priority scoring (_annotate_and_prioritize_variant()) — ranks all candidate genes within window_kb by a composite score that weights biotype, promoter proximity, and distance, then selects the single top-ranked gene (or the two flanking genes for intergenic hits).

After annotation, the table is deduplicated with distance-based clumping (_clump_by_distance()) and optionally written to table_out.

Parameters:

leads_df (pandas.DataFrame) – DataFrame of lead SNPs as returned by get_lead_snps(). Must contain columns CHR, POS, P, BUILD.
window_kb (int, optional) – Search radius in kilobases around each lead SNP. Default is 500.
table_out (str or None, optional) – File path at which to write the annotated locus summary table as a tab-delimited TSV. Set to None to suppress file output.
resources (ResourceConfig, optional) – ResourceConfig instance providing paths to the Ensembl gene-info TSV (hg38 or hg19). Defaults to default_resources.

Returns:

Clumped locus summary table. Contains all columns from leads_df plus annotation fields from both passes, including:

genic — True when the lead SNP overlaps a gene body.
nearest_upstream_gene — nearest upstream gene symbol (strand-aware).
upstream_distance — distance to nearest_upstream_gene in bp.
nearest_downstream_gene — nearest downstream gene symbol (strand-aware).
downstream_distance — distance to nearest_downstream_gene in bp.
promoter_upstream_flag — True when the SNP is within 2 kb upstream of a TSS.
gene_density — number of genes within the search window.
top_gene — top-priority gene from the scoring pass.
biotype — Ensembl biotype of top_gene ('intergenic' when no genic overlap).
priority_score — composite priority score (genic hits only).

Return type:

pandas.DataFrame

Notes

The gene reference (hg38 or hg19) is selected automatically based on the BUILD column in leads_df. hg19 builds are matched to the GRCh37 gene-info file; all others use the GRCh38 file.