pycmplot.annotation
Nearest-gene annotation for lead SNPs and generation of the hits summary table. Annotations are based on physical proximity to gene boundaries using a bundled gene reference, with optional biotype weighting.
pycmplot.annotation
Nearest-gene annotation for GWAS lead SNPs and generation of the structured locus summary table.
The main public function, get_hits_summary_table(), accepts the lead
SNP DataFrame produced by get_lead_snps(), annotates
each lead with the nearest (and most biologically plausible) gene using a
two-pass strategy — strand-aware boundary distance followed by a composite
priority score — and writes a tab-delimited locus summary file alongside the
plot.
Gene reference files
Annotation relies on a bundled Ensembl gene-info TSV (hg38 or hg19). The
file is resolved through ResourceConfig; custom
paths can be supplied via the PYCMPLOT_GENEINFO_HG38 /
PYCMPLOT_GENEINFO_HG19 environment variables.
- pycmplot.annotation.get_annotation_column(annotate: str = None, hits_table: DataFrame = None, label_col: str = None)[source]
- pycmplot.annotation.get_hits_summary_table(leads_df: DataFrame, window_kb: int = 500, table_out: str | None = None, resources: ResourceConfig | None = None) DataFrame[source]
Annotate lead SNPs with nearest genes and write the locus summary table.
For each lead SNP in leads_df, runs two complementary annotation passes:
1. Strand-aware boundary search (
_annotate_variant()) — identifies the nearest upstream and downstream genes and detects genic / promoter overlap. 2. Priority scoring (_annotate_and_prioritize_variant()) — ranks all candidate genes within window_kb by a composite score that weights biotype, promoter proximity, and distance, then selects the single top-ranked gene (or the two flanking genes for intergenic hits).After annotation, the table is deduplicated with distance-based clumping (
_clump_by_distance()) and optionally written to table_out.- Parameters:
leads_df (pandas.DataFrame) – DataFrame of lead SNPs as returned by
get_lead_snps(). Must contain columnsCHR,POS,P,BUILD.window_kb (int, optional) – Search radius in kilobases around each lead SNP. Default is
500.table_out (str or None, optional) – File path at which to write the annotated locus summary table as a tab-delimited TSV. Set to
Noneto suppress file output.resources (ResourceConfig, optional) –
ResourceConfiginstance providing paths to the Ensembl gene-info TSV (hg38 or hg19). Defaults todefault_resources.
- Returns:
Clumped locus summary table. Contains all columns from leads_df plus annotation fields from both passes, including:
genic—Truewhen the lead SNP overlaps a gene body.nearest_upstream_gene— nearest upstream gene symbol (strand-aware).upstream_distance— distance tonearest_upstream_genein bp.nearest_downstream_gene— nearest downstream gene symbol (strand-aware).downstream_distance— distance tonearest_downstream_genein bp.promoter_upstream_flag—Truewhen the SNP is within 2 kb upstream of a TSS.gene_density— number of genes within the search window.top_gene— top-priority gene from the scoring pass.biotype— Ensembl biotype oftop_gene('intergenic'when no genic overlap).priority_score— composite priority score (genic hits only).
- Return type:
Notes
The gene reference (hg38 or hg19) is selected automatically based on the
BUILDcolumn in leads_df. hg19 builds are matched to the GRCh37 gene-info file; all others use the GRCh38 file.See also
pycmplot.stats.get_lead_snpsProvides the leads_df input to this function.
pycmplot.resources.ResourceConfigControls the paths to the gene-info reference files.
Examples
>>> from pycmplot.annotation import get_hits_summary_table >>> hits = get_hits_summary_table( ... leads_df=leads, ... window_kb=500, ... table_out="./results/HbF_locus_summary.tsv", ... ) >>> hits[["SNP", "CHR", "POS", "top_gene", "biotype"]].head() SNP CHR POS top_gene biotype 0 rs123456 2 60718043 BCL11A protein_coding 1 rs789012 11 5246696 HBB protein_coding