Workshop: Agentic Variant Interpretation

Introduction

What is ClawBio?

The first bioinformatics-native AI agent skill library. Curated, reproducible, open-source.

Bioinformatics skills for AI agents

ClawBio is a collection of self-contained, reproducible bioinformatics skills that any AI agent can call. Each skill handles a specific task: annotating variants, scoring pharmacogenomic risk, running differential expression, searching clinical trials, and more. The skills run locally, keep genetic data on your machine, and produce structured, auditable outputs.

Think of it as a toolbox. The AI agent decides which tool to pick up. The tool does the analysis. You review the results.

Local-first. Reproducible. Open.

Local-first: your genomic data never leaves your laptop. Skills process everything in-place.
Reproducible: every skill exports commands.sh, environment.yml, and SHA-256 checksums. Re-run any analysis and get the same output.
Open-source: MIT licensed. Fork it, extend it, contribute back.
Equity-aware: built-in HEIM diversity metrics flag when analyses depend on biased reference data.

From raw data to clinical-grade report in one command

          # Traditional approach: 6 manual steps across 3 tools

          vep --input sample.vcf --output vep_out.txt --cache

          # ...parse JSON, cross-reference ClinVar, check gnomAD, read CPIC tables...

          # ClawBio: one command

          python clawbio.py run variant-annotation --input sample.vcf --output report/

          Annotated 21 variants. 3 Tier 1 (pathogenic). 6 Tier 2 (drug response). Report saved.

The problem

Why bioinformatics needs agent skills

🚫

Reproducibility is broken

Dependency hell, dead links to reference data, hardcoded paths, missing configs. Most published bioinformatics analyses cannot be re-run a year later.

🤖

AI hallucinates biology

Large language models guess star alleles, invent gene-drug associations, and cite retracted papers. Without grounded skills, AI in genomics is unreliable.

⏱

Manual pipelines take weeks

A clinical-grade variant annotation requires VEP, ClinVar lookup, gnomAD frequency checks, CPIC cross-referencing, and manual prioritisation. Each step is a separate tool with its own interface.

🌍

Equity gaps persist

86% of GWAS participants are of European descent. Polygenic risk scores lose up to 80% accuracy in non-European populations. AI trained on biased data amplifies existing disparities.

ClawBio's solution

Every skill is self-contained with pinned dependencies, demo data, and reproducibility metadata. The AI agent routes to the right skill, the skill does the grounded analysis, and the human reviews the structured output. No hallucination. No broken pipelines.

Community

Growth and contributors

488GitHub Stars

85Forks

39Skills

13Contributors

Project milestones

March 2026

v0.4 — Galaxy Integration

Bridge to 8,000+ Galaxy tools. BioBlend SDK integration. Cross-platform skill chaining.

March 2026

v0.3.1 — Agent-Friendly

Added llms.txt, AGENTS.md, and machine-readable catalog.json so any AI agent can discover and use ClawBio skills automatically.

March 2026

v0.3 — Imperial College AI Agent Hackathon

Security audit (32 fixes). Full README overhaul. Production deployment of RoboTerri Telegram bot.

February 2026

v0.2 — Tests and CI

57 automated tests. GitHub Actions CI pipeline. ClawHub skill registry.

January 2026

v0.1 — First public release

Core skills: variant annotation, pharmacogenomics, equity scoring, nutrigenomics. The Corpasome as demo genome.

Join the community

ClawBio is open to contributions. Wanted skills include GWAS automation (PLINK/REGENIE), clinical ACMG classification, pathway enrichment (GO/KEGG), phylogenetics, and spatial transcriptomics. See the contributing guide to get started.

Background

Variant interpretation: the biology before the AI

This section covers the key concepts you need before running the practical.

Genomic variation in a nutshell

Every human genome carries 4 to 5 million single nucleotide polymorphisms (SNPs) compared to the reference genome. Most are benign. A small fraction affect protein function, drug metabolism, or disease risk. The challenge is finding the variants that matter in a sea of noise.

Beyond SNPs, variation includes insertions/deletions (indels), copy number variants (CNVs), and structural rearrangements. This workshop focuses on SNPs and small indels because they are what consumer genotyping platforms (23andMe, AncestryDNA) measure.

The ACMG five-tier classification

The American College of Medical Genetics and Genomics (ACMG) defines five categories for variant classification:

Category	Meaning	Clinical action
Pathogenic	Directly contributes to disease	Report. Genetic counselling.
Likely pathogenic	Strong evidence, not conclusive	Report with caveat.
VUS	Uncertain significance	Do not act on. May be reclassified.
Likely benign	Probably no clinical effect	Generally not reported.
Benign	No disease association	Not reported.

Key point: VUS (Variant of Uncertain Significance) is the honest answer when there is not enough evidence. You will never catch up with the classification backlog. Neither will AI. Learning to communicate uncertainty is a core clinical skill.

The annotation pipeline

A standard variant interpretation workflow follows this chain:

VCF → VEP → ClinVar → gnomAD → ACMG → Report

VCF: Variant Call Format, the standard file for storing genomic variants
VEP: Ensembl Variant Effect Predictor, determines functional consequence (missense, synonymous, etc.)
ClinVar: NCBI database of variant-disease associations
gnomAD: Genome Aggregation Database, population allele frequencies across 76,000+ genomes
ACMG: Classification framework that combines all evidence into a five-tier verdict

Pharmacogenomics: when your genome affects your medication

Pharmacogenomics (PGx) studies how genetic variation affects drug response. The key genes and their clinical impact:

Gene	Drugs affected	Clinical consequence
CYP2D6	Codeine, tamoxifen, SSRIs (51 drugs total)	Poor metabolisers get no pain relief from codeine
CYP2C19	Clopidogrel (Plavix), PPIs	Poor metabolisers: clopidogrel does not work
CYP2C9 + VKORC1	Warfarin	Wrong dose causes dangerous bleeding or clotting
TPMT	Azathioprine, 6-MP	Poor metabolisers: severe bone marrow toxicity
DPYD	5-fluorouracil, capecitabine	Deficiency can be fatal at standard chemotherapy doses

CPIC (Clinical Pharmacogenetics Implementation Consortium) publishes evidence-based guidelines that map genotype to drug recommendation. ClawBio's pharmgx-reporter skill implements these guidelines directly.

The equity problem in genomics

Genomic databases are heavily biased toward European populations:

86% of GWAS participants are of European ancestry
BRCA variant databases have 30x more entries for European populations
Polygenic risk scores lose up to 80% accuracy in non-European populations
44% of neglected tropical diseases have zero dedicated genomic research infrastructure

AI trained on biased data amplifies existing disparities. A variant classified as "benign" in European databases may be pathogenic in another population but simply unstudied. ClawBio's equity-scorer skill quantifies this gap using the HEIM (Health Equity Impact Metric) framework.

The Corpasome: a real open genome

In 2013, Manuel Corpas published his 23andMe genotype data under a CC0 (public domain) licence, making it one of the first fully open personal genomes. This workshop uses the Corpasome as its primary dataset.

Real findings from this genome include:

Factor V Leiden (rs6025): carrier for thrombophilia risk
HFE C282Y (rs1800562): carrier for hereditary haemochromatosis
CFTR deltaF508 (rs113993960): carrier for cystic fibrosis
VKORC1 + CYP2C9: warfarin sensitivity (AVOID standard dose)
MTHFR C677T (rs1801133): folate metabolism variant
APOE e3/e4: elevated Alzheimer's risk factor

Citation: Corpas, M. (2013). Crowdsourcing the Corpasome. Source Code for Biology and Medicine, 8, 13. doi:10.1186/1751-0473-8-13

Materials

Workshop materials and links

Everything you need to run the practical. No local installation required.

Essential links

Resource	Link	Notes
Google Colab notebook	Open in Colab	Main practical. Runs in browser, free tier.
Lecture slides (PPTX)	Download	ClawBio overview deck
ClawBio GitHub	github.com/ClawBio/ClawBio	Source code, skills, documentation
Corpasome paper	doi:10.1186/1751-0473-8-13	Corpas (2013), Source Code Biol Med
Ensembl VEP	ensembl.org/vep	Variant Effect Predictor (public REST API)
ClinVar	ncbi.nlm.nih.gov/clinvar	Variant-disease associations database
gnomAD	gnomad.broadinstitute.org	Population allele frequency data
CPIC Guidelines	cpicpgx.org	Pharmacogenomics clinical guidelines
ACMG Standards	Richards et al. (2015)	Genetics in Medicine 17(5):405-24

Skills used in this workshop

variant-annotation

Annotates VCF variants via Ensembl VEP REST API. Extracts ClinVar significance, gnomAD frequencies, and assigns Tier 1-4 priority. Outputs report, TSV, and JSON.

pharmgx-reporter

Pharmacogenomic report from 23andMe/AncestryDNA data. 12 genes, 31 SNPs, 51 drugs. CPIC-grounded, zero external dependencies, runs offline.

clinpgx

Deep gene-drug lookup via the ClinPGx API. Provides detailed CPIC guideline context, PharmGKB annotations, and FDA label information.

Requirements

A Google account (for Colab access)
A web browser (Chrome, Firefox, or Safari)
No installation, no API keys, no payment required
Approximately 30 minutes for the guided practical

Walkthrough

Step-by-step workshop instructions

Open the Colab notebook and follow along.

Setup (2 minutes)

Run the first two code cells. They clone the ClawBio repository and install dependencies (pysam, requests, pandas, matplotlib). You should see:

          ClawBio loaded successfully

          Skills available: 39

If Colab is slow

The git clone takes 10-20 seconds. If it times out, click "Runtime > Restart and run all". The Colab free tier occasionally throttles new sessions.

Explore the Corpasome (5 minutes)

The notebook loads Manuel Corpas's 23andMe genotype file (gzipped, ~600,000 SNPs). You will see:

The 23andMe file format: rsID, chromosome, position, genotype
Total SNP count across all chromosomes
A per-chromosome breakdown showing Chr 1 has the most variants and Chr 22 the fewest

Discussion point: Why does chromosome 1 have the most SNPs? (It is the largest chromosome, ~249 Mb.)

Convert to VCF (3 minutes)

The notebook extracts 21 clinically relevant variants from the full genotype file and converts them to VCF format. These span:

Pharmacogenomics: CYP2C19, CYP2C9, CYP2D6, VKORC1, TPMT, MTHFR
Cancer risk: BRCA1, TP53
Cardiovascular: Factor V (F5), Prothrombin (F2), HFE
Other Mendelian: CFTR, APOE, SERPINA1

The output VCF is small enough to annotate in seconds using the free Ensembl REST API.

Run variant annotation (5 minutes)

This is the core analysis step. The variant-annotation skill sends the 21 variants to Ensembl VEP and enriches them with ClinVar and gnomAD data. The output includes:

A report.md with a prioritised summary of findings
An annotated_variants.tsv table with per-variant details
A result.json for programmatic access

What to watch for

The VEP API processes 21 variants in a single batch (under the 200-variant limit). You should see status messages as each batch is submitted and cached. If the API is slow, the skill will retry automatically.

Pharmacogenomic interpretation (5 minutes)

The clinpgx skill maps the annotated variants to CPIC drug recommendations. The key output is a gene-by-gene metaboliser profile and a drug recommendation table.

The warfarin finding is the highlight of this step: the combination of VKORC1 TT (high sensitivity) and CYP2C9 *1/*2 (intermediate metaboliser) triggers an AVOID or significantly reduce dose recommendation. Without genotyping, a standard dose could cause dangerous bleeding.

Exercises (15 minutes, independent work)

Three exercises for students:

Exercise	Task	Status
5a	Run variant-annotation on the bundled 20-variant synthetic ClinVar panel (`--demo` flag). Compare the findings with your Corpasome results.	Required
5b	Upload your own 23andMe or AncestryDNA file and re-run Steps 2-4 on your data. Privacy note: data stays in Colab, deleted on session end.	Optional
5c	Pick one gene from the results. Research its function, ACMG classification, gnomAD frequency, and write a brief interpretation: would you report this to a patient?	Required

Results guide

Understanding your results

What the output tables and reports mean, and how to interpret the key findings.

Priority tiers

The variant-annotation skill assigns every variant a priority tier based on clinical significance, population frequency, and functional impact:

Tier	Criteria	Example from Corpasome
Tier 1	Pathogenic or likely pathogenic in ClinVar. Rare in gnomAD (AF < 0.001).	CFTR deltaF508 (rs113993960), carrier for cystic fibrosis
Tier 2	Drug response variant or established risk factor. CPIC-actionable.	VKORC1 rs9923231 TT, warfarin high sensitivity
Tier 3	Variant of uncertain significance. Insufficient evidence to classify.	Rare missense variants with no ClinVar entry
Tier 4	Benign or likely benign. Common in populations (> 1% frequency).	MTHFR A1298C (rs1801131), common polymorphism

Key findings from the Corpasome

Factor V Leiden (rs6025) Tier 1

Gene: F5 (coagulation factor V). Genotype: heterozygous carrier.
Clinical meaning: 3-8x increased risk of venous thromboembolism (blood clots). The most common inherited thrombophilia in Europeans (~5% carrier frequency). Relevant for oral contraceptive prescribing, surgery planning, and long-haul travel advice.
Action: Reportable finding. Genetic counselling recommended for family cascade testing.

HFE C282Y (rs1800562) Tier 1

Gene: HFE (homeostatic iron regulator). Genotype: heterozygous carrier.
Clinical meaning: Carrier for hereditary haemochromatosis. Homozygotes (C282Y/C282Y) accumulate excess iron, leading to liver damage, diabetes, and heart failure if untreated. Heterozygous carriers have mildly elevated iron but rarely develop clinical disease.
Action: Monitor serum ferritin periodically. No treatment needed for carriers.

CFTR deltaF508 (rs113993960) Tier 1

Gene: CFTR (cystic fibrosis transmembrane conductance regulator). Genotype: heterozygous carrier.
Clinical meaning: Carrier for cystic fibrosis, the most common lethal autosomal recessive condition in Europeans (~1 in 25 carrier frequency). Two copies needed for disease. Relevant for reproductive planning.
Action: Partner testing recommended before family planning.

Warfarin: CYP2C9 + VKORC1 Tier 2

Genes: CYP2C9 (*1/*2, intermediate metaboliser) + VKORC1 (rs9923231 TT, high sensitivity).
Clinical meaning: This combination means warfarin is metabolised more slowly than average AND the drug target is more sensitive. Standard dosing would cause dangerously high drug levels and serious bleeding risk.
CPIC recommendation: AVOID standard dose. Use pharmacogenomic-guided dosing algorithm or consider alternative anticoagulants (DOACs).
Why this matters: Warfarin has a narrow therapeutic window. Too little means clotting; too much means haemorrhage. This is the textbook example of pharmacogenomics saving lives.

APOE e3/e4 (rs429358 + rs7412) Risk factor

Gene: APOE (apolipoprotein E). Genotype: e3/e4.
Clinical meaning: The e4 allele is the strongest common genetic risk factor for late-onset Alzheimer's disease. One copy (e3/e4) increases risk approximately 3-fold compared to e3/e3. Two copies (e4/e4) increase risk ~12-fold. However, many e4 carriers never develop Alzheimer's, and many Alzheimer's patients do not carry e4.
Ethical note: APOE status is an ACMG secondary finding (SF v3.2). Disclosure is recommended but must be accompanied by counselling. The result is probabilistic, not deterministic.

MTHFR C677T (rs1801133) Tier 2

Gene: MTHFR (methylenetetrahydrofolate reductase). Genotype: heterozygous.
Clinical meaning: Reduced enzyme activity for folate metabolism. Heterozygotes retain ~65% activity (not clinically significant for most people). Homozygotes (~35% activity) may benefit from methylfolate supplementation, especially during pregnancy. The variant is extremely common (~30-40% of Europeans are carriers).
Context: MTHFR is frequently over-interpreted in direct-to-consumer reports. Most carriers require no clinical action.

Reading the annotated variants table

The TSV output contains one row per variant. Key columns:

Column	What it means
`gene`	Gene symbol (e.g., CYP2D6, CFTR)
`consequence`	Functional effect: missense_variant, synonymous, frameshift, etc.
`impact`	VEP impact tier: HIGH, MODERATE, LOW, MODIFIER
`clinvar_significance`	ClinVar classification: Pathogenic, Likely pathogenic, VUS, Benign, Drug response
`gnomad_af`	Global allele frequency in gnomAD. Values below 0.001 (0.1%) are considered rare.
`priority_tier`	ClawBio's computed tier (1-4) combining all evidence fields
`priority_score`	Numeric score for ranking within a tier. Higher means more clinically relevant.

Important limitations

Consumer genotyping arrays (23andMe, AncestryDNA) test ~600,000 of the genome's ~3 billion positions. They miss structural variants, most rare variants, and cannot reliably detect copy number changes. A "clear" report from a genotyping array does not mean the genome is free of pathogenic variants. Clinical-grade whole genome sequencing covers far more ground.

Summary

Take-home messages

The fundamentals have not changed

Variant interpretation still requires understanding of molecular biology, population genetics, clinical context, and the ACMG framework. AI accelerates the mechanical steps (annotation, database lookups, prioritisation), but it does not replace the human judgement needed to decide whether a variant is clinically actionable.

Speed has changed dramatically

What used to take a bioinformatician days (downloading tools, configuring environments, running VEP, parsing output, cross-referencing databases) now takes minutes with agent-driven skills. The bottleneck shifts from data processing to interpretation and clinical decision-making.

Pharmacogenomics is actionable today

Drug-gene interactions like warfarin/CYP2C9/VKORC1 are well-established, guideline-supported, and directly affect prescribing decisions. This is not hypothetical future medicine. It is already implemented in leading hospitals through pre-emptive PGx testing.

VUS is the honest answer

Over half of all variants in ClinVar are classified as VUS. The backlog is growing faster than reclassification efforts. Communicating uncertainty to patients, rather than overpromising on what genomics can deliver, is a core skill for anyone working in this field.

Equity gaps are real and growing

AI systems trained on biased data amplify existing disparities. A variant that appears benign in European databases may be pathogenic in an understudied population. Every genomic analysis should consider the ancestry context of the individual and the reference databases being queried.

Open data enables open science

This entire workshop runs on a CC0-licensed genome, open-source skills, free public APIs, and a free Colab notebook. Reproducible, accessible, and transparent. Anyone in the world can run the same analysis and get the same results. That is the standard to aim for.

Medical disclaimer

ClawBio is a research and educational tool. It is not a medical device and does not provide clinical diagnoses. The findings discussed in this workshop are for educational purposes only. Consult a healthcare professional before making any medical decisions based on genetic data.

Agentic Variant Interpretation

Contents

What is ClawBio?

Bioinformatics skills for AI agents

Local-first. Reproducible. Open.

From raw data to clinical-grade report in one command

Why bioinformatics needs agent skills

Reproducibility is broken

AI hallucinates biology

Manual pipelines take weeks

Equity gaps persist

Growth and contributors

Project milestones

v0.4 — Galaxy Integration

v0.3.1 — Agent-Friendly

v0.3 — Imperial College AI Agent Hackathon

v0.2 — Tests and CI

v0.1 — First public release

Variant interpretation: the biology before the AI

Genomic variation in a nutshell

The ACMG five-tier classification

The annotation pipeline

Pharmacogenomics: when your genome affects your medication

The equity problem in genomics

The Corpasome: a real open genome

Workshop materials and links

Essential links

Skills used in this workshop

variant-annotation

pharmgx-reporter

clinpgx

Requirements

Step-by-step workshop instructions

Setup (2 minutes)

Explore the Corpasome (5 minutes)

Convert to VCF (3 minutes)

Run variant annotation (5 minutes)

Pharmacogenomic interpretation (5 minutes)

Exercises (15 minutes, independent work)

Understanding your results

Priority tiers

Key findings from the Corpasome

Factor V Leiden (rs6025) Tier 1

HFE C282Y (rs1800562) Tier 1

CFTR deltaF508 (rs113993960) Tier 1

Warfarin: CYP2C9 + VKORC1 Tier 2

APOE e3/e4 (rs429358 + rs7412) Risk factor

MTHFR C677T (rs1801133) Tier 2

Reading the annotated variants table

Take-home messages

The fundamentals have not changed

Speed has changed dramatically

Pharmacogenomics is actionable today

VUS is the honest answer

Equity gaps are real and growing

Open data enables open science

Continue exploring