Systems Biology Workshop

Agentic Variant Interpretation

Hands-on with a real human genome. Annotate clinically relevant variants, interpret pharmacogenomic findings, and understand what AI changes (and does not change) about genomic analysis.

Dr Manuel Corpas · University of Westminster · 27 March 2026

Open in Google Colab GitHub Repository

What is ClawBio?

The first bioinformatics-native AI agent skill library. Curated, reproducible, open-source.

Bioinformatics skills for AI agents

ClawBio is a collection of self-contained, reproducible bioinformatics skills that any AI agent can call. Each skill handles a specific task: annotating variants, scoring pharmacogenomic risk, running differential expression, searching clinical trials, and more. The skills run locally, keep genetic data on your machine, and produce structured, auditable outputs.

Think of it as a toolbox. The AI agent decides which tool to pick up. The tool does the analysis. You review the results.

Local-first. Reproducible. Open.

  • Local-first: your genomic data never leaves your laptop. Skills process everything in-place.
  • Reproducible: every skill exports commands.sh, environment.yml, and SHA-256 checksums. Re-run any analysis and get the same output.
  • Open-source: MIT licensed. Fork it, extend it, contribute back.
  • Equity-aware: built-in HEIM diversity metrics flag when analyses depend on biased reference data.

From raw data to clinical-grade report in one command

# Traditional approach: 6 manual steps across 3 tools
vep --input sample.vcf --output vep_out.txt --cache
# ...parse JSON, cross-reference ClinVar, check gnomAD, read CPIC tables...

# ClawBio: one command
python clawbio.py run variant-annotation --input sample.vcf --output report/
Annotated 21 variants. 3 Tier 1 (pathogenic). 6 Tier 2 (drug response). Report saved.

Why bioinformatics needs agent skills

🚫

Reproducibility is broken

Dependency hell, dead links to reference data, hardcoded paths, missing configs. Most published bioinformatics analyses cannot be re-run a year later.

🤖

AI hallucinates biology

Large language models guess star alleles, invent gene-drug associations, and cite retracted papers. Without grounded skills, AI in genomics is unreliable.

Manual pipelines take weeks

A clinical-grade variant annotation requires VEP, ClinVar lookup, gnomAD frequency checks, CPIC cross-referencing, and manual prioritisation. Each step is a separate tool with its own interface.

🌍

Equity gaps persist

86% of GWAS participants are of European descent. Polygenic risk scores lose up to 80% accuracy in non-European populations. AI trained on biased data amplifies existing disparities.

ClawBio's solution

Every skill is self-contained with pinned dependencies, demo data, and reproducibility metadata. The AI agent routes to the right skill, the skill does the grounded analysis, and the human reviews the structured output. No hallucination. No broken pipelines.

Growth and contributors

488GitHub Stars
85Forks
39Skills
13Contributors

Project milestones

March 2026

v0.4 — Galaxy Integration

Bridge to 8,000+ Galaxy tools. BioBlend SDK integration. Cross-platform skill chaining.

March 2026

v0.3.1 — Agent-Friendly

Added llms.txt, AGENTS.md, and machine-readable catalog.json so any AI agent can discover and use ClawBio skills automatically.

March 2026

v0.3 — Imperial College AI Agent Hackathon

Security audit (32 fixes). Full README overhaul. Production deployment of RoboTerri Telegram bot.

February 2026

v0.2 — Tests and CI

57 automated tests. GitHub Actions CI pipeline. ClawHub skill registry.

January 2026

v0.1 — First public release

Core skills: variant annotation, pharmacogenomics, equity scoring, nutrigenomics. The Corpasome as demo genome.

Join the community

ClawBio is open to contributions. Wanted skills include GWAS automation (PLINK/REGENIE), clinical ACMG classification, pathway enrichment (GO/KEGG), phylogenetics, and spatial transcriptomics. See the contributing guide to get started.

Variant interpretation: the biology before the AI

This section covers the key concepts you need before running the practical.

Genomic variation in a nutshell

Every human genome carries 4 to 5 million single nucleotide polymorphisms (SNPs) compared to the reference genome. Most are benign. A small fraction affect protein function, drug metabolism, or disease risk. The challenge is finding the variants that matter in a sea of noise.

Beyond SNPs, variation includes insertions/deletions (indels), copy number variants (CNVs), and structural rearrangements. This workshop focuses on SNPs and small indels because they are what consumer genotyping platforms (23andMe, AncestryDNA) measure.

The ACMG five-tier classification

The American College of Medical Genetics and Genomics (ACMG) defines five categories for variant classification:

CategoryMeaningClinical action
PathogenicDirectly contributes to diseaseReport. Genetic counselling.
Likely pathogenicStrong evidence, not conclusiveReport with caveat.
VUSUncertain significanceDo not act on. May be reclassified.
Likely benignProbably no clinical effectGenerally not reported.
BenignNo disease associationNot reported.

Key point: VUS (Variant of Uncertain Significance) is the honest answer when there is not enough evidence. You will never catch up with the classification backlog. Neither will AI. Learning to communicate uncertainty is a core clinical skill.

The annotation pipeline

A standard variant interpretation workflow follows this chain:

VCF → VEP → ClinVar → gnomAD → ACMG → Report
  • VCF: Variant Call Format, the standard file for storing genomic variants
  • VEP: Ensembl Variant Effect Predictor, determines functional consequence (missense, synonymous, etc.)
  • ClinVar: NCBI database of variant-disease associations
  • gnomAD: Genome Aggregation Database, population allele frequencies across 76,000+ genomes
  • ACMG: Classification framework that combines all evidence into a five-tier verdict

Pharmacogenomics: when your genome affects your medication

Pharmacogenomics (PGx) studies how genetic variation affects drug response. The key genes and their clinical impact:

GeneDrugs affectedClinical consequence
CYP2D6Codeine, tamoxifen, SSRIs (51 drugs total)Poor metabolisers get no pain relief from codeine
CYP2C19Clopidogrel (Plavix), PPIsPoor metabolisers: clopidogrel does not work
CYP2C9 + VKORC1WarfarinWrong dose causes dangerous bleeding or clotting
TPMTAzathioprine, 6-MPPoor metabolisers: severe bone marrow toxicity
DPYD5-fluorouracil, capecitabineDeficiency can be fatal at standard chemotherapy doses

CPIC (Clinical Pharmacogenetics Implementation Consortium) publishes evidence-based guidelines that map genotype to drug recommendation. ClawBio's pharmgx-reporter skill implements these guidelines directly.

The equity problem in genomics

Genomic databases are heavily biased toward European populations:

  • 86% of GWAS participants are of European ancestry
  • BRCA variant databases have 30x more entries for European populations
  • Polygenic risk scores lose up to 80% accuracy in non-European populations
  • 44% of neglected tropical diseases have zero dedicated genomic research infrastructure

AI trained on biased data amplifies existing disparities. A variant classified as "benign" in European databases may be pathogenic in another population but simply unstudied. ClawBio's equity-scorer skill quantifies this gap using the HEIM (Health Equity Impact Metric) framework.

The Corpasome: a real open genome

In 2013, Manuel Corpas published his 23andMe genotype data under a CC0 (public domain) licence, making it one of the first fully open personal genomes. This workshop uses the Corpasome as its primary dataset.

Real findings from this genome include:

  • Factor V Leiden (rs6025): carrier for thrombophilia risk
  • HFE C282Y (rs1800562): carrier for hereditary haemochromatosis
  • CFTR deltaF508 (rs113993960): carrier for cystic fibrosis
  • VKORC1 + CYP2C9: warfarin sensitivity (AVOID standard dose)
  • MTHFR C677T (rs1801133): folate metabolism variant
  • APOE e3/e4: elevated Alzheimer's risk factor

Citation: Corpas, M. (2013). Crowdsourcing the Corpasome. Source Code for Biology and Medicine, 8, 13. doi:10.1186/1751-0473-8-13

Workshop materials and links

Everything you need to run the practical. No local installation required.

Essential links

ResourceLinkNotes
Google Colab notebook Open in Colab Main practical. Runs in browser, free tier.
Lecture slides (PPTX) Download ClawBio overview deck
ClawBio GitHub github.com/ClawBio/ClawBio Source code, skills, documentation
Corpasome paper doi:10.1186/1751-0473-8-13 Corpas (2013), Source Code Biol Med
Ensembl VEP ensembl.org/vep Variant Effect Predictor (public REST API)
ClinVar ncbi.nlm.nih.gov/clinvar Variant-disease associations database
gnomAD gnomad.broadinstitute.org Population allele frequency data
CPIC Guidelines cpicpgx.org Pharmacogenomics clinical guidelines
ACMG Standards Richards et al. (2015) Genetics in Medicine 17(5):405-24

Skills used in this workshop

variant-annotation

Annotates VCF variants via Ensembl VEP REST API. Extracts ClinVar significance, gnomAD frequencies, and assigns Tier 1-4 priority. Outputs report, TSV, and JSON.

pharmgx-reporter

Pharmacogenomic report from 23andMe/AncestryDNA data. 12 genes, 31 SNPs, 51 drugs. CPIC-grounded, zero external dependencies, runs offline.

clinpgx

Deep gene-drug lookup via the ClinPGx API. Provides detailed CPIC guideline context, PharmGKB annotations, and FDA label information.

Requirements

Step-by-step workshop instructions

Open the Colab notebook and follow along.

Setup (2 minutes)

Run the first two code cells. They clone the ClawBio repository and install dependencies (pysam, requests, pandas, matplotlib). You should see:

ClawBio loaded successfully
Skills available: 39
If Colab is slow

The git clone takes 10-20 seconds. If it times out, click "Runtime > Restart and run all". The Colab free tier occasionally throttles new sessions.

Explore the Corpasome (5 minutes)

The notebook loads Manuel Corpas's 23andMe genotype file (gzipped, ~600,000 SNPs). You will see:

  • The 23andMe file format: rsID, chromosome, position, genotype
  • Total SNP count across all chromosomes
  • A per-chromosome breakdown showing Chr 1 has the most variants and Chr 22 the fewest

Discussion point: Why does chromosome 1 have the most SNPs? (It is the largest chromosome, ~249 Mb.)

Convert to VCF (3 minutes)

The notebook extracts 21 clinically relevant variants from the full genotype file and converts them to VCF format. These span:

  • Pharmacogenomics: CYP2C19, CYP2C9, CYP2D6, VKORC1, TPMT, MTHFR
  • Cancer risk: BRCA1, TP53
  • Cardiovascular: Factor V (F5), Prothrombin (F2), HFE
  • Other Mendelian: CFTR, APOE, SERPINA1

The output VCF is small enough to annotate in seconds using the free Ensembl REST API.

Run variant annotation (5 minutes)

This is the core analysis step. The variant-annotation skill sends the 21 variants to Ensembl VEP and enriches them with ClinVar and gnomAD data. The output includes:

  • A report.md with a prioritised summary of findings
  • An annotated_variants.tsv table with per-variant details
  • A result.json for programmatic access
What to watch for

The VEP API processes 21 variants in a single batch (under the 200-variant limit). You should see status messages as each batch is submitted and cached. If the API is slow, the skill will retry automatically.

Pharmacogenomic interpretation (5 minutes)

The clinpgx skill maps the annotated variants to CPIC drug recommendations. The key output is a gene-by-gene metaboliser profile and a drug recommendation table.

The warfarin finding is the highlight of this step: the combination of VKORC1 TT (high sensitivity) and CYP2C9 *1/*2 (intermediate metaboliser) triggers an AVOID or significantly reduce dose recommendation. Without genotyping, a standard dose could cause dangerous bleeding.

Exercises (15 minutes, independent work)

Three exercises for students:

ExerciseTaskStatus
5aRun variant-annotation on the bundled 20-variant synthetic ClinVar panel (--demo flag). Compare the findings with your Corpasome results.Required
5bUpload your own 23andMe or AncestryDNA file and re-run Steps 2-4 on your data. Privacy note: data stays in Colab, deleted on session end.Optional
5cPick one gene from the results. Research its function, ACMG classification, gnomAD frequency, and write a brief interpretation: would you report this to a patient?Required

Understanding your results

What the output tables and reports mean, and how to interpret the key findings.

Priority tiers

The variant-annotation skill assigns every variant a priority tier based on clinical significance, population frequency, and functional impact:

TierCriteriaExample from Corpasome
Tier 1 Pathogenic or likely pathogenic in ClinVar. Rare in gnomAD (AF < 0.001). CFTR deltaF508 (rs113993960), carrier for cystic fibrosis
Tier 2 Drug response variant or established risk factor. CPIC-actionable. VKORC1 rs9923231 TT, warfarin high sensitivity
Tier 3 Variant of uncertain significance. Insufficient evidence to classify. Rare missense variants with no ClinVar entry
Tier 4 Benign or likely benign. Common in populations (> 1% frequency). MTHFR A1298C (rs1801131), common polymorphism

Key findings from the Corpasome

Factor V Leiden (rs6025) Tier 1

Gene: F5 (coagulation factor V). Genotype: heterozygous carrier.
Clinical meaning: 3-8x increased risk of venous thromboembolism (blood clots). The most common inherited thrombophilia in Europeans (~5% carrier frequency). Relevant for oral contraceptive prescribing, surgery planning, and long-haul travel advice.
Action: Reportable finding. Genetic counselling recommended for family cascade testing.

HFE C282Y (rs1800562) Tier 1

Gene: HFE (homeostatic iron regulator). Genotype: heterozygous carrier.
Clinical meaning: Carrier for hereditary haemochromatosis. Homozygotes (C282Y/C282Y) accumulate excess iron, leading to liver damage, diabetes, and heart failure if untreated. Heterozygous carriers have mildly elevated iron but rarely develop clinical disease.
Action: Monitor serum ferritin periodically. No treatment needed for carriers.

CFTR deltaF508 (rs113993960) Tier 1

Gene: CFTR (cystic fibrosis transmembrane conductance regulator). Genotype: heterozygous carrier.
Clinical meaning: Carrier for cystic fibrosis, the most common lethal autosomal recessive condition in Europeans (~1 in 25 carrier frequency). Two copies needed for disease. Relevant for reproductive planning.
Action: Partner testing recommended before family planning.

Warfarin: CYP2C9 + VKORC1 Tier 2

Genes: CYP2C9 (*1/*2, intermediate metaboliser) + VKORC1 (rs9923231 TT, high sensitivity).
Clinical meaning: This combination means warfarin is metabolised more slowly than average AND the drug target is more sensitive. Standard dosing would cause dangerously high drug levels and serious bleeding risk.
CPIC recommendation: AVOID standard dose. Use pharmacogenomic-guided dosing algorithm or consider alternative anticoagulants (DOACs).
Why this matters: Warfarin has a narrow therapeutic window. Too little means clotting; too much means haemorrhage. This is the textbook example of pharmacogenomics saving lives.

APOE e3/e4 (rs429358 + rs7412) Risk factor

Gene: APOE (apolipoprotein E). Genotype: e3/e4.
Clinical meaning: The e4 allele is the strongest common genetic risk factor for late-onset Alzheimer's disease. One copy (e3/e4) increases risk approximately 3-fold compared to e3/e3. Two copies (e4/e4) increase risk ~12-fold. However, many e4 carriers never develop Alzheimer's, and many Alzheimer's patients do not carry e4.
Ethical note: APOE status is an ACMG secondary finding (SF v3.2). Disclosure is recommended but must be accompanied by counselling. The result is probabilistic, not deterministic.

MTHFR C677T (rs1801133) Tier 2

Gene: MTHFR (methylenetetrahydrofolate reductase). Genotype: heterozygous.
Clinical meaning: Reduced enzyme activity for folate metabolism. Heterozygotes retain ~65% activity (not clinically significant for most people). Homozygotes (~35% activity) may benefit from methylfolate supplementation, especially during pregnancy. The variant is extremely common (~30-40% of Europeans are carriers).
Context: MTHFR is frequently over-interpreted in direct-to-consumer reports. Most carriers require no clinical action.

Reading the annotated variants table

The TSV output contains one row per variant. Key columns:

ColumnWhat it means
geneGene symbol (e.g., CYP2D6, CFTR)
consequenceFunctional effect: missense_variant, synonymous, frameshift, etc.
impactVEP impact tier: HIGH, MODERATE, LOW, MODIFIER
clinvar_significanceClinVar classification: Pathogenic, Likely pathogenic, VUS, Benign, Drug response
gnomad_afGlobal allele frequency in gnomAD. Values below 0.001 (0.1%) are considered rare.
priority_tierClawBio's computed tier (1-4) combining all evidence fields
priority_scoreNumeric score for ranking within a tier. Higher means more clinically relevant.
Important limitations

Consumer genotyping arrays (23andMe, AncestryDNA) test ~600,000 of the genome's ~3 billion positions. They miss structural variants, most rare variants, and cannot reliably detect copy number changes. A "clear" report from a genotyping array does not mean the genome is free of pathogenic variants. Clinical-grade whole genome sequencing covers far more ground.

Take-home messages

1

The fundamentals have not changed

Variant interpretation still requires understanding of molecular biology, population genetics, clinical context, and the ACMG framework. AI accelerates the mechanical steps (annotation, database lookups, prioritisation), but it does not replace the human judgement needed to decide whether a variant is clinically actionable.

2

Speed has changed dramatically

What used to take a bioinformatician days (downloading tools, configuring environments, running VEP, parsing output, cross-referencing databases) now takes minutes with agent-driven skills. The bottleneck shifts from data processing to interpretation and clinical decision-making.

3

Pharmacogenomics is actionable today

Drug-gene interactions like warfarin/CYP2C9/VKORC1 are well-established, guideline-supported, and directly affect prescribing decisions. This is not hypothetical future medicine. It is already implemented in leading hospitals through pre-emptive PGx testing.

4

VUS is the honest answer

Over half of all variants in ClinVar are classified as VUS. The backlog is growing faster than reclassification efforts. Communicating uncertainty to patients, rather than overpromising on what genomics can deliver, is a core skill for anyone working in this field.

5

Equity gaps are real and growing

AI systems trained on biased data amplify existing disparities. A variant that appears benign in European databases may be pathogenic in an understudied population. Every genomic analysis should consider the ancestry context of the individual and the reference databases being queried.

6

Open data enables open science

This entire workshop runs on a CC0-licensed genome, open-source skills, free public APIs, and a free Colab notebook. Reproducible, accessible, and transparent. Anyone in the world can run the same analysis and get the same results. That is the standard to aim for.

Medical disclaimer

ClawBio is a research and educational tool. It is not a medical device and does not provide clinical diagnoses. The findings discussed in this workshop are for educational purposes only. Consult a healthcare professional before making any medical decisions based on genetic data.

Continue exploring

ClawBio has 39 skills covering pharmacogenomics, ancestry analysis, equity scoring, single-cell RNA-seq, GWAS, proteomics, metagenomics, and more.