Hands-on with a real human genome. Annotate clinically relevant variants, interpret pharmacogenomic findings, and understand what AI changes (and does not change) about genomic analysis.
Introduction
The first bioinformatics-native AI agent skill library. Curated, reproducible, open-source.
The problem
Dependency hell, dead links to reference data, hardcoded paths, missing configs. Most published bioinformatics analyses cannot be re-run a year later.
Large language models guess star alleles, invent gene-drug associations, and cite retracted papers. Without grounded skills, AI in genomics is unreliable.
A clinical-grade variant annotation requires VEP, ClinVar lookup, gnomAD frequency checks, CPIC cross-referencing, and manual prioritisation. Each step is a separate tool with its own interface.
86% of GWAS participants are of European descent. Polygenic risk scores lose up to 80% accuracy in non-European populations. AI trained on biased data amplifies existing disparities.
Every skill is self-contained with pinned dependencies, demo data, and reproducibility metadata. The AI agent routes to the right skill, the skill does the grounded analysis, and the human reviews the structured output. No hallucination. No broken pipelines.
Community
Bridge to 8,000+ Galaxy tools. BioBlend SDK integration. Cross-platform skill chaining.
Added llms.txt, AGENTS.md, and machine-readable catalog.json so any AI agent can discover and use ClawBio skills automatically.
Security audit (32 fixes). Full README overhaul. Production deployment of RoboTerri Telegram bot.
57 automated tests. GitHub Actions CI pipeline. ClawHub skill registry.
Core skills: variant annotation, pharmacogenomics, equity scoring, nutrigenomics. The Corpasome as demo genome.
ClawBio is open to contributions. Wanted skills include GWAS automation (PLINK/REGENIE), clinical ACMG classification, pathway enrichment (GO/KEGG), phylogenetics, and spatial transcriptomics. See the contributing guide to get started.
Background
This section covers the key concepts you need before running the practical.
Materials
Everything you need to run the practical. No local installation required.
| Resource | Link | Notes |
|---|---|---|
| Google Colab notebook | Open in Colab | Main practical. Runs in browser, free tier. |
| Lecture slides (PPTX) | Download | ClawBio overview deck |
| ClawBio GitHub | github.com/ClawBio/ClawBio | Source code, skills, documentation |
| Corpasome paper | doi:10.1186/1751-0473-8-13 | Corpas (2013), Source Code Biol Med |
| Ensembl VEP | ensembl.org/vep | Variant Effect Predictor (public REST API) |
| ClinVar | ncbi.nlm.nih.gov/clinvar | Variant-disease associations database |
| gnomAD | gnomad.broadinstitute.org | Population allele frequency data |
| CPIC Guidelines | cpicpgx.org | Pharmacogenomics clinical guidelines |
| ACMG Standards | Richards et al. (2015) | Genetics in Medicine 17(5):405-24 |
Annotates VCF variants via Ensembl VEP REST API. Extracts ClinVar significance, gnomAD frequencies, and assigns Tier 1-4 priority. Outputs report, TSV, and JSON.
Pharmacogenomic report from 23andMe/AncestryDNA data. 12 genes, 31 SNPs, 51 drugs. CPIC-grounded, zero external dependencies, runs offline.
Deep gene-drug lookup via the ClinPGx API. Provides detailed CPIC guideline context, PharmGKB annotations, and FDA label information.
Results guide
What the output tables and reports mean, and how to interpret the key findings.
The variant-annotation skill assigns every variant a priority tier based on clinical significance, population frequency, and functional impact:
| Tier | Criteria | Example from Corpasome |
|---|---|---|
| Tier 1 | Pathogenic or likely pathogenic in ClinVar. Rare in gnomAD (AF < 0.001). | CFTR deltaF508 (rs113993960), carrier for cystic fibrosis |
| Tier 2 | Drug response variant or established risk factor. CPIC-actionable. | VKORC1 rs9923231 TT, warfarin high sensitivity |
| Tier 3 | Variant of uncertain significance. Insufficient evidence to classify. | Rare missense variants with no ClinVar entry |
| Tier 4 | Benign or likely benign. Common in populations (> 1% frequency). | MTHFR A1298C (rs1801131), common polymorphism |
The TSV output contains one row per variant. Key columns:
| Column | What it means |
|---|---|
gene | Gene symbol (e.g., CYP2D6, CFTR) |
consequence | Functional effect: missense_variant, synonymous, frameshift, etc. |
impact | VEP impact tier: HIGH, MODERATE, LOW, MODIFIER |
clinvar_significance | ClinVar classification: Pathogenic, Likely pathogenic, VUS, Benign, Drug response |
gnomad_af | Global allele frequency in gnomAD. Values below 0.001 (0.1%) are considered rare. |
priority_tier | ClawBio's computed tier (1-4) combining all evidence fields |
priority_score | Numeric score for ranking within a tier. Higher means more clinically relevant. |
Consumer genotyping arrays (23andMe, AncestryDNA) test ~600,000 of the genome's ~3 billion positions. They miss structural variants, most rare variants, and cannot reliably detect copy number changes. A "clear" report from a genotyping array does not mean the genome is free of pathogenic variants. Clinical-grade whole genome sequencing covers far more ground.
Summary
Variant interpretation still requires understanding of molecular biology, population genetics, clinical context, and the ACMG framework. AI accelerates the mechanical steps (annotation, database lookups, prioritisation), but it does not replace the human judgement needed to decide whether a variant is clinically actionable.
What used to take a bioinformatician days (downloading tools, configuring environments, running VEP, parsing output, cross-referencing databases) now takes minutes with agent-driven skills. The bottleneck shifts from data processing to interpretation and clinical decision-making.
Drug-gene interactions like warfarin/CYP2C9/VKORC1 are well-established, guideline-supported, and directly affect prescribing decisions. This is not hypothetical future medicine. It is already implemented in leading hospitals through pre-emptive PGx testing.
Over half of all variants in ClinVar are classified as VUS. The backlog is growing faster than reclassification efforts. Communicating uncertainty to patients, rather than overpromising on what genomics can deliver, is a core skill for anyone working in this field.
AI systems trained on biased data amplify existing disparities. A variant that appears benign in European databases may be pathogenic in an understudied population. Every genomic analysis should consider the ancestry context of the individual and the reference databases being queried.
This entire workshop runs on a CC0-licensed genome, open-source skills, free public APIs, and a free Colab notebook. Reproducible, accessible, and transparent. Anyone in the world can run the same analysis and get the same results. That is the standard to aim for.
ClawBio is a research and educational tool. It is not a medical device and does not provide clinical diagnoses. The findings discussed in this workshop are for educational purposes only. Consult a healthcare professional before making any medical decisions based on genetic data.
Get involved
ClawBio has 39 skills covering pharmacogenomics, ancestry analysis, equity scoring, single-cell RNA-seq, GWAS, proteomics, metagenomics, and more.