ClawBio Workshop

Agentic GWAS

From summary statistics to causal variants

Dr Manuel Corpas · University of Westminster · 2026

Where we are

In the previous session you annotated a single genome and found clinically relevant variants using VEP, ClinVar, gnomAD, and CPIC.

Now we scale up. GWAS looks across thousands of genomes to find variants associated with disease at the population level.

This session: ~10 min slides, then ~20 min hands-on in Google Colab. By the end you will have queried nine GWAS databases, computed a polygenic risk score, and fine-mapped a locus to identify causal variants.

Learning objectives

Explain what a GWAS is and what summary statistics contain
Query nine federated databases for variant associations in seconds
Calculate polygenic risk scores from published PGS Catalog scores
Apply SuSiE fine-mapping to identify credible sets of causal variants
Recognise cross-ancestry gaps in GWAS representation
Run all three analyses in Google Colab with zero infrastructure

Part 1

What is a GWAS?

GWAS in one slide

Test every variant in the genome for association with a trait across thousands of people.

6,000+published GWAS

500M+participants (cumulative)

90,000+trait associations

86%European ancestry

The last number is the problem. Most GWAS findings may not transfer to non-European populations.

What are summary statistics?

Published GWAS release summary statistics: per-variant results without individual genotypes.

Field	Meaning	Example
`rsid`	Variant identifier	rs7903146
`beta`	Effect size (log-odds or per-allele)	0.31
`se`	Standard error of beta	0.02
`p`	P-value for association	5.2 x 10^-38
`MAF`	Minor allele frequency	0.28

Key insight: Summary statistics are public, free, and enough to run PRS, meta-analysis, and fine-mapping. No HPC. No data access agreements. No individual-level data.

Part 2

Three skills, one pipeline

The GWAS pipeline

GWAS Lookup

→

PRS Calculator

→

Fine-Mapping

GWAS Lookup

Query 9 databases in parallel for a single variant: GWAS Catalog, Open Targets, UKB, FinnGen, Biobank Japan, GTEx, eQTL Catalogue.

PRS Calculator

Compute polygenic risk scores from 23andMe or AncestryDNA data using 3,000+ published scores from the PGS Catalog.

Fine-Mapping

Apply SuSiE to identify credible sets of likely causal variants from summary statistics. No individual-level data needed.

Skill 1: GWAS Lookup

Give it an rsID. It queries nine databases in parallel and returns a unified report.

Database	What it returns	Ancestry
GWAS Catalog	Published trait associations	Mixed
Open Targets	Credible sets, L2G scores	Mixed
UKB-TOPMed PheWeb	PheWAS across 4,500 phenotypes	Multi-ancestry
FinnGen r12	Finnish disease endpoints	Finnish
Biobank Japan	East Asian PheWAS	Japanese
GTEx v8	eQTL tissue expression	Mostly European
EBI eQTL Catalogue	Multi-tissue eQTL associations	Mixed

One command: python gwas_lookup.py --rsid rs7903146

Cross-ancestry: why it matters

The problem

86% of GWAS participants are of European ancestry
Effect sizes and allele frequencies differ between populations
PRS trained on Europeans perform poorly in African and South Asian populations
Risk of widening health disparities

What ClawBio does

Queries UKB, FinnGen, and Biobank Japan in one call
Flags allele frequency differences across populations
MVP (Million Veteran Program) is the most diverse GWAS cohort: 33% non-European
Uganda Genome Resource: 6,407 samples, African ancestry GWAS

Skill 2: Polygenic Risk Scores

A PRS sums the effects of many variants into a single risk estimate.

Formula: PRS = ∑ (dosage_i × effect_weight_i) across all matched variants

ClawBio ships with 6 curated scores for instant demos:

Trait	PGS ID	Variants
Type 2 diabetes	PGS000013	8
Coronary artery disease	PGS000004	46
Breast cancer	PGS000001	77
Prostate cancer	PGS000057	147
Atrial fibrillation	PGS000011	12
BMI	PGS000039	97

Risk categories: Low (<20th) · Average (20-80th) · Elevated (80-95th) · High (>95th)

Skill 3: SuSiE Fine-Mapping

GWAS finds associated regions. Fine-mapping finds the causal variants within them.

Without fine-mapping

A GWAS hit spans 10-200 correlated SNPs in LD
Which one is causal? All look equally significant
Manual triage: slow, subjective, error-prone

With SuSiE

Credible sets: minimal set of SNPs capturing 95% of causal probability
PIPs: posterior inclusion probability per variant
Handles multiple causal signals per locus
Works from summary statistics alone

Democratising GWAS

The infrastructure barrier is gone.

No HPC

Summary statistics are small. Everything runs in Google Colab on a free tier.

No data access agreements

Summary statistics are publicly released. No application, no waiting, no ethics board.

No bioinformatics team

ClawBio wraps the full pipeline. One command per analysis.

No cost

Google Colab is free. ClawBio is MIT-licensed. PGS Catalog is open. All databases are public APIs.

Implication for the Global South: A researcher in Lima, Kampala, or Dhaka can run the same GWAS analyses as one at the Broad Institute. Today.

Part 3

Hands-on Practical

Open Google Colab now

What you will do

Step	Task	Time
1	Setup: install ClawBio in Colab (same as variant interpretation)	2 min
2	GWAS Lookup: query rs7903146 (type 2 diabetes) across 9 databases	5 min
3	Compare allele frequencies across UKB, FinnGen, and Biobank Japan	3 min
4	PRS: compute polygenic risk scores for 6 traits using the Corpasome	5 min
5	Fine-mapping: run SuSiE on a demo locus with 2 causal signals	5 min

Requirements: Same as before. A Google account and a web browser. Nothing to install.

Variants we will explore

rsID	Gene / Locus	Trait	Why it matters
rs7903146	TCF7L2	Type 2 diabetes	Strongest common T2D signal. OR 1.4 per allele.
rs429358	APOE	Alzheimer's	Already found in variant interpretation. Now see the GWAS context.
rs3798220	LPA	Heart disease	Lipoprotein(a). Risk factor for coronary events.
rs1801282	PPARG	Type 2 diabetes	Drug target for thiazolidinediones.

Take-home messages

GWAS summary statistics are free and public. You do not need individual-level data to do meaningful research.
Three ClawBio skills cover the full GWAS workflow: lookup, PRS, and fine-mapping.
Cross-ancestry analysis is not optional. Most GWAS are European-biased. Query multiple biobanks to check transferability.
Fine-mapping narrows GWAS hits to causal variants. SuSiE credible sets are the state of the art.
Infrastructure is no longer a barrier. Google Colab + ClawBio = publication-quality GWAS analysis for free.

Resources

Resource	Link
ClawBio GitHub	github.com/ClawBio/ClawBio
Variant Interpretation Workshop	Previous workshop
GWAS Catalog	ebi.ac.uk/gwas
PGS Catalog	pgscatalog.org
Open Targets	genetics.opentargets.org
Corpasome (Zenodo)	doi:10.5281/zenodo.19297389
Ensembl VEP	ensembl.org/vep
gnomAD	gnomad.broadinstitute.org