Whole-Genome Sequencing Workshop

Exploring a Personal Genome at 30x Depth

Go beyond SNP chips. Explore 3.7 million SNPs, 912,000 indels, and 8,925 structural variants from a real human genome. Everything runs in Google Colab. No installation, no API keys, no cost.

Dr Manuel Corpas · University of Westminster · 2026

Open in Google Colab GitHub Repository

What is this dataset?

A complete human genome, sequenced at 30x depth, published under open access. What SNP arrays miss, WGS captures.

3.7MSNPs
912KIndels
8,925Structural Variants
1,387CNVs

A real genome, fully open

Manuel Corpas published his 30x whole-genome sequence under CC0 (public domain) on Zenodo (DOI: 10.5281/zenodo.19297389). The genome was sequenced by Dante Labs on an Illumina platform and aligned to the GRCh37 reference.

The dataset contains SNPs, indels, and structural variants (DEL, DUP, INV, BND, INS), plus copy number variant calls. This is a complete picture of one human genome, not a sparse sampling of pre-selected positions.

Research and educational use

This dataset is provided for research and educational purposes only. It is not intended for clinical decision-making.

What WGS captures that SNP arrays miss

SNP chips test approximately 600,000 pre-selected positions. WGS sequences every base pair across the entire genome. The difference is not incremental; it is a different category of data.

MetricSNP Array30x WGS
SNPs~600,000~3,716,648
Indels0~912,009
Structural variants08,925
CNVs01,387
Gene coverageSparse (pre-selected)Complete
Ti/Tv ratioN/A2.03
Het/Hom ratioN/A1.63

SNP arrays are useful for population-scale screening, but they are blind to insertions, deletions, inversions, duplications, and novel variants not on the chip. WGS sees all of these.

Pre-built subsets for instant analysis

The full genome VCF is large. For this workshop, we provide pre-extracted subsets committed to the repository. No bulk download needed for the tutorial.

SubsetContentsFile
chr20SNPs + indels from chromosome 20 (tutorial focus)corpas-30x/subsets/chr20_snps_indels.vcf.gz
PGx loci5 pharmacogenomic variantscorpas-30x/subsets/pgx_loci.vcf.gz
NutriGx loci11 nutrigenomic variantscorpas-30x/subsets/nutrigx_loci.vcf.gz
SV calls8,925 structural variant callscorpas-30x/subsets/sv_calls.vcf.gz
CNV calls1,387 copy number variant callscorpas-30x/subsets/cnv_calls.vcf.gz

WGS quality, structural variants, and clinical context

The concepts you need before running the practical.

QC metrics: how to tell if a genome is good

Three numbers tell you most of what you need to know about a WGS callset:

  • Ti/Tv ratio: transitions vs transversions. Expected value for WGS is ~2.0. A ratio well below 2.0 suggests contamination, alignment artefacts, or low-quality variant calls.
  • Het/Hom ratio: heterozygous vs homozygous variant calls. Expected range for a single individual is 1.5 to 1.7. Values outside this range may indicate sample contamination (too high) or consanguinity (too low).
  • Total variant count: a European genome typically carries 3.5 to 4.5 million SNPs relative to GRCh37. Counts far outside this range warrant investigation.

This genome: Ti/Tv = 2.03, Het/Hom = 1.63, total SNPs = 3,716,648. All within expected ranges.

Structural variants: the hidden layer

Structural variants (SVs) are genomic rearrangements of 50 bp or larger. They are harder to detect than SNPs and indels, require specialised callers, and are frequently missed by standard short-read pipelines. Yet SVs contribute disproportionately to rare disease.

SV TypeDescriptionCount in this genome
DELDeletion: segment of DNA removed5,854
BNDBreakend / translocation: rearrangement across chromosomes1,413
DUPDuplication: segment of DNA copied778
INVInversion: segment flipped in orientation673
INSInsertion: new segment of DNA added207

SVs range in size from 50 bp to megabases. An estimated 20% of rare disease cases involve structural variants, but most clinical pipelines still focus primarily on SNPs and small indels.

From SNP chip to WGS: what changes clinically

Moving from a genotyping array to whole-genome sequencing does not just increase the variant count. It opens entire categories of analysis that are impossible with arrays:

  • Full CYP2D6 haplotyping: gene deletions and duplications that change metaboliser status are invisible to SNP chips. WGS detects the structural rearrangements directly.
  • HLA typing: the HLA region is too polymorphic and structurally complex for array-based genotyping. WGS reads through the full region.
  • Novel variant discovery: arrays only test known positions. WGS finds variants that have never been catalogued before.
  • Coding indels: frameshift insertions and deletions in protein-coding genes are a major source of loss-of-function. Arrays do not detect them.
  • SVs disrupting gene function: large deletions or inversions that break genes are only visible with sequencing data.

The Corpasome: 13 years of open genomics

The Corpasome is one of the longest-running open personal genome projects:

  • 2013: Manuel Corpas published his 23andMe genotype data under CC0 (public domain). One of the first fully open personal genomes. Used in dozens of studies and teaching exercises worldwide.
  • 2026: The same individual, now sequenced at 30x depth by Dante Labs. Full WGS data released under CC0 on Zenodo.

Two datasets, same genome, 13 years apart. The 23andMe file has ~600,000 SNPs. The WGS file has 3.7 million SNPs, 912,000 indels, and 8,925 structural variants. Same person, radically different resolution.

Citations:
Corpas, M. (2013). Crowdsourcing the Corpasome. Source Code for Biology and Medicine, 8, 13. doi:10.1186/1751-0473-8-13
Corpas, M. (2026). 30x WGS of the Corpasome. Zenodo. doi:10.5281/zenodo.19297389

Workshop materials and links

Everything you need to run the practical. No local installation required.

Essential links

ResourceLinkNotes
Google Colab notebook Open in Colab Main practical. Runs in browser, free tier.
Zenodo dataset doi:10.5281/zenodo.19297389 Full 30x WGS VCF (CC0 licence)
ClawBio GitHub github.com/ClawBio/ClawBio Source code, skills, documentation
Variant interpretation workshop Previous workshop SNP chip analysis with 23andMe data
Ensembl VEP ensembl.org/vep Variant Effect Predictor (public REST API)
ClinVar ncbi.nlm.nih.gov/clinvar Variant-disease associations database
gnomAD gnomad.broadinstitute.org Population allele frequency data

Skills used in this workshop

variant-annotation

Annotates VCF variants via Ensembl VEP REST API. Extracts ClinVar significance, gnomAD frequencies, and assigns Tier 1-4 priority. Outputs report, TSV, and JSON.

pharmgx-reporter

Pharmacogenomic report from genotype data. 12 genes, 31 SNPs, 51 drugs. CPIC-grounded, zero external dependencies, runs offline. Now compatible with WGS VCF input.

bio-orchestrator

Routes queries to the right skill automatically. Handles multi-step analyses that span variant annotation, pharmacogenomics, and structural variant exploration.

Requirements

Step-by-step workshop instructions

Open the Colab notebook and follow along.

Setup (2 minutes)

Run the first two code cells. They clone the ClawBio repository and install dependencies (pysam, requests, pandas, matplotlib). You should see:

ClawBio loaded successfully
Skills available: 39
WGS subsets found: chr20, pgx_loci, nutrigx_loci, sv_calls, cnv_calls
If Colab is slow

The git clone takes 10-20 seconds. If it times out, click "Runtime > Restart and run all". The Colab free tier occasionally throttles new sessions.

Load and explore the chr20 subset (5 minutes)

The notebook loads the chromosome 20 VCF subset. This is a manageable slice of the full genome, small enough to process quickly but large enough to demonstrate real WGS data. You will see:

  • Total variant count on chr20 (SNPs + indels)
  • Breakdown by variant type: SNPs vs insertions vs deletions
  • Ti/Tv ratio for chr20 (should be close to 2.0)
  • Het/Hom ratio for chr20 (should be close to 1.6)
# Example output
chr20 variants loaded: 98,412
SNPs: 84,291 | Insertions: 6,832 | Deletions: 7,289
Ti/Tv: 2.05 | Het/Hom: 1.61

Discussion point: Why is the Ti/Tv ratio a useful QC metric? (Transitions are biochemically more likely than transversions. A ratio near 2.0 indicates clean, unbiased variant calls.)

Explore structural variants (8 minutes)

Load the SV VCF and examine the types, sizes, and chromosomal distribution of structural variants across the genome.

  • Count variants by type: DEL, DUP, INV, BND, INS
  • Plot size distribution of deletions and duplications
  • Identify the largest deletion and the largest duplication
  • Check which genes overlap with large SVs
# SV type counts
DEL: 5,854
BND: 1,413
DUP: 778
INV: 673
INS: 207
Total: 8,925

Discussion point: Why are deletions the most common SV type? (Deletions are easier to detect from short reads than insertions or complex rearrangements. This is partly biological and partly a detection bias.)

Extract pharmacogenomic variants from WGS (5 minutes)

Load the PGx loci subset extracted from the WGS data. Compare what the whole-genome sequence found at pharmacogenomic loci with what the original 23andMe chip reported.

  • Load the 5 PGx variants from the WGS subset
  • Cross-reference with the 23andMe findings from the previous workshop
  • Identify any variants that WGS captured but the SNP chip missed
  • Note differences in genotype calling between platforms
Why this matters

SNP arrays can only report on positions that are physically printed on the chip. If a pharmacogenomic variant is not on the chip design, it will never appear in results, regardless of whether the patient carries it. WGS has no such blind spots.

Annotate chr20 variants (5 minutes)

Run the variant-annotation skill on a selection of chr20 variants. The Ensembl VEP REST API has a limit of 200 variants per batch, so the notebook selects a clinically relevant subset.

  • Select variants in coding regions on chr20
  • Submit to VEP for functional annotation
  • Retrieve ClinVar significance and gnomAD frequencies
  • Review the prioritised output: Tier 1 through Tier 4
# Run annotation on chr20 coding variants
python clawbio.py run variant-annotation --input chr20_coding.vcf --output report/
Annotated 87 variants. 1 Tier 1 (pathogenic). 4 Tier 2 (drug response). Report saved.

Exercises (15 minutes, independent work)

Three exercises for students:

ExerciseTaskStatus
5aCompare WGS PGx findings with the 23andMe PGx results from the previous workshop. Which variants are shared? Which are unique to WGS?Required
5bPick the largest structural variant (by size) from the SV callset. What genes does it overlap? What is the likely functional impact? Search ClinVar and gnomAD for supporting evidence.Required
5cFrom the chr20 annotation results, identify any variants not present in ClinVar. These are novel or unclassified. Would you report them? What additional evidence would you need?Optional

Understanding your results

How to interpret QC metrics, structural variant types, and the key findings from this genome.

QC interpretation guide

MetricExpected rangeThis genomeWhat abnormal values mean
Ti/Tv ratio ~2.0 for WGS 2.03 Below 1.5: possible contamination or alignment errors. Above 2.5: possible reference bias.
Het/Hom ratio 1.5 to 1.7 1.63 Above 2.0: possible sample contamination. Below 1.2: possible consanguinity or inbreeding.
Total SNPs 3.5M to 4.5M (European) 3,716,648 Counts far outside range: possible ancestry mismatch, contamination, or pipeline error.
Total indels 700K to 1.2M 912,009 Very low counts may indicate caller sensitivity issues.

Structural variant type guide

TypeWhat it meansBiological significance
DEL A segment of DNA is missing compared to the reference Can remove entire exons or genes. Common cause of loss-of-function. Relevant to carrier screening and rare disease.
DUP A segment of DNA is duplicated (extra copy) Gene duplications can increase gene dosage. In CYP2D6, extra copies create ultra-rapid metabolisers who burn through drugs too fast.
INV A segment is flipped in orientation (reversed) Can disrupt genes at breakpoints or alter regulatory elements. Some inversions are polymorphic and benign.
BND A breakend: one side of a rearrangement, often a translocation Suggests complex rearrangement across chromosomes. Can create fusion genes (relevant in cancer).
INS New DNA inserted relative to the reference Can be mobile element insertions (Alu, LINE) or novel sequence. Hardest SV type to detect from short reads.

Key statistics from this genome

3.7MSNPs
912KIndels
8,925SVs
1,387CNVs
2.03Ti/Tv
1.63Het/Hom
What WGS found that the SNP chip missed

The 23andMe chip reported ~600,000 SNPs from this individual. The 30x WGS found over 3.7 million SNPs (6x more), plus 912,000 indels and 8,925 structural variants that were completely invisible to the array. The indels alone include frameshift variants in coding genes, and the structural variants include deletions spanning multiple exons. None of this information exists in a SNP chip report.

Important limitations

Short-read WGS (like this Illumina dataset) still has blind spots. Highly repetitive regions, segmental duplications, and some complex SVs are better resolved by long-read sequencing (PacBio, Oxford Nanopore). The SV callset here is conservative; the true number of structural variants is likely higher.

Take-home messages

1

WGS is the gold standard for variant discovery

Whole-genome sequencing captures SNPs, indels, structural variants, and CNVs in a single assay. No pre-selection bias. No missing positions. If a variant exists in the genome, WGS can find it. SNP arrays, by contrast, only interrogate a fixed set of known positions.

2

Structural variants are clinically significant but underexplored

SVs account for an estimated 20% of rare disease diagnoses, yet they are routinely overlooked in standard analysis pipelines. Deletions can remove entire genes. Duplications can double gene dosage. Inversions can disrupt regulatory elements. This genome alone carries 8,925 SVs, most of which have never been individually characterised.

3

QC metrics are your first line of defence

Before interpreting a single variant, check Ti/Tv, Het/Hom, and total variant counts. These three numbers catch contamination, pipeline errors, and sample mix-ups faster than any downstream analysis. If the QC is wrong, everything built on top of it is unreliable.

4

The same genome tells different stories at different resolution

A 23andMe chip sees 600,000 SNPs and reports carrier status for a handful of conditions. A 30x WGS of the same person reveals 3.7 million SNPs, 912,000 indels, and thousands of structural variants. The biology did not change. The resolution did. Clinical interpretation depends entirely on what you measure.

5

Open data accelerates science

This genome was published under CC0 in 2013 (23andMe) and again in 2026 (30x WGS). Both datasets are freely available for anyone to download, analyse, and build upon. Open data enables reproducible science, independent validation, and educational use at no cost. This entire workshop exists because one person chose to share their genome.

6

Agent-driven analysis makes WGS accessible

WGS data is large and complex. Analysing it traditionally requires bioinformatics expertise, command-line tools, and significant compute. Agent-driven skills reduce this barrier. The AI routes to the right tool, the tool runs the grounded analysis, and the user reviews structured output. WGS interpretation is no longer limited to specialist bioinformaticians.

Medical disclaimer

ClawBio is a research and educational tool. It is not a medical device and does not provide clinical diagnoses. The findings discussed in this workshop are for educational purposes only. Consult a healthcare professional before making any medical decisions based on genetic data.

Continue exploring

ClawBio has 39 skills covering pharmacogenomics, ancestry analysis, equity scoring, single-cell RNA-seq, GWAS, proteomics, metagenomics, and more.