Workshop: Exploring a 30x Whole-Genome Sequence

The dataset

What is this dataset?

A complete human genome, sequenced at 30x depth, published under open access. What SNP arrays miss, WGS captures.

3.7MSNPs

912KIndels

8,925Structural Variants

1,387CNVs

A real genome, fully open

Manuel Corpas published his 30x whole-genome sequence under CC0 (public domain) on Zenodo (DOI: 10.5281/zenodo.19297389). The genome was sequenced by Dante Labs on an Illumina platform and aligned to the GRCh37 reference.

The dataset contains SNPs, indels, and structural variants (DEL, DUP, INV, BND, INS), plus copy number variant calls. This is a complete picture of one human genome, not a sparse sampling of pre-selected positions.

Research and educational use

This dataset is provided for research and educational purposes only. It is not intended for clinical decision-making.

What WGS captures that SNP arrays miss

SNP chips test approximately 600,000 pre-selected positions. WGS sequences every base pair across the entire genome. The difference is not incremental; it is a different category of data.

Metric	SNP Array	30x WGS
SNPs	~600,000	~3,716,648
Indels	0	~912,009
Structural variants	0	8,925
CNVs	0	1,387
Gene coverage	Sparse (pre-selected)	Complete
Ti/Tv ratio	N/A	2.03
Het/Hom ratio	N/A	1.63

SNP arrays are useful for population-scale screening, but they are blind to insertions, deletions, inversions, duplications, and novel variants not on the chip. WGS sees all of these.

Pre-built subsets for instant analysis

The full genome VCF is large. For this workshop, we provide pre-extracted subsets committed to the repository. No bulk download needed for the tutorial.

Subset	Contents	File
chr20	SNPs + indels from chromosome 20 (tutorial focus)	`corpas-30x/subsets/chr20_snps_indels.vcf.gz`
PGx loci	5 pharmacogenomic variants	`corpas-30x/subsets/pgx_loci.vcf.gz`
NutriGx loci	11 nutrigenomic variants	`corpas-30x/subsets/nutrigx_loci.vcf.gz`
SV calls	8,925 structural variant calls	`corpas-30x/subsets/sv_calls.vcf.gz`
CNV calls	1,387 copy number variant calls	`corpas-30x/subsets/cnv_calls.vcf.gz`

Background

WGS quality, structural variants, and clinical context

The concepts you need before running the practical.

QC metrics: how to tell if a genome is good

Three numbers tell you most of what you need to know about a WGS callset:

Ti/Tv ratio: transitions vs transversions. Expected value for WGS is ~2.0. A ratio well below 2.0 suggests contamination, alignment artefacts, or low-quality variant calls.
Het/Hom ratio: heterozygous vs homozygous variant calls. Expected range for a single individual is 1.5 to 1.7. Values outside this range may indicate sample contamination (too high) or consanguinity (too low).
Total variant count: a European genome typically carries 3.5 to 4.5 million SNPs relative to GRCh37. Counts far outside this range warrant investigation.

This genome: Ti/Tv = 2.03, Het/Hom = 1.63, total SNPs = 3,716,648. All within expected ranges.

Structural variants: the hidden layer

Structural variants (SVs) are genomic rearrangements of 50 bp or larger. They are harder to detect than SNPs and indels, require specialised callers, and are frequently missed by standard short-read pipelines. Yet SVs contribute disproportionately to rare disease.

SV Type	Description	Count in this genome
DEL	Deletion: segment of DNA removed	5,854
BND	Breakend / translocation: rearrangement across chromosomes	1,413
DUP	Duplication: segment of DNA copied	778
INV	Inversion: segment flipped in orientation	673
INS	Insertion: new segment of DNA added	207

SVs range in size from 50 bp to megabases. An estimated 20% of rare disease cases involve structural variants, but most clinical pipelines still focus primarily on SNPs and small indels.

From SNP chip to WGS: what changes clinically

Moving from a genotyping array to whole-genome sequencing does not just increase the variant count. It opens entire categories of analysis that are impossible with arrays:

Full CYP2D6 haplotyping: gene deletions and duplications that change metaboliser status are invisible to SNP chips. WGS detects the structural rearrangements directly.
HLA typing: the HLA region is too polymorphic and structurally complex for array-based genotyping. WGS reads through the full region.
Novel variant discovery: arrays only test known positions. WGS finds variants that have never been catalogued before.
Coding indels: frameshift insertions and deletions in protein-coding genes are a major source of loss-of-function. Arrays do not detect them.
SVs disrupting gene function: large deletions or inversions that break genes are only visible with sequencing data.

The Corpasome: 13 years of open genomics

The Corpasome is one of the longest-running open personal genome projects:

2013: Manuel Corpas published his 23andMe genotype data under CC0 (public domain). One of the first fully open personal genomes. Used in dozens of studies and teaching exercises worldwide.
2026: The same individual, now sequenced at 30x depth by Dante Labs. Full WGS data released under CC0 on Zenodo.

Two datasets, same genome, 13 years apart. The 23andMe file has ~600,000 SNPs. The WGS file has 3.7 million SNPs, 912,000 indels, and 8,925 structural variants. Same person, radically different resolution.

Citations:
Corpas, M. (2013). Crowdsourcing the Corpasome. Source Code for Biology and Medicine, 8, 13. doi:10.1186/1751-0473-8-13
Corpas, M. (2026). 30x WGS of the Corpasome. Zenodo. doi:10.5281/zenodo.19297389

Materials

Workshop materials and links

Everything you need to run the practical. No local installation required.

Essential links

Resource	Link	Notes
Google Colab notebook	Open in Colab	Main practical. Runs in browser, free tier.
Zenodo dataset	doi:10.5281/zenodo.19297389	Full 30x WGS VCF (CC0 licence)
ClawBio GitHub	github.com/ClawBio/ClawBio	Source code, skills, documentation
Variant interpretation workshop	Previous workshop	SNP chip analysis with 23andMe data
Ensembl VEP	ensembl.org/vep	Variant Effect Predictor (public REST API)
ClinVar	ncbi.nlm.nih.gov/clinvar	Variant-disease associations database
gnomAD	gnomad.broadinstitute.org	Population allele frequency data

Skills used in this workshop

variant-annotation

Annotates VCF variants via Ensembl VEP REST API. Extracts ClinVar significance, gnomAD frequencies, and assigns Tier 1-4 priority. Outputs report, TSV, and JSON.

pharmgx-reporter

Pharmacogenomic report from genotype data. 12 genes, 31 SNPs, 51 drugs. CPIC-grounded, zero external dependencies, runs offline. Now compatible with WGS VCF input.

bio-orchestrator

Routes queries to the right skill automatically. Handles multi-step analyses that span variant annotation, pharmacogenomics, and structural variant exploration.

Requirements

A Google account (for Colab access)
A web browser (Chrome, Firefox, or Safari)
No installation, no API keys, no payment required
Approximately 40 minutes for the guided practical

Walkthrough

Step-by-step workshop instructions

Open the Colab notebook and follow along.

Setup (2 minutes)

Run the first two code cells. They clone the ClawBio repository and install dependencies (pysam, requests, pandas, matplotlib). You should see:

          ClawBio loaded successfully

          Skills available: 39

          WGS subsets found: chr20, pgx_loci, nutrigx_loci, sv_calls, cnv_calls

If Colab is slow

The git clone takes 10-20 seconds. If it times out, click "Runtime > Restart and run all". The Colab free tier occasionally throttles new sessions.

Load and explore the chr20 subset (5 minutes)

The notebook loads the chromosome 20 VCF subset. This is a manageable slice of the full genome, small enough to process quickly but large enough to demonstrate real WGS data. You will see:

Total variant count on chr20 (SNPs + indels)
Breakdown by variant type: SNPs vs insertions vs deletions
Ti/Tv ratio for chr20 (should be close to 2.0)
Het/Hom ratio for chr20 (should be close to 1.6)

          # Example output

          chr20 variants loaded: 98,412

          SNPs: 84,291 | Insertions: 6,832 | Deletions: 7,289

          Ti/Tv: 2.05 | Het/Hom: 1.61

Discussion point: Why is the Ti/Tv ratio a useful QC metric? (Transitions are biochemically more likely than transversions. A ratio near 2.0 indicates clean, unbiased variant calls.)

Explore structural variants (8 minutes)

Load the SV VCF and examine the types, sizes, and chromosomal distribution of structural variants across the genome.

Count variants by type: DEL, DUP, INV, BND, INS
Plot size distribution of deletions and duplications
Identify the largest deletion and the largest duplication
Check which genes overlap with large SVs

          # SV type counts

          DEL:  5,854

          BND:  1,413

          DUP:    778

          INV:    673

          INS:    207

          Total: 8,925

Discussion point: Why are deletions the most common SV type? (Deletions are easier to detect from short reads than insertions or complex rearrangements. This is partly biological and partly a detection bias.)

Extract pharmacogenomic variants from WGS (5 minutes)

Load the PGx loci subset extracted from the WGS data. Compare what the whole-genome sequence found at pharmacogenomic loci with what the original 23andMe chip reported.

Load the 5 PGx variants from the WGS subset
Cross-reference with the 23andMe findings from the previous workshop
Identify any variants that WGS captured but the SNP chip missed
Note differences in genotype calling between platforms

Why this matters

SNP arrays can only report on positions that are physically printed on the chip. If a pharmacogenomic variant is not on the chip design, it will never appear in results, regardless of whether the patient carries it. WGS has no such blind spots.

Annotate chr20 variants (5 minutes)

Run the variant-annotation skill on a selection of chr20 variants. The Ensembl VEP REST API has a limit of 200 variants per batch, so the notebook selects a clinically relevant subset.

Select variants in coding regions on chr20
Submit to VEP for functional annotation
Retrieve ClinVar significance and gnomAD frequencies
Review the prioritised output: Tier 1 through Tier 4

          # Run annotation on chr20 coding variants

          python clawbio.py run variant-annotation --input chr20_coding.vcf --output report/

          Annotated 87 variants. 1 Tier 1 (pathogenic). 4 Tier 2 (drug response). Report saved.

Exercises (15 minutes, independent work)

Three exercises for students:

Exercise	Task	Status
5a	Compare WGS PGx findings with the 23andMe PGx results from the previous workshop. Which variants are shared? Which are unique to WGS?	Required
5b	Pick the largest structural variant (by size) from the SV callset. What genes does it overlap? What is the likely functional impact? Search ClinVar and gnomAD for supporting evidence.	Required
5c	From the chr20 annotation results, identify any variants not present in ClinVar. These are novel or unclassified. Would you report them? What additional evidence would you need?	Optional

Results guide

Understanding your results

How to interpret QC metrics, structural variant types, and the key findings from this genome.

QC interpretation guide

Metric	Expected range	This genome	What abnormal values mean
Ti/Tv ratio	~2.0 for WGS	2.03	Below 1.5: possible contamination or alignment errors. Above 2.5: possible reference bias.
Het/Hom ratio	1.5 to 1.7	1.63	Above 2.0: possible sample contamination. Below 1.2: possible consanguinity or inbreeding.
Total SNPs	3.5M to 4.5M (European)	3,716,648	Counts far outside range: possible ancestry mismatch, contamination, or pipeline error.
Total indels	700K to 1.2M	912,009	Very low counts may indicate caller sensitivity issues.

Structural variant type guide

Type	What it means	Biological significance
DEL	A segment of DNA is missing compared to the reference	Can remove entire exons or genes. Common cause of loss-of-function. Relevant to carrier screening and rare disease.
DUP	A segment of DNA is duplicated (extra copy)	Gene duplications can increase gene dosage. In CYP2D6, extra copies create ultra-rapid metabolisers who burn through drugs too fast.
INV	A segment is flipped in orientation (reversed)	Can disrupt genes at breakpoints or alter regulatory elements. Some inversions are polymorphic and benign.
BND	A breakend: one side of a rearrangement, often a translocation	Suggests complex rearrangement across chromosomes. Can create fusion genes (relevant in cancer).
INS	New DNA inserted relative to the reference	Can be mobile element insertions (Alu, LINE) or novel sequence. Hardest SV type to detect from short reads.

Key statistics from this genome

3.7MSNPs

912KIndels

8,925SVs

1,387CNVs

2.03Ti/Tv

1.63Het/Hom

What WGS found that the SNP chip missed

The 23andMe chip reported ~600,000 SNPs from this individual. The 30x WGS found over 3.7 million SNPs (6x more), plus 912,000 indels and 8,925 structural variants that were completely invisible to the array. The indels alone include frameshift variants in coding genes, and the structural variants include deletions spanning multiple exons. None of this information exists in a SNP chip report.

Important limitations

Short-read WGS (like this Illumina dataset) still has blind spots. Highly repetitive regions, segmental duplications, and some complex SVs are better resolved by long-read sequencing (PacBio, Oxford Nanopore). The SV callset here is conservative; the true number of structural variants is likely higher.

Summary

Take-home messages

WGS is the gold standard for variant discovery

Whole-genome sequencing captures SNPs, indels, structural variants, and CNVs in a single assay. No pre-selection bias. No missing positions. If a variant exists in the genome, WGS can find it. SNP arrays, by contrast, only interrogate a fixed set of known positions.

Structural variants are clinically significant but underexplored

SVs account for an estimated 20% of rare disease diagnoses, yet they are routinely overlooked in standard analysis pipelines. Deletions can remove entire genes. Duplications can double gene dosage. Inversions can disrupt regulatory elements. This genome alone carries 8,925 SVs, most of which have never been individually characterised.

QC metrics are your first line of defence

Before interpreting a single variant, check Ti/Tv, Het/Hom, and total variant counts. These three numbers catch contamination, pipeline errors, and sample mix-ups faster than any downstream analysis. If the QC is wrong, everything built on top of it is unreliable.

The same genome tells different stories at different resolution

A 23andMe chip sees 600,000 SNPs and reports carrier status for a handful of conditions. A 30x WGS of the same person reveals 3.7 million SNPs, 912,000 indels, and thousands of structural variants. The biology did not change. The resolution did. Clinical interpretation depends entirely on what you measure.

Open data accelerates science

This genome was published under CC0 in 2013 (23andMe) and again in 2026 (30x WGS). Both datasets are freely available for anyone to download, analyse, and build upon. Open data enables reproducible science, independent validation, and educational use at no cost. This entire workshop exists because one person chose to share their genome.

Agent-driven analysis makes WGS accessible

WGS data is large and complex. Analysing it traditionally requires bioinformatics expertise, command-line tools, and significant compute. Agent-driven skills reduce this barrier. The AI routes to the right tool, the tool runs the grounded analysis, and the user reviews structured output. WGS interpretation is no longer limited to specialist bioinformaticians.

Medical disclaimer

ClawBio is a research and educational tool. It is not a medical device and does not provide clinical diagnoses. The findings discussed in this workshop are for educational purposes only. Consult a healthcare professional before making any medical decisions based on genetic data.

Exploring a Personal Genome at 30x Depth

Contents

What is this dataset?

A real genome, fully open

What WGS captures that SNP arrays miss

Pre-built subsets for instant analysis

WGS quality, structural variants, and clinical context

QC metrics: how to tell if a genome is good

Structural variants: the hidden layer

From SNP chip to WGS: what changes clinically

The Corpasome: 13 years of open genomics

Workshop materials and links

Essential links

Skills used in this workshop

variant-annotation

pharmgx-reporter

bio-orchestrator

Requirements

Step-by-step workshop instructions

Setup (2 minutes)

Load and explore the chr20 subset (5 minutes)

Explore structural variants (8 minutes)

Extract pharmacogenomic variants from WGS (5 minutes)

Annotate chr20 variants (5 minutes)

Exercises (15 minutes, independent work)

Understanding your results

QC interpretation guide

Structural variant type guide

Key statistics from this genome

Take-home messages

WGS is the gold standard for variant discovery

Structural variants are clinically significant but underexplored

QC metrics are your first line of defence

The same genome tells different stories at different resolution

Open data accelerates science

Agent-driven analysis makes WGS accessible

Continue exploring