Go beyond SNP chips. Explore 3.7 million SNPs, 912,000 indels, and 8,925 structural variants from a real human genome. Everything runs in Google Colab. No installation, no API keys, no cost.
The dataset
A complete human genome, sequenced at 30x depth, published under open access. What SNP arrays miss, WGS captures.
Background
The concepts you need before running the practical.
Materials
Everything you need to run the practical. No local installation required.
| Resource | Link | Notes |
|---|---|---|
| Google Colab notebook | Open in Colab | Main practical. Runs in browser, free tier. |
| Zenodo dataset | doi:10.5281/zenodo.19297389 | Full 30x WGS VCF (CC0 licence) |
| ClawBio GitHub | github.com/ClawBio/ClawBio | Source code, skills, documentation |
| Variant interpretation workshop | Previous workshop | SNP chip analysis with 23andMe data |
| Ensembl VEP | ensembl.org/vep | Variant Effect Predictor (public REST API) |
| ClinVar | ncbi.nlm.nih.gov/clinvar | Variant-disease associations database |
| gnomAD | gnomad.broadinstitute.org | Population allele frequency data |
Annotates VCF variants via Ensembl VEP REST API. Extracts ClinVar significance, gnomAD frequencies, and assigns Tier 1-4 priority. Outputs report, TSV, and JSON.
Pharmacogenomic report from genotype data. 12 genes, 31 SNPs, 51 drugs. CPIC-grounded, zero external dependencies, runs offline. Now compatible with WGS VCF input.
Routes queries to the right skill automatically. Handles multi-step analyses that span variant annotation, pharmacogenomics, and structural variant exploration.
Results guide
How to interpret QC metrics, structural variant types, and the key findings from this genome.
| Metric | Expected range | This genome | What abnormal values mean |
|---|---|---|---|
| Ti/Tv ratio | ~2.0 for WGS | 2.03 | Below 1.5: possible contamination or alignment errors. Above 2.5: possible reference bias. |
| Het/Hom ratio | 1.5 to 1.7 | 1.63 | Above 2.0: possible sample contamination. Below 1.2: possible consanguinity or inbreeding. |
| Total SNPs | 3.5M to 4.5M (European) | 3,716,648 | Counts far outside range: possible ancestry mismatch, contamination, or pipeline error. |
| Total indels | 700K to 1.2M | 912,009 | Very low counts may indicate caller sensitivity issues. |
| Type | What it means | Biological significance |
|---|---|---|
| DEL | A segment of DNA is missing compared to the reference | Can remove entire exons or genes. Common cause of loss-of-function. Relevant to carrier screening and rare disease. |
| DUP | A segment of DNA is duplicated (extra copy) | Gene duplications can increase gene dosage. In CYP2D6, extra copies create ultra-rapid metabolisers who burn through drugs too fast. |
| INV | A segment is flipped in orientation (reversed) | Can disrupt genes at breakpoints or alter regulatory elements. Some inversions are polymorphic and benign. |
| BND | A breakend: one side of a rearrangement, often a translocation | Suggests complex rearrangement across chromosomes. Can create fusion genes (relevant in cancer). |
| INS | New DNA inserted relative to the reference | Can be mobile element insertions (Alu, LINE) or novel sequence. Hardest SV type to detect from short reads. |
The 23andMe chip reported ~600,000 SNPs from this individual. The 30x WGS found over 3.7 million SNPs (6x more), plus 912,000 indels and 8,925 structural variants that were completely invisible to the array. The indels alone include frameshift variants in coding genes, and the structural variants include deletions spanning multiple exons. None of this information exists in a SNP chip report.
Short-read WGS (like this Illumina dataset) still has blind spots. Highly repetitive regions, segmental duplications, and some complex SVs are better resolved by long-read sequencing (PacBio, Oxford Nanopore). The SV callset here is conservative; the true number of structural variants is likely higher.
Summary
Whole-genome sequencing captures SNPs, indels, structural variants, and CNVs in a single assay. No pre-selection bias. No missing positions. If a variant exists in the genome, WGS can find it. SNP arrays, by contrast, only interrogate a fixed set of known positions.
SVs account for an estimated 20% of rare disease diagnoses, yet they are routinely overlooked in standard analysis pipelines. Deletions can remove entire genes. Duplications can double gene dosage. Inversions can disrupt regulatory elements. This genome alone carries 8,925 SVs, most of which have never been individually characterised.
Before interpreting a single variant, check Ti/Tv, Het/Hom, and total variant counts. These three numbers catch contamination, pipeline errors, and sample mix-ups faster than any downstream analysis. If the QC is wrong, everything built on top of it is unreliable.
A 23andMe chip sees 600,000 SNPs and reports carrier status for a handful of conditions. A 30x WGS of the same person reveals 3.7 million SNPs, 912,000 indels, and thousands of structural variants. The biology did not change. The resolution did. Clinical interpretation depends entirely on what you measure.
This genome was published under CC0 in 2013 (23andMe) and again in 2026 (30x WGS). Both datasets are freely available for anyone to download, analyse, and build upon. Open data enables reproducible science, independent validation, and educational use at no cost. This entire workshop exists because one person chose to share their genome.
WGS data is large and complex. Analysing it traditionally requires bioinformatics expertise, command-line tools, and significant compute. Agent-driven skills reduce this barrier. The AI routes to the right tool, the tool runs the grounded analysis, and the user reviews structured output. WGS interpretation is no longer limited to specialist bioinformaticians.
ClawBio is a research and educational tool. It is not a medical device and does not provide clinical diagnoses. The findings discussed in this workshop are for educational purposes only. Consult a healthcare professional before making any medical decisions based on genetic data.
Get involved
ClawBio has 39 skills covering pharmacogenomics, ancestry analysis, equity scoring, single-cell RNA-seq, GWAS, proteomics, metagenomics, and more.