Spring 2026 · KAIST · Co-taught with Prof. Homin Kim
This course explores the intersection of artificial intelligence and protein science. The AI section covers modern deep learning approaches — from foundational concepts to state-of-the-art models including AlphaFold, RFDiffusion, and ProteinMPNN. All lecture notes are written in a textbook narrative style, accessible to students new to machine learning.
Prerequisites: Python programming, linear algebra, basic probability. Prior deep learning experience is not required, but students without it are expected to study the preliminary notes.
Textbooks: Our lecture notes are self-contained, but students who want deeper background may find these open-access textbooks helpful:
- White et al., Deep Learning for Molecules and Materials — applied deep learning for molecular and materials science, with interactive examples.
- Zhang et al., Dive into Deep Learning (CC BY-SA 4.0) — hands-on, code-first introduction to deep learning with PyTorch.
- Prince, Understanding Deep Learning (CC BY-NC-ND) — conceptual and mathematical treatment with excellent figures.
Preliminary Notes
Self-study material for students with a biology background to prepare for in-class lectures.
- Introduction to Machine Learning with Linear Regression — What machine learning is, how to represent data as tensors, and how a linear model learns from protein data—one gradient step at a time.
- Protein Features and Neural Networks — How to turn protein sequences into numerical features—one-hot encodings, PyTorch tensors—and the neural network architectures that learn representations from them.
- Training Neural Networks for Protein Science — Loss functions, optimizers, the training loop, data loading, validation, and overfitting—everything you need to train a protein model.
- Case Study: Predicting Protein Solubility — An end-to-end case study—building an MLP solubility predictor from sequence features, and learning to evaluate honestly with sequence-identity splits, class weighting, and early stopping.
- Code Walkthrough: nano-solubility — Build a protein solubility classifier from scratch — learned embeddings, 1D convolutions, and evaluation on real E. coli expression data.
Lectures
In-class lectures covering advanced architectures and landmark protein AI models.
- Transformers for Protein Sequences — Attention mechanisms for protein sequences—from adaptive weight matrices to multi-head self-attention and the full transformer architecture.
- Graph Neural Networks for Protein Structures — Message-passing networks for protein structures—from graph representations to GCN, GAT, MPNN, and SE(3)-equivariant architectures.
- Variational Autoencoders for Proteins — Variational autoencoders for generating novel proteins—the encoder-decoder framework, the ELBO derivation, and the reparameterization trick.
- Code Walkthrough: nano-polymer-vae — Build a variational autoencoder for 2D bead-spring polymers — data preprocessing, encoder-decoder MLP, latent-space visualization, and property optimization.
- Diffusion Models for Protein Generation — Denoising diffusion probabilistic models—the forward noising process, reverse denoising, score matching, and conditional generation for protein structures.
- Code Walkthrough: nano-polymer-diffusion — Build a denoising diffusion model (DDPM) for 2D bead-spring polymers — noise schedule, MLP denoiser with timestep embedding, and side-by-side comparison with the VAE.
- Protein Language Models: Tasks and Applications — What protein language models learn from evolution, how their embeddings capture structure and function, and practical applications from mutation scoring to structure prediction.
- Protein Language Models: Architecture and Training — Inside ESM-2's transformer backbone—masked language modeling, SwiGLU activations, embedding extraction, mutation scoring, LoRA fine-tuning, and attention-based contact prediction.
- Code Walkthrough: nano-esm2 — Build ESM2 from scratch in 288 lines of PyTorch — masked language modeling for protein sequences.
- AlphaFold: The Structure Prediction Problem — Why protein structure prediction matters, from Anfinsen's hypothesis to CASP14—and how AlphaFold2 and its successors are transforming biology, drug discovery, and protein engineering.
- AlphaFold: Architecture and Training — Inside AlphaFold2—input embedding, the Evoformer's triangle updates, invariant point attention, FAPE loss, and the full prediction pipeline.
- Code Walkthrough: nano-alphafold2 — Build AlphaFold2 from scratch in ~650 lines of PyTorch — Pairformer, SE(3) diffusion, and FAPE loss.
- Protein Backbone Design: Problems and Applications — The de novo protein design revolution—why controlling backbone shape gives control over function, and how RFDiffusion and related methods are enabling designed binders, enzymes, and assemblies.
- RFDiffusion: SE(3) Diffusion for Protein Backbones — The mathematics of structure generation—rotation representations, IGSO(3) noise, SE(3) equivariant networks, and the RFDiffusion architecture for de novo protein backbone design.
- Code Walkthrough: nano-rfdiffusion — Build RFDiffusion from scratch in 607 lines of PyTorch — SE(3) diffusion with IPA-based denoising.
- Inverse Folding and the Protein Design Pipeline — The inverse folding problem—why multiple sequences fold into the same structure, how ProteinMPNN bridges backbone design to experimental testing, and the complete computational protein design workflow.
- ProteinMPNN: Architecture and Training — Inside ProteinMPNN—k-nearest neighbor graph construction, geometric edge features, message-passing encoder, random-order autoregressive decoder, and training with coordinate noise.
- Code Walkthrough: nano-proteinmpnn — Build ProteinMPNN from scratch in 448 lines of PyTorch — inverse folding with graph neural networks.
Key References
- Jumper et al. (2021). “Highly accurate protein structure prediction with AlphaFold.” Nature.
- Watson et al. (2023). “De novo design of protein structure and function with RFdiffusion.” Nature.
- Dauparas et al. (2022). “Robust deep learning-based protein sequence design using ProteinMPNN.” Science.
- Lin et al. (2023). “Evolutionary-scale prediction of atomic-level protein structure with a language model.” Science.