Protein & Artificial Intelligence

Spring 2026 · KAIST · Co-taught with Prof. Homin Kim

This course explores the intersection of artificial intelligence and protein science. The AI section covers modern deep learning approaches — from foundational concepts to state-of-the-art models including AlphaFold, RFDiffusion, and ProteinMPNN. All lecture notes are written in a textbook narrative style, accessible to students new to machine learning.

Prerequisites: Python programming, linear algebra, basic probability. Prior deep learning experience is not required, but students without it are expected to study the preliminary notes.

Textbooks: Our lecture notes are self-contained, but students who want deeper background may find these open-access textbooks helpful:

White et al., Deep Learning for Molecules and Materials — applied deep learning for molecular and materials science, with interactive examples.
Zhang et al., Dive into Deep Learning (CC BY-SA 4.0) — hands-on, code-first introduction to deep learning with PyTorch.
Prince, Understanding Deep Learning (CC BY-NC-ND) — conceptual and mathematical treatment with excellent figures.

Preliminary Notes

Self-study material for students with a biology background to prepare for in-class lectures.

Introduction to Machine Learning with Linear Regression — What machine learning is, how to represent data as tensors, and how a linear model learns from protein data—one gradient step at a time.
Protein Features and Neural Networks — How to turn protein sequences into numerical features—one-hot encodings, PyTorch tensors—and the neural network architectures that learn representations from them.
Training Neural Networks for Protein Science — Loss functions, optimizers, the training loop, data loading, validation, and overfitting—everything you need to train a protein model.
Case Study: Predicting Protein Solubility — An end-to-end case study—building an MLP solubility predictor from sequence features, and learning to evaluate honestly with sequence-identity splits, class weighting, and early stopping.
Code Walkthrough: nano-solubility — Build a protein solubility classifier from scratch — learned embeddings, 1D convolutions, and evaluation on real E. coli expression data.

Lectures

In-class lectures covering advanced architectures and landmark protein AI models.

Transformers for Protein Sequences — Attention mechanisms for protein sequences—from adaptive weight matrices to multi-head self-attention and the full transformer architecture.
Graph Neural Networks for Protein Structures — Message-passing networks for protein structures—from graph representations to GCN, GAT, MPNN, and SE(3)-equivariant architectures.
Variational Autoencoders for Proteins — Variational autoencoders for generating novel proteins—the encoder-decoder framework, the ELBO derivation, and the reparameterization trick.
Code Walkthrough: nano-polymer-vae — Build a variational autoencoder for 2D bead-spring polymers — data preprocessing, encoder-decoder MLP, latent-space visualization, and property optimization.
Diffusion Models for Protein Generation — Denoising diffusion probabilistic models—the forward noising process, reverse denoising, score matching, and conditional generation for protein structures.
Code Walkthrough: nano-polymer-diffusion — Build a denoising diffusion model (DDPM) for 2D bead-spring polymers — noise schedule, MLP denoiser with timestep embedding, and side-by-side comparison with the VAE.
Protein Language Models: Tasks and Applications — What protein language models learn from evolution, how their embeddings capture structure and function, and practical applications from mutation scoring to structure prediction.
Protein Language Models: Architecture and Training — Inside ESM-2's transformer backbone—masked language modeling, SwiGLU activations, embedding extraction, mutation scoring, LoRA fine-tuning, and attention-based contact prediction.
Code Walkthrough: nano-esm2 — Build ESM2 from scratch in 288 lines of PyTorch — masked language modeling for protein sequences.
AlphaFold: The Structure Prediction Problem — Why protein structure prediction matters, from Anfinsen's hypothesis to CASP14—and how AlphaFold2 and its successors are transforming biology, drug discovery, and protein engineering.
AlphaFold: Architecture and Training — Inside AlphaFold2—input embedding, the Evoformer's triangle updates, invariant point attention, FAPE loss, and the full prediction pipeline.
Code Walkthrough: nano-alphafold2 — Build AlphaFold2 from scratch in ~650 lines of PyTorch — Pairformer, SE(3) diffusion, and FAPE loss.
Protein Backbone Design: Problems and Applications — The de novo protein design revolution—why controlling backbone shape gives control over function, and how RFDiffusion and related methods are enabling designed binders, enzymes, and assemblies.
RFDiffusion: SE(3) Diffusion for Protein Backbones — The mathematics of structure generation—rotation representations, IGSO(3) noise, SE(3) equivariant networks, and the RFDiffusion architecture for de novo protein backbone design.
Code Walkthrough: nano-rfdiffusion — Build RFDiffusion from scratch in 607 lines of PyTorch — SE(3) diffusion with IPA-based denoising.
Inverse Folding and the Protein Design Pipeline — The inverse folding problem—why multiple sequences fold into the same structure, how ProteinMPNN bridges backbone design to experimental testing, and the complete computational protein design workflow.
ProteinMPNN: Architecture and Training — Inside ProteinMPNN—k-nearest neighbor graph construction, geometric edge features, message-passing encoder, random-order autoregressive decoder, and training with coordinate noise.
Code Walkthrough: nano-proteinmpnn — Build ProteinMPNN from scratch in 448 lines of PyTorch — inverse folding with graph neural networks.

Key References

Jumper et al. (2021). “Highly accurate protein structure prediction with AlphaFold.” Nature.
Watson et al. (2023). “De novo design of protein structure and function with RFdiffusion.” Nature.
Dauparas et al. (2022). “Robust deep learning-based protein sequence design using ProteinMPNN.” Science.
Lin et al. (2023). “Evolutionary-scale prediction of atomic-level protein structure with a language model.” Science.