Data Centric Deep Learning
Learn to build, improve, and repair deep learning models with a data-centric approach. This course will put you in the shoes of a deep learning engineer, and simulate the real world challenge of improving data quality, building and testing deep learning models, and improving performance with a human-in-the-loop. Week by week, we will develop an understanding of the critical role of data in deep learning operations – from integration tests to deep learning tooling to iterative annotation. Learn the best practices for deep learning in the real world.
Senior Manager at Apple and Instructor at Stanford
Andrew Maas is currently at Apple working on data-centric deep learning. He completed a PhD in Computer Science at Stanford in 2015 advised by Andrew Ng and Dan Jurafsky. His dissertation focused on large scale deep learning methods for spoken and written language. Andrew has worked as an engineer and scientific advisor to several startups including Wit.ai, Coursera, and Semantic Machines. Prior to Apple, he built an NLP platform for precise healthcare language as cofounder of Roam Analytics. Additionally he also teaches CS224S: Spoken Language Processing as a visiting lecturer at Stanford University.
PhD Scholar at Stanford
Mike Wu is currently a fifth year PhD student at Stanford University advised by Noah Goodman. His research spans the fields of inference algorithms, deep generative models, and unsupervised learning. Mike’s research has appeared in NeurIPS, ICLR, AISTATS, and other top ML conferences with two best paper awards and his work has been featured in the New York Times. Mike previously worked as a software engineer at an AI startup called Lattice Data, and as a research engineer at Meta’s applied machine learning group. Mike and Andrew designed and taught a new version of Stanford’s CS224S: Spoken Language Processing in 2022.
As deep learning becomes more deeply embedded in real world applications, there are fundamental questions around scalability, reproducibility, and quality. Unlike its predecessors, neural network systems introduce a new relationship between the practitioner and data – trained deep learning engineers take a “data-centric” approach to building, improving, and repairing models to be high performing and reliable in the real world. There is now a new skill set and toolkit of best practices for ensuring the quality of data, annotations, and models altogether.
In this course, students are given a series of projects that showcase best practices in both natural language and computer vision. Students will receive a mix of practical knowledge – the best tools and frameworks for deep learning engineering, and decision making guidelines – what are the different ways I can use data in the modern AI workflow? The course will take students through multiple stages, from inspecting annotations to continuous testing to iterative annotation to protecting models against distribution shift and adversarial examples. By the end of the course, students will have built a web application with an embedded model and have a thorough understanding of what it means to take a data-centric approach to AI.
- How to inspect and improve data quality and annotation quality.
- How to identify and remove data anomalies or outliers.
- The types of annotation errors and their effects on model performance.
- Data analysis in NLP and computer vision.
- Simulations of annotation errors and a model evaluation framework.
- Annotation analysis for (1) a bounding-box task for object detection and (2) a text span task for entity recognition.
- Train deep learning models in two different modalities: text and images.
- To construct reproducible end-to-end machine learning workflows.
- To finetune small networks on top of foundation models in computer vision.
- Post-training processing (such as exporting, tracking, compression) of deep learning models for deployment.
- Best practices for continuous testing of deep learning models.
- Comfort with popular deep learning tools like Weights and Biases, ONNX, and FastAPI.
- Integration tests, regression tests, and directionality tests for model quality assurance.
- A MetaFlow pipeline that chains together training, evaluation, and deployment on a benchmark dataset of handwritten digits.
- The role of active learning and self-learning in a deep learning framework.
- How to use unlabeled data and model uncertainty to improve performance.
- Best practices for designing web applications with embedded ML models.
- Tools to identify which examples to prioritize for labeling.
- Tools to noisily label large batches of data quickly without a third party service.
- A lightweight web application in Flask that supports human-in-the-loop labeling.
- How to identify and handle distribution shift and adversarial examples.
- The different types of distribution shift in NLP and computer vision.
- Data augmentation techniques for model robustness.
- Leverage the implemented workflows to quickly retrain and deploy a model.
- Pipeline to handle the appearance of a new label class.
- Repair models in response to adversarial examples in a visual classification task with outlier image watermarks.
- Monitoring tools to track model performance and detect distribution shifts.
Students who want to learn the infrastructure and operations behind practical deep learning for real world applications.
Students who have taken the first two courses in the co:rise ML foundations track.
Data scientists and research engineers looking for best practices in building and maintaining deep learning models.
Familiarity with Python, and comfortable with reading documentation for learning new tools. co:rise Python for Machine Learning course or equivalent.
Experience in basic machine learning and data science. Co:rise Introduction to Applied ML: Supervised Learning course or equivalent.
Basic web development with tools like Flask. Students do not need to be experts at building web applications.
Basic experience in deep learning, including using PyTorch. Co:rise Deep learning essentials, ML Coursera course, or equivalent.