DataPrepOps and the Practice of Data-Centric AI
Data preparation is rarely seen as an actual discipline that can be taught and learned. Many data scientists currently "improvise" their approach to data preparation. However, a rigorous approach to this first step can yield information data scientists can use to develop, train, and tune ML models. We've developed this course to address this dichotomy in data science and to change the perception of data preparation. The DataPrepOps concepts we will cover in this course can help machine-learning practitioners develop solid, practical data preparation skills they can use across their ML projects.
Course taught by expert instructors
Founder and CEO, Alectio
Dr. Jennifer Prendki is the founder and CEO of Alectio, the first startup fully focused on DataPrepOps - the discipline focusing on the automation and operationalization of Data Preparation. Her team are on a mission to help ML teams build models with less data, and hence more cost-efficiently. Prior to Alectio, Jennifer was the VP of Machine Learning at Figure Eight; she also built an entire ML function from scratch at Atlassian, and led multiple Data Science projects on the Search team at Walmart Labs.
Learn and apply skills with real-world projects.
Experienced data scientists who wish to build a theoretical understanding of Data Preparation techniques
Data scientists interested in both the theoretical and practical aspects of Data-Centric AI
MLOps engineers who desire to learn about the operational side of Data-Centric AI
Familiarity with fundamental machine learning concepts, especially supervised machine learning
Familiarity with software development in Python
Familiarity with basic data orchestration tools such as Airflow or Flyte
Try these prep courses first
You will perform a few experiments using the same models from prior weeks, but with different subsets of data.
- A brief history of AI winters, and how Big Data and ImageNet got us moving forward
- How data-centric AI is different from model-centric AI, and why the data science field is adopting the data-centric AI model
- The many aspects of data-centric AI, and why it isn't just another term for active learning or Human-in-the-Loop machine learning
- What data preparation is, and what it is not
- Operational challenges with data-centric AI
- The economic benefits of data-centric AI
- You will test different data-centric AI techniques like data labeling and data augmentation, and evaluate their impact on model performance
- You will build intuition on why bad data preparation can lead to unrecoverable biases
- You will begin to develop automated approaches to data preparation, specifically data curation
Your goal with this project will be to fully annotate from scratch the data that we will use for the rest of the course, and to annotate the data as accurately as possible.
- Types of data annotation for all data modalities
- Commercial aspects of data labeling, and how to best choose a labeling partner for a particular project
- Best practices for manual data labeling
- Human-in-the-Loop data labeling and how to set up a Human-in-the-Loop data labeling pipeline in practice
- Auto-labeling and when (and when not) to use it
- The Snorkel algorithm and when (and when not) to use it
- New concepts in data labeling
- You will upload the dataset
- You will manually annotate some of the data
- You will find a suitable model to annotate the remaining data using an auto-labeling approach
- Advanced participants will work on an implementation of the Snorkel algorithm
- You will have access to several open-source annotation tools throughout this project
You will run your own active learning process on a given notebook.
- Everything you need to know about active learning
- The difference between pooling and streaming active learning
- How active learning relates to online learning
- Many, many querying strategies
- Basic machine-learning and reinforcement-learning techniques for active learning
- You will practice some common querying strategies
- You will tune off-the-shelf querying strategies and measure the impact of that tuning on the learning process
- You will code and test several of your own querying strategies
- You will learn how to measure and track the performance of your active learning process
You will be challenged to build a mini data-centric AI MLOps pipeline with Airflow.
- Practical challenges around building a training pipeline to support data-centric AI
- How to build an MLOps pipeline that incorporates the iteration and feedback loops required for Human-in-the-Loop ML and data-centric AI
- DataPrepOps MVP: a basic pipeline to get things off the ground
- Tips for integrating popular and open-source data-labeling APIs into an iterative training pipeline
- What continuous labeling is, and why it is necessary for ML observability and online learning
- How to incorporate data augmentation and synthetic data generation into traditional MLOps
- You will set up a basic data-centric AI pipeline
- You will integrate into the pipeline an auto-labeling process that allows you to annotate data continuously
- (Time permitting) You will incorporate automated data-and-labeling quality management and basic control loops into your pipeline to ensure it runs properly
Work on projects that bring your learning to life.
Made to be directly applicable in your work.
Live access to experts
Sessions and Q&As with our expert instructors, along with real-world projects.
Network & community
Core reviews a study groups. Share experiences and learn alongside a global network of professionals.
Support & accountability
We have a system in place to make sure you complete the course, and to help nudge you along the way.
Get reimbursed by your company
More than half of learners get their Courses and Memberships reimbursed by their company.
Hundreds of companies have dedicated L&D and education budgets that have covered the costs.