Vijay Karunamurthy is the head of engineering at Scale AI. He has been at the forefront of new developments in AI in personalization, privacy, search, monetization, and more. co:rise instructor Mike Wu sat down with Vijay to talk about his career journey, recent developments in AI, and his advice for getting into this fast-growing field.
This interview has been edited and condensed for clarity.
Mike: Vijay, to kick things off, I’d love to hear how you got started working in AI.
Vijay: It started with a lot of stumbling and teaching myself during my undergrad years, in the late nineties. I was studying biochemistry and biology, and I came across some literature on the evolution of neural networks and how they could be brought to bear on problems in those fields. I was just reading through papers and trying to reimplement work that I saw there.
It was so early that the techniques I was using seem ridiculous now. I wasn't even using matrix multiplication to train my network. I was just going node by node, trying to calculate what the gradient was and apply back propagation. It was really slow. And it was limited in terms of the insights I could get — I was often just rediscovering things that were already well-known. But from there, I was hooked.
When I came out to Silicon valley after undergrad, I worked at a startup that ended up getting acquired by a financial services company. That gave me a chance to explore other machine learning techniques. We did a lot with support vector machines, and a lot of thinking about maximum margin models and other ways of learning from data.
Mike: I started in biology, too. I remember doing those 40 hour research weeks in the lab…and then deciding to switch to the computation side. Vijay: I think a lot of people in machine learning have started in other fields. If you have a love of statistics, and you like exploring how statistics can be applied in different domains, there’s a good chance you’ll start to pick up some machine learning techniques. And then you can apply those same techniques to so many different kinds of problems.
Mike: You were an early employee at YouTube. What were some of the big problems you worked on while you were there?
Vijay: Well, we knew discoverability was critical to driving the growth of the site. If people don’t find something interesting to watch, they’ll leave, and if someone uploads a video that doesn’t get any views, they’re not going to upload more content.
At first, we were just making pretty simple optimizations to our search and discoverability features. But when we got to the point where we had around a million videos being uploaded per day, we started to have really difficult search problems that were very specific to video. We needed to move beyond searching titles and descriptions—which weren’t always reliable—to searching the video content itself. But the machine learning techniques that you could apply to video at that time were pretty basic. They looked at video as a series of individual static frames. No one had tried to train an algorithm that could tell you something interesting about the full video sequence.
It turns out that the solution to this problem is something called a convolutional neural network. Instead of breaking out features of the video, convolutional networks have these convolutional kernels to find areas of interest and look at how they might change over time. Convolutional networks have other applications, too, but they were developed largely in response to this problem of video search.
The pivotal moment for me was when we found a video that someone had uploaded without any title or any description. With no metadata to work from, the convolutional network was able to say, “This is a hip hop video with breakdancing in it.” That’s something that a human might not even be able to tell you — to recognize the features of breakdancing and label it correctly. It was a big moment for the entire team.
Mike: We’re seeing another new type of architecture appearing now: the transformer. How do you see that evolving?
Vijay: Great question. For people who aren’t as familiar with this area, a transformer is a neural network that learns context — that is, it learns how all the different elements in a data set relate to each other, even when they’re far apart. We call this an attention-based technique, in the sense that every element “pays attention” to every other element. This approach was first demonstrated in a 2017 paper called “Attention is all you need,” which I still think is one of the most prescient paper names ever created. It really spells it out.
It’s probably easiest to understand the power of transformers by looking at an example. Going back to video search — in the early days of convolutional neural networks, we could only look at a sequence of about 10 seconds of video. You could pull out some knowledge about what's happening within that short scene. But now, with these attention-based techniques, we can actually understand what’s happening in much longer sequences. We can look at scenes and interpret them in the context of the entire video.
The same principle applies across a lot of different domains. Natural language processing is a big one, because context is so critical in written and spoken communication. Image processing, too. Really any data processing challenge where context might come in from unexpected locations in the larger dataset.
The other thing about these attention-based models is that one model can work for a lot of different contexts. Researchers at DeepMind recently developed a single model that learned to do about 50 different tasks — playing Atari, manipulating a robot arm, answering questions about images, and the list goes on.
Essentially, this field is wide open. There are so many different approaches that are being tried, all based upon how powerful these transformer models are. I think the next several years are really going to blow the door open in what we're able to accomplish.
Mike: Your last role before Scale AI was at Apple. I know privacy was a big focus across a lot of your work there. How did that push you in terms of the way you build products and think about AI?
Vijay: In my first year at Apple, Tim Cook came out at a conference and took a strong stance on Apple's values around privacy. He said privacy is a human right. When you take a stance that strong, it really influences every team to think about how they might work differently.
It was a big shift from my time at YouTube and Google, where I was working with data that was uploaded by users. In that context, you can do all the learning on the server. But when you’re dealing with private data — like the data that gets collected by an Apple Watch, for example — you never want data sent to the server unnecessarily. You have to adapt techniques so they can be trained offline and then still apply to online mode.
We really started to blaze the trail to see what we could do with privacy-first approaches like federated learning and differential privacy. Essentially, we were looking at different ways to have a lot of data and information about what's happening to the user on their device, but not have that data getting sent to servers in an identifiable way. Those techniques are still developing, and I think we’re really just starting to see how powerful they can be.
Mike: In addition to privacy, we’re seeing that there are still other issues that come up with deep learning systems — things like bias, adversarial examples, or unpredictable failure modes. How do you think about those challenges when you're deploying ML products into the real world?
Vijay: There's a lot of fascinating work being done in this area. And it often starts with just understanding the data you’re using to train your model, and showing how bias and different scenarios are arising. That really helps you get the best headstart on addressing those issues.
One of the things we're doing at Scale AI is helping customers measure model performance for scenarios that are specific to their domains. For example, in the autonomous vehicle space, you’re combining a ton of data to construct your knowledge of how a car is situated and the route that it should take. That in itself is really challenging. But within that bigger problem, there are certain scenarios where it’s especially important for you to know how the model is performing — like how you’re handling pedestrian traffic. You need to specifically track those scenarios to fine tune the model.
It turns out that this is where a lot of teams need to spend their time and attention — understanding their data, tracking scenarios, and validating how the models are performing on those specific scenarios. The initial model training is just the tip of the iceberg.
It’s exciting to see how synthetic data is starting to be brought to bear on some of these problems, because there are actually a lot of important scenarios that you don’t often get to see in real life. For instance, in the pedestrian example that I just gave, you might need to look at pedestrians crossing the crosswalk in different positions, at different paces, different times of day, etc. With synthetic data you could go from having maybe 30 minutes of video footage that was really relevant to the pedestrian scenario, to suddenly having hundreds of hours of footage that represent all the different ways that scenario can play out.
Mike: Switching gears a little bit — one question I get from a lot of students is whether we need to learn the mathematical foundations of machine learning and deep learning, now that we have these languages like Pytorch and TensorFlow that abstract so much away for us. What do you think? Can we get away with just thinking at a high level, or are the fundamentals still important?
Vijay: That's a great question. I think it’s still really important to understand what’s happening behind the scenes.
For one thing, at some point in your career you're going to go beyond using pre-trained models, or using a model architecture that you've already seen. You’re going to want to make changes and improvements so your models work better for the particular problems you’re trying to solve.
The other issue is that if something isn’t working, you need to understand the fundamentals to figure out why. Especially in machine learning, you can waste hours or days of your time tuning hyper-parameters, and then it turns out to be something a lot more basic that's actually breaking.
Mike: Last question. What advice do you have for someone who’s trying to get started in AI, whether they’re early in their career or looking to make a career transition into this space?
Vijay: I’d definitely recommend taking a course that’s very tactical and hands-on, like the co:rise Machine Learning Foundations Track. You want to do something where you’re really delving in, looking at code, and starting to learn a little bit about how Pytorch and Keras are laid out.
It’s probably going to feel similar to when you first learned to code — starting with small tasks in an IDE or in a browser window, building an understanding of what coding is, and then enriching your knowledge over time as you get into more complicated programs. With machine learning, I think you can start by building a pretty simple model and applying it to a very simple problem. That's a great foundation for getting into the more complicated techniques and adapting models for other domains.
I wouldn’t recommend spending a lot of time on academic descriptions of how attention works or how these transformer models are built — I think it's much more helpful to build that first model, learn how it works, see how things can go wrong and delve into fixing them. That’s going to help you understand what this field is and how you can apply it to the particular problems you’re trying to solve.