Data Science for Security and Fraud
This course provides an introduction to tackling digital security and fraud challenges using data science. We will kick off with an overview of problem areas like fake account creation, account takeover, bot traffic, and phishing, and propose a framework for understanding and addressing them. We will then explore common data sources such as web application logs and telemetry collected from devices, networks, and user behavior. Each week, we will apply data science techniques like anomaly detection and graph analysis to tackle a specific problem. At the end of this course, you will be able to build solutions to identify malicious and fraudulent behavior on the internet, and design ways to stop them.
Yiing Chau Mak
Head of Data at MetaMap
Mak is currently Head of Data at MetaMap. Previously, he led data science at Shape Security, where he helped Fortune 500 companies detect and stop malicious traffic – by bad bots, bad humans, and everything in between. As Director of Data Science at F5 (which acquired Shape), he built a real-time machine learning-based system that tackled many digital security and fraud challenges, such as fake account creation, account takeover, and unemployment claims. In his past life, Mak worked at the intersection of cybersecurity, cybercrime and personal data protection at the Cyber Security Agency of Singapore, where he led the creation of Singapore’s cybersecurity legislation and strategy.
As long as humans exist, fraud will also exist. With so much of our lives now spent online, the opportunities to commit fraud digitally have exploded. The internet is a gold mine for fraudsters: once they get past web security, they are able to access a wealth of personal and financial data, often under a cloak of anonymity. Anything of value that can be exploited will be exploited: stolen credit cards, bank transfers, airline miles, credit card points, new account sign-up bonuses, unemployment claims – the list goes on. Not to mention that some two to three billion login credentials are leaked every year, so attackers have more options than ever when it comes to taking over victims’ accounts.
We cannot talk about online fraud without talking about online security – specifically, the security of web applications which we use to transact, and which we often trust and take for granted. What may seem like a highly secure website (uses “https”, looks clean and professional, and has an unwieldy 18-character alpha-numeric-symbolic password requirement alongside 2FA for every login) can appear very different to a trained attacker, and therefore may not be secure at all. Especially if the attacker already has access to your email account! (Captchas do not help.)
The good news is that attackers inevitably leave digital footprints, even if they might seem small and insignificant compared to the vast amount of (very noisy) data that web systems generate. But how do you find the right ‘signals’ in this noise? How do you know what is ‘normal’ behavior and therefore what is ‘abnormal’?
I believe that with the right data science techniques, combined with a good understanding of both web security concepts and fraud schemes, it is possible to identify malicious and fraudulent behavior on the internet, and design ways to stop attackers.
Over the four weeks, you will learn how to think from an attacker’s perspective. To do that, you will first be tasked to hack into a provided (real!) web application. Once you have attempted that, we will switch over to “defense mode”. You will learn to apply various data science techniques to analyze telemetry and web application logs generated by the web application, with the goal of identifying potential attack traffic and figuring out how to stop the attacker. We will cover common fraud issues like fake account creation and account takeover, and explore some real-life case studies.
Fighting against online fraud is a never-ending cat-and-mouse game. I hope that this course will provide you with the fundamentals to stay one step ahead of attackers, and enable you to take some initial steps to defend your organization’s web applications and users from fraud.
- How web applications work. In particular, how web content is delivered over the internet, and how to inspect websites and web/API traffic.
- How to think about web application security. Core security paradigms, what “identity” really means online, how attackers evolve over time, and some thoughts on rules-based vs. AI/ML systems for security
- How to think like an attacker. Common vulnerabilities and process/security loopholes in web applications, and how to probe for them. Top techniques employed by attackers (hint: it is not all about technology).
- Analyzing web application data. Understand what qualifies as “useful” data for security and fraud, where to obtain such data, and how to process and analyze it.
- Bots on the internet: a primer. Learn all about bots and why they are necessary to commit fraud at scale. Understand how they work, how to differentiate between bots and humans, and how bots manifest in web application traffic. Also: why captchas are ineffective at stopping bots.
- Detecting (bad) bot traffic. Explore concepts such as traffic entropy, and use readily available signals from devices, networks and user behavior to engineer useful features for bot detection. Apply anomaly detection to identify bot traffic in web application logs.
- A framework for detecting (human) fraud. Understand the spectrum of humans, bots, and everything in between, and frame online fraud detection as a two-stage binary classification problem
- Graph theory 101. What are graphs, and why we should leverage graph databases in fraud detection.
- Detecting (bad) human traffic. Design and structure a graph database using TigerGraph, and populate the graph from web application logs. Use the graph to discover signs of potential fraud
- Why user journeys are useful. Pros and cons, challenges.
- Model user journeys on a web application: Construct a view of common vs. uncommon user journeys on a given web application, by modeling the probability of navigating from any page to any other page.
- Pinpointing the “user” in a user journey. Deterministic and probabilistic ways of identifying users across multiple sessions. Strategies used by attackers to evade detection across sessions, and how to overcome (some of) them.
I worked with Mak when he led the data science team at Shape Security. I recall when Mak first expressed interest in working for Shape, he was asked to complete a practical exercise to demonstrate his analytical and presentation skills. The exercise required that applicants review a .csv file with tens of thousands of entries, find the anomalies and present/explain them to a non-technical panel. Applicants weren't told how many anomalies there were but there were ten, and previous applicants typically found 5-7. Mak finished well under the three hour time limit and provided the best analysis and presentation we had ever seen. What's more, Mak not only found all ten anomalies, he found an 11th anomaly we didn't even know about. After joining Shape as a data scientist, Mak was quickly promoted to lead the entire team. The net is, I would cancel a family vacation to attend one of Mak's sessions.
Mak is the real deal. In a space that's flooded with hype and FUD, Mak brings actionable knowledge to the table by focusing his curriculum on the highest impact practical problems in security and risk. The diversity of his professional experience means that you will not only get top-tier technical instruction on the applying data science to fraud problems, but will also get a rare opportunity to combine this with attacker economics and philosophy. This is not a course to miss.
Data scientists and analysts who are curious about the security and fraud space, or who need to defend their organizations and products from online fraud and abuse
Cybersecurity practitioners and fraud/abuse/trust and safety analysts who want to tackle online security and fraud problems at scale
- Ability to write Python fluently, and manipulate data within Python.
- Basic understanding of statistics and probability.
- Data science and fraud/security experience are not required.