In one of my previous articles, I wrote how data engineering teams can scale - both from technical and human points of views. In this article, I’ll be my own devil’s advocate and describe types of things that can kill you and what kinds of practices to implement to boost your team’s immune system. I can’t emphasize more how important the human element is.
Below, I’ll cover symptoms that can really do damage, how to prevent them, and how to treat them.
Just remember where we are in the series!
Anything Can Kill You With a Compromised Immune System
Teams are living beings - sometimes they get sick and to bring them back to health, require proper diagnosis and treatment.
Many things can happen, so instead of covering all the potential scenarios, I’ll cover exactly what kinds of symptoms can actually kill a team.
Damages From Data Contamination Can Spread Like Flu
No one likes drinking out of a pipe contaminated with germs or shaking hands with someone who just sneezed into their hand. And data users don’t want to use dirty data. Worst thing that can happen is it can destroy your team’s credibility and lose customers. If that happens often enough, best case scenario is jobs being at risk. Worst is selling low-quality AI products and dealing with legal issues or facing financial death as you start to lose customers.
Preventing Flu Is The Best Choice
Here’s a fun story.
3 years ago, my team and I had to demo our model in live setting to a customer. We took in some fresh data prepared by clients to show it can handle both mundane and edge cases. In our testing environment, everything worked perfectly - model sees high traffic history without anomalies, it predicts high traffic; and it sees low traffic history and an upcoming anomaly event, then predicts a spike in traffic. Everything seemed fine. But the model ended up doing the complete opposite in front of clients.
Luckily, we’ve worked with them before and they’ve been pretty happy. But we were getting questions left and right - “Hey that’s weird, maybe you can just turn things upside down but can we make sure other models never do this?”, “We’ll give you more test cases, so can you send us a full simulation analysis of those based on the model?“, “You said you got other models cooked up, can you send simulation results from those as well?”, and the list goes on… Good questions but also very painful for the team.
After the demo, we were able to find the issue in 15 minutes - the data pipeline was turning an important feature into opposite signs by subtracting the original number from 1. 100 became -99, -100 became 101, 0 turned into 1, and 1 into 0. We figured out someone pushed a new version of the code that was supposed to fix comments but accidentally copied and pasted a line that should be elsewhere into the next line. Almost everyone on the team has done something dumb but never before showing to clients. We were all embarrassed.
The biggest issue here was that this simple mistake created a ton of work for everyone and made us look less trustworthy. This kind of stuff stings and where you get stung leaves a scar.
Key takeaways from this story are - 1) This can happen to anyone so don’t blame others; 2) Practice good data hygiene like you wash your hands - wash them after coming back inside (every-time before an important training run, check your code), wash them before you cook (check code before you write new ones), and wash them after you cook (check code after you write new ones); and 3) If you don’t do this - best case you embarrass yourself and the rest of the team, worst case you lose a client or get fired.
If the Flu Spreads, It’s an Epidemic
Let’s imagine a case where not even our clients caught this simple mistake.
If the model went into production as is, that would have been a disaster. All the time and effort that’s been put into it would be wasted because no customer would ever trust what’s made nor the people who made it. That is BAD.
Worse, let’s pretend that these kind of mistakes happen regularly. What happens is that the model gets to a point where it may look like it’s working but not even the team knows exactly what gets fed into it. This is where it becomes an epidemic.
Without the ability to correctly diagnose, any new issues that come up become a single touchpoint problem - “Oh let’s just tune the model”, “Let’s just re-do the architecture of the model“, “No no no, it’s a data issue, let’s add another feature“. If you’re at this point, everyone’s already been infected. Anything you try to do, just makes things worse and worse. Only course of action is to revert back and re-implement the original design from scratch. This is an absolute waste of time that could be avoided with good hygiene. So always wash your hands.
Vaccines & Herd Immunity - Handful Practices For Prevention
Herd immunity is an important concept here because if you train almost everyone on the team to practice good data hygiene and give them a dose of other best practices, you can prevent an epidemic.
If practicing data hygiene is like washing hands, implementing best practices is sort of like getting vaccines. Sometimes, you need multiple shots but one at a time. These best practices come down to training and growing good habits.
So what are these best practices?
All About The Basics
Anyone who touches data should know some basics - including those who are not data engineers. At least they’d be able to follow conversations that data engineers are having. Goal is twofold - i) to understand just enough to be able to tell if a data engineer is good, great, or bad; and ii) to be able to follow through data engineering conversations and code written by data engineers.
Quick disclaimer, no pro-tips can make you an expert overnight at figuring out who’s bad, good, or great. This takes experience and lot of trial and error. Non data engineers also should be teaching data engineers on what goes on in their craft.
Let’s First Identify Sources of Data Contamination
There are 2 major ways (and some combination of the 2) that data contamination occurs.
Model thinks that a feature containing not a lot of information somehow does have lot of information.
Feature that does contain a lot of information gets recorded the wrong way.
It’s important to treat them in chronological order. Because if you learn how to prevent 1st kind of data contamination, it teaches you enough about how information works and what data users go through. Then learning how to prevent 2nd type of data contamination jus becomes a matter of digging very deep into the data.
1st Shot: Data Is Tricky, So Grow Data Intuition
This part treats the 1st type of data contamination mentioned above.
What part of the data is really important and how do you know how someone can use them or that it’s even enough data in the first place? If most people working with data understand these (so-called having data intuition), your team will be able to tell which features should be added or be taken out. Entire team having data intuition just decreases the chance of data contamination from a weird feature slipping into processed data.
Let’s go through how this might play out in practice. You can have 100 Petabytes of data but if all your data contains is information of “Person A did XYZ at Time T”, then is this really enough information to predict what person B will do at Time S?
Think about the information that is contained in the data. Does the way people and actions are labeled contain any specific meaning? Are there finite number of actions in the data? Or can you find a real and consistent pattern on when a person takes some action?
If everyone who works with data can sit down and share these findings, then any changes in the data pipeline can be quickly reviewed and verified for quality. There is no magic formula here. It is a constant effort, sort of like exercising.
2nd Shot: Data Has Irregularities That Need the Right Treatment, Make People Aware
This parts concerns the 2nd type of data contamination.
Irregularities in data can be anywhere and can be quite random. Sometimes the patterns are obvious or has good documentation on why they occur, so you can fix them easily. However, sometimes there are patterns that are unrecognizable because there’s not a lot of information about how the data was even recorded. This takes a lot of time digging things up and everyone working with data should be aware that this could happen. And when it happens, everyone should be aware there’s a list somewhere that contains these unrecognizable patterns and how they’re currently being treated.
Another issue is actually when there are just too many patterns for anyone to remember how to handle. This is a bit ironic since we’re talking about how to scale data pipeline building but data itself can be intractable to begin with. This is where herd immunity really kicks in - if enough people who work with data have taken a look at different parts of the data, in aggregate you get a team that can handle all the irregularities. So make sure that people are always updated on what parts of the data has been covered by the rest of the team. You also don’t want too many people looking at the same part of the data.
Vitamins, Flu Pills, and White Blood Cells
Vaccines get you most of the way there. But there are more specific practices that could prevent data contamination on a day to day (or weekly) basis.
Review Your Data Pipeline Frequently
Don’t repeat my mistakes. Review your pipelines to make sure they are all working properly. No matter how simple they are, I cannot emphasize how important it is to look at the list below when you audit your data pipeline.
Encoding done well? - Remember the way computer stores information really does matter. This means going over ASCII vs UTF vs binary representation are done in a desired way. Make sure that floating vs fixed point issues don’t exist. Make sure that there’s no obscure encoding unless data users specifically request it.
Right numbers at the right places? - In the code, is that number 2 being multiplied really supposed to be 2? Could it actually be 3? 0.2? 20? 21?
Right functions at the right places? - Similar to above. Is that “x + 1” really correct"? Could it be “x-1“ or “x/1”?
Any weird bottlenecks that shouldn’t be there? - If there’s a bottleneck and it’s not caused by algorithmic efficiency issues, there very likely is something wrong going on there. Investigate further.
Get Reality Checks From Data Users
Walk through your data once you’ve mapped out the feature and information groups. Does it make sense to the data users? Do they think anything is unnecessary or not included? Iterate quickly on this and don’t wait until a model is developed by data users.
Culture Management = White Blood Cell Management
Transparent culture where everyone feels great about what they do is always the goal. Mot really my place to explain exactly how to pull this off since there are many other great resources.
Article Too Long, Will Post Other Parts Throughout the Week.
I apologize to the readers. Data hygiene and is so important and critical to the survival of any team that makes AI that I more than doubled my usual length. I’ll post the rest next week and both this article and the continuing ones will be considered part of the 4th Digest.
5th digest will be posted in a normal schedule - approximately 2 weeks after when this article is published.