Keep Your Data & Data Teams Healthy - Part 1

Budget-Friendly Ways To Build AI #2

Apr 02, 2025

In one of my previous articles, I wrote how data engineering teams can scale - both from technical and human points of views. In this article, I’ll be my own devil’s advocate and describe types of things that can kill you and what kinds of practices to implement to boost your team’s immune system. I can’t emphasize more how important the human element is.

Below, I’ll cover symptoms that can really do damage, how to prevent them, and how to treat them.

We’re pretty close to the end of Data Pipelining Section.

Just remember where we are in the series!

Anything Can Kill You With a Compromised Immune System

Teams are living beings - sometimes they get sick and to bring them back to health, require proper diagnosis and treatment.

Many things can happen, so instead of covering all the potential scenarios, I’ll cover exactly what kinds of symptoms can actually kill a team.

Damages From Data Contamination Can Spread Like Flu

What's a virus, anyway? Part 2: How ... — Treat all data contamination like it’s a pathogen - practice good hygiene and eliminate on site.

No one likes drinking out of a pipe contaminated with germs or shaking hands with someone who just sneezed into their hand. And data users don’t want to use dirty data. Worst thing that can happen is it can destroy your team’s credibility and lose customers. If that happens often enough, best case scenario is jobs being at risk. Worst is selling low-quality AI products and dealing with legal issues or facing financial death as you start to lose customers.

Preventing Flu Is The Best Choice

How to Wash Your Hands—Properly > News > Yale Medicine — Please wash your hands.

Here’s a fun story.

My Top 5 “Oh Shit!” Moments in Anime – #5 - Low Five Productions

3 years ago, my team and I had to demo our model in live setting to a customer. We took in some fresh data prepared by clients to show it can handle both mundane and edge cases. In our testing environment, everything worked perfectly - model sees high traffic history without anomalies, it predicts high traffic; and it sees low traffic history and an upcoming anomaly event, then predicts a spike in traffic. Everything seemed fine. But the model ended up doing the complete opposite in front of clients.

Miss Manners: Attorney is tired of being verbally abused by clients who receive unwelcome news - nj.com

Luckily, we’ve worked with them before and they’ve been pretty happy. But we were getting questions left and right - “Hey that’s weird, maybe you can just turn things upside down but can we make sure other models never do this?”, “We’ll give you more test cases, so can you send us a full simulation analysis of those based on the model?“, “You said you got other models cooked up, can you send simulation results from those as well?”, and the list goes on… Good questions but also very painful for the team.

Clumsiness: Causes, Treatment, and Takeaway

After the demo, we were able to find the issue in 15 minutes - the data pipeline was turning an important feature into opposite signs by subtracting the original number from 1. 100 became -99, -100 became 101, 0 turned into 1, and 1 into 0. We figured out someone pushed a new version of the code that was supposed to fix comments but accidentally copied and pasted a line that should be elsewhere into the next line. Almost everyone on the team has done something dumb but never before showing to clients. We were all embarrassed.

The biggest issue here was that this simple mistake created a ton of work for everyone and made us look less trustworthy. This kind of stuff stings and where you get stung leaves a scar.

Key takeaways from this story are - 1) This can happen to anyone so don’t blame others; 2) Practice good data hygiene like you wash your hands - wash them after coming back inside (every-time before an important training run, check your code), wash them before you cook (check code before you write new ones), and wash them after you cook (check code after you write new ones); and 3) If you don’t do this - best case you embarrass yourself and the rest of the team, worst case you lose a client or get fired.

If the Flu Spreads, It’s an Epidemic

Let’s imagine a case where not even our clients caught this simple mistake.

If the model went into production as is, that would have been a disaster. All the time and effort that’s been put into it would be wasted because no customer would ever trust what’s made nor the people who made it. That is BAD.

Worse, let’s pretend that these kind of mistakes happen regularly. What happens is that the model gets to a point where it may look like it’s working but not even the team knows exactly what gets fed into it. This is where it becomes an epidemic.

Don’t be a noob. Do proper code reviews.

Without the ability to correctly diagnose, any new issues that come up become a single touchpoint problem - “Oh let’s just tune the model”, “Let’s just re-do the architecture of the model“, “No no no, it’s a data issue, let’s add another feature“. If you’re at this point, everyone’s already been infected. Anything you try to do, just makes things worse and worse. Only course of action is to revert back and re-implement the original design from scratch. This is an absolute waste of time that could be avoided with good hygiene. So always wash your hands.

Vaccines & Herd Immunity - Handful Practices For Prevention

Herd immunity is an important concept here because if you train almost everyone on the team to practice good data hygiene and give them a dose of other best practices, you can prevent an epidemic.

If practicing data hygiene is like washing hands, implementing best practices is sort of like getting vaccines. Sometimes, you need multiple shots but one at a time. These best practices come down to training and growing good habits.

So what are these best practices?

All About The Basics

Anyone who touches data should know some basics - including those who are not data engineers. At least they’d be able to follow conversations that data engineers are having. Goal is twofold - i) to understand just enough to be able to tell if a data engineer is good, great, or bad; and ii) to be able to follow through data engineering conversations and code written by data engineers.

Quick disclaimer, no pro-tips can make you an expert overnight at figuring out who’s bad, good, or great. This takes experience and lot of trial and error. Non data engineers also should be teaching data engineers on what goes on in their craft.

Let’s First Identify Sources of Data Contamination

There are 2 major ways (and some combination of the 2) that data contamination occurs.

Model thinks that a feature containing not a lot of information somehow does have lot of information.
Feature that does contain a lot of information gets recorded the wrong way.

It’s important to treat them in chronological order. Because if you learn how to prevent 1st kind of data contamination, it teaches you enough about how information works and what data users go through. Then learning how to prevent 2nd type of data contamination jus becomes a matter of digging very deep into the data.

1st Shot: Data Is Tricky, So Grow Data Intuition

This part treats the 1st type of data contamination mentioned above.

What part of the data is really important and how do you know how someone can use them or that it’s even enough data in the first place? If most people working with data understand these (so-called having data intuition), your team will be able to tell which features should be added or be taken out. Entire team having data intuition just decreases the chance of data contamination from a weird feature slipping into processed data.

Let’s go through how this might play out in practice. You can have 100 Petabytes of data but if all your data contains is information of “Person A did XYZ at Time T”, then is this really enough information to predict what person B will do at Time S?

Think about the information that is contained in the data. Does the way people and actions are labeled contain any specific meaning? Are there finite number of actions in the data? Or can you find a real and consistent pattern on when a person takes some action?

If everyone who works with data can sit down and share these findings, then any changes in the data pipeline can be quickly reviewed and verified for quality. There is no magic formula here. It is a constant effort, sort of like exercising.

2nd Shot: Data Has Irregularities That Need the Right Treatment, Make People Aware

This parts concerns the 2nd type of data contamination.

Irregularities in data can be anywhere and can be quite random. Sometimes the patterns are obvious or has good documentation on why they occur, so you can fix them easily. However, sometimes there are patterns that are unrecognizable because there’s not a lot of information about how the data was even recorded. This takes a lot of time digging things up and everyone working with data should be aware that this could happen. And when it happens, everyone should be aware there’s a list somewhere that contains these unrecognizable patterns and how they’re currently being treated.

Another issue is actually when there are just too many patterns for anyone to remember how to handle. This is a bit ironic since we’re talking about how to scale data pipeline building but data itself can be intractable to begin with. This is where herd immunity really kicks in - if enough people who work with data have taken a look at different parts of the data, in aggregate you get a team that can handle all the irregularities. So make sure that people are always updated on what parts of the data has been covered by the rest of the team. You also don’t want too many people looking at the same part of the data.

Vitamins, Flu Pills, and White Blood Cells

Citrus fruit | Love Food Hate Waste — These help you stay healthy

The Best Cold and Flu Medicine on the Market | The Healthy — This treats symptoms when you get sick

Understanding White Blood Cells and Differentials — This kills germs from within

Vaccines get you most of the way there. But there are more specific practices that could prevent data contamination on a day to day (or weekly) basis.

Review Your Data Pipeline Frequently

Don’t repeat my mistakes. Review your pipelines to make sure they are all working properly. No matter how simple they are, I cannot emphasize how important it is to look at the list below when you audit your data pipeline.

Encoding done well? - Remember the way computer stores information really does matter. This means going over ASCII vs UTF vs binary representation are done in a desired way. Make sure that floating vs fixed point issues don’t exist. Make sure that there’s no obscure encoding unless data users specifically request it.
Right numbers at the right places? - In the code, is that number 2 being multiplied really supposed to be 2? Could it actually be 3? 0.2? 20? 21?
Right functions at the right places? - Similar to above. Is that “x + 1” really correct"? Could it be “x-1“ or “x/1”?
Any weird bottlenecks that shouldn’t be there? - If there’s a bottleneck and it’s not caused by algorithmic efficiency issues, there very likely is something wrong going on there. Investigate further.

Get Reality Checks From Data Users

Walk through your data once you’ve mapped out the feature and information groups. Does it make sense to the data users? Do they think anything is unnecessary or not included? Iterate quickly on this and don’t wait until a model is developed by data users.

Culture Management = White Blood Cell Management

Transparent culture where everyone feels great about what they do is always the goal. Mot really my place to explain exactly how to pull this off since there are many other great resources.

Article Too Long, Will Post Other Parts Throughout the Week.

I apologize to the readers. Data hygiene and is so important and critical to the survival of any team that makes AI that I more than doubled my usual length. I’ll post the rest next week and both this article and the continuing ones will be considered part of the 4th Digest.

5th digest will be posted in a normal schedule - approximately 2 weeks after when this article is published.

Data Monster

Discussion about this post