Lone Wolves & Freaks: Scaling Data Pipelining for AI

Budget-Friendly Ways To Build AI #1

Data Monster

Mar 06, 2025

This is written for everyone - students, engineers, researchers, and managers alike.

Cost-Effective AI Starts with Cost-Effective Data Pipelines

Data first comes in and then out. Data engineers build how data comes in, called data pipeline aka ETL pipeline. Data pipeline is meant to deliver usable data without leaks to users who will do the “AI agent stuff” before data comes out.

Data Pipelining Costs Scale By Amount Of Human Work

Testing environments should be running on small subsets of data so we can ignore compute costs. But having too many engineers around for way too long is expensive, especially with AI where data engineers bring in a lot of expertise and play a critical role in AI quality.

But Being Cheap Kills

If you cheap out, say hello to low quality AI. Would you put harmful gasoline to your car that could destroy your engines just because it’s cheap? Of course not.

Good Data Pipelines Optimize Consumption Costs, Good Pipeline Infrastructure Optimize Usage Costs

Good data pipeline prevents data users from haggling through data. Making pipeline usage scalable is the role of infrastructure and data users themselves.

How About Storage?

Storage costs are fixed based on data size. There’s much room for innovation to bring down storage costs but this is out of my wheelhouse. It’s best to choose storage options based on security and access requirements.

Building Pipelines Scale With Respect To Data Complexity

*This part is a bit technical.

Data engineers must focus on decreasing data complexity as much as possible. Key factors that create complexity are threefold - number of different attributes (also called features), number of data points, and data representation.

There are some factors outside of data engineers’ control - AI model architecture, explainability of features, and data throughput. They are not always mutually exclusive and harder to measure. These can be covered in the future when discussing how to scale model training.

Number of Features Should Linearly Impact Number of Tasks

To be precise, number of different “feature groups” should linearly scale to number of “task groups”. There are 2 types of “feature group” - Information groups and Representation groups. Each feature can belong to multiple groups.

Information group contains lists of data features where each list has features that can be added or subtracted from one another to create new information. For example, if you have an anonymized patient data and you need to grab their age perhaps the data contains the year of birth and year of treatment rather than the age. Every list of features per group should be able to be processed similarly.

Representation group contains features that are represented similarly in format and encoding.

Each “task group” contains set of specific tasks when all accomplished reduce complexity of data to a desired state. Each task group can have different number of tasks.

Number of Data Points Can Impact Required Code Efficiency

If the data size if large enough, then we need efficient algorithms not easy-to-write ones. There’s no magic formula for the inflection point so engineers should run their own experiments.

Many programmers have love and hate relationship with algorithms. Those managing a project or running a team are going to have a love and hate relationship with challenges that come from this.

If you are in the LLM field, there is plenty of open source research out there. But if you work with proprietary data, you may be left on your own. Plan accordingly. Have couple friends who are experts at algorithms that can help you over a beer or a coffee. Even contacting a CS professor researching algorithms may be helpful if it’s critical. This is an important area of active problem solving.

Funky Ways Data is Represented Are Curveballs

Let’s go back to the anonymized patient data example. In the simplest case, raw data just show the age so there’s nothing to be done. If it shows year of birth and year of treatment, then we can just subtract.

Now, assume that raw data collectors were really really concerned about privacy and decided to encrypt the years in some obscure ways. If age is necessary for your data users (maybe because you want your AI to predict potential lifespan left of a patient), you’re going to be stuck.

Many questions will arise. What are the odds someone on the team can also break encryptions? Is it worth the time for someone on the team to catch up on cryptography knowledge? Should you consult a specialist? What options did raw data collectors give you? Can you even talk to them?

It’s much better to anticipate potential curveballs like this and answer all the questions during planning phase than responding to them ad hoc.

The best case scenario is that data collectors provide some way to decrypt the information securely or they can send some mutation of age that is usable. But how often do you luck out like that?

There can be many different types of curveballs. Be prepared by identifying them during planning and don’t shy away from asking for help.

Don’t Try To Have Perfect Raw Data

Even if it’s data engineers’ job to make the lives of data users easier, they too wish for data that is easy to work with. But, we are all limited by how raw data is collected. Limitations can come from hardware, how data is collected, and how much data can be handled by collectors at once.

It’s advantageous for data engineers to understand these limitations so they can get working right away and not make bad requests to data collectors. There likely are some patterns. If they share commonalities and can be processed in similar ways, that’s a feature group!

Information groups are more about the contents of the data so must come after understanding the raw data. Do several features have information that is somehow related to one another? Can they represent the same information if they are subtracted from one another? Can they represent new information if they are combined? If any of the answers are “yes”, bucket each list of features in the same information group.

Domain Expertise Reduces Learning Curve

If someone has dealt with a specific type of data before, of course they will know what to do. But this can be either rare or almost irrelevant depending on the field.

Number of Tasks = Headcount

Although I have some opinions, I can’t speak on how much data engineers should be paid. I just know they should be paid very well. However, I can certainly say that amount of work to be done impacts headcount. There’s no formula for number of headcount needed but good teams are larger than great teams.

Good Data Engineers Are Lone Wolves, Great Data Engineers Are Freaks

Smart readers should be prepared to handle lot of complexity. The right team can crack the data quickly.

What distinguishes good vs great is how well the teams scale.

Team of lone wolf engineers and efficient manager scale linearly, with a lump-sum cost for training

Lone wolves are self-sufficient creatures. Give them reasonable amount of tasks and a reasonable timeline, and they’ll deliver to perfection. An individual lone wolf can and should get things done linearly with respect to number of tasks.

Good managers allocate tasks efficiently to right people. They may prepare training materials to quickly get everyone to become domain experts. However, this still scales linearly - if a task takes 8 hours on average, maybe a good manager can make it into 6 hours.

So what is the secret to greatness?

Team of freaks and a leader who attracts freaks scale sub-linearly

“Freak” is a technical term. Do you know those oddballs who’d rather work on some obscure thing than anything else? Freaks are obsessed with data and uncovering its hidden secrets. They love sharing what they found and they love to communicate before doing anything that might be considered “freaky”. If they don’t like sharing, maybe they are more of mad scientists.

Freaks always find the quickest paths and don’t mind spending a lot of energy digging very deep. They are not just any regular nerds - they are quick learners because they can identify the basic building blocks and apply them elsewhere. They’re great abstract thinkers and have great intuition on what secrets may be hidden in data.

Freaks love talking to other freaks. But they might start breaking things if you just leave them be…

Freaks are hard to find and even harder to manage. They don’t need an ordinary manager, but a leader. Ideal data engineering leaders are well-rounded freaks. They are even harder find because the requirements are - i) they need to have gone through the ropes enough to know the nitty-gritty details, ii) they need to have charisma to attract other freaks, and iii) most importantly, they need to care about others. Leader is a glue that holds the team together. On the other hand, a good manager just knows how to pull what levers well.

Look how well organized they are now with a leader.

Team of freaks and a great leader is a powerhouse. They can scale sub-linearly because they figure out a way to reduce the number of tasks sub-linearly or they can figure out a way to find simple solutions to each task that scales sub-linearly to what everyone else expects. They achieve this as a team and by planning thoroughly.

Freaks in Practice - Great Planning Reduces Execution Time & Open Communication Builds Knowledge Assets

Freaks choose the below path without anyone asking.

Team of freaks do 3 things well:

Find commonality across tasks and reduce the total number.
Find unique yet intuitive solutions to curveballs.
Gain domain expertise very quickly.

Good planning is for identifying potential areas of commonality and potential curveballs. Great teams don’t really need the training phase to become domain experts, they can just learn it on the go during planning phase.

With open, transparent, and creative communication, freaks can fill each other’s gaps very quickly and accumulate knowledge assets for the entire team. This is the key to sub-linear scaling.

Role of a leader is threefold - i) enable free behavior and open communication, ii) give incentives to guide behaviors towards the team’s goal, and iii) remove any blockers very quickly, sometimes anticipating them before they happen.

Lone Wolves vs. Freaks - State of Mind

Some questions should naturally arise:

What sets lone-wolves apart from freaks?
How do you find and attract freaks?
Can you turn lone-wolves into freaks?

But there really is only one question to answer - how to make sure freaks stay freaks?

Weak leadership and weak incentives turn freaks into lone-wolves. Freaks don’t like to feel like their wings are clipped. If you present freaks with an environment they don’t enjoy, they will behave like lone wolves for you.

Strong leadership that communicates their vision clearly and lets people do their own thing while helping fill in whatever gaps that might exist may even turn lone wolves into freaks. Freakiness is a fire that needs to be kept burning.

Best way to burn this fire consistently is to make them feel at home - safe and comfortable to be their true selves. If you can create an environment like that, you won’t have problem attracting freaks and retaining them.

Two Types of Freaks - You Need Both

Freaks I’ve met, either love data itself or love problem solving. Data lovers enjoy finding all the nooks and crannies of data sets. They make great data scientists and analysts as well but usually want more control over data. What gives the ultimate sense of control but the ability to impact the lives of many data users?

Those who love problem solving just love endlessly solving puzzles. Working with data just poses them with a list of many small puzzles.

If someone is both, maybe they are already great.

Start By Caring About Something & Be Open

If you already know what you want to build with AI, then a great place to start is having something you really care about and being open about it. Stay open-minded to what others really care about and be yourself. You may just be able to scale AI data pipelines.