Engineering as the data team

Matt Allen, Fri Sep 22 2023 • data team

Adding data capabilities to an engineering organization is a daunting task. There are a million opinions out there on how to go about it, do you centralize or embed analysts, do you lean on streaming or batch, what technologies do you build on, and how do you hire people to do this work anyway? For better or worse, I’m here to add one more set of opinions to the pile.

Background

I began my professional career working on data platform teams, building pipes for people to send events through mostly. Our teams were always centralized, with the goal of building expertise and facilitating communication between data producers and consumers across the company. Something that I have noticed as my career progressed is the sense of exclusivity among data practitioners and experts. That problem needs Data Science, and Data Science is hard in ways that normal engineers don’t understand. This decision impacts data, and so it needs special consideration by an expert. There’s too much context around how data is used to possibly educate the product engineering teams on how to play nice with the consumers.These kind of opinions are, in the end, no more valid than any other claims to special status and privilege from engineers. I wouldn’t tolerate a devops engineer who acted this way about core infrastructure, or a security engineer who acted through fiat instead of educating, and you shouldn’t tolerate gatekeeping and status seeking from the data world just because it’s somewhat different to the rest of your engineering projects. The most effective data people I have worked with had none of this ego or sense of exclusivity. They worked to educate those around them and promote good decision making by a team with a wide range of experience with data tasks, and that allowed them to get more done and solve more important problems than people who stayed within a data silo.

Centralization

I have found that team structure does a lot to promote this kind of attitude or dissuade it. In fully centralized teams, it’s hard not to feel like it’s “us vs the world.” Your requests come from people who don’t understand what you do or why you do it in particular ways, you constantly feel like people are breaking things you depend on (because they don’t understand how to avoid it) and you end up feeling pressure from every area of the business. Fully centralized data teams quickly become the bottleneck for a wide range of projects and tasks, and that leads to resentment from both sides. In addition, because a centralized team is responsible for building all of the data products, it will eventually grow to an unsustainable size. When your data organization is as big as the rest of product engineering, something has gone wrong.

Embedding

The other popular way to structure a data team seems to be to keep the data people as close to their stakeholders as possible, embedding them on teams like marketing and product. This leads to a splintering of knowledge and expertise among departments that have little to do with each other. It’s difficult to share learnings, communicate problems, or plan a cohesive strategy when there is no central authority who oversees all of the disparate uses of your data. This is also where things like duplication of core functionality and infrastructure tend to come from, as people work independently to solve problems that would benefit from reuse and standardization. If no one is thinking about data in a holistic way, you lose a lot of opportunities for efficiency.

Data as engineering

One thing you may notice about both of the previous approaches is that they presuppose some separation between “data people” and everyone else. The industry promotes this thinking through separated titles and career tracks, and I think that it does a lot of harm. Rather than trying to bolt data on to an engineering org like an afterthought, I think a lot could be gained from just training product engineers to think of data in a different way. You already have a working engineering team, why not teach them to solve data problems instead of building a parallel team? Now this is easier said than done to be sure, I don’t mean to imply that there is an easy road to data excellence that doesn’t take hard work and planning. Rather, I think that by breaking down the silos between engineering and data you can gain a lot of efficiency and build reliable and cost effective data solutions.

Centralize tools, not solutions

The first step to accomplishing this task of empowering engineering to solve data problems is to give them good tools. There are significant extra considerations that need to be accounted for when building a data driven product, and asking every engineer to learn those techniques in addition to their other expertise is a big ask. If you give people a paved road, however, you can empower them to build good solutions without understanding the “why” behind every decision. This means building some standards around how people produce and consume datasets, and educating all of your engineers on when to choose each tool. This is really no different than what we do with any other core infrastructure; Kubernetes is just a tool to simplify server deployment and networking so that not every engineer needs to understand the details, but everyone can still get good outcomes.

Datasets as products

One of the keys to implementing this strategy is to encourage people to think of the datasets that their solutions produce, and to consider the quality of those datasets as part of their requirements. Too often I have seen dataset production as an afterthought for product teams, something that was bolted on later in development, that isn’t tested manually or automatically, and that is only thought of when someone complains about a breakage. This isn’t because engineers don’t care about producing good, reliable products, it’s because they don’t think of the dataset as part of the product at all. I think the most important bit of data education that can be done is to teach people that these datasets are actually critical to other teams and parts of the business, that their actions in improving them will be appreciated and rewarded, and that they have as much responsibility to the dataset consumers as they do to the people who use their product or API. Engineers are already great at maintaining datasets and data products, they just don’t think of them in the same terms as a data engineer and don’t support the same access patterns and technologies. After all, something like an auth product needs to ingest, maintain, and make available a complex set of entities and relationships already. Is it really such a big ask to add usage logging and batch access to such a product? I think if engineers are given the right support and incentives, they can think of their products as holistic combinations of realtime and offline behaviors, and that this gives the data “team” access to a lot of great resources that you already have.

Contracts and boundaries

Engineers are already used to thinking about obligations to other teams, contracts, API definitions, migration strategies, etc. If you extend these concepts to datasets as well as online behavior, the mindset really doesn’t change much. One needs to think about historical behavior and how things have changed over time more in data land, and the access patterns and volumes are different, but the key idea of a contract between producer and consumer is nothing new. With the right tooling and support, every engineering team can be using contracts in the dataset space just as smoothly as they use them in the API space.

Putting it all together

In closing, my next data team is going to be less of a group responsible for solutions and more a source of knowledge and best practices for a wider engineering organization. There’s still a need for some experienced people to build tools and a platform for everyone else to use, just like there’s a need for experts in deployment and cloud infrastructure. There’s also a need for someone to look at the big picture of who is using what data in what ways, and find places for improvements to accuracy, timeliness, and efficiency. This is similar to a systems architect or principal engineer who looks at the big picture of API usage and the relationships between services, so again it really isn’t anything new. Finally, there is still a need for cross-pollination and knowledge sharing between people like analysts and data scientists working on different teams. There are numerous solutions to this problem, which really is the same problem you face with things like frontend experts working on different projects. The key takeaway from all of this is to stop thinking of data as a special beast that needs special solutions, and start thinking about how the solutions you already use to build a great engineering team can expand to embrace data as well.