What makes a good data platform?

Matt Allen, Fri Sep 22 2023 • data platform

Data platforms come in as many shapes and sizes as any other part of your infrastructure, and the decisions made during their construction often have ramifications that aren't felt until a lot of time has passed. There are plenty of articles out there telling you what you need and what tools to use, so I see no need to write another. What I'd like to do instead is discuss strategies for building a data platform that fits your particular needs and organization, rather than suggesting that there's one size that fits all.

If you're thinking of building (or largely buying) a data platform, you probably already see the value of data in terms of some particular use cases that would benefit your organization. My first nugget of advice is that if you can't think of concrete use cases, you shouldn't be building any kind of platform. I don't say this to be facetious; not all organizations are in a place where they would benefit from dedicated data infrastructure, and building tools for no one isn't a productive use of time or money. Let's assume then that you have a handful of ideas that seem data oriented, and that those use cases have the buy-in from stakeholders that leads you to believe that they would be worth pursuing. It may be tempting to take a sort of "foundational" approach, to start from the bottom and think of the way things "should" be with an eye towards correctness and purity. Certainly these are noble goals, but they may lead to building things that offer no concrete value to your company in the short term. And since requirements and technology change on rapid timescales, no value in the short term often means no value at all. Another approach is to build only exactly what is needed for your current projects or use cases. While this will certainly encourage you to build only what you need, the danger of over-fitting to the current environment should not be ignored. If you build yourself into a corner because you didn't consider the need to grow, you may end up rebuilding your platform every six months, a waste to be sure.

My guiding light is therefore a mixture of the two approaches. I try to build things that solve the current use cases first and foremost, but include the need to change, adapt, and grow as a requirement for all use cases I consider. This isn't a panacea or a simple task, but it does give me a base to work from.

Let's consider some concrete examples to understand this strategy better. If you were designing a communication protocol between two areas of the product, it would be clear from the immediate requirements that some sort of schema or data validation is essential to ensuring that the two systems are speaking the same language. Adding the requirements of change and growth might encourage you to think about such things as backwards compatibility, schema versioning, or data migrations. Now you have a solution that solves your problem, but doesn't assume that the current state of the world will remain stable for very long. If you were building an event bus, you might consider requirements such as latency, durability, freshness, partition tolerance, etc. Adding a requirement for flexibility encourages you to think about things like replay or overwrite scenarios, to think about heterogenous data versions living in one stream, and to prepare for new consumers and producers with different demands. This isn't to say that you should build for every conceivable eventuality, some assumptions can drastically reduce the scope and cost of work and are totally realistic to make. Rather, the goal is to consider plausible future needs and make sure that if they do arise, you will be able to meet them without a complete rebuild. The key to a good platform as always is judging prudent concerns for the future from flights of fancy.

Features that enable flexibility

Tools, not Solutions

One of the biggest things that promotes a flexible data platform is to think of the components within it as tools to be used to solve problems, not solutions themselves. The solutions to your business problems don’t belong in platform-land, and trying to make them live there is a bit like putting your web application logic in with your infrastructure definitions. Just like a traditional web platform enables teams to build solutions on top of it without knowing everything about networking, a data platform should enable people to build data driven solutions to problems without having the knowledge of a data engineer. This means providing guard rails around things like throughput, latency, cost, and access patterns for each tool. Make sure your users know the implications of decisions like streaming vs batch, but don’t try to make the decision for them.

Versioning

One of the biggest pain points with data driven systems is the need to evolve data schemas and access patterns as business requirements change. To that end, making sure that every data artifact in your system is versioned, and that old versions continue to work long enough for consumers to adopt the newer versions, is key. This could mean there will be periods when you are duplicating lots of data and work, but the decrease in friction between teams is well worth the overhead. In addition, some version changes don’t require any duplication of work because they are backwards compatible, or the old version can be created as a view on top of the new world.

Access logging

No one would build a web API platform without some kind of access monitoring and logging, and the same should be true of data products. Knowing who is using your data, how often, and what version they are using is key to managing change within a data platform. This means that the platform needs to provide a golden path for collecting this audit data and surfacing it to data producers.

Documentation and metadata

One of the most powerful tools in a platform maintainer’s arsenal is building things that let people answer their own questions. Things like, “is this dataset ready for production use?”, “who produces this dataset?”, “who owns the provisioning of this identifier?”, and “when was the last time this was used?” are all questions that people should be able to answer without hopping in a support channel. This means that lots of metadata about datasets and systems needs to be maintained and surfaced to everyone in the org.

Agnosticism to underlying technology

The number of opinions about databases, query engines, storage solutions, and streaming platforms could fill a book. Chances are, your organization will not be perfectly homogenous in the choice of technology for all of these categories, and even if things start that way they may still change in the future as new products become available. If you build the core framework of your data platform in a way that is agnostic to specific technologies, you can more easily weather those changes. This is not to suggest that you need to rebuild the wheel for every problem, it is probably enough to think about some simple abstraction and encapsulation around the borders of your platform, which will give you hooks to add flexibility in the future. This is a technique that can bite you if you overdo it, so it's worth spending some time thinking about where to abstract over an underlying tool and where to just use the tool as-is.