It’s been a hot second but I’m really excited to start writing again in 2023!
I’ve been a PM for (internal) compute infrastructure aka Borg at Google for a year now and something that comes up a lot is the tension between abstraction and customization. Of course, this isn’t a novel problem, almost every industry is plagued with this in one way or another. Do you build solutions for the average user or the power user? Do you optimize for price (niche audience, high margin) or volume (wide audience, low margin)?
I dealt with this at VMware when we created a Kubernetes CRD for Virtual Machines (VMs). There is an extensive list of configuration options available for VMs and figuring out which of these to expose to users and which of these to hide was an imperfect science. The balance between simplicity and complexity is hard, especially when there isn’t much data to reference.
At Google too, this problem comes up in a variety of ways. I will caveat by saying that internal platform PM-ing brings up unique challenges because you have a captive audience and you don’t generate revenue.
There are power users i.e. a small number of Google’s largest, most trafficked apps and services care which about NUMA nodes, memory bandwidth, throttling, and noisy-neighbor problems (among other things). You also have the long-tail of regular apps and services that simply want enough compute to just deploy and run their apps.
If measured by sheer impact, it’s tempting to solve problems that will help large customers. A small optimization can result in massive gains, although over time the low hanging fruit in this area drop significantly and you’re left with expensive multi-year projects. As a funny aside, I learnt the term “watermelon” today for projects that are green in status on the outside, but crumbling (aka red) on the inside. Over the length of my career, I’ve come to the realization that a lot of these big projects with the promise of a “holy grail” result turn out to be watermelons.
On the other hand, if you’re building for the long-tail of small customers, then you’re dealing with multiple user-personas with multiple needs, all of whom need something slightly different, but you can’t afford to build custom solutions… so you slap together a union of the most-requested features and send out thoughts and prayers for adoption (jk but not really).
Now switching from the customer PoV to being a provider of the underlying compute, there is also a tradeoff between vertically integrating and not. In other words, you can either abstract away the hardware underneath or expose custom features. As an internal compute platform at Google, we have deep knowledge of all the apps and services we run. Some apps run best on certain types of hardware. So we can make intelligent scheduling and placement decisions. But just because we can doesn’t always mean we should, because the more picky applications are about they run, the less efficient the platform might be overall.
It’s like seats on a plane—if the airline doesn’t have butts in every seat, they’re losing money. This is the essential premise of every cloud provider really. Even if it is less optimal for the application, it’s probably in the best interest of the platform to have scheduling flexibility. It’s often a call between optimizing for a global maxima over a local maxima, and the platform team is the only one with broad enough visibility to make this call (and the call can be really hard to make).
There are also times when you have to bite the bullet and take the hit because performance is more valuable than efficiency, especially when you’re running a latency-sensitive, highly available application that brings in the moolah.
As much as I wish there was a framework to pick between abstraction or customization, there isn’t one that I have found (and if you know of something, please send it my way). I’ve concluded that every application is different and needs to be assessed on its unique characteristics, and as a platform PM, I have no choice but to accept that.
P.S. I’m trying to more actively angel invest this year in the infra, compute, devops, dev tools/productivity space and would love to talk to anyone who is doing this successfully or is on a similar journey. Bonus if you’re an operator angel and/or a woman in tech! Please reach out if this is you.