Zillow did not have metallic balls

“Do you know what it takes to sell real estate? It takes brass balls to sell real estate.”

— Blake, Glengarry Glen Ross

Last week, Zillow made such a whopper of an oopsie that they shut down an entire division and wrote off more than $500 million in losses. I’m not gonna mince words here. Zillow simply did not have the guts to be in the “iBuying” business. Here is Max Levchin, founder of Affirm and the man most responsible for kickstarting the algorithmic underwriting craze in Silicon Valley:

Host: “How are you thinking about judging what is and isn’t important, and how do we know, like, is it [the underwriting model] going to work?

Max: “So you don’t. The honest answer is you don’t, and the only way of building a a successful anti-fraud and risk underwriting system is rigor and for lack of a better term, balls of steel.”

Zillow couldn’t stomach this uncertainty, with their CEO saying that “buying and selling thousands of homes every month required the company to put too much capital at risk.” The $500 million dollar question is, why couldn’t they? They oversimplified the problem. They thought they needed to build a machine learning model when they really needed to build an entirely new organization, one that possessed the technical and cultural mindset necessary to succeed in this space.

A machine learning organization thinks of risk entirely differently than an automated risk underwriting organization. But this is actually a byproduct of a much more fundamental difference: how these organizations treat data. Machine learning engineers tend to treat data as fungible and, as such, more is always better. This seems to have been Zillow’s approach (emphasis mine):

“The business model rested on the assumption that Zillow’s algorithm, fed by the company’s trove of data, would be able to predict home prices with pinpoint accuracy.”

—Wall Street Journal

“We used historical data and countless simulations to test [our housing price forecasts].”

—Rich Barton, Zillow CEO, Q4 earnings call

Contrast that with how Max Levchin thinks about data (emphasis again mine):

“The most valuable data is not social data, not what you ate for lunch, not even debt-to-income ratio, even though you know that’s fairly predictive, but your own data because every dataset that you’re looking at internally describes your own process, including your bugs, including your delays, including what changed, including the merchant base that you sign and the merchant base that churned. All those changes to your system are encoded in your own logs and building models from your own data is the only way to build a really successful system.”

Which leads us back to risk. If data is truly fungible, then the path to building a business like Zillow Offers is fairly simple: acquire as much housing data as possible, train a model on it, ensure it spits out sufficiently correct predictions, and use that to make offers. Understanding whether or not your model meets this threshold is a pretty binary decision: it either is good enough, or it isn’t. Therefore, the perceived risk with this approach is fairly low. Again, this seems to be Zillow’s approach:

“We set unit economics targets that required us to stay within plus or minus 200 basis points in breakeven.”

—Rich Barton, Zillow CEO, Q4 earnings call

But automated risk underwriting engineers know that you can’t bootstrap a model on existing data sets. The only way to get the data you need to be successful is to pay for it in time, talent, and treasure. Needless to say, this is a significantly riskier undertaking. Here’s Max again:

“Of course, it’s very hard to build models from your own data if you don’t have a lot of data and the only way to get a lot of data is to lose a bunch of money.

[…]

The punchline is you only find out what works when you use the data that describes your own system, and that means processing a lot of transactions and bracing for impacts because a lot of the transactions are gonna go sour. One of the things that happens for a brand-new launched credit card: done right, you lose about 50% of the dollar volume in the first several months which is terrifying because it’s half the money, literally.”

So what lessons can we learn from Zillow Offers’ collapse? Mark my words: they are only the first company to go under. The algorithmic underwriting sector is the new Hot Thing in Silicon Valley, and many of these companies are taking shortcuts as they try to outrun the competition. Yet, in the end, these shortcuts will be their downfall. Here are some common mistakes I’m seeing in the space:

You cannot bootstrap off an existing dataset. Full stop. These datasets can contain implicit assumptions or associations that you are not aware of. This is the original sin of many a algorithmic risk underwriting startup.
You are operating in an adversarial environment. Most folks in ML are used to working with pretty boring data—demographic data, handwriting samples, etc. That changes as soon as you introduce cold, hard cash into the equation. As soon as there is money to be made, fraudsters are going to be hard at work reverse engineering your model. Have you separated your fraud detection models from your risk underwriting models? Do you have systems in place to detect these fraudulent requests, and are you directing the right requests into the correct training pipelines?
Startups underestimate how much money it will take to train the model. As previously noted, you should expect to lose 50% of your capital allocated towards underwriting. I suspect many startups drastically underestimate this amount, realize they are going to run out of money, which means raising capital under duress, which means extremely bad terms, which makes future success even less likely.
Startups do not implement sufficient rigor into their data collection and analysis processes. As your understanding of the problem evolves and your technical infrastructure matures, it is only natural to change your data model to better fit the problem you are trying to solve. However, problems often emerge when these updates break backwards-compatibility in often subtle ways. Even as your data model evolves, you want to make sure you can compare the decisions you are making today with the decisions you made on Day 1.
Startups let business concerns affect model development. There are often valid business considerations that lead to startups making sub-optimal (in the eyes of the model) decisions. Consider the following example from the Buy Now, Pay Later (BNPL) space. You are the ML lead at a company having a bake-off with a competitor to win a major deal. Perhaps your decisioning model has a lower risk tolerance than your competitor, which means you are going to approve fewer loans. But does the merchant care about that? No! Every denial you make means a lost sale for them, which means lost revenue. So someone from finance comes over and says, “Hey, can you loosen your risk controls on this merchant for the next two weeks? We’ve run the numbers and we can afford increased default rates – we’ll just consider this a marketing expense.” When you feed those loans back into your model to train them, are you marking them differently than a normal loan? How about when you calculate the health of your book? It is likely that Zillow made this mistake, adding automatic price boosts to the model to increase the takeup rate of its offers.

At a high level, the story of Zillow Offers is a story of our industry at its best. It’s a story of an industry giant facing an existential threat from a scrappy upstart and then deciding to confront the challenge head-on, even though their business (high-margin lead generation driven by a large sales team) and their technical expertise (SEO optimization) is completely orthogonal to the problem at hand (low-margin business driven by state-of-the-art machine learning).

And this storyline isn’t going away. The activity in the algorithmic risk underwriting space is frenetic, to say the least. Unfortunately, it is inevitable that we will see many more companies, of all shapes and sizes, make these same mistakes. Because when you have a hammer, everything tends to look like a nail and when you have TensorFlow, everything tends to look like an ML problem.