How Gojek’s early team managed 1 million drivers with 12 engineers
In Norse mythology, Yggdrasil is a gigantic mythical tree that connects the nine worlds of the Norse cosmology. It’s a holy tree whose branches touch the heavens. The gods hold court beneath it, debating the fates of other gods, men, and monsters. It’s the tree of life and the center of everything.
At Gojek, the Allocations Team is the eternal green ash tree. The company’s multiple products rely on this team to assign drivers and ensure completion of orders. The team cuts across products and services responsible for more than 1 million registered drivers on the platform. Back then, there were only 12 engineers in the team.
This is their story.
In 2010, Gojek started as a call center. Customers would call a number, then someone from the call center would “allocate” a driver after looking at a list of drivers in an Excel sheet.
Nadiem Makarim, Gojek’s CEO, was the first person to test the Gojek app. As soon as a dummy order was created and Makarim got the first notification, the app was open for business.
At its peak in 2014, the company had 200 drivers. The three-person engineering team then had a stopgap model for matchmaking drivers and customers. The codebase eventually had to be reworked to match scale, so the new matchmaking codebase was built in two days and could handle about 700 drivers. But everything was manual, and something had to be done to automate the whole process.
In the beginning of 2015, a version of the app was launched to keep pace with increasing demand. The company’s other products – Go-Ride, Go-Food, and Go-Send – went live. Customers would “bid” for a driver, then drivers would get a notification and accept the order.
This was the birth of the “bid engine,” a classic matchmaker between supply and demand (in this case, driver and customer). It would form the genesis on which Gojek was going to be built.
But there was a problem: siloed boxes.
There was a total of three products with the same underlying infrastructure, but without interconnectedness. That was a problem that shouldn’t exist when success depends on deeper linkage across teams. The problem was serious, as all three teams were working on the same infrastructure.
Moreover, Gojek simply wasn’t prepared to handle the kind of adoption and growth it began to witness as soon as the app was launched. The algorithm started to crumble. The engineering was pretty straightforward with 10 to 15 lines of code, solving problems for a small set of data. As the drivers increased, downtimes became a routine affair. There were way too many bottlenecks.
Gojek was failing – and fast.
Soon enough, the company had one of India’s best consulting firms, Code Monk, in its arsenal. The task for the engineering team then was to work on the bidding engine and ensure that there’d be no more downtimes.
It’s about the ones and zeros
The old codebase was written in Java. It’s a programming language that belonged to the internet age, but not for a startup in 2015 that was exploding in demand and imploding with a lack of resources and engineering wherewithal.
Why were we using Java? Because of the Golden Hammer anti-pattern: “I know Java, so Java is the best.”
Niranjan Paranjape, Gojek’s chief technology officer, plugged in the hard drive and checked the code. As soon as he opened the readme file, the first line was
mvn install -DskipTests.
The code had never been tested. In other words, the code was live without ever passing a single quality check.
The old codebase was called Stan Marsh. For the uninitiated, Stan Marsh is a character from the animated TV series South Park.
Stan Marsh was the legacy code on which Gojek was going to be built. Because there was no test harness, it was difficult to understand which portion of the app was working and which wasn’t. Considering the app was live, no one wanted to touch a “ticking time bomb.”
Paranjape rewrote the entire codebase in Golang, a language he didn’t know. Golang could handle concurrency and manage load: the kind Gojek was experiencing. It wasn’t an easy decision to go with Golang, though, because not many knew the programming language.
But some risks are worth taking, and that defines good engineering from great.
Luckily, this initial decision-making set the tone for the team to learn, experiment, adapt, and take responsibility or be disrupted.
After three nights, a dozen bottles of energy drinks, and several cups of coffee, the mothership was ready. Or so the team thought.
In a month, Gojek’s driver count tripled, and more problems came.
The new normal
For the next three months, downtimes were the new normal.
The “broadcast algorithm” the bid engine team was relying on was failing. It was broadcasting the same order to the entire driver database. So, every driver could see the same order multiple times, but they couldn’t necessarily fulfill it.
The algorithm had a three-fold problem:
Accountability: How could the app reward high-performing drivers (i.e., drivers who do more orders or never cancel on a customer) when they simply couldn’t accept an order due to the algorithm’s issue? Likewise, how could the app deny bonuses for low-performing drivers? There was no accountability for the driver or the business fundamentals.
High concurrency: The sheer volume of orders meant that drivers were missing out on orders. The unfulfilled orders were caused by the multiple order blasts and server loads. That resulted in a poor customer experience.
Note: Location-based orders are a peculiar problem for Gojek. In a distance of 20 meters, more than 30 Go-Ride scooters can be spotted.
Unhealthy competition: Since the orders were sent to all drivers, the service couldn’t guarantee quality for customers. The app couldn’t also get the nearest driver to take on an order.
The algorithm’s nature was proving doubtful. Who got an order became a function of the smartphone: better GPS, hardware, internet, software, etc. These all played a critical role – and that was unfair. Zero accountability and an unreasonable driver congestion meant things were going awry.
Seeing 10x growth, but 100% failure
When Paranjape pulled a couple of all-nighters and reworked the code, the core portion was rewritten to make it a spike.
What’s a spike? You break the rules and throw caution to the air with the objective of shipping something out to keep the company afloat.
The problem with the spike, however, was that it wasn’t the end solution. That meant more downtimes and more failures. But, the team was in murky waters by late 2015.
Anyone who wore a Gojek T-shirt became the unofficial complaint box.
At that point, Gojek was managing more than 300,000 orders every day. Failures were routine – again. Wherever Makarim went, he was questioned on why the app was crashing or why users couldn’t find customers. By then, the tech team was made up of around 10 people who were firefighting every day.
When Shobhit Srivastava, one of the company’s programmers, went to a nearby pizza store to grab a quick bite, the drivers who were in the store approached him to question the app. Anyone who wore a Gojek T-shirt became the unofficial complaint box. Something needed to change – and fast.
The big rewrite
The team needed to work on a different algorithm, do one-to-one personalization, pin accountability on drivers, identify what a perfect driver looks like, and ideate how to frame this persona. The big rewrite began in the middle of 2016, and the bid engine team became the Allocations Team.
At that point, the company was still losing customers, so it was time to go back to square one – back to taking risks.
The team then switched to using Clojure, an obvious choice given the specific complexities it could handle. The language could design better abstractions for a specific problem the team needed to solve. While Golang was the modern superbike that had it all, Clojure was the cruiser: really simple and capable of designing complex code. It ushered in the idea of getting organized and ensuring good software development practices.
“Only two in the team knew Clojure then, but it solved an important business problem. We went with it and we all had to learn,” Paranjape recalls.
A six-member team started the work with Clojure, with the first task being to replicate the bid engine logic.
This is not to say that one language is better than the other. It’s tempting to arrive at that conclusion when you see the image above.
But there were trade-offs when we made the switch: While Golang is superior in performance, the capability to make changes and add features was hard. Language was traded for design.
The innate abstraction to sniff out what works when, how, and why is what makes lean engineering so special at Gojek.
After two months, a stable product was live. After three days of releasing, no one noticed there was a new codebase or algorithm in place. There were no issues, and the app was starting to achieve scale.
Shaping a mindset
That’s just half of the story. A million mistakes later, we’re still making mistakes. But that’s the good part. We fail fast, and then we build fast. There’s no hierarchy. There’s an ingrained mentality of managing more with less. Anything that’s repetitive gets automated.
One could argue this mindset was born out of desperation to make Gojek’s products the arteries crisscrossing through the heart of Indonesia. Regardless, the engineering psyche was passed down even to our recruitment process.
Having more people does not mean better code.
Remember Stan Marsh, the old legacy code on which Gojek was being built?
Ten percent of it still survives today (although there’s a plan to eventually put it to rest). It still exists for legacy reasons. I suspect the team is also sentimental about it – after all, it was the foundation on which Gojek was built.
Smart engineering is about working with a legacy codebase and improving it. Fly with what you have and make it better, and everything else will follow: the team embraced that challenge.
It all boils down to the kind of people you let in the system. People should also be empowered to make decisions. As our India head, Sidu Ponnappa, often says: “Don’t throw people at a problem.” Having more people does not mean better work. Having more people does not mean better code.