Skip to main content

What a decade of production systems taught me

My first production system served ten million people. It was a citizen portal for Kazakhstan—the kind of system where downtime means people cannot access government services, cannot file documents, cannot do the bureaucratic tasks that society runs on. There was no “we’ll fix it in the next sprint.” There was only: this must work.

I was 19 years old. I did not have a computer science degree from Stanford. I did not have mentors who had built at scale. I had books, documentation, Stack Overflow, and the absolute terror of being responsible for something that millions of people depended on.

That system is still running. Over a decade later, still serving citizens, still processing documents, still doing its job while governments changed, while the world went through a pandemic, while I moved to different continents and built other things.

Here is what building systems like that taught me.

Lesson 1: The boring choices are the right choices

When you are young and technical, you want to use the new thing. The framework that just launched. The database that promises ten times the performance. The architecture pattern you read about on Hacker News.

Production systems taught me the opposite. The boring choice is almost always the right choice.

PostgreSQL over the new distributed database that promises infinite scale. React over the framework that launched last month. REST over GraphQL unless you have a specific reason. Kubernetes only if you actually need it, and you probably do not.

Boring technologies have one thing that exciting technologies lack: they have been debugged by millions of people before you. Every edge case has been discovered. Every failure mode has been documented. Every solution has been Stack Overflow’d.

When your system is down at 3am on a Saturday, you do not want to be the first person to encounter a bug. You want to be the thousandth person, with a clear Google result explaining exactly what went wrong and how to fix it.

Lesson 2: Observability is not optional

The government portal taught me this the hard way. The first version had minimal logging. When something went wrong, we knew it was wrong but not why. We would stare at metrics, guess at causes, deploy fixes that might help, and hope.

After the third incident where we spent hours guessing, I rebuilt the logging system. Every request logged. Every database query logged. Every external service call logged. Structured logs that could be queried. Metrics that could be graphed. Traces that showed the exact path of every request through the system.

The next incident took fifteen minutes to diagnose instead of four hours. The system had not gotten more reliable. We had gotten more able to see what was happening.

I now tell every founder the same thing: spend 20% of your initial development time on observability. Logs, metrics, traces. It feels like overhead when you are building. It feels like salvation when you are debugging.

Lesson 3: The database is the system

After the citizen portal, I built a maritime safety tool for enterprise clients. Ships in the ocean, tracking compliance, managing inspections. Different domain, same lesson reinforced: the database is the system. Everything else is just interface.

Your API can be rewritten. Your frontend can be redesigned. Your business logic can be refactored. But your database schema is permanent. The decisions you make about how to model data will outlive every other decision you make.

I have seen startups rewrite their entire application three times. Same database. The schema they designed in month one was still constraining their options in year three.

This is why I spend more time on data modeling than on any other part of system design. Getting the entities right. Getting the relationships right. Getting the constraints right. Everything else flows from there.

Lesson 4: Complexity is debt

Every feature you add is debt. Every abstraction you create is debt. Every “clever” solution is debt that will come due when someone needs to understand what the code does.

The citizen portal started with too many microservices. We ended up merging most of them. Not because we removed functionality, but because we learned that most of the separation was artificial. Services that always deployed together. Services that shared databases. Services that could not function independently.

The system got simpler. Deployments got easier. Debugging got faster. We had paid down complexity debt.

I now follow a rule: no abstraction until the third time. The first time you need something, write it inline. The second time, write it inline again. The third time, maybe extract it. Maybe.

Most abstractions are created too early, based on imagined future requirements that never materialize. The code ends up optimized for a future that never came, while being harder to understand in the present that exists.

Lesson 5: Failure modes matter more than success paths

When a junior developer builds a feature, they think about what happens when everything works. When a senior developer builds a feature, they think about what happens when things fail.

What happens when the database is slow? What happens when the external API times out? What happens when the user submits the form twice? What happens when the network drops in the middle of a transaction?

Every integration point is a failure point. Every external dependency is something that can go wrong. The production system is not the code you wrote. It is the code you wrote plus everything that can interfere with it.

I design for failure first. What is the worst thing that can happen? How do we detect it? How do we recover? Only after I have answered these questions do I think about the happy path.

Lesson 6: Humans are part of the system

The maritime safety system taught me something the government portal had not: the humans operating the system are part of the system.

We built beautiful dashboards, elegant workflows, comprehensive features. The operators used about 10% of it. They had their own ways of doing things. They copy-pasted between applications. They kept spreadsheets on the side. They invented workarounds for problems we did not know existed.

The system we designed was not the system that ran. The system that ran was our software plus human behavior plus organizational constraints plus unwritten rules plus years of accumulated habit.

After that project, I started spending more time observing how people actually use systems. Not how they say they use them. Not how we designed them to be used. How they actually use them, including the workarounds, the shortcuts, the “wrong” ways that are actually right for their context.

Lesson 7: Technical decisions are business decisions

Every technical choice is a business choice wearing engineering clothes.

Choosing Kubernetes is a business decision about operational complexity. Choosing a microservices architecture is a business decision about team structure. Choosing to build versus buy is a business decision about time versus money.

I have seen startups make technically elegant decisions that killed the business. Beautiful architectures that took too long to build. Sophisticated infrastructure that required engineers they could not hire. Optimal solutions for problems they would never have at their scale.

The right technical decision is the one that serves the business. Sometimes that means choosing the “worse” technology because it ships faster. Sometimes that means technical debt because the market window is closing. Sometimes that means building something ugly that works over something beautiful that does not exist yet.

The synthesis

Over a decade. Government systems serving millions. Maritime safety tools where bugs have real-world consequences. Products used by people who do not care how clever the architecture is, only whether it works.

What I learned: production systems are not about code. They are about reliability, observability, simplicity, and humility. Humility about what you do not know. Humility about what can go wrong. Humility about the difference between the system you designed and the system that actually runs.

I now work with founders through Prodmake who are building their first production systems. They come to me with architecture questions. I give them architecture answers. But more importantly, I try to give them the mindset that took me over a decade to develop.

The mindset is simple: your job is not to build software. Your job is to build something that works, keeps working, and can be understood and maintained by whoever comes after you.

Everything else is vanity.

More in tech

View all →

Stay in the loop

New writings, tools, and updates—no spam

You're in!