Felipe Polo Site

Legacy migrations: You will fail

October 20, 2020

Rebuilding a legacy platform from scratch is really fun. Sometimes you need to replicate a lot of business logic placed somewhere inside a really big monolith full of unreadable hacks, massive functions or lacking any sort of tests. Sometimes it even shares responsibilities with stored procedures in different databases with restricted access or no one able to tell you a single thing about them.

Under these circumstances, let me tell you a little secret: you will fail. There is no way a team is able to understand all that complexity as well as rebuilding the whole thing in a sensible amount of time. So you’ve got two options:

1.- Focus on building the perfect system that never fails

2.- Focus on building a system that fails perfectly

I typically prefer to go for option number 2. Working with legacy systems rebuilds makes you trust very few things as so many things can go wrong.. So replacing a legacy system is all about managing risk- here are a few tips:

Backup your data.

I’m not only talking about backing up your databases here. Especially if you’re the source of truth, make sure you back your data up in different, cheap storage like Azure blob or Amazon S3. Not only in case of server failure, but also in case your data model gets corrupted along the way.

Use a data bus for resilience.

It helps to decouple your services
It offloads requests
It improves and eases recoverability and retries
It makes error handling easier
It helps to avoid bottlenecks

Replay your events.

As I said, you will get it wrong. If your input data comes from people operating other systems like an external CRM or CMS you don’t want to ask them to resave 1000 items for you, three times a day just because your team needs to reprocess data from last week. Save them in your platform and build the ability to replay this data to have more control internally.

Build a cockpit.

Would you jump onto a plane that is flown without any sort of environmental feedback? Make sure you build an operational cockpit to be able to fly your software with guarantees: monitors displaying operational metrics, proactive alerts and a good logging system are some of the aspects you will need to consider on every platform you build.

Keep the dev team close to the problem.

To understand how concepts like agile, delivery and DevOps play together, we probably need to understand first what is the value of a software team. Its main value is to solve real business problems, often done by delivering working software into a running production environment frequently and in a sustainable way.

To achieve this, developers need to stay very close to the problem. They need to talk directly to the stakeholders, they need to work collaboratively with the product owner, and they also need to be able to operate, release, rollback and watch the software they build. This increases accountability and reduces communication burdens (which usually bring delays).

Do not trust 3rd parties. Ever.

“This old system never failed in 10 years”. This is what I heard from a customer when we started a rebuild of their legacy platform. Well, guess what? It did fail.

Always make sure you assess the old systems you will depend on, and take into account technical constraints like rate limits, payload sizes…

Also consider how to process data when these systems fail, because they might: it could be a token revoked without previous communication, it could be a 500 error, it could be a memory leak, it could be an API contract no longer respected. Plan for failure.

Figure out where your bottlenecks are.

When you’re building a new platform (especially if it needs to integrate with third parties), always run performance tests engaging these external systems before going live. In which areas will your platform suffer when scaling capacity? Focus on making that area stronger or it will bite you sooner or later.

In one of the legacy migrations we recently implemented we noticed the old system would take 6 seconds to digest data. Because we ran performance tests on time we could change the architecture design, offloading that process to happen in the background so users wouldn’t be affected by this legacy constraint.

Hopefully these tips will be valuable to you! What other learnings from your experience can you add?