What indicators would suggest to you that an architecture is too complex and not resilient enough to bounce back from a failure?

Resilience Availability Application Architecture+2 more

3.2k views2 Upvotes9 Comments

Sort By:

Oldest

CIO21 days ago

It's difficult to track the system's performance, identify issues, or diagnose failures.

VP of Engineering20 days ago

An indicator of too much complexity might be how difficult it is to make changes. Is feature development slowing down over time? Is it common for changes in one area to introduce bugs in another? How many teams must be involved in making a typical change?

Regarding resiliency, how long does it take to do a deployment? Has the team practiced rollbacks and have they been successful? If there are processes that might need to be re-run in a failure scenario, does that happen automatically or manually? Are those processes idempotent or will they create duplicate records/transactions?

Director of IT in Manufacturing16 days ago

I would say that MTTR (Mean Time To Repair) is a fairly good indicator of the inherent complexity behind an architecture, which clearly makes operations complex or even complicated in certain cases.
The shorter is the MTTR, the quicker the infrastructure is able to recover from failures. Resilient infrastructures are typically self-healing. Self-healing capabilities push the MTTR down to hundreds of milliseconds, oftentimes even lower than that.

Please join or sign in to view more content.

By joining the Peer Community, you'll get:

Peer Discussions and Polls
One-Minute Insights
Connect with like-minded individuals

Director of Engineering in Healthcare and Biotech15 days ago

I would add that external dependencies are somethings to consider. We have elite DORA metrics but we lost an industry wide provider this year. One of our competitors just reported a significant financial loss because they were unable to switch providers in a timely manner.

Strategic Banking IT advisor in Banking14 days ago

I may sound out of track with my answer, but still, this is a first step in raising the awareness:

A few years ago, we've done a half-day simulation. Putting all the "normal" people that would be involved if a crisis happened.

The scenario was: a plane crashed into our main datacenter facility. Everything burned.

Someone got called by the lead manager of incident management.

"How do we restart our primary services (ATM, Branches, Online Banking, Payments, Call Centers, etc.)?"

Then we watched everyone explaining, at the right moment, the actions to be accomplished.

The conclusion: NO ONE was mastering the END TO END infrastructure required. For example, everyone assume that IP addresses would have been redirected to our DR site. Then, after asking questions, someone realize that some manual interventions were required, etc.

At the end: all agreed (including execs) that we could do better on this and that better documentation was required. At the sametime, we've done work on the architecture itself to increase its resiliency. Moving to cloud crucial elements is part of this work.

Hope it makes a little sense.