When reporting system uptime for a SaaS product, how are you handling partial impact or intermittent issues where the system is running but some customers are experiencing issues while others are unimpacted? Is there a best practice for determining what is reported in system uptime metrics as "down?" With all of the redundancies built in, the application we're supporting is rarely truly down; most issues fall into this category of partial/intermittent where impacted users/customers are limited. Unfortunately, it can often be challenging to identify exactly what customers had issues and for how long. How have you handled similar issues?
Sort By:
Oldest
Chief Technology Officer in Softwarea year ago
Defined P1,P2,P3,P4P1 - System down for all users
P2 - System up but critical functionality down for more than 5% users
P3 - System up but critical functionality down for less than 1% users
P4 - Featurre issue.
SVP - Software Engineering in Finance (non-banking)a year ago
This is tough to measure and not easy to put in your typical uptime measure but we use third party tools like AlertSite to check uptime independently from whatever measures our vendor provides as we hit these systems every 5 minutes. It might catch every downtime but if there truly are frequent intermittent system issues, it will eventually see patterns over time because of the frequency of the independent checks.Manager in Constructiona year ago
I'd agree with the other comment here, describing the key issue types and % impact on user-base. If it is affecting small number of users then its a P3.P1 - System down for all users
P2 - System up but critical functionality down for more than 5% users
P3 - System up but critical functionality down for less than 1% users
P4 - Feature issue.
VP of IT in Softwarea year ago
for the most part, we could partial as a full outage unless the partial is very limited functionality. For intermittent disruptions, if it's very up and down we could the entire period - otherwise we evaluate it on a case by case basis and maybe would count a large percentage of the time as the outage time.
When it comes to the "Up/Down" judgement call, an Operations Director had to apply Wisdom to make the metric decision. And they needed to be able to defend the decision in monthly metrics