Saturday, July 31, 2021

Cascading Failures

This article https://www.infoq.com/articles/anatomy-cascading-failure/ listed 6 anti-patterns about cascading failures, after reading it, I think we do have some of these anti-patterns in our system.

  1. Accepting unbounded numbers of incoming requests
  2. Dangerous client retry behavior
  3. Crashing on bad input — the ‘Query of Death’
  4. Proximity-based failover and the domino effect
  5. Work prompted by failure
  6. Long startup times 

Cascading failures are failures that involve a vicious cycle or a feedback loop. A system in cascading failure won’t self-heal; it’ll only be restored through human intervention. Now let's look at the incidents or outages we had in the past which was cascading failures.

Incident 1: Client sent unnecessary API calls using a polling mode to list meetings which caused overload to the backend system, then impacted start/join meeting.  

Setting a limit on the load on each instance of your service is so important. As we can control the client, we should also avoid polling to be kind to the system.

Incident 2: healthcheck API detected error with 3 retry then restarted the pods. The error was due to network blip, but consecutive retries without exponential back-off mechanism.  

In this case, turning off the health checks really can help. Health checks, liveness, readiness checks, whichever you're calling them. This applies both in your load balancer systems and your orchestration systems. The way to reduce this feedback loop is to delay replication, because sometimes your failure is transient, or to use a token bucket algorithm, which limits the amount of inflight replication to put a brake on that reinforcing loop and prevent the feedback cycle from running away.

Incident 3: Listening to voicemail but hang up in the middle might cause recording service crash. This was a code bug, so it eventually crashed all servers. Unfortunately, these servers were not properly monitored.  

Avoid crashing on bad input or user behaviors. if you allow a request to your system to cause a crash, that means that a request to your system can reduce the capacity of your service. There's a practical fuzz testing, which can be really helpful for detecting unintentional or intentional crashes on inputs.

Incident 4: ElasticSearch was upgraded from 7.1 to 7.9 but had performance issue, the search latency suddenly increased 10x times. Slow ES caused latency in web service, which quickly used up all tomcat threads because there was no "circuit breaker" in web service (which is a client in this scenario). After all Http threads were used up, web application failed, so customers cannot access the web portal.

Circuit breaker is very good because it's very protective of backend services in overload while still allowing fast retries in the common case of just one or two backends overloaded. Client retries can easily go exponential. That's a really good reason to limit the number of retries. Use exponential backoff, so 100 milliseconds, 200 milliseconds, 400 milliseconds, and so on. Jitter your retries. Jitter is just going to smear the excess load of retries over time.

Incident 5: Server added a SQL query which didn't use a proper index caused full table scan. When the query was executed, it was very slow, and took one DB connection. This reduced the DB connection capacity without quick connection release, then quickly all DB connections are used up. Healthcheck API will detected DB timeout with retries, then restarted pods. Restarting will not solve this problem as all DB connections are in use, which need human intervention. 

Took the backend DNS out of service, added DB index, and reintroduced traffic to bring the service back. The other way to do it is typically to use some mechanism to block the client's ability to connect.

Incident 6: We have not met this, but the DR (disaster recovery) design using Proximity-based failover is not indeed an anti-pattern.  We have a geographically distributed service, and we have a setup that lets the system fail over from one failed region to another region.

If you have this pattern, you want to think about imposing maximum capacities, or really overprovisioning to deal with the possible implications of this failure. That is why we reserve 50% capacity for possibile DR in terms of registration and concurrent sessions. Current challenges are ongoing calls and RPS when fail over to DR region.

No comments:

Post a Comment