Gotta Catch ’Em All

https://blog.twitter.com/engineering/en_us/topics/insights/2019/gotta-catch–em-all.html

This article is a follow on of thoughts related to The Testing Renaissance article.

If you have ever worked in a server architecture before, then odds are that you have come across many different styles of error handling (especially within the same server). This is very common in legacy code bases where many engineers have come and gone, influencing the code with their own techniques and styles.

From our experience, if you can design your core system around a catch-all architecture, then it is trivial to reason about the success and failure of that system. Additionally, writing tests (especially TDD) is much easier and more natural. The lead time from a customer-reported bug to a regression test capturing that bug is extremely fast. Most importantly, overall confidence in your system is high because testing is easier and error handling is in a centralized location.

Our main goal in this article is to describe an error handling design that we have successfully utilized in our own servers to simplify and improve the readability of our systems.

Definition

  • Controllers – Where our endpoints are implemented 
  • Filters – Request/Response wrappers around our controllers.Filters allow pre- and post-handling of a client’s request and response for example, request logging, statistics reporting, etc.
  • Statistics reporting (AKA stats) – Things like requests per minute or percentage of successful requests (success rate) or even counts of specific failures
  • Error handling – The handling of exceptional cases; perhaps a service failed for an unexpected reason like a network failure

Advantages

  • Nearly all error handling code around the services you depend on disappears.
  • You’re happy, and error logic paths are much easier to understand.  
  • More easily create systems with 100% success rates as in this example (even with millions of requests).

This Tweet is unavailable

As you can see in the charts above, 100% success rate and not a single failure in last 2 hours with requests per minute at 6 million! In fact, at the time of this writing, the last week had not had a single failure which is pretty impressive.

High Level Design

The high level design is a few steps:

  1. Prefer exceptions to response codes (A great example that demonstrates why follows later in the blog on both)
  2. Prefer putting stats recording in the top level filter (design shown later) rather than mixed in with business logic
  3. Categorizing BadClientRequestExceptions as not impacting your success rate (that’s the client’s problem, not yours)
  4. Record all exceptions as stats in the top-level filter for ease of monitoring. Our code looks like this: stats.getCounter(exc.getClass.getName).increment()

When our system’s success rate goes down, we look at the single graph from step 4 with all these exception stats and know exactly the issue immediately most of the time.

How-to

First, to kick this pattern off, you will need to create a few base exceptions. This allows many common components/filters to respond to common exceptions and reused in most servers.

  • Abstract class BadClientRequestException – Signifies the client sent a bad request to us and should not impact our success rate
  • Abstract class DependencyException – Signifies a system that we call is failing and should impact our success rate
  • Abstract class ValidDiscardingRequestException – Signifies the request was valid but is a shortcut to saying, “I am not going to complete this request”
  • All other exceptions – Signifies my server has a bug

Now, if you have an http server, you will have a few more such as:

  • NotFoundException
  • BadGatewayException
  • One exception per http response code (note that you generally subclass as sometimes you have many errors to one response code and want clarity in stats as to which one it was)

Some may disagree with having a normal case like NotFoundException in your system, but the advantages on the practical side have been ten-fold. I guess it’s whether you stick with theory or practicality on making your life easier. Customers (an operations team perhaps) generally don’t care about your theory and only care about you delivering a good experience, so we lean towards making the operations team happy.

As an example we have hit on 404, some systems just record number of 404s that happen. We instead throw a few different exceptions:

  • UserNotFoundException extends NotFoundException in the case where a user hits URL http://twitter.com/user/{userid} if the {userid} is not found
  • AccountNotFoundException extends NotFoundException in the case where an account is not found from URL http://twitter.com/account/{accountid}
  • PageNotFoundException where the general URL/route is not found at all

This in turn gives us statistics on these specific events happening while all still returning a 404 to the clients. It also records better in our logs about what specifically went wrong if a client asks us so we can distinguish between a client hitting the wrong route and using the wrong account id.

Next, let’s look at a picture of some reusable filters

This Tweet is unavailable

CatchAllFilter [Req, Resp]

The first filter is called CatchAllFilter and is reusable if the protocol exceptions it translates to are the same for every server (hopefully they are). It then has the simple job of translating any exceptions into the protocol exception to be returned to the client. In the case of http, there would be an HttpCatchAllFilter that translates all the base exceptions into the http status codes. That is completely reusable as well. The request and response are generically typed to work with any request/response type.

StatsFilter[Req, Resp]

This is where we get some huge power in our design. Many people add stats throughout their system, littering the code with more unnecessary code. Instead, this lets you do it all in one place. Ours also separates exceptions into ‘ignored failures’ and ‘failures’.  ‘Ignored failures’ are things like BadClientRequestException and ValidDiscardingRequestException. They do not impact our success rate, which is why we have systems at 1,000,000 rps that also have 100% success rates. The real beauty of StatsFilter though is in how it records stats like so

     stats.getCounter(exception.getClass).increment()

Notice that exception.getClass returns the subclass of BadClientRequestException (or a subclass of any of the base exceptions). This way, anytime someone adds a new subclass, we automatically get a new statistic of how often it is happening. Even for the success cases of ValidDiscardingException, we see things like DuplicateException or SettingNotOnForUserException so we can visually see all the stats. One of our failures graphs has many lines and we easily cursor over the offending spike, pinpointing the issue without looking at logs.

 

This Tweet is unavailable

And looking at the legend tells us the exact exception of what issue just occured in production.

This Tweet is unavailable

TxLoggingFilter[Req, Resp]  

Since we have systems that are doing tons of requests every second, frequently we want to log the result of every request and the result of the last seven days to one month for looking into what happened when we get inquiries. This filter typically logs to a persistent store the request and response where the response may have been an exception. Again, this is a reusable piece that can be configured to write to a different table each time.

Now, let’s say we did all this with response codes — many people will propose we do that instead, actually. Great, now you are going down the path of no reusability as each system’s response codes are different (and you can’t subclass those codes). The one thing that remains the same, however, is the base exceptions or categories they all fall under so while we have infinite base exceptions, we only have four abstract exceptions.

The other option is to just do response codes from server back to client and internally throw exceptions. In this case, however, each client following this pattern will have to create translations from response codes to exceptions for all their clients creating much more work for client teams. This is much better than doing a whole system internally with response codes, but also don’t like to give client teams more work.

Concrete Code

There is a big debate on response codes vs. exceptions, so let’s go with specific examples as in each case we have applied this to a system built on response codes, it has been quite a reduction in code and an increase in simplicity. The specific example below only talks to two services, but in reality, our systems talk to five or more. With five or more, you can imagine the complexity exploding to really see how much more complex code becomes.

Complex Code with Response Codes

Below is an example of the complexities

This Tweet is unavailable

Simplified Code with Exceptions

Note that we isolate ‘most’ exception handling in the filters making business logic very quick to ramp up on and maintain.

This Tweet is unavailable

Now, we can explain what the code above is doing. The first step is to issue a search query to get all the Tweet IDs to the search cluster. Once we have the Tweet IDs, we then need to issue a request to the Tweet store to get the actual Tweets. In our real system, we further need to get all the user IDs out of the Tweets and geo IDs and go to those systems as well. However, just by using two remote systems, we can easily see how much more complex our code is made by trying to use response codes. Next imagine if it were five systems and how the code grows exponentially.

Stats with Return Codes or Exceptions

Another issue that comes up with response codes and exceptions is monitoring those cases. In our team, we monitor every exception automatically. As seen above, it pops up with a human-readable name. Our stats graphs do not even know the name of the exception ahead of time. In a response code based system, we see the system littered with counters throughout the code to record all the different paths. With exceptions (including exceptions that mean success but discard the request), we record all those paths and isolate them into one ‘aspect’ of the system, a StatsFilter.scala class.

Imagine a world in which you further increase the complexity of the above ‘Response Code’ example. You now need to sprinkle more code throughout hiding your intention of what the business logic success path is really doing. It is very important for a developer to ramp on the success path first and then start understanding the failure cases after that.

The naming of ‘Exception’ is actually quite unfortunate. The way we use it is actually an ‘InterruptCode’. One example is that we have a UserSettingOffInterruptCode, though we call it UserSettingOffException(ick). In this case, a request comes in, and then deep in the logic we find out the user has a setting off. We don’t want to continue down the normal logic chain, so we short-circuit with a UserSettingOffException, which jumps all the way up to the filters and records the count of this code path happening. This is very useful to us as we

  • See how many notifications we tried to send but the user setting was turned off
  • Can easily add any response type interrupt exception and it shows up as a statistic without us doing any extra graphing work
  • Can easily choose to impact or not impact our success rates from deep within our system by the type of interrupt we subclass

Performance

Naturally constructing an exception for normal success cases can hurt performance a little as it fills in the stack trace.  In those cases, you can override the method fillInStackTrace with a no-op to bring the performance back to normal(ie. Very, very fast).  We actually use this on ValidDiscardingRequestException above and for each unique reason, we have a unique exception so, at the very least, we do not need the very last line in the stack trace as each unique exception is only thrown from one location in the code.

Advanced Cases

There are two cases where you may not be able to throw the exception all the way to the top level stats filter entry point into your server:

1 to N events

In this first case, your system intakes one request and sends N requests where one of those N events may fail, but you want the others to continue and succeed. One example is if you personally have a tablet, a phone, and an email, and we tell a server to send out a notification to all your devices. The first request is send this information. This fans out into three requests and N of those three can fail. In a case like this, we may throw a PartialFailureException or a FullFailureException back to the client that called us with the initial request. They can then inspect and decide what to do next.

Fail Open

When using this catch-all pattern, it is always best to fail closed to start with (keep your system simple, KISS) and only go to a fail open strategy if failing closed is truly causing issues.

In failing open, you now have more work to do. Here are some of the new issues now:

  • The exception won’t make it to the top level transaction logger, so you need to record it to make sure the information makes it to the top.
  • You need new stats and graph sections to monitor this that you did not need before as failures on this system won’t show up in the top level graphs.
  • You need new alerts/pages since we fail open, it will not impact your Success Rate.
  • You may/may not need to tell your customer in the response about the partial failure, depending on the system.

Disadvantages

One disadvantage is lack of documentation on all the possible exceptions a client can receive. Of course, if you fail open, there is no exception back to the client, so this only applies for exceptions that go back to the client. Perhaps there is some hidden exception you think you can recover from. Perhaps there is an exception that is so critical that you want to be paged. The way we deal with this today is by putting all the exceptions in a single file that we can refer clients to. Of course, clients don’t always know about that file, or even to ask for it, which is a downside. In the end, this trade-off definitely did not outweigh the benefit of this pattern.

Final Thoughts

In summary, we have presented a way to organize your business logic separate from all the failure cases, stats, and logging, keeping your code extremely clear and clean for the next developer you hire. We have even converted legacy projects to return exceptions rather than codes which worked out very well for us.  Please give it a try and feel free to contact us.

This Tweet is unavailable

Leave a Reply

Your email address will not be published. Required fields are marked *

Next Post

Banish the � with...

Thu Apr 4 , 2019
https://shkspr.mobi/blog/2019/04/banish-the-%ef%bf%bd-with-unifont/

You May Like