Saturday, November 22, 2014

Exception handling and cleanup

So there's been a bit of a slog recently on whether or not cleanup actions (e.g. stuff you'd put in the second argument to bracket or finally) should be run in an implicit uninterruptibleMask. I have argued that this is a desirable change. I'm writing this post mostly to lay out the logic I've used to reach (and support) this position, because it's a bit tricky (much like the issue itself).

A bit of background: currently, if an interruptible function is used within an exception handler, asynchronous exceptions may arrive and cause the handler to behave in unexpected ways. In particular, if an async exception arrives during an interruptible period in a cleanup action run by bracket[1], you can get resource leaks. This is noted in the documentation, but it can be surprisingly difficult to get right. Notably, this has resulted in several bugs in GHC itself (details in the thread).

Two arguments have thus far been presented against the proposal, one that the change is undesired and secondly that more caution should be exercised before making such a silent semantic change, especially as it's possible that currently-working (although likely resource-leaking) programs would deadlock. I reject the first argument (for reasons explained below) and have some sympathy for the second, although I believe the benefits outweigh the costs. I'll refer to the argument that this change is undesired as Undesired below.

Basically there are two cases of "complex" cleanup actions that have been under discussion. Here are naive (incorrect) examples of each:

cleanup1 (Structure h1 h2) = hClose h1 >> hClose h2
cleanup2 var = withMVar var >>= hClose

The first, cleanup1, is a handler that has to perform multiple, unrelated actions. The second, cleanup2, is a handler that has to perform a blocking call to access a resource, then close that resource.

Undesired argues that cleanup1 is better written hClose h1 `finally` hClose h2. I agree that is a better solution (although I tend to use nested brackets myself). That structure is nearly fully exception-safe (except hClose itself isn't async-safe), and it handles synchronous exceptions properly. In particular, uninterruptibleMask is neither necessary nor sufficient to write cleanup1 properly[2].

There is no way to safely write cleanup2 without using uninterruptibleMask[3]. It is possible to write a much more sophisticated handler that could eventually time out and possibly take an alternative action, but I don't believe I've ever seen an example of such code. Even then, it still wouldn't close the handle.

Undesired further argues that adding an uninterruptibleMask to bracket will tend to make people think that the naive handler in the first case is safe, even though it isn't, and therefore we shouldn't add it. Users should instead know to use finally appropriately in that case, and uninterruptibleMask appropriately in the second.

I reject this conclusion. I believe the second case is quite common, likely more common than the first. Both Handle and Process use this structure internally (and hence are both vulnerable to async-exception issues that can only be fixed with uninterruptibleMask, which their cleanup functions do not currently employ). I also think it's quite common in client code when there's a shared structure that requires cleanup. Such cleanups can only be handled safely via uninterruptibleMask, and in general use of that function should be discouraged.

If we want to be leak-free (in the second case, and with hClose), we need to call uninterruptibleMask somewhere. Where should we call it? The obvious choices are in the cleanup action (e.g. within hClose), within bracket, or manually by the user. I reject the last as being undesired and error-prone; we certainly don't want the common idiom to be .. `finally` (uninterruptibleMask_ (hClose h)). I believe within bracket is better than within hClose. First of all, we really only expect exception-safety in the context of a function like bracket. Having a bare hClose is already not regarded as a safe idiom, so having it be vulnerable to async exceptions in another manner should not be a problem. Given that, I believe it's better to change the semantics of bracket so that all such handlers will work properly, rather than requiring that Handle, Process, and others be fixed (and countless 3rd-party libraries audited. I have at least two of my own I need to audit.).

Some opponents of the proposal have presented various control-flow constructs built upon bracket that would lead to deadlock under the proposed semantics. If hClose et al used uninterruptibleMask internally, those control flows would not be expressible at all, unless a new function "interruptibleHClose" were added. Similarly for Process and 3rd-party code. If instead bracket is changed, then we'd need to preserve the current semantics as "interruptibleBracket". Anyone relying on this specific control flow would just need to change from bracket to interruptibleBracket to preserve the current semantics, instead of needing to change hClose, process termination functions, dbClose, etc.[4]

Changing bracket means we'll probably fix a bunch of currently-buggy code (anything like cleanup2 that doesn't use uninterruptibleMask, including Handle and Process cleanups), and it will be less work to adapt for anyone relying upon the current semantics.

As I mentioned above, IMHO there is merit to the argument that we should be cautious when making a silent change like this. However, I think the pain is unavoidable. The current situation is not async-safe, and cannot be made async-safe without changing semantics (i.e. adding uninterruptibleMask somewhere). Here are all the alternative possibilities that have thus far been discussed:

  1. Users need to be intimately familiar with exceptions and interruptible operations to write truly exception-safe code, including knowing when to wrap built-in functions in uninterruptibleMask (the status quo).
  2. Add uninterruptibleMask internally to hClose, certain process-related functions, and anywhere else similar "cleanup2". Anyone relying upon the current interruptible behavior must audit their code, and (if available) switch to interruptible variants. 3rd-party code must all be audited. Concerns about unkillable threads/deadlocks apply.[5]
  3. Add uninterruptibleMask internally to bracket, finally, etc. Anyone relying upon the current interruptible behavior must audit their code, and switch to interruptible variants. 3rd-party code will be async-safe, although if it's currently vulnerable to sync exceptions it will remain vulnerable. (the proposal). Concerns about unkillable threads/deadlocks apply.

Of these choices, I consider the first untenable. I simply do not believe we should expect this level of async exception sophistication from most users. I will gladly accept the possibility of unkillable threads in this context to achieve the greater exception safety of 2 or 3, of which I contend 3 is better. It places the least burden of knowledge on the greatest number of users, involves less code-change than the second option, and also leads to the greatest safety by default.

[1] From now on I'll just write "bracket", but will generally mean the whole cleanup-handler family of functions.

[2] Except uninterruptibleMask is the only way to make hClose fully async-exception-safe.

[3] Or its semantic equivalent. It's often possible to change from a blocking operation to looping over a non-blocking operation under mask, but this has the same effect of making the code unresponsive to async operations.

[4] Given the examples that have been presented thus far, I strongly believe that any user relying upon the interruptibility of cleanup operations understands the Haskell exception system intimately and therefore is in the best position to adapt to any semantic changes of exception-handling functions.

[5] I expect the extra risk from changing bracket to be uninterruptible to be minimal over changing hClose etc, as Handles are ubiquitous.