Show / Hide Table of Contents

    Akka.Persistence Failure Handling and Supervision

    The official Akka.NET documentation has some guidance on how to handle Akka.Persistence failures. This approach is to use a BackoffSupervisor to recreate your PersistentActor in the event of a recovery or write failure after a period of time.

    Here's an example of how a BackoffSupervisor might be created using just Akka.NET and Akka.Persistence:

     var childProps = Props.Create<PersistentActor>();
     var props = BackoffSupervisor.Props(
            Backoff.OnStop(
                childProps,
                "myActor",
                TimeSpan.FromSeconds(3),
                TimeSpan.FromSeconds(30),
                0.2));
      var supervisor = MyActorSystem.ActorOf(props, name: "mySupervisor");
    

    Akka.Persistence working with a BackoffSupervisor

    The BackoffSupervisor forwards its messages down to its child, the PersistentActor, who in turns does the actual work. But then disaster strikes: the database goes down.

    Akka.Persistence working with a BackoffSupervisor when database goes down.

    In Akka.Persistence, when the PersistentActor is unable to commit any of its events to the database via its Persist(event, Func<object> handler) methods or if the actor is unable to recover its previous state from the database at startup, the actor will throw an exception and usually stop itself. The actor is, effectively, in a non-operable state and attempting to do any further work with the failed PersistentActor would be a tremendous violation of most developers' consistency requirements.

    Thus, the right thing to do is to kill and recreate the actor after waiting for a brief period of time to see if the database is available again. This is exactly what the BackoffSupervisor built into Akka.NET does.

    Akka.Persistence working with BackoffSupervisor to recreate failed child upon database failure.

    Using the settings provided to the BackoffSupervisor by the end-user, the BackoffSupervisor will compute a random interval to determine when it's "safe" to recreate the child again and have it try to recover its state from the database and resume processing events. In a situation where your database is buckling under high loads and is periodically becoming unavailable, the absolute worst recovery strategy is to immediately hammer the database with additional retry requests. This will exacerbate the issue and compound the load problems that caused the outage in the first place.

    A far better approach is to use a traditional "backoff-and-retry" strategy, which is what the BackoffSupervisor does - give the database a moment to get back up and running prior to retrying any previously failed requests, and then stagger those retry requests across a range of intervals rather than all at once.

    Problems with BackoffSupervisor

    The approach that has been traditionally recommended by the Akka.NET and Akka teams for years is highly generalized and solid, but it has some major shortcomings:

    1. No ability to replay failed Persist calls;
    2. All messages sent to the BackoffSupervisor while the underlying PeristentActor is waiting to be recreated during the backoff period are lost; and
    3. Puts the onus onto the Akka.NET end-user to handle both of these scenarios.

    Enter the PersistenceSupervisor from the Akka.Persistence.Extras NuGet package.

    The PersistenceSupervisor works using more or less the same strategy as the BackoffSupervisor, but with some key differences:

    1. All PersistenceSupervisor actors keep track of which messages they receive are events (state changes) versus commands for their child PersistentActor - when an event is detected it is decorated with an IConfirmableMessage so it can be ACKed via an Confirmation message sent back to the PersistenceSupervisor from the PersistentActor.
    2. PersistentSupervisors will buffer all events in the order in which they were received UNTIl receiving a Confirmation from its child - in the event of a write failure, after the child is killed and recreated all buffered events will be replayed to the child in their original order. Once the child ACKs each event individually those events will be freed from the PersistenceSupervisor's memory.
    3. While the underlying PersistentActor is waiting to be recreated, all messages sent to the PersistenceSupervisor will be buffered - only to be released once the child is recreated.

    Designing Persistent Actors to Work with PersistenceSupervisor

    The PersistenceSupervisor and the underlying persistent actor it creates need to be designed to work together as a tandem pair - blindly slapping a PersistenceSupervisor on top of a pre-existing UntypedPersistentActor or a ReceivePersistentActor is a recipe for bad times and tears.

    First, setting up the PersistenceSupervisor:

    var childProps = Props.Create(() => new WorkingPersistentActor("fuber"));
    var supervisor = PersistenceSupervisor.PropsFor((o, l) =>
        {
            if (o is int i)
                return new WorkingPersistentActor.AddToCount(l, string.Empty, i);
    
            return new ConfirmableMessageEnvelope(l, string.Empty, o);
        }, o => o is int, childProps, "myActor",
        strategy: SupervisorStrategy.StoppingStrategy.WithMaxNrOfRetries(100));
    
    var sup = Sys.ActorOf(supervisor, "fuber");
    

    The PersistenceSupervisor takes a configuration object that helps it determine which messages are events and which ones are not, but for convenience we hide this via the PersistenceSupervisor.PropFor method - in this method you still need to supply the following arguments:

    1. A function of type Func<object, long, IConfirmableMessage> - this is the mapping function that the PersistenceSupervisor uses to wrap events inside a message of type IConfirmableMessage. Your persistent actors will need to be programmed to handle IConfirmableMessage and send a reply back of type Confirmation. The long passed into this method is the deliveryId, used to enable the PersistenceSupervisor to correlate each event prior to knowing the event's sequence number. That long needs to be assigned to the IConfirmableMessage.DeliveryId property.
    2. A function of type Func<object, bool> - this predicate function is used to tell the PersistenceSupervisor which messages are events (when the function returns true) and which messages are stateless commands (when the function returns false).
    3. The Props of the persistent child actor that will be created and recreated in the event of an Akka.Persistence failure.
    4. The name of the child actor once it's created.

    Next, we're going to need to code our PersistentActor to be able to handle these additional message types introduced by the PersistenceSupervisor.

    public class WorkingPersistentActor : ReceivePersistentActor
    {
        private readonly ILoggingAdapter _log = Context.GetLogger();
        private int _currentCount;
    
        public WorkingPersistentActor(string persistenceId)
        {
            PersistenceId = persistenceId;
    
            Recover<int>(i =>
            {
                _log.Info("Recovery: Adding [{0}] to current count [{1}] - new count: [{2}]", i, _currentCount,
                    _currentCount + i);
                _currentCount += i;
            });
    
            Recover<SnapshotOffer>(o =>
            {
                if (o.Snapshot is int i)
                {
                    _log.Info("Recovery: Setting initial count to [{1}]", i);
                    _currentCount = i;
                }
            });
    
            Command<AddToCount>(e =>
            {
                Persist(e.CountValue, iN =>
                {
                    _log.Info("Command: Adding [{0}] to current count [{1}] - new count: [{2}]", iN, _currentCount,
                        _currentCount + iN);
                    _currentCount += iN;
    
                    // ACK the message back to parent
                    Context.Parent.Tell(new Confirmation(e.ConfirmationId, PersistenceId));
                });
            });
    
            Command<GetCount>(g => { Sender.Tell(_currentCount); });
        }
    

    The most important bit of code is where the WorkingPersistentActor process the AddToCount command, which is an IConfirmableMessage packaged by the PersistenceSupervisor:

    Command<AddToCount>(e =>
    {
        Persist(e.CountValue, iN =>
        {
            _log.Info("Command: Adding [{0}] to current count [{1}] - new count: [{2}]", iN, _currentCount,
                _currentCount + iN);
            _currentCount += iN;
    
            // ACK the message back to parent
            Context.Parent.Tell(new Confirmation(e.ConfirmationId, PersistenceId));
        });
    });
    

    Every persisted event should be decorated with an IConfirmableMessage of some kind - this is what drives the acknowledgment protocol that the PersistenceSupervisor uses to guarantee delivery:

    Guaranteeing delivery of Akka.Persistence events with the Akka.Persistence.Extras.PersistenceSupervisor

    If the WorkingPersistentActor never sent of its Conrimation messages back, those events will be retained inside the parent actor's buffers indefinitely. Therefore, the Context.Parent.Tell(new Confirmation(e.ConfirmationId, PersistentId)) is absolutely critical to ensuring that this tandem pair works correctly.

    Taking the PersistenceSupervsior for a Test-Drive

    If you'd like to try the PersistenceSupervisor in a short sample application without necessarily having to use it yourself, please clone the Akka.Persistence.Extras Github repository and run the PersistenceSupervisor sample. This example uses an intentionally unreliable Akka.Persistence journal implementation that fails frequently upon both recovery and write, but as you'll see the PersistentActor is still able to process all of its work in the end despite numerous failures. That's a testament to the power and reliability of the PersistenceSupervisor.

    In This Article
    • Problems with BackoffSupervisor
    • Designing Persistent Actors to Work with PersistenceSupervisor
      • Taking the PersistenceSupervsior for a Test-Drive
    Back to top Copyright © 2015-2018 Petabridge®