Akka.Persistence Failure Handling and Supervision
The official Akka.NET documentation has some guidance on how to handle Akka.Persistence failures. This approach is to use a BackoffSupervisor
to recreate your PersistentActor
in the event of a recovery or write failure after a period of time.
Here's an example of how a BackoffSupervisor
might be created using just Akka.NET and Akka.Persistence:
var childProps = Props.Create<PersistentActor>();
var props = BackoffSupervisor.Props(
Backoff.OnStop(
childProps,
"myActor",
TimeSpan.FromSeconds(3),
TimeSpan.FromSeconds(30),
0.2));
var supervisor = MyActorSystem.ActorOf(props, name: "mySupervisor");
The BackoffSupervisor
forwards its messages down to its child, the PersistentActor
, who in turns does the actual work. But then disaster strikes: the database goes down.
In Akka.Persistence, when the PersistentActor
is unable to commit any of its events to the database via its Persist(event, Func<object> handler)
methods or if the actor is unable to recover its previous state from the database at startup, the actor will throw an exception and usually stop itself. The actor is, effectively, in a non-operable state and attempting to do any further work with the failed PersistentActor
would be a tremendous violation of most developers' consistency requirements.
Thus, the right thing to do is to kill and recreate the actor after waiting for a brief period of time to see if the database is available again. This is exactly what the BackoffSupervisor
built into Akka.NET does.
Using the settings provided to the BackoffSupervisor
by the end-user, the BackoffSupervisor
will compute a random interval to determine when it's "safe" to recreate the child again and have it try to recover its state from the database and resume processing events. In a situation where your database is buckling under high loads and is periodically becoming unavailable, the absolute worst recovery strategy is to immediately hammer the database with additional retry requests. This will exacerbate the issue and compound the load problems that caused the outage in the first place.
A far better approach is to use a traditional "backoff-and-retry" strategy, which is what the BackoffSupervisor
does - give the database a moment to get back up and running prior to retrying any previously failed requests, and then stagger those retry requests across a range of intervals rather than all at once.
Problems with BackoffSupervisor
The approach that has been traditionally recommended by the Akka.NET and Akka teams for years is highly generalized and solid, but it has some major shortcomings:
- No ability to replay failed
Persist
calls; - All messages sent to the
BackoffSupervisor
while the underlyingPeristentActor
is waiting to be recreated during the backoff period are lost; and - Puts the onus onto the Akka.NET end-user to handle both of these scenarios.
Enter the PersistenceSupervisor
from the Akka.Persistence.Extras NuGet package.
The PersistenceSupervisor
works using more or less the same strategy as the BackoffSupervisor
, but with some key differences:
- All
PersistenceSupervisor
actors keep track of which messages they receive are events (state changes) versus commands for their childPersistentActor
- when an event is detected it is decorated with anIConfirmableMessage
so it can be ACKed via anConfirmation
message sent back to thePersistenceSupervisor
from thePersistentActor
. PersistentSupervisor
s will buffer all events in the order in which they were received UNTIl receiving aConfirmation
from its child - in the event of a write failure, after the child is killed and recreated all buffered events will be replayed to the child in their original order. Once the child ACKs each event individually those events will be freed from thePersistenceSupervisor
's memory.- While the underlying
PersistentActor
is waiting to be recreated, all messages sent to thePersistenceSupervisor
will be buffered - only to be released once the child is recreated.
Designing Persistent Actors to Work with PersistenceSupervisor
The PersistenceSupervisor
and the underlying persistent actor it creates need to be designed to work together as a tandem pair - blindly slapping a PersistenceSupervisor
on top of a pre-existing UntypedPersistentActor
or a ReceivePersistentActor
is a recipe for bad times and tears.
First, setting up the PersistenceSupervisor
:
var childProps = Props.Create(() => new WorkingPersistentActor("fuber"));
var supervisor = PersistenceSupervisor.PropsFor((o, l) =>
{
if (o is int i)
return new WorkingPersistentActor.AddToCount(l, string.Empty, i);
return new ConfirmableMessageEnvelope(l, string.Empty, o);
}, o => o is int, childProps, "myActor",
strategy: SupervisorStrategy.StoppingStrategy.WithMaxNrOfRetries(100));
var sup = Sys.ActorOf(supervisor, "fuber");
The PersistenceSupervisor
takes a configuration object that helps it determine which messages are events and which ones are not, but for convenience we hide this via the PersistenceSupervisor.PropFor
method - in this method you still need to supply the following arguments:
- A function of type
Func<object, long, IConfirmableMessage>
- this is the mapping function that thePersistenceSupervisor
uses to wrap events inside a message of typeIConfirmableMessage
. Your persistent actors will need to be programmed to handleIConfirmableMessage
and send a reply back of typeConfirmation
. Thelong
passed into this method is the deliveryId, used to enable thePersistenceSupervisor
to correlate each event prior to knowing the event's sequence number. Thatlong
needs to be assigned to theIConfirmableMessage.DeliveryId
property. - A function of type
Func<object, bool>
- this predicate function is used to tell thePersistenceSupervisor
which messages are events (when the function returnstrue
) and which messages are stateless commands (when the function returnsfalse
). - The
Props
of the persistent child actor that will be created and recreated in the event of an Akka.Persistence failure. - The name of the child actor once it's created.
Next, we're going to need to code our PersistentActor
to be able to handle these additional message types introduced by the PersistenceSupervisor
.
public class WorkingPersistentActor : ReceivePersistentActor
{
private readonly ILoggingAdapter _log = Context.GetLogger();
private int _currentCount;
public WorkingPersistentActor(string persistenceId)
{
PersistenceId = persistenceId;
Recover<int>(i =>
{
_log.Info("Recovery: Adding [{0}] to current count [{1}] - new count: [{2}]", i, _currentCount,
_currentCount + i);
_currentCount += i;
});
Recover<SnapshotOffer>(o =>
{
if (o.Snapshot is int i)
{
_log.Info("Recovery: Setting initial count to [{1}]", i);
_currentCount = i;
}
});
Command<AddToCount>(e =>
{
Persist(e.CountValue, iN =>
{
_log.Info("Command: Adding [{0}] to current count [{1}] - new count: [{2}]", iN, _currentCount,
_currentCount + iN);
_currentCount += iN;
// ACK the message back to parent
Context.Parent.Tell(new Confirmation(e.ConfirmationId, PersistenceId));
});
});
Command<GetCount>(g => { Sender.Tell(_currentCount); });
}
The most important bit of code is where the WorkingPersistentActor
process the AddToCount
command, which is an IConfirmableMessage
packaged by the PersistenceSupervisor
:
Command<AddToCount>(e =>
{
Persist(e.CountValue, iN =>
{
_log.Info("Command: Adding [{0}] to current count [{1}] - new count: [{2}]", iN, _currentCount,
_currentCount + iN);
_currentCount += iN;
// ACK the message back to parent
Context.Parent.Tell(new Confirmation(e.ConfirmationId, PersistenceId));
});
});
Every persisted event should be decorated with an IConfirmableMessage
of some kind - this is what drives the acknowledgment protocol that the PersistenceSupervisor
uses to guarantee delivery:
If the WorkingPersistentActor
never sent of its Conrimation
messages back, those events will be retained inside the parent actor's buffers indefinitely. Therefore, the Context.Parent.Tell(new Confirmation(e.ConfirmationId, PersistentId))
is absolutely critical to ensuring that this tandem pair works correctly.
Taking the PersistenceSupervsior
for a Test-Drive
If you'd like to try the PersistenceSupervisor
in a short sample application without necessarily having to use it yourself, please clone the Akka.Persistence.Extras Github repository and run the PersistenceSupervisor
sample. This example uses an intentionally unreliable Akka.Persistence journal implementation that fails frequently upon both recovery and write, but as you'll see the PersistentActor
is still able to process all of its work in the end despite numerous failures. That's a testament to the power and reliability of the PersistenceSupervisor
.