Trustbit

View Original

The Power of EventSourcing

We just had a nice experience while maintaining one of the applications we created for our customer and we wanted to share this with you!

We are building a small ‘on-the-side’ project for a customer that supports the planing of a huge event with more than 1.500 guests. It's not in their major line of work and they use the application for about 1.5 months to organise the seating plan for their annual Company party. Together we decided that it’s OK to take some shortcuts during the implementation to reduce the overall effort.

Over the last few weeks we have been busy working on the application. Lots of small change requests ('I want a different text in the generated email', ‘Can you add this field on that page?’, ‘We thought this feature works differently’) were coming in every couple of days.

Due to the ‘ad-hoc’ (or in German we would call it ‘hemdsärmelig’) setup of the project we, me and my colleagues Rinat Abdullin and Aigiz Kunafin, decided it would make sense to take an event-sourcing approach for the application.
One of the reasons is that storing all changes in the application as raw events allows us to move fast, evolve the application as new requirements arise and spend not too much time in modelling the system upfront.
Rinat sketched it briefly in a Tweet a while ago:

  • we have an event sourced application with a single aggregate, operating on a single event stream,

  • running on one application server is sufficient,

  • we keep all state in memory and persist only new events to a SQL-Server (currently we have approximately 10.000 events and the application - ASPNETCORE - consumes 250 MB of memory on Windows)

  • and we have isolated, dedicated pages which communicate with the user via web-sockets. (It's a mixture of serverside rendering and some Vanilla/jQuery based JavaScript)

The setup allows us to move fast and there is not a lot of time to write tests; We have only one unit & integration test. To mitigate this, we wanted to make sure that we had a good, fast and reliable CI/CD pipeline and comprehensive logging in the application.

A consequence of not having a suite of tests is that things tend to break from time to time. Usually the errors are either

  • isolated on one particular page

  • mitigated via Event Sourcing

The errors are usually fixed within a few minutes and rolled out via another deployment.

The incident

Just a couple of days ago we had another incident with the customer where the architecture and the setup paid off big time.

In the application we have functionality where users can assign physical dining tables, spread over several rooms of the event location, to logical tables to organise and plan the event. Prior to this assignment the customer does a physical random table-picking, by drawing the tables assignments one by one, to determine which table should be positioned in which room.

Afterwards they enter the randomly assigned tables into an Excel-File and upload it to the application.

However, there is a catch: some tables - for VIPs - are placed in a different way and shouldn’t be updated by this table-picking mechanism.
This time they kept the logical VIP tables in the Excel-File they uploaded, but they had no physical tables assigned to them.
During the upload of the file we reset all previously assigned physical tables for the VIPs. You could argue that it was not a bug and worked as designed, but the customer was still not very happy about this, because they thought that they lost a lot of work!

So what did we do to mitigate this issue? We simply deleted the events generated by the upload, fixed the implementation, re-uploaded the file and everything was fine. And the best thing: in the meanwhile the customer could keep on working with the application! Although we did ask them not to work in areas directly related with VIP tables.

Everything was resolved in approximately one hour and the result: a very happy customer.
The customer knew upfront that we were storing every change they made in the application and that this empowers us to react to errors either on our or the customers side in a very fast way. We proved them again that there are no catastrophic errors and we have chosen the right architecture for the application!