BUILDING A SAAS PLATFORM BASED ON REACTIVE ARCHITECTURE
Pursuing to attend modern demands on robustness, resilience and flexibility we as software engineers face challenges day by day to implement and maintain systems that are often critical for your business operations. So many times, we encounter ourselves in a challenge that we must maximize value to our customers with fewer resources to encounter more efficient in terms of business while provides a good user experience and supporting accordantly business operation with a reliable system.
In one of these challenging days working on a legacy Logistics Software that supports a critical manufacturer operation, we had to provide the same solution for any other possible customer that requires our services. But we knew that the effort to replicate the same software bringing all its architecture for every new customer will make the technical effort increase dramatically and its operation effort increase even more.
That effort represents costs, and increasing costs means that new businesses are unfeasible because it offends the business plan, causing a company to lose many opportunities if it pursues that market share. A Software Architecture that was not properly design to attend the requirements will impact the code quality and the reliability of the entire system.
We as software engineers read before about Reactive Architecture and were aware about all advantages that meets exactly with what we needed.
In this paper we want to tell how it was, pointing the challenges in our journey and how we overcome successfully all of them.
As a Logistics system, the system has to deal with complex chain of processes in order to provide a consistent data at the end to the manufacturer operation that need to monitor all warehouse outcomes that gives a clean vision of what is in transit or what has been delivered in time or not of every truck or plane travel. The original solution has been based in a set of working jobs that receives data from shipping companies and manufacturer ERP in a certain flow that the system applies business logics to input data and return as input to another scheduled job that will determine if the invoice has been issued correctly and grouped well in shipping process then how it was dispatched and so on.
Following the above scheme, the system was able to queue the data incoming to process chain and at the same time, providing resilience in case you need to reprocess one stage of the process.
But the scheduled job has a cycle, once a new data income the process would take a long time until finish through all the chain.
Also, the entire queue is based on a data base recording data that will not be used as result being completely dismissing requiring a purge in a period. The I/O time is another unnecessary point being another opportunity for improvement. For this purpose, that is to organize processes input, you don’t need a relational data base consistency.
Another point, related to scalability, in case you need to scale these processes for another manufacturer will be another pain because you need to bring the hole RDBMS along because many of the business logics purpose is inside of the data base.
The application front end built in JSF couples some business logics into the same packaging war. We know that in this case the point was not elected technology but the code implementation, but we saw an opportunity do decouple in API services connected by a Single Page Application providing the front end.
According to the new requirements, we needed to build a multi-tenant architecture, in a way that we can visualize the customization effort separately. In a productization process we can give a clean vision of costs and effort to the commercial team when they need to write a commercial propose to bring new customers.
Based on our Logistics System and our new non-functional needs we reflected the conceptual layer presented before in our infra structure architecture as the image below.
From the top to the bottle, you can see site domain, load balance, api gateway, cognito and Lambda. All AWS services of the same account that is related the VPC (Amazon Virtual Private Cloud) that wraps the applications. They are conceptually on Tenant Data Integration Layer demanding efforts of cloud skills in case of receiving a new costumer as a tenant.
Some API exposed to the Customers and Partners for integration purpose are running on Data SerializerLayer, and many of its data is an input to the systems processes. These APIs must have high availability and very elastic absorbing all data volume incoming from customers and third parties. At first moment we elected lambda functions to deal with these requirements, but we faced compatibility problems with MSK due the security layer running Python code, then we decided to move these APIs applications into EKS Cluster and performed satisfactorily well.
The Amazon Virtual Private Cloud isolates the context of our application cluster enabling each server communicates with each other while are completely safe and isolated on the internet. A load balance redirects the traffic from internet to the specific service port of the cluster and prevents attacks from outside.
Once all security layer allows the incoming traffic and the api gateway authenticates the requests integration with cognito, the EKS receives the incoming requests and starts interacting with the applications.
All efforts made on this layer is placed on Tenant Data Integration Layer. Mainly by SRE team or cloud specialist.
The managed AWS service for Kafka has been used as our service broker due its performance and flexibility. All data that we had before stored into IFTables in the Legacy are now running here. Every topic represents an event, and every event receives its proper type of message already orchestrated by the Inbound layer.
Once the data reached the Transactional Data Processes, a microservice applies business logics transforms the data and returns another message by its producer placing in a topic of another service that executes the next step of the logistics process.
Kafka enables asynchronous processes following the business workflow reactively, that means asynchronous, nonblocking and event-driven. The EKS gives resilience and elasticity that will be explained in the next point.
The MSK enables the communication between all layers, due of this, during our journey we noticed that we should be more careful with Kafka configurations to provide properly the support that our architecture needs having some headaches the first time the application came to production and having some lessons learned.
Be aware to the relations with topics partitions and microservices replicas that consumes the same topic. It could offend ISR metrics when Kafka needs to balance the load of messages to each consumer. One solution that we found was to follow the same number of topic partitions that we have our replica pods. In case Kafka lost the reference of the leader broker, the microservice would process all messages from beginning in an endless loop. The configuration of Consumer Groups can help the processes balance that in our experience worked efficiently.
Provides a cloud managed k8 that has exposed node ports redirecting the traffic to Tenant Data API that conceptually is on Tenant Data Integration.
The Amazon Elastic Kubernetes Services is a managed service, as the name says, provides elasticity vertically and horizontally once all stateless services are running on docker images. With HPA (Horizontal Pod Scalation) we can programmatically auto scale the number of pods by demand and the proper EKS auto scales the number of nodes in case we need increase the entire workload. We can also vertically change the class of the servers of each node increasing memory and cpu in case we need.
In case of fail, another pod takes the task in case of crash once we have workload resources enough giving resilience to the entire system. All to provide enough structure to receive any number of tenants that we need with reasonably cloud resources costs.
Optimized Resource Management: Proper sizing of replicas and their relationship to Kafka Brokers is crucial. In a past incident, minimal infrastructure requirements led to unavailability during a cluster collapse. The system, comprising 16 microservices with at least 2 replicas each, had 10 microservices consuming from Kafka topics. When horizontal scaling occurred, Kafka became a bottleneck, increasing ping times and eventually causing consumer microservices to fail. Kubernetes attempted to restart these services, exhausting resources and causing EKS to stop functioning.
To prevent such scenarios, it is essential to have the right deployment configuration, including resource limitations in YAML files, and enough brokers and hardware resources.
Once we solve the processes orchestration with inbound and MSK, we found that there was no need of a RDBMS in this case gaining in performance and data flexibility using MongoDB Collections. The possibility of using multiple databases on one managed cluster gave us another benefit for working with microservices and its dedicated databases on a small effort and a centralized management tool that is Atlas. Atlas MongoDB was crucial for the project success on a reasonable cost enabling flexibility in our architecture that we’ve never seen before. The autoscaling gave to us confidence to build auto scalable microservices plugged into our k8 cluster. In our model efforts are classified as Transactional Data Processes along with the Microservices that connects with.
Transitioning from RDBMS to Document Databases, A Journey: Transitioning from traditional relational database management systems (RDBMS) to document-based databases posed significant challenges for our team. We struggled for an extended period to create an effective application within this new paradigm. The primary pain points were data modeling and querying.
Efficient data modeling in document databases is crucial. Properly representing data in collections can prevent difficulties during data queries and aggregations-concepts that were new to us. Once our team mastered efficient collection modeling and correctly set up data indexes, resource usage decreased dramatically. This optimization allowed us to downgrade from an M20 to an M10 instance, effectively halving our costs.
Applying these 4 layers concept we can match easily to the product structure with the incremental architecture of the project. Easily identifying what is core feature or core processes at the moment of receiving a new tenant. Providing a clearer support for projecting efforts from technical team in different scenarios of customization in a way that the infrastructure is centralized, scalable and flexible.
Also, we could evidence the benefits of these technologies, architecture and concepts that has been implemented for a Logistics domain but can easily be replied for any one another that has the same needs.
Originally published at https://marcalcantara.substack.com.