The data platform is realized as a modular system with the use of individually functional, individually deployable and replaceable modules. The individual modules and layers are designed with the idea of substitutability for a different solution or an already existing service (e.g. provided as SaaS) and with the idea of maximal flexibility and scalability.
Scalability and modularity
Because of the difficulties in anticipating future developments (in data volume, number of data suppliers and individual use cases) there is a big emphasis on scalability – e.g. the layer for receiving data is designed as a minimal (lightweight) stateless application, which is possible to deploy, in combination with a load balancer, in a number of instances for a quick reception of data and for inclusion of requirements to the queue for gradual processing. The queue is also an integral part of the system and realizes the reception, persistence of data, distribution, and synchronisation between the individual computational nodes. The system is, therefore, able to react to traffic in the data flow.
The maximum of modules is solved as stateless applications, where the state is kept by a common database cluster. Computational codes (individual instances realizing processing of data, transformation, and calculation of data) are also arbitrarily scalable and by increasing the number of their instances the processing and picking up data from the queue will become faster.
The system architecture connects the pros of a microservice-oriented architecture (flexibility, individual scaling, substitutability, speed of rollouts of new updates) with the traditional design. That, of course, does not bring just the pros but also the cons, which had to be taken into consideration during the design and realization (mainly during operating and version promotions with retroactively incompatible changes), because some modules are not purely independent microservices.
The whole system is operated on a virtual infrastructure, the individual applications are in a containerized environment (Docker). Above the containers there is an orchestration layer, which is responsible for load balancing, clustering, deploying of new versions (a rolling update of the individual services, first there is a launch of the second instance in the new version, the traffic is redirected and then the old version is switched off, that is why there is zero-downtime deployment) etc.
The architecture is designed with the idea of continuous deployment (CI/CD) therefore new versions of individual services in an automatized system can be deployed gradually in short iterations. The platform is built above a cloud environment with the use of microservices, containerization, scaling of the environment and using the services ‘as a service’.
The application is divided into individual and independent modules:
- Input interface (Input Gateway),
- Access layer (ACL, Permission Proxy),
- Queue (Message Broker, Queue),
- Integration layer (Integration Engine),
- Database layer,
- Output interface (Output Gateway),
- Admin panel (Admin Panel),
- Dispatching and data analysis (Client Panel),
- Management of time-controlled tasks (CRON).
Architecture framework scheme:
The following scheme describes the individual modules of solutions, including the way of connecting them together (the type of connection, interface) and the indication of communication and dependence. Each module contains its name and the name of the project (repository).
Description of the individual modules can be found on the page Integration modules.
Description of the frontend modules can be found on the page Web applications.
The following scheme describes the technological stack of the project and the use of technologies on individual parts of the platform. A more accurate definition of the technologies and frameworks is available in the description of each module in the section Integration modules.
Basic characteristics of system architecture:
- Individual components have a clearly defined interface and they can be scaled or replaced with a different solution in the future
- All incoming requests have to go through
Permission proxy, this is a layer which is responsible for authentication and authorization
- Input gateway defines incoming endpoints and validates the data structure of incoming data
- Processing of data is done through the message queue RabbitMQ with the help of
- Worker flow: picks up message -> performs the task -> saves the result through a message to another queue or just switches off
- It is possible to download data from external sources on the basis of time (
cron) or a different event, which saves the message to the queue
- The use of ETL Keboola is optional and enables the acquisition of data from many external sources without the need for writing your own connection
The whole project runs on Docker Swarm and the individual components are Docker containers.
The reverse proxy is secured by Traefik, which distributes requests to individual containers, which should be available on the internet. The access to the other containers is available only from the internal network.
The layout of containers to individual nodes of cluster is given by their needs. The containers which are available directly from the internet are on a common node, applications that are demanding in terms of performance have a much more powerful type of node and database containers need larger storage. The last node type is the service one, where administration and monitoring services run.