Troubleshooting

Cases of errors in the system and their solution

Temporary unavailability of the database

A longer and unexpected unavailability of the database should not occur due to the requirements of its availability. The database is an integral part of the system.

If the connection to the database is temporarily lost for a short time (e.g. in the case of deploying a new version of the database from the supplier etc.), then the Integration Engine will try to reconnect to the database with a short timeout. This will be performed several times in a row. So the application will not break down immediately after the connection failure.

Temporary unavailability of Message Broker (RabbitMQ)

Message Broker is deployed in a 2 node cluster with a load balancer, if one of the nodes breaks down, then all the traffic is switched over and it will be processed by the second node. To secure a high level of availability, both nodes are geographically separated.

If a failure in communication occurs (an error in the network layer) – the Input Gateway (where the connection to the queue is critical for not losing received data) will switch to an alternative connection to the Message Broker (we have a separate IP address for this connection). If the queue is for a short time generally unavailable, then the Input Gateway will try to reconnect-after.

An error in the data source

PUSH data – a case of an outage of the external system that actively sends us data does not bring many reliable options to identify it (the system is either down, it stopped sending us data, has only just stopped sending us data or there is an interruption or a delay on the network). In this case, we have real-time monitoring of the amount of data that we receive and alerting to the time window without data. In the case of such alert, we can inform our source of the outage of data and hand it over to their side for finding a solution.

PULL data – we have an alerting set for the error of calling an external API or validation of data. In the case of an error, we solve it with the supplier (non-functional, not available API, wrong data). We store the log of all the recorded errors to the database for a general overview and an eventual history search or an addition to a published data set (if we publish the data set as OpenData, we add an information about errors/unavailability of data, so the users will know where is the data missing, where is the ‘data gap’ and what was the reason, so they can work with it in their statistical calculations, etc.).

Errors within individual modules

Input Gateway

Logging the received data and its number to the InfluxDB + custom alerts on the threshold and times + normal error logs

Integration Engine

Custom errors to PostgreSQL, number of records to InfluxDB.

According to the status/severity of the error, if there is an ERROR/WARNING the Integration Engine sends the message back to RabbitMQ (nAck) or confirms its takeover and processing with a noncritical error (Ack).

In RabbitMQ the not received, rejected, not processed messages are stored to the ‘dead queue’ where we have set monitoring of their number and the option of their manual processing/switch to a different queue. All the messages that are not processed end in the Dead Queue (they end with a severity error, an unknown error or they do not end at all – their TTL expires).

We have written down all the codes and meanings of all the errors. We have an alerting set for all the errors. If we identify errors that do not have an influence on the running of the system or they are ‘ordinary’, then we explicitly lower the severity of these errors to a warning. Below you can find a list of all the errors.

Connection Errors

1001: Error while connecting to {name}.

1002: {name} connection not exists. First call connect() method.

1003: Sending the message to exchange failed.

1004: Error while saving data to InfluxDB.

Datasources Errors

2001: Retrieving of the source data failed.

2002: Error while getting data from server.

2003: Error while parsing source data.

2004: Error while validating source data.

Transformations Errors

3001: Sorted Waste Containers were not set. Use method setContainers().

3002: {name} must be a valid number.

Models Errors

4001: Error while saving to database.

4002: Error while truncating data.

4003: Model data was not found.

4004: Error while getting from database.

4005: Error while validating data.

Workers Errors

5001: Error while updating {name}.

Error `5001` is processed as a `warning`

5002: Error while purging old data.

5003: Error while sending the POST request.

5004: Error while checking RopidGTFS saved rows.

5005: Worker and worker method must be defined.

Other Errors

6001: Method is not implemented.

6002: Retrieving of the open street map nominatim data failed.

Error `6002` is processed as a `warning`

6003: The ENV variable {name} cannot be undefined.

Output Gateway

All output APIs return standard HTTP status codes. Implementation in the package @golemio/errors. The standard possible errors in answers from API are a part of the OpenAPI specification – 404 Not Found, 401 Not Authorized, 403 Forbidden, 429 Too Many Requests, etc.

Errors in the code (JS)

Within the JS code in the whole project we use centralized processing of errors, we use our own package @golemio/errors, which extends the native JavaScript Error, but it allows us to capture much more information and provides us with another functionality. We distinguish between operative and non-operative errors of the program/programmer vs. an expected error in the application, which the application is able to process. We capture the whole stack-trace (however we do not send it to the user), we know the origin of the error.

The package @golemio/errors is easily and universally deployable even in other projects.