For storing data we have Postgres (with a PostGIS superstructure for Geodata), Mongo DB, Influx DB. We currently have more of them due to different characteristics of data and data sets – for some structured data it is optimal to use a relational database, for others it is better to use a document database or a time-series database. We try to be flexible and to treat data in the best possible way, to store them and provide them to data analysts in the most meaningful way possible.
General database description
Basic database schema
The database is divided into individual schemas
Meta(for storing metadata about individual data sets)
Public(contains own data)
History(history tables to individual data tables)
Analytic(a schema for creating and storing analytic units and views and access of data analysts)
Tmp(for creating temporary extracts from primary data)
This schema should contain basic information about the registered data sets. It also contains information about the performance progress of individual sets. The
batch table contains data about individual data batches in the way they came to the data platform. The
extract_summary is a detailed record of the data stratification into individual extracts.
It contains individual data sets. Each of them can be formed by one or a number of extracts (data tables). Their eventual linking is dependent on the domain of the given data set. According to the characteristics of stored data, it is sometimes appropriate to create a history table, which contains all changes in the primary extract. This kind of table is created in a history schema and is performed by a trigger, which is above its own extract. All tables of data sets contain the same group of audit fields, which are intended for getting an overview of the origin and state of the given record:
create_batch_id– an identifier of the batch that created the record
created_by- an identifator of the user or system that created the record
created_at- time stamp for knowing when was the record created in the data platform. If the db record changes, the information regarding this change is stored
update_batch_id– an identifier of the batch that most recently changed the record
update_by- an identifator of the user or system that most recently changed the record
update_at- time stamp for knowing when was the record changed in the data platform
This schema contains history tables that are performed by triggers. The reason for creating this table is the need to examine the state of data to a given point in time.
It is intended for specialized analyses where it is necessary to change the primary data in an appropriate way (e.g. aggregate them).
It is intended for the creation of temporary tables that are needed in individual partial calculations.
A description of the internal processes of the database
Meta schema also contains basic procedures and functions that are intended for general administration.
add_audit_fields(adjustment – an extension of the table with a standard set of audit fields)
create_batch(creating a record in a batch – open)
close_batch(terminating the processing of a batch – CLOSE/ERROR)
count_batch(during the termination of a batch – record in extract_summary)
close_extract(a function that closes an existing extract within a batch – called from MongoDB)
create_historization_process(creation of a history table and a trigger)
retention(a retention process over the db – according to metadata)
trg_general_write_history(a general trg function for history triggers)
A description of the retention policy of data
The retention policy of the database is managed by the information that is stored in metadata. Generally we can assume that individual data sets will have a different need regarding the length of storing production data. This information is stored in metadata.
Every data set contains two types of data:
This data says after how many days will data from the database be deleted individually for production tables and history tables. If this is not filled out, then it means that data will not be continuously deleted. This data is valid for the whole data set. If it is necessary to set the retention data of some of the extracts in a different way, it is possible to fill out the same attributes in the record of the extract table. If this data is filled out, then it has a higher priority than data in a data set.