Integration of a dataset
The main instructions that describe the process of the integration of a new data set: from analysis, implementation of receiving PUSH data/periodic download of PULL data, transformation and storing in the database, to publishing and exposing an output API from the data.
1. Analysis of a data set
- The process of acquiring data (PULL, PUSH)
- The type and size of data
- The format of storage and target DB (mongo, postgresql)
2. Creating schemas for a data set (Schema Definitions)
- git repo: gitlab.com/operator-ict/golemio/code/schema-definitions-public
- Creating schemas by the nature of data (input or data source schema, output schema, history schema, etc.)
- Creating a migration of the DB (new tables, indexes)
3. Input Gateway
- git repo: https://gitlab.com/operator-ict/golemio/code/input-gateway
- This step is needed only if the data is sent actively from the source (PUSH)
- Creating an endpoint for receiving data
- Validation of incoming data
- Sending data to the queue
- Documentation (OpenAPI)
See: gitlab.com/operator-ict/golemio/code/input-gateway/blob/master/docs/new_dataset_integration.md.
4. Integration Engine
- git repo: gitlab.com/operator-ict/golemio/code/integration-engine
- Creating transformation of data, e.g.:
modules/NewDataset/NewDatasetTransformation.ts
- Creating a worker, e.g.:
modules/NewDataset/NewDatasetWorker.ts
- Adding a record to
queueDefinitions.ts
- Defining a data source in the worker, only if it is necessary to actively download the data (PULL)
- Defining a model in the worker
- Implementation of methods for processing messages from the queues, the whole logic
- Test writing
- Documentation (
docs/datasets.md
)
See: gitlab.com/operator-ict/golemio/code/integration-engine/blob/master/docs/new_dataset_integration.md.
5. Definition of a cron task
- git repo: gitlab.com/operator-ict/golemio/code/cron-tasks
- Only if it is necessary to actively download the data (PULL)
- Defining the periodicity and the queue for sending messages from the cron
- Send the definition to DevOps
6. Output Gateway
- git repo: gitlab.com/operator-ict/golemio/code/output-gateway
- Creating routes for the data set
- Definition of all the filters, limits, etc. by the nature of data
- Definition of data enrichment (linking) by the nature of data
- Documentation (OpenAPI)
See: gitlab.com/operator-ict/golemio/code/output-gateway/blob/master/docs/new_dataset_integration.md.