Integration of a dataset

The main instructions that describe the process of the integration of a new data set: from analysis, implementation of receiving PUSH data/periodic download of PULL data, transformation and storing in the database, to publishing and exposing an output API from the data.

1. Analysis of a data set

  • The process of acquiring data (PULL, PUSH)
  • The type and size of data
  • The format of storage and target DB (mongo, postgresql)

2. Creating schemas for a data set (Schema Definitions)

See: gitlab.com/operator-ict/golemio/code/schema-definitions-public/blob/master/docs/new_dataset_integration.md.

3. Input Gateway

  • git repo: https://gitlab.com/operator-ict/golemio/code/input-gateway
  • This step is needed only if the data is sent actively from the source (PUSH)
  • Creating an endpoint for receiving data
  • Validation of incoming data
  • Sending data to the queue
  • Documentation (OpenAPI)

See: gitlab.com/operator-ict/golemio/code/input-gateway/blob/master/docs/new_dataset_integration.md.

4. Integration Engine

  • git repo: gitlab.com/operator-ict/golemio/code/integration-engine
  • Creating transformation of data, e.g.: modules/NewDataset/NewDatasetTransformation.ts
  • Creating a worker, e.g.: modules/NewDataset/NewDatasetWorker.ts
  • Adding a record to queueDefinitions.ts
  • Defining a data source in the worker, only if it is necessary to actively download the data (PULL)
  • Defining a model in the worker
  • Implementation of methods for processing messages from the queues, the whole logic
  • Test writing
  • Documentation (docs/datasets.md)

See: gitlab.com/operator-ict/golemio/code/integration-engine/blob/master/docs/new_dataset_integration.md.

5. Definition of a cron task

6. Output Gateway

See: gitlab.com/operator-ict/golemio/code/output-gateway/blob/master/docs/new_dataset_integration.md.