Priciples
Rationale
Goal is to build up and keep updated the analytical database by regularly pull structured data from different sources.
This analytical database is our target database. This shouldn't be analytical, this can be migration data from legacy database to new database (into dedicated schema). In general this database (or schema(s)) is considered read-only for others but/except pulling/loading system.
Limitations
Target database is PostgreSQL database.
We don't consider it as limitation but pure fun. But there are lot of different backgrounds and goals and some may exclude PostgreSQL.
Possibilities (from PostgreSQL)
DataPuller is optimized for PostgreSQL and uses lot of its internal ways to think (existence of schemas, access rights system, temporary tables, hashing, uuid-ing, returning in inserts and updates).
DataPuller user can get benefits from other features of PostgreSQL:
- PostGIS extension for spatial data management
- parsing XML and JSON with SQL
- no network comsumption for internal "transformations"
Way to think - Everything is pulling
Everything is pulling.
- We can pull data from different sources (to our target database).
- And we can pull data from our own target database (to our target database) to generate new results (eg. aggregates or some other sort of excerpts)
The pulling task
One pulling task is meant to fullfil one target table in target database.
- Yes indeed, in past we have made several pulling tasks to fullfil the same table, but it makes overall system more fragile. So, we don't suggest it any more. Preferred way is to have N+1 tasks: N tasks to pull data into own tables, and last task to join them (eg. SQL UNION).
Loose sync (Brownian motion)
As data warehouse developer You cannot control what happenes inside source (operational information system) database. This lives its own life driven by changing needs.
DataPuller tries to pull one and unother table from one and other sources. Eventually at some point in time all data may be syncronized. However (depending how You build up system using DataPuller) some fresh data may still be missing.
- If Your analytics need long term trends then DataPuller is good.
- If they need last second data... then You may want to consider real-time replication or build analitical solution directly into source database.
But eventually data from sources are at some refreshness level in target database. Now happenes the controlled part. You can declare dependencies and after successfully loading one table the dependant pulling starts. You can clean, obscure, aggregate and eliminate existing data and get new tables with new data.