Data Storage | Orchestral Health Data Model & Data Lake

Data is stored in two places in Orchestral, each with its own use and structure.

Data Storage.png

Health Data Model (canonical database)

Stores the data that has been processed through the ingestion pipelines in a Postgres database.

Features

Individual data elements stored in subject domains specific to healthcare:
- Audit - logs all data access, generating alerts if exceptions are detected.
- Consent - what data a patient has consented to make available - to whom and for what purpose.
- Business Interaction - interactions including encounters, procedures, referrals and immunizations.
- Finance - financial information about the provision of healthcare including costs incurred and claim payment advisories.
- Finding - clinical data including observations, vitals, conditions, allergies, problems and results.
- Party - details of individuals and organizations involved in healthcare (e.g. patients, practitioners and facilities).
- Provenance - source item and sender information stamped onto each data element.
- Service Delivery - general information about the provision of healthcare including care teams and care plans.
- Substance - details about specimens sent for testing and medications.
A relationship web of data that becomes more organized and contextualized as it grows.
Data accessed via GraphQL, MCP, Rest APIs, Json Event Subcription (pub/sub), Write APIs.
ACID compliance for all data going in and out of the Health Data Model.

How it works

The Health Data Model (HDM) comes fully-formed, ready to use with Orchestral, has been built by healthcare data specialists with decades of experience, and tested on real-world data.

It can be extended to meet unique needs in the Domain Modeller application, a visual interface to extend the HDM without writing code. Drag boxes, add fields, create links between related data, then deploy it to your environment.

The deployed HDM can be viewed in the Data Catalog application, a searchable auto-generated schema explorer for instant data discovery. Shows you what your Health Data Model looks like, where every single piece of data is stored, and what links it has to related data.

Your ingested data can be explored using pre-built Jupyter Notebooks filled with queries to start analyzing your data the moment you start ingesting it.

Data Lake

Stores the raw data, as well as data extracted as datasets from the Health Data Model, and file/blobs in Amazon S3 storage buckets.

Features

Raw data backup - everything ingested into Orchestral is stored here first, and can be tracked back to the original file or message, allowing for replaying of data from the source.
Catalog and statistics entries - stored in compressed parquet files, all data tagged and traceable.
Processed datasets - just-in-time warehousing of datasets for quick analysis and reporting.
Anonymized datasets - deidentified datasets ready for analysis, dashboards, and external applications via secure API endpoints.
Secure access - data accessed with Apache Spark SQL queries, file/blobs.

How It Works

All raw data saved into the Data Lake S3 buckets is catalogued and stamped with its provenance (origin), which means that our Data Lake will never become a Data Swamp.

Data is stored in here during the ingestion process along with its catalogue entry. It serves as a backup of the data that is stored in the Health Data Model, and every individual data item in the canonical database can be traced back to its originating file or message here.

When doing queries (eg. via Jupyter Notebooks), processed and anonymized datasets can be saved back to the Data Lake as just-in-time warehousing for rapid analysis and reporting, and delivery to external reporting applications.