Ingestion Pipeline | Orchestral Health Data Platform

The ingestion pipeline is part of Orchestral that does the heavy lifting taking your data from all the sources, in all the different formats, through the appropriate connector, and then moving through a series of services to store it into the canonical database, our Health Data Model (HDM).

Data can come from any source, in any format (FHIR, HL7, CCDA, JSON, CSV, XML, OMOP, Images, VCF / BAM, Audio / Video, Text / Word, Excel, PDF, email, and others), through any arrival method / connector (stream (AWS SQS/Kafka, asynchronous messaging), batch (files), ad-hoc uploads, API, events, database).

Ingestion Pipeline.png

Features

Accepts data from any source, in any format.
Validates data at the front door and raises any issues on the data quality dashboard.
All raw data is tagged and saved into a Data Lake.
ETL - extracts, transforms and loads data into the canonical database, the HDM.

How it works

Configuration

Pipelines are specifically configured for each type of data, but can accept that type of data from any data providers.

For example, a pipeline for Medicaid provider registry CSV (batch) files is configured to understand exactly what data is required, the shape it should be in, how to handle each individual data item, and the relationships items have to each other and any potential data that might already be in the Health Data Model.

The following tools are used to understand and configure your platform:

Environment Manager - the interface to manage your data providers, data items, and pipeline configurations in your Orchestral environment. Choose from a list of pre-built pipelines to get up and running quickly.
Data Catalogue - browse and understand your HDM.
Domain Modeller - our HDM comes out-of-the-box with a comprehensive set of subject domains to handle your healthcare data. We also provide a visual interface to extend the HDM without the need to write any code, which means you won’t run into limitations with processing all your data.
Domain Mapper - Our pre-built pipelines come with the mapping ready to go, but if you need to extend or build your own pipelines, or tweak an existing one, there is the Domain Mapper interface to configure exactly where each value in your source data is mapped into the HDM.

Ingestion pipeline services

Once configured, you can start ingesting data through your ingest pipelines.

Each service in the pipeline has a specific task.

File preparation

This first service is optional. It is configured if the pipeline will be ingesting any non-standard data. The file preparation service will transform and clean it to the industry standard specification, and any data that is included in the source that does not fit into the standard is carried along with the standardized data through the ingestion pipeline, ensuring a lossless process. No data left behind!

Validation

At this step the data is checked and compared against the expected structure. This is configured If validation fails, the data is flagged for review and it goes no further. Any items that fail validation are highlighted in a data quality dashboard.

Catalog

Now that the data has been cleared for ingestion, a catalog entry is created that records:

Metadata - details about the batch / document / window including the source file. This includes the catalog entry ID, pipeline ID, sender, receiver, arrival timestamps, size and arrival type (ie. batch, stream, document).
Basic statistics - details about the items in the batch / document / window including count and total size.
- For window (stream), statistics are also calculated for mean size, minimum size, maximum size, and standard deviation.

This becomes part of its provenance record. Any single item of data in the HDM can be traced back to this point, this catalog entry, and the raw copy of the data that is saved in the Data Lake.

Statistics

Optional and configurable in-depth statistics can be recorded against any data that arrives in a pipeline. These include:

Text - analyses all values in a specific string field / attribute. This includes minimum length, maximum length, count (how many strings), and missing count (ie. number of empties) in that attribute across the batch / window.
Categorical - analyses distinct values in a specific string field / attribute. Results are stored in separate tables for:
- Descriptive - statistics on the specific attribute. This includes count (of all values) and missing count (ie. number of empties).
- Value count - statistics on each value stored in the specific attribute. This includes the value (as stored) and count (of each value).
Numeric - analyses all values in a specific number (double or integer) field / attribute. This includes minimum, maximum, average, standard deviation, count (how many numbers), and missing count (ie: number of empties) in that attribute across the batch / window.

Store raw data

At this point in the pipeline, a copy of the raw data (file, message) is stored in the Data Lake along with its catalog entry and (if applicable) statistics entry.

Learn about ⑤ Data Storage

Parser

Performs document-level actions to optimize the data elements, e.g. identify data types (string, int64) and standardize dates. This is a more fine-tuned preparation of the data to get it ready for the next service.

Object creation (interim object service)

Uses a specific schema, configured in the pipeline, to sort all of the data values from the raw data item into a single, organized JSON file.

During this process, this service can complete basic if/then actions and add context to information, such as whether it came from a clinical origin or not.

This step ensures that no matter which format the file arrives it, is it organized and prepared for handing off to the domain services.

Mapping

The domains use the mapping definitions, which have been configured in the Domain Mapper, to determine exactly which elements in the data belong where in the Health Data Model, and which handler is required to apply the business logic to process the data.

Domain services store data

This is the part of the pipeline that does the bulk of the work and makes use of the years of experience our team has with healthcare data.

There are a range of specialized service coordinators and handlers that act as gatekeepers of the relevant subject domain. The domain services will store data into the Health Data Model (canonical database), and largely work in parallel; the data is not stored all at once as it arrived, but as each individual component is processed.

Learn about ⑤ Data Storage

Reduce duplication

Before saving data, the handlers check they have enough information to create a valid record, and then check whether there is already an existing record. If there is not an existing record, the relevant handler creates it. If there is an existing record, the handler checks whether it needs to be updated. This ensures we are not saving duplicate records when one suffices. For example - a physical address is unique, so the domain service checks if that address has already been recorded, and doesn’t create a new record for it if it already exists - it just creates a relationship to it.

Smart matching logic

As each coordinator and handler is designed to understand their specific area, they are configured with business logic that knows how healthcare data works. A person can be a patient and a practitioner, so the domain service will not only check the Party Domain for existing patient records, but any person record that meets the rigorous matching criteria. If a match is found, a relationship between records is formed - they are the same person after all! Where there is any ambiguity, the record is flagged for review by a data steward; this reduces the risk of falsely linking or overwriting records that can happen in other systems.

Through this smart logic, a relationship web is formed between pieces of data that could otherwise remain disparate - patient to practitioner, practitioner to healthcare provider, healthcare provider to insurance provider, patient to next-of-kin (who might themselves be an existing known patient elsewhere), and much more. As more data is added, the relationship web grows.

The ability to accurately match incoming data to existing records can be further enhanced by adding in the Indexity product to handle the identity management logic.

Learn about ② Indexity

Terminology

The Code Resolver Service will resolve any codes against our in-built library of code sets (valuesets and vocabularies) and store the result in the Health Data Model.

Learn about ③ Code Resolver