Documentation: Data Set Transformations

Data set transformations allow admins to create new (derived) data sets by applying various operations such as filtering, field manipulation, and joining. To create, edit, delete, schedule, or execute a data set transformation, go to Admin→Data Set Transformations.

To create a new transformation, select a desired type in the drop-down located on the right side. Once configured, execute a transformation by clicking the run button at the associated row in the list table. Each transformation produces a result data set with a configurable data set id, name, and storage type.

Transformations can be scheduled for periodic execution (daily at a specified time) and support stream-based processing for large data sets.

Note

Data set transformations can be created and executed only by admins.

Copy
Filter
Drop Fields
Rename Fields
Change Field Types
Change Field Enums
Infer
Link Two Data Sets
Link Sorted Two Data Sets
Link Multi Data Sets
Link Sorted Multi Data Sets
Merge Multi Data Sets
Merge Fully Multi Data Sets
Match Groups

Copy

Creates a complete copy of a source data set. This is useful as a starting point for further modifications or as a backup before applying destructive changes.

Filter

Creates a new data set containing only the rows that match specified filter conditions. This allows you to create subsets of data based on criteria such as field values, ranges, or combinations using AND logic.

Drop Fields

Creates a new data set with specified fields removed. Use this to strip unnecessary or sensitive columns from a data set.

Rename Fields

Creates a new data set with specified fields renamed using old-to-new name mappings. This is useful for standardizing field names across data sets or making them more descriptive.

Change Field Types

Creates a new data set where specified fields have their data types changed (e.g., from String to Integer, or from Integer to Enum). This is helpful when automatic type inference did not produce the desired result.

Change Field Enums

Updates enumeration values of specified fields in-place (without creating a new data set). Use this to relabel categories, e.g., renaming "0" and "1" to "Control" and "Case".

Infer

Re-infers field types, enumerations, and statistics from the source data. This is useful after manual data modifications or when the original type inference needs to be refreshed.

Link Two Data Sets

Joins two data sets on specified key fields, similar to a SQL join. You specify the left and right data sets along with the fields to join on. The result data set contains the combined fields from both sources.

Link Sorted Two Data Sets

Joins two pre-sorted data sets on specified key fields. This variant is optimized for data sets that are already sorted by the join key, enabling more efficient processing for large data sets.

Link Multi Data Sets

Joins multiple data sets on specified key fields. This extends the two-dataset join to handle any number of source data sets in a single operation.

Link Sorted Multi Data Sets

Joins multiple pre-sorted data sets on specified key fields. Like the sorted two-dataset variant, this is optimized for pre-sorted inputs.

Merge Multi Data Sets

Merges multiple data sets by appending rows (union-like operation) with explicit field name mappings. Use this when combining data sets that have different field names but represent the same data.

Merge Fully Multi Data Sets

Merges multiple data sets with automatic field matching. All fields from all source data sets are included in the result, with automatic alignment of matching field names.

Match Groups With Confounders

Creates a new data set by matching groups while controlling for confounding variables. This is commonly used in clinical studies to create balanced cohorts (e.g., matching cases and controls by age and gender).

Introduction

Analytics

Machine Learning

Administration

Other

Data Set Transformations

Transformation Types