Data Set Transformations
Data set transformations allow admins to create new (derived) data sets by applying various operations such as filtering, field manipulation, and joining. To create, edit, delete, schedule, or execute a data set transformation, go to Admin→Data Set Transformations.
To create a new transformation, select a desired type in the drop-down located on the right side. Once configured, execute a transformation by clicking the run button at the associated row in the list table. Each transformation produces a result data set with a configurable data set id, name, and storage type.
Transformations can be scheduled for periodic execution (daily at a specified time) and support stream-based processing for large data sets.
Transformation Types
- Copy
- Filter
- Drop Fields
- Rename Fields
- Change Field Types
- Change Field Enums
- Infer
- Link Two Data Sets
- Link Sorted Two Data Sets
- Link Multi Data Sets
- Link Sorted Multi Data Sets
- Merge Multi Data Sets
- Merge Fully Multi Data Sets
- Match Groups
Copy
Creates a complete copy of a source data set. This is useful as a starting point for further modifications or as a backup before applying destructive changes.
Filter
Creates a new data set containing only the rows that match specified filter conditions. This allows you to create subsets of data based on criteria such as field values, ranges, or combinations using AND logic.
Drop Fields
Creates a new data set with specified fields removed. Use this to strip unnecessary or sensitive columns from a data set.
Rename Fields
Creates a new data set with specified fields renamed using old-to-new name mappings. This is useful for standardizing field names across data sets or making them more descriptive.
Change Field Types
Creates a new data set where specified fields have their data types changed (e.g., from String to Integer, or from Integer to Enum). This is helpful when automatic type inference did not produce the desired result.
Change Field Enums
Updates enumeration values of specified fields in-place (without creating a new data set). Use this to relabel categories, e.g., renaming "0" and "1" to "Control" and "Case".
Infer
Re-infers field types, enumerations, and statistics from the source data. This is useful after manual data modifications or when the original type inference needs to be refreshed.
Link Two Data Sets
Joins two data sets on specified key fields, similar to a SQL join. You specify the left and right data sets along with the fields to join on. The result data set contains the combined fields from both sources.
Link Sorted Two Data Sets
Joins two pre-sorted data sets on specified key fields. This variant is optimized for data sets that are already sorted by the join key, enabling more efficient processing for large data sets.
Link Multi Data Sets
Joins multiple data sets on specified key fields. This extends the two-dataset join to handle any number of source data sets in a single operation.
Link Sorted Multi Data Sets
Joins multiple pre-sorted data sets on specified key fields. Like the sorted two-dataset variant, this is optimized for pre-sorted inputs.
Merge Multi Data Sets
Merges multiple data sets by appending rows (union-like operation) with explicit field name mappings. Use this when combining data sets that have different field names but represent the same data.
Merge Fully Multi Data Sets
Merges multiple data sets with automatic field matching. All fields from all source data sets are included in the result, with automatic alignment of matching field names.
Match Groups With Confounders
Creates a new data set by matching groups while controlling for confounding variables. This is commonly used in clinical studies to create balanced cohorts (e.g., matching cases and controls by age and gender).