Pipelines
Pipelines is a library designed to generate and evaluate data analysis pipelines.
Transformation interface
Pipelines.Card — Type
abstract type Card endAbstract supertype to encompass all possible cards.
Current implementations:
SplitCard(type = "split"),RescaleCard(type = "rescale"),ClusterCard(type = "cluster"),DimensionalityReductionCard(type = "dimensionality_reduction"),GLMCard(type = "glm"),MixedModelCard(type = "mixed_model"),InterpCard(type = "interp"),GaussianEncodingCard(type = "gaussian_encoding"),StreamlinerCard(type = "streamliner"),WildCard.
Pipelines.Card — Method
Card(d::AbstractDict)Generate a Card based on a configuration dictionary d.
Examples
Dictionaries are given in TOML format for clarity.
Card configuration d:
type = "cluster"
method = "kmeans"
method_options = {classes = 3}
inputs = [
"wind_10m",
"wind_20m",
"temperature_10m",
"temperature_20m",
"precipitation",
"irradiance",
"humidity"
]Resulting card:
julia> card = Card(d);
julia> typeof(card)
ClusterCard
julia> card.clusterer
Pipelines.KMeansMethod(3, 100, 1.0e-6, nothing)
julia> card.inputs
7-element Vector{String}:
"wind_10m"
"wind_20m"
"temperature_10m"
"temperature_20m"
"precipitation"
"irradiance"
"humidity"Pipelines.Card — Method
Card(d::AbstractDict, params::AbstractDict; recursive::Integer = 1)Generate a Card based on a parametric configuration dictionary d and parameter dictionary params. The value recursive denotes how many times to process replaced variables. Use recursive = 0 to avoid recursion altogether and a large number to allow arbitrary recursion.
Parametric configurations are experimental, the API is not yet fully stabilized and documented.
Current implementation
- Variable substitution based on key
-v - Splicing variable substitution based on key
-s - Range substitution based on key
-r - Splicing joining with underscore based on key
-j
Examples
Dictionaries are given in TOML format for clarity.
Initial card configuration d:
type = "cluster"
method = "kmeans"
method_options = {classes = {"-v" = "nclasses"}}
inputs = [
{"-j" = ["component", {"-r" = 3}]},
{"-j" = [["wind", "temperature"], ["10m", "20m"]]},
{"-s" = "additional_input_vars"},
"humidity"
]Parameter dictionary params:
nclasses = 3
additional_input_vars = ["precipitation", "irradiance"]Final card configuration Pipelines.apply_helpers(d, params; recursive):
method = "kmeans"
classes = 3
type = "cluster"
inputs = [
"component_1",
"component_2",
"component_3",
"wind_10m",
"wind_20m",
"temperature_10m",
"temperature_20m",
"precipitation",
"irradiance",
"humidity"
]Pipelines.train — Function
train(repository::Repository, card::Card, source; schema = nothing)::CardStateReturn a trained model for a given card on a table table in the database repository.db.
Pipelines.evaluate — Function
evaluate(
repository::Repository,
card::Card,
state::CardState,
(source, destination)::Pair,
id::AbstractString;
schema = nothing
)Replace table destination in the database repository.db with the outcome of executing the card on the table source. The new table destination will have an additional column id, to be joined with the row number of the original table.
Here, state represents the result of train(repository, card, source; schema). See also train.
Pipelines.get_inputs — Function
get_inputs(c::Card; invert::Bool = false, train::Bool = !invert)::Vector{String}Return the list of inputs for a given card.
Pipelines.get_outputs — Function
get_outputs(c::Card; invert::Bool = false)::Vector{String}Return the list of outputs for a given card.
Pipelines.invertible — Function
invertible(c::Card)::BoolReturn true for invertible cards, false otherwise.
Pipeline computation
Pipelines.Node — Type
Node(
card::Card,
state = CardState();
update::Bool = true,
train::Bool = true
)Generate a Node object from a Card.
Pipelines.train! — Function
train!(
repository::Repository,
node::Node,
table::AbstractString;
schema = nothing
)Train node on table table in repository. The field state of node is modified.
See also evaljoin, train_evaljoin!.
Pipelines.evaljoin — Function
evaljoin(
repository::Repository,
nodes::AbstractVector,
table::AbstractString,
[keep_vars];
schema = nothing
)
evaljoin(
repository::Repository,
node::Node,
(source, destination)::Pair,
[keep_vars];
schema = nothing
)Replace table in the database repository.db with the outcome of executing all the transformations in nodes, without training the nodes. The resulting outputs of the pipeline are joined with the original columns keep_vars (defaults to keeping all columns).
If only a node is provided, then one should pass both source and destination tables.
See also train!, train_evaljoin!.
Return pipeline graph and metadata.
Pipelines.train_evaljoin! — Function
train_evaljoin!(
repository::Repository,
nodes::AbstractVector,
table::AbstractString,
[keep_vars];
schema = nothing
)
train_evaljoin!(
repository::Repository,
node::Node,
(source, destination)::Pair,
[keep_vars];
schema = nothing
)Replace table in the database repository.db with the outcome of executing all the transformations in nodes, after having trained the nodes. The resulting outputs of the pipeline are joined with the original columns keep_vars (defaults to keeping all columns).
If only a node is provided, then one should pass both source and destination tables.
Return pipeline graph and metadata.
Pipeline reports
Pipelines.report — Function
report(repository::Repository, nodes::AbstractVector)Create default reports for all nodes referring to a given repository. Each node must be of type Node.
report(::Repository, ::Card, ::CardState)Overload this method (replacing Card with a specific card type) to implement a default report for a given card type.
Pipeline visualizations
Pipelines.visualize — Function
visualize(repository::Repository, nodes::AbstractVector)Create default visualizations for all nodes referring to a given repository. Each node must be of type Node.
visualize(::Repository, ::Card, ::CardState)Overload this method (replacing Card with a specific card type) to implement a default visualization for a given card type.
Cards
Pipelines.SplitCard — Type
struct SplitCard <: Card
type::String
label::String
method::String
splitter::SplittingMethod
order_by::Vector{String}
by::Vector{String}
output::String
endCard to split the data into two groups according to a given function splitter.
Currently supported methods are
tiles(requirestilesargument, e.g.,tiles = [1, 1, 2, 1, 1, 2]),percentile(requirespercentileargument, e.g.percentile = 0.9).
Pipelines.RescaleCard — Type
struct RescaleCard <: Card
type::String
label::String
by::Vector{String}
inputs::Vector{String}
targets::Vector{String}
partition::Union{String, Nothing}
suffix::String
target_suffix::Union{String, Nothing}
endCard to rescale one or more columns according to a given rescaler. The supported methods are
zscore,maxabs,minmax,log,logistic.
The resulting rescaled variable is added to the table under the name "$(originalname)_$(suffix)".
Pipelines.ClusterCard — Type
struct ClusterCard <: Card
type::String
label::String
method::String
clusterer::ClusteringMethod
inputs::Vector{String}
weights::Union{String, Nothing}
partition::Union{String, Nothing}
output::String
endCluster inputs based on clusterer. Save resulting column as output.
Pipelines.DimensionalityReductionCard — Type
struct DimensionalityReductionCard <: Card
type::String
label::String
method::String
projector::ProjectionMethod
inputs::Vector{String}
partition::Union{String, Nothing}
n_components::Int
output::String
endProject inputs based on projector. Save resulting column as output.
Pipelines.GLMCard — Type
struct GLMCard <: Card
type::String
label::String
distribution_name::String
distribution::Distribution
link_name::Union{String, Nothing}
link::Link
inputs::Vector{Any}
target::String
formula::FormulaTerm
weights::Union{String, Nothing}
partition::Union{String, Nothing}
suffix::String
endRun a Generalized Linear Model (GLM) based on formula.
Pipelines.MixedModelCard — Type
struct MixedModelCard <: AbstractGLMCard
type::String
label::String
distribution_name::String
distribution::Distribution
link_name::Union{String, Nothing}
link::Link
inputs::MixedInputs
target::String
formula::FormulaTerm
weights::Union{String, Nothing}
partition::Union{String, Nothing}
suffix::String
endRun a Mixed Model based on formula. To use this card, you must load the MixedModels.jl package first.
Pipelines.InterpCard — Type
struct InterpCard <: Card
type::String
label::String
method::String
interpolator::InterpolationMethod
input::String
targets::Vector{String}
partition::Union{String, Nothing} = nothing
suffix::String = "hat"
endInterpolate targets based on input.
Pipelines.GaussianEncodingCard — Type
struct GaussianEncodingCard <: CardDefines a card for applying Gaussian transformations to a specified column.
Fields:
type::String: Card type, i.e.,"gaussian_encoding".label::String: Label to represent the card in a UI.method::String: Name of the processing method (see below).temporal_preprocessor::TemporalProcessingMethod: Tranformation to process a given column (see below).input::String: Name of the column to transform.n_components::Int: Number of Gaussian curves to generate.lambda::Float64: Coefficient for scaling the standard deviation.suffix::String: Suffix added to the output column names.
Notes:
- The
methodfield determines the preprocessing applied to the column. - No automatic selection based on column type. The user must ensure compatibility:
"identity": Assumes the column is numeric."dayofyear": Assumes the column is a date or timestamp."hourofday": Assumes the column is a time or timestamp.
Methods:
- Defined in the
TEMPORAL_PREPROCESSING_METHODSdictionary:"identity": No transformation."dayofweek": Applies the SQLdayofweekfunction."dayofyear": Applies the SQLdayofyearfunction."hourofday": Applies the SQLhourfunction."minuteofhour": Computes the minute within the hour."minuteofday": Computes the minute within the day.
Train:
- Returns: SimpleTable (Dict{String, AbstractVector}) with Gaussian parameters:
σ: Standard deviation for Gaussian transformations.d: Normalization value.μ_1, μ_2, ..., μ_n: Gaussian means.
Evaluate:
- Steps:
- Preprocesses the column using the specified method.
- Temporarily registers the Gaussian parameters (
params_tbl) usingwith_table. - Joins the source table with the params table via a CROSS JOIN.
- Computes Gaussian-transformed columns.
- Selects only the required columns (original and transformed).
- Replaces the target table with the final results.
Pipelines.StreamlinerCard — Type
struct StreamlinerCard <: Card
type::String
label::String
model_name::String
model::Model
training_name::String
training::Training
order_by::Vector{String}
inputs::Vector{String}
targets::Vector{String}
partition::Union{String, Nothing} = nothing
suffix::String = "hat"
endRun a Streamliner model, predicting targets from inputs.
Pipelines.WildCard — Type
struct WildCard{train, evaluate} <: Card
type::String
label::String
order_by::Vector{String}
inputs::Vector{String}
targets::Vector{String}
weights::Union{String, Nothing}
partition::Union{String, Nothing}
outputs::Vector{String}
endCustom card that uses arbitrary training and evaluations functions.
Card registration
Pipelines.register_card — Function
register_card(config::CardConfig)Set a given card configuration as globally available.
See also CardConfig.
Pipelines.CardConfig — Type
@kwdef struct CardConfig{T <: Card}
key::String
label::String
needs_targets::Bool
needs_order::Bool
allows_weights::Bool
allows_partition::Bool
widget_configs::StringDict = StringDict()
methods::StringDict = StringDict()
endConfiguration used to register a card.