Pipelines

Pipelines is a library designed to generate and evaluate data analysis pipelines.

Transformation interface

Pipelines.trainFunction
train(repository::Repository, card::Card, source; schema = nothing)::CardState

Return a trained model for a given card on a table table in the database repository.db.

source
Pipelines.evaluateFunction
evaluate(repository::Repository, card::Card, state::CardState, (source, destination)::Pair; schema = nothing)

Replace table destination in the database repository.db with the outcome of executing the card on the table source.

Here, state represents the result of train(repository, card, source; schema). See also train.

source
Pipelines.inputsFunction
inputs(c::Card)::OrderedSet{String}

Return the list of inputs for a given card.

source
Pipelines.outputsFunction
outputs(c::Card)::OrderedSet{String}

Return the list of outputs for a given card.

source

Pipeline computation

Pipelines.evaluateMethod
evaluate(repository::Repository, cards::AbstractVector, table::AbstractString; schema = nothing)

Replace table in the database repository.db with the outcome of executing all the transformations in cards.

source

Pipeline reports

Pipelines.reportFunction
report(repository::Repository, nodes::AbstractVector)

Create default reports for all nodes referring to a given repository. Each node must be of type Node.

source
report(::Repository, ::Card, ::CardState)

Overload this method (replacing Card with a specific card type) to implement a default report for a given card type.

source

Pipeline visualizations

Pipelines.visualizeFunction
visualize(repository::Repository, nodes::AbstractVector)

Create default visualizations for all nodes referring to a given repository. Each node must be of type Node.

source
visualize(::Repository, ::Card, ::CardState)

Overload this method (replacing Card with a specific card type) to implement a default visualization for a given card type.

source

Cards

Pipelines.SplitCardType
struct SplitCard <: Card
    splitter::SQLNode
    order_by::Vector{String}
    by::Vector{String}
    output::String
end

Card to split the data into two groups according to a given function splitter.

Currently supported methods are

  • tiles (requires tiles argument, e.g., tiles = [1, 1, 2, 1, 1, 2]),
  • percentile (requires percentile argument, e.g. percentile = 0.9).
source
Pipelines.RescaleCardType
struct RescaleCard <: Card
    rescaler::Rescaler
    by::Vector{String} = String[]
    columns::Vector{String}
    suffix::String = "rescaled"
end

Card to rescale of one or more columns according to a given rescaler. The supported methods are

  • zscore,
  • maxabs,
  • minmax,
  • log,
  • logistic.

The resulting rescaled variable is added to the table under the name "$(originalname)_$(suffix)".

source
Pipelines.ClusterCardType
struct ClusterCard <: Card
    clusterer::Clusterer
    columns::Vector{String}
    partition::Union{String, Nothing}
    output::String
end

Cluster columns based on clusterer. Save resulting column as output.

source
Pipelines.DimensionalityReductionCardType
struct DimensionalityReductionCard <: Card
    projector::Projector
    columns::Vector{String}
    n_components::Int
    partition::Union{String, Nothing}
    output::String
end

Project columns based on projector. Save resulting column as output.

source
Pipelines.GLMCardType
struct GLMCard <: Card
    formula::FormulaTerm
    weights::Union{String, Nothing}
    distribution::Distribution
    link::Link
    partition::Union{String, Nothing}
    suffix::String
end

Run a Generalized Linear Model (GLM) based on formula.

source
Pipelines.InterpCardType
struct InterpCard <: Card
    interpolator::Interpolator
    predictor::String
    targets::Vector{String}
    extrapolation_left::ExtrapolationType.T
    extrapolation_right::ExtrapolationType.T
    dir::Union{Symbol, Nothing} = nothing
    partition::Union{String, Nothing} = nothing
    suffix::String = "hat"
end

Interpolate targets based on predictor.

source
Pipelines.GaussianEncodingCardType
struct GaussianEncodingCard <: Card

Defines a card for applying Gaussian transformations to a specified column.

Fields:

  • column::String: Name of the column to transform.
  • processed_column::Union{FunClosure, Nothing}: Processed column using a given method (see below).
  • n_modes::Int: Number of Gaussian curves to generate.
  • max::Float64: Maximum value used for normalization (denominator).
  • lambda::Float64: Coefficient for scaling the standard deviation.
  • suffix::String: Suffix added to the output column names.

Notes:

  • The method field determines the preprocessing applied to the column.
  • No automatic selection based on column type. The user must ensure compatibility:
    • "identity": Assumes the column is numeric.
    • "dayofyear": Assumes the column is a date or timestamp.
    • "hourofday": Assumes the column is a time or timestamp.

Methods:

  • Defined in the TEMPORAL_PREPROCESSING dictionary:
    • "identity": No transformation.
    • "dayofyear": Applies the SQL dayofyear function.
    • "hourofday": Applies the SQL hour function.

Train:

  • Returns: SimpleTable (Dict{String, AbstractVector}) with Gaussian parameters:
    • σ: Standard deviation for Gaussian transformations.
    • d: Normalization value.
    • μ_1, μ_2, ..., μ_n: Gaussian means.

Evaluate:

  • Steps:
    1. Preprocesses the column using the specified method.
    2. Temporarily registers the Gaussian parameters (params_tbl) using with_table.
    3. Joins the source table with the params table via a CROSS JOIN.
    4. Computes Gaussian-transformed columns.
    5. Selects only the required columns (original and transformed).
    6. Replaces the target table with the final results.
source
Pipelines.StreamlinerCardType
struct StreamlinerCard <: Card
model::Model
training::Training
order_by::Vector{String}
predictors::Vector{String}
targets::Vector{String}
partition::Union{String, Nothing} = nothing
suffix::String = "hat"

end

Run a Streamliner model, predicting targets from predictors.

source