Pipelines

Pipelines is a library designed to generate and evaluate data analysis pipelines.

Transformation interface

Pipelines.Card — Type

abstract type Card end

Abstract supertype to encompass all possible cards.

Current implementations:

SplitCard,
RescaleCard,
ClusterCard,
DimensionalityReductionCard,
GLMCard,
InterpCard,
GaussianEncodingCard,
StreamlinerCard.

source

Pipelines.train — Function

train(repository::Repository, card::Card, source; schema = nothing)::CardState

Return a trained model for a given card on a table table in the database repository.db.

source

Pipelines.evaluate — Function

evaluate(repository::Repository, card::Card, state::CardState, (source, destination)::Pair; schema = nothing)

Replace table destination in the database repository.db with the outcome of executing the card on the table source.

Here, state represents the result of train(repository, card, source; schema). See also train.

source

Pipelines.inputs — Function

inputs(c::Card)::OrderedSet{String}

Return the list of inputs for a given card.

source

Pipelines.outputs — Function

outputs(c::Card)::OrderedSet{String}

Return the list of outputs for a given card.

source

Pipelines.invertible — Function

invertible(c::Card)::Bool

Return true for invertible cards, false otherwise.

source

Pipeline computation

Pipelines.Card — Method

Card(d::AbstractDict)

Generate a Card based on a configuration dictionary.

source

Pipelines.evaluate — Method

evaluate(repository::Repository, cards::AbstractVector, table::AbstractString; schema = nothing)

Replace table in the database repository.db with the outcome of executing all the transformations in cards.

source

Pipeline reports

Pipelines.report — Function

report(repository::Repository, nodes::AbstractVector)

Create default reports for all nodes referring to a given repository. Each node must be of type Node.

source

report(::Repository, ::Card, ::CardState)

Overload this method (replacing Card with a specific card type) to implement a default report for a given card type.

source

Pipeline visualizations

Pipelines.visualize — Function

visualize(repository::Repository, nodes::AbstractVector)

Create default visualizations for all nodes referring to a given repository. Each node must be of type Node.

source

visualize(::Repository, ::Card, ::CardState)

Overload this method (replacing Card with a specific card type) to implement a default visualization for a given card type.

source

Cards

Pipelines.SplitCard — Type

struct SplitCard <: Card
    splitter::SQLNode
    order_by::Vector{String}
    by::Vector{String}
    output::String
end

Card to split the data into two groups according to a given function splitter.

Currently supported methods are

tiles (requires tiles argument, e.g., tiles = [1, 1, 2, 1, 1, 2]),
percentile (requires percentile argument, e.g. percentile = 0.9).

source

Pipelines.RescaleCard — Type

struct RescaleCard <: Card
    rescaler::Rescaler
    by::Vector{String} = String[]
    columns::Vector{String}
    suffix::String = "rescaled"
end

Card to rescale of one or more columns according to a given rescaler. The supported methods are

zscore,
maxabs,
minmax,
log,
logistic.

The resulting rescaled variable is added to the table under the name "$(originalname)_$(suffix)".

source

Pipelines.ClusterCard — Type

struct ClusterCard <: Card
    clusterer::Clusterer
    columns::Vector{String}
    partition::Union{String, Nothing}
    output::String
end

Cluster columns based on clusterer. Save resulting column as output.

source

Pipelines.DimensionalityReductionCard — Type

struct DimensionalityReductionCard <: Card
    projector::Projector
    columns::Vector{String}
    n_components::Int
    partition::Union{String, Nothing}
    output::String
end

Project columns based on projector. Save resulting column as output.

source

Pipelines.GLMCard — Type

struct GLMCard <: Card
    formula::FormulaTerm
    weights::Union{String, Nothing}
    distribution::Distribution
    link::Link
    partition::Union{String, Nothing}
    suffix::String
end

Run a Generalized Linear Model (GLM) based on formula.

source

Pipelines.InterpCard — Type

struct InterpCard <: Card
    interpolator::Interpolator
    predictor::String
    targets::Vector{String}
    extrapolation_left::ExtrapolationType.T
    extrapolation_right::ExtrapolationType.T
    dir::Union{Symbol, Nothing} = nothing
    partition::Union{String, Nothing} = nothing
    suffix::String = "hat"
end

Interpolate targets based on predictor.

source

Pipelines.GaussianEncodingCard — Type

struct GaussianEncodingCard <: Card

Defines a card for applying Gaussian transformations to a specified column.

Fields:

column::String: Name of the column to transform.
processed_column::Union{FunClosure, Nothing}: Processed column using a given method (see below).
n_modes::Int: Number of Gaussian curves to generate.
max::Float64: Maximum value used for normalization (denominator).
lambda::Float64: Coefficient for scaling the standard deviation.
suffix::String: Suffix added to the output column names.

Notes:

The method field determines the preprocessing applied to the column.
No automatic selection based on column type. The user must ensure compatibility:
- "identity": Assumes the column is numeric.
- "dayofyear": Assumes the column is a date or timestamp.
- "hourofday": Assumes the column is a time or timestamp.

Methods:

Defined in the TEMPORAL_PREPROCESSING dictionary:
- "identity": No transformation.
- "dayofyear": Applies the SQL dayofyear function.
- "hourofday": Applies the SQL hour function.

Train:

Returns: SimpleTable (Dict{String, AbstractVector}) with Gaussian parameters:
- σ: Standard deviation for Gaussian transformations.
- d: Normalization value.
- μ_1, μ_2, ..., μ_n: Gaussian means.

Evaluate:

Steps:
1. Preprocesses the column using the specified method.
2. Temporarily registers the Gaussian parameters (params_tbl) using with_table.
3. Joins the source table with the params table via a CROSS JOIN.
4. Computes Gaussian-transformed columns.
5. Selects only the required columns (original and transformed).
6. Replaces the target table with the final results.

source

Pipelines.StreamlinerCard — Type

struct StreamlinerCard <: Card
model::Model
training::Training
order_by::Vector{String}
predictors::Vector{String}
targets::Vector{String}
partition::Union{String, Nothing} = nothing
suffix::String = "hat"

end

Run a Streamliner model, predicting targets from predictors.

source