Pipelines
Pipelines is a library designed to generate and evaluate data analysis pipelines.
Transformation interface
Pipelines.Card
— Typeabstract type Card end
Abstract supertype to encompass all possible cards.
Current implementations:
Pipelines.train
— Functiontrain(repository::Repository, card::Card, source; schema = nothing)::CardState
Return a trained model for a given card
on a table table
in the database repository.db
.
Pipelines.evaluate
— Functionevaluate(repository::Repository, card::Card, state::CardState, (source, destination)::Pair; schema = nothing)
Replace table destination
in the database repository.db
with the outcome of executing the card
on the table source
.
Here, state
represents the result of train(repository, card, source; schema)
. See also train
.
Pipelines.inputs
— Functioninputs(c::Card)::OrderedSet{String}
Return the list of inputs for a given card.
Pipelines.outputs
— Functionoutputs(c::Card)::OrderedSet{String}
Return the list of outputs for a given card.
Pipelines.invertible
— Functioninvertible(c::Card)::Bool
Return true
for invertible cards, false
otherwise.
Pipeline computation
Pipelines.Card
— MethodCard(d::AbstractDict)
Generate a Card
based on a configuration dictionary.
Pipelines.evaluate
— Methodevaluate(repository::Repository, cards::AbstractVector, table::AbstractString; schema = nothing)
Replace table
in the database repository.db
with the outcome of executing all the transformations in cards
.
Pipeline reports
Pipelines.report
— Functionreport(repository::Repository, nodes::AbstractVector)
Create default reports for all nodes
referring to a given repository
. Each node must be of type Node
.
report(::Repository, ::Card, ::CardState)
Overload this method (replacing Card
with a specific card type) to implement a default report for a given card type.
Pipeline visualizations
Pipelines.visualize
— Functionvisualize(repository::Repository, nodes::AbstractVector)
Create default visualizations for all nodes
referring to a given repository
. Each node must be of type Node
.
visualize(::Repository, ::Card, ::CardState)
Overload this method (replacing Card
with a specific card type) to implement a default visualization for a given card type.
Cards
Pipelines.SplitCard
— Typestruct SplitCard <: Card
splitter::SQLNode
order_by::Vector{String}
by::Vector{String}
output::String
end
Card to split the data into two groups according to a given function splitter
.
Currently supported methods are
tiles
(requirestiles
argument, e.g.,tiles = [1, 1, 2, 1, 1, 2]
),percentile
(requirespercentile
argument, e.g.percentile = 0.9
).
Pipelines.RescaleCard
— Typestruct RescaleCard <: Card
rescaler::Rescaler
by::Vector{String} = String[]
columns::Vector{String}
suffix::String = "rescaled"
end
Card to rescale of one or more columns according to a given rescaler
. The supported methods are
zscore
,maxabs
,minmax
,log
,logistic
.
The resulting rescaled variable is added to the table under the name "$(originalname)_$(suffix)"
.
Pipelines.ClusterCard
— Typestruct ClusterCard <: Card
clusterer::Clusterer
columns::Vector{String}
partition::Union{String, Nothing}
output::String
end
Cluster columns
based on clusterer
. Save resulting column as output
.
Pipelines.DimensionalityReductionCard
— Typestruct DimensionalityReductionCard <: Card
projector::Projector
columns::Vector{String}
n_components::Int
partition::Union{String, Nothing}
output::String
end
Project columns
based on projector
. Save resulting column as output
.
Pipelines.GLMCard
— Typestruct GLMCard <: Card
formula::FormulaTerm
weights::Union{String, Nothing}
distribution::Distribution
link::Link
partition::Union{String, Nothing}
suffix::String
end
Run a Generalized Linear Model (GLM) based on formula
.
Pipelines.InterpCard
— Typestruct InterpCard <: Card
interpolator::Interpolator
predictor::String
targets::Vector{String}
extrapolation_left::ExtrapolationType.T
extrapolation_right::ExtrapolationType.T
dir::Union{Symbol, Nothing} = nothing
partition::Union{String, Nothing} = nothing
suffix::String = "hat"
end
Interpolate targets
based on predictor
.
Pipelines.GaussianEncodingCard
— Typestruct GaussianEncodingCard <: Card
Defines a card for applying Gaussian transformations to a specified column.
Fields:
column::String
: Name of the column to transform.processed_column::Union{FunClosure, Nothing}
: Processed column using a given method (see below).n_modes::Int
: Number of Gaussian curves to generate.max::Float64
: Maximum value used for normalization (denominator).lambda::Float64
: Coefficient for scaling the standard deviation.suffix::String
: Suffix added to the output column names.
Notes:
- The
method
field determines the preprocessing applied to the column. - No automatic selection based on column type. The user must ensure compatibility:
"identity"
: Assumes the column is numeric."dayofyear"
: Assumes the column is a date or timestamp."hourofday"
: Assumes the column is a time or timestamp.
Methods:
- Defined in the
TEMPORAL_PREPROCESSING
dictionary:"identity"
: No transformation."dayofyear"
: Applies the SQLdayofyear
function."hourofday"
: Applies the SQLhour
function.
Train:
- Returns: SimpleTable (Dict{String, AbstractVector}) with Gaussian parameters:
σ
: Standard deviation for Gaussian transformations.d
: Normalization value.μ_1, μ_2, ..., μ_n
: Gaussian means.
Evaluate:
- Steps:
- Preprocesses the column using the specified method.
- Temporarily registers the Gaussian parameters (
params_tbl
) usingwith_table
. - Joins the source table with the params table via a CROSS JOIN.
- Computes Gaussian-transformed columns.
- Selects only the required columns (original and transformed).
- Replaces the target table with the final results.
Pipelines.StreamlinerCard
— Typestruct StreamlinerCard <: Card
model::Model
training::Training
order_by::Vector{String}
predictors::Vector{String}
targets::Vector{String}
partition::Union{String, Nothing} = nothing
suffix::String = "hat"
end
Run a Streamliner model, predicting targets
from predictors
.