DataIngestion

Ingestion interface

DataIngestion.is_supportedFunction
is_supported(file::AbstractString)

Denote whether a file is of one of the available formats:

  • json,
  • tsv,
  • txt,
  • csv,
  • parquet.
source
DataIngestion.load_filesFunction
load_files(
    repository::Repository,
    files::AbstractVector{<:AbstractString},
    table = "source";
    format::AbstractString,
    schema = nothing,
    union_by_name = true, kwargs...)
)

Load files into a table called table (defaults to "source") within the schema schema (defaults to main schema) inside repository.db, where repository is a Repository.

The format is inferred or can be passed explicitly.

The following formats are supported:

  • json,
  • tsv,
  • txt,
  • csv,
  • parquet.

union_by_name and the remaining keyword arguments are forwarded to the reader for the given format.

source

Internal

DataIngestion.parse_pathsFunction
parse_paths(d::AbstractDict)::Vector{String}

Generate a list of file paths based on a configuration dictionary. The file paths are interpreted as relative to DataIngestion.DATA_DIR[].

source

Metadata for filter generation

DataIngestion.summarizeFunction
summarize(repository::Repository, tbl::AbstractString; schema = nothing)

Compute summaries of variables in table tbl within the database repository.db. The summary of a variable depends on its type, according to the following rules.

  • Categorical variable => list of unique types.
  • Continuous variable => extrema.
source

Filtering interface

DataIngestion.selectFunction
select(
    repository::Repository,
    filters::AbstractVector,
    (src, tgt)::Pair = "source" => "selection";
    schema = nothing
)

Create a table with name tgt (defaults to "selection") within the schema schema (defaults to main schema) inside repository.db, where repository is a Repository. The table tgt is filled with rows from the table src (defaults to "source") that are kept by the filters in filters.

Each filter should be an instance of Filter.

source

Filters

DataIngestion.IntervalFilterType
struct IntervalFilter{T} <: Filter
    colname::String
    interval::ClosedInterval{T}
end

Object to retain only those rows for which the variable colname lies inside the interval.

source
DataIngestion.ListFilterType
struct ListFilter{T} <: Filter
    colname::String
    list::Vector{T}
end

Object to retain only those rows for which the variable colname belongs to a list of options.

source