DataIngestion

Ingestion interface

DataIngestion.is_supportedFunction
is_supported(file::AbstractString)

Denote whether a file is of one of the available formats:

  • json,
  • tsv,
  • txt,
  • csv,
  • parquet.
source
DataIngestion.load_filesFunction
load_files(
    repository::Repository,
    files::AbstractVector{<:AbstractString};
    format::AbstractString,
    schema = nothing,
    union_by_name = true, kwargs...)
)

Load files into a table called TABLE_NAMES.source inside repository.db within the schema schema (defaults to main schema).

The format is inferred or can be passed explicitly.

The following formats are supported:

  • json,
  • tsv,
  • txt,
  • csv,
  • parquet.

union_by_name and the remaining keyword arguments are forwarded to the reader for the given format.

source

Internal

DataIngestion.parse_pathsFunction
parse_paths(d::AbstractDict)::Vector{String}

Generate a list of file paths based on a configuration dictionary. The file paths are interpreted as relative to DataIngestion.DATA_DIR[].

source

Metadata for filter generation

DataIngestion.summarizeFunction
summarize(repository::Repository, tbl::AbstractString; schema = nothing)

Compute summaries of variables in table tbl within the database repository.db. The summary of a variable depends on its type, according to the following rules.

  • Categorical variable => list of unique types.
  • Continuous variable => extrema.
source

Filtering interface

DataIngestion.selectFunction
select(repository::Repository, filters::AbstractVector; schema = nothing)

Create a table with name TABLE_NAMES.selection within the database repository.db, where repository is a Repository. The table TABLE_NAMES.selection is filled with rows from the table TABLE_NAMES.source that are kept by the filters in filters.

Each filter should be an instance of Filter.

source

Filters

DataIngestion.IntervalFilterType
struct IntervalFilter{T} <: Filter
    colname::String
    interval::ClosedInterval{T}
end

Object to retain only those rows for which the variable colname lies inside the interval.

source
DataIngestion.ListFilterType
struct ListFilter{T} <: Filter
    colname::String
    list::Vector{T}
end

Object to retain only those rows for which the variable colname belongs to a list of options.

source