DataIngestion
Ingestion interface
DataIngestion.is_supported — Function
is_supported(file::AbstractString)Denote whether a file is of one of the available formats:
json,tsv,txt,csv,parquet.
DataIngestion.acceptable_paths — Function
acceptable_paths()List of relative paths corresponding to supported files within DATA_DIR[].
DataIngestion.load_files — Function
load_files(
repository::Repository,
files::AbstractVector{<:AbstractString},
table = "source";
format::AbstractString,
schema = nothing,
union_by_name = true, kwargs...)
)Load files into a table called table (defaults to "source") within the schema schema (defaults to main schema) inside repository.db, where repository is a Repository.
The format is inferred or can be passed explicitly.
The following formats are supported:
json,tsv,txt,csv,parquet.
union_by_name and the remaining keyword arguments are forwarded to the reader for the given format.
Internal
DataIngestion.parse_paths — Function
parse_paths(d::AbstractDict)::Vector{String}Generate a list of file paths based on a configuration dictionary. The file paths are interpreted as relative to DataIngestion.DATA_DIR[].
Metadata for filter generation
DataIngestion.summarize — Function
summarize(repository::Repository, tbl::AbstractString; schema = nothing)Compute summaries of variables in table tbl within the database repository.db. The summary of a variable depends on its type, according to the following rules.
- Categorical variable => list of unique types.
- Continuous variable => extrema.
Filtering interface
DataIngestion.Filter — Type
abstract type Filter endAbstract supertype to encompass all possible filters.
Current implementations:
DataIngestion.Filter — Method
Filter(d::AbstractDict)Generate a Filter based on a configuration dictionary.
DataIngestion.select — Function
select(
repository::Repository,
filters::AbstractVector,
(src, tgt)::Pair = "source" => "selection";
schema = nothing
)Create a table with name tgt (defaults to "selection") within the schema schema (defaults to main schema) inside repository.db, where repository is a Repository. The table tgt is filled with rows from the table src (defaults to "source") that are kept by the filters in filters.
Each filter should be an instance of Filter.
Filters
DataIngestion.IntervalFilter — Type
struct IntervalFilter{T} <: Filter
colname::String
interval::ClosedInterval{T}
endObject to retain only those rows for which the variable colname lies inside the interval.
DataIngestion.ListFilter — Type
struct ListFilter{T} <: Filter
colname::String
list::Vector{T}
endObject to retain only those rows for which the variable colname belongs to a list of options.