DataIngestion
Ingestion interface
DataIngestion.is_supported
— Functionis_supported(file::AbstractString)
Denote whether a file is of one of the available formats:
json
,tsv
,txt
,csv
,parquet
.
DataIngestion.acceptable_paths
— Functionacceptable_paths()
List of relative paths corresponding to supported files within DATA_DIR[]
.
DataIngestion.load_files
— Functionload_files(
repository::Repository,
files::AbstractVector{<:AbstractString};
format::AbstractString,
schema = nothing,
union_by_name = true, kwargs...)
)
Load files
into a table called TABLE_NAMES.source
inside repository.db
within the schema schema
(defaults to main schema).
The format is inferred or can be passed explicitly.
The following formats are supported:
json
,tsv
,txt
,csv
,parquet
.
union_by_name
and the remaining keyword arguments are forwarded to the reader for the given format.
Internal
DataIngestion.parse_paths
— Functionparse_paths(d::AbstractDict)::Vector{String}
Generate a list of file paths based on a configuration dictionary. The file paths are interpreted as relative to DataIngestion.DATA_DIR[]
.
Metadata for filter generation
DataIngestion.summarize
— Functionsummarize(repository::Repository, tbl::AbstractString; schema = nothing)
Compute summaries of variables in table tbl
within the database repository.db
. The summary of a variable depends on its type, according to the following rules.
- Categorical variable => list of unique types.
- Continuous variable => extrema.
Filtering interface
DataIngestion.Filter
— Typeabstract type Filter end
Abstract supertype to encompass all possible filters.
Current implementations:
DataIngestion.Filter
— MethodFilter(d::AbstractDict)
Generate a Filter
based on a configuration dictionary.
DataIngestion.select
— Functionselect(repository::Repository, filters::AbstractVector; schema = nothing)
Create a table with name TABLE_NAMES.selection
within the database repository.db
, where repository
is a Repository
. The table TABLE_NAMES.selection
is filled with rows from the table TABLE_NAMES.source
that are kept by the filters in filters
.
Each filter should be an instance of Filter
.
Filters
DataIngestion.IntervalFilter
— Typestruct IntervalFilter{T} <: Filter
colname::String
interval::ClosedInterval{T}
end
Object to retain only those rows for which the variable colname
lies inside the interval
.
DataIngestion.ListFilter
— Typestruct ListFilter{T} <: Filter
colname::String
list::Vector{T}
end
Object to retain only those rows for which the variable colname
belongs to a list
of options.