Tablecloth Backend Implementation
Tablecloth is the default backend for Datajure. It is an addition on the top of tech.ml.dataset
, reorganising its existing functions into simple-to-use APIs.
Detailed documentations of Tablecloth can be found on its official website.
Dataset Construction
Datajure provides a function dataset
to construct a Tablecloth dataset from an associative map. See Examples for specific usage.
Alternatively, you can also create the dataset using the function tc/dataset
or tc/let-dataset
provided by Tablecloth, with which you can create a dataset from:
- single values
- sequence of maps
- map of sequences or values
- sequence of columns (taken from other dataset or created manually)
- sequence of pairs:
[string column-data]
or[keyword column-data]
- array of any arrays
- file types: raw/gzipped csv/tsv, json, xls(x) taken from local - file system or URL
- input stream
For examples, please refer to the official document of Tablecloth.
Row Selection
The row selection operations are implemented with the tc/select-rows
function in Tablecloth, which natively supports both "by-filter" selection and "by-index" selection. When multiple filters are provided, we use reduce
to repeatedly apply tc/select-rows
.
Column Selection
The column selection operations are implemented with the tc/select-columns
function in Tablecloth.
Optional Selection
Group by
The grouping by operation is implemented using the tc/group-by
function provided by Tablecloth, which accepts a regular dataset and returns a grouped dataset. The grouped dataset will then be handled by the aggregate functions.
Sort by
The sorting by operation is implemented using the tc/order-by
function provided by Tablecloth.
Aggregate Function
Tablecloth lacks customized support for aggregate functions. Therefore, we need to use the tc/info
function to generate relevant statistical data of relevant columns when performing grouping by operations, and select the required columns during column selection.
We defined the get-agg-key
function to solve the naming problem of aggregated columns.