tech.ml.dataset Backend Implementation

2023-09-11 By: TAM Pai-Lok

Dataset Construction
Row Selection
Column Selection
Optional Selection

Group by
Sort by

Aggregate Function

tech.ml.dataset is a Clojure library for tabular data processing similar to Python's Pandas, or R's data.table. It supports pragmatic data-intensive work on the JVM by providing powerful abstractions that simplify implementing efficient solutions to real problems. Detailed documentations of tech.ml.dataset can be found on its official website.

Dataset Construction

Alternatively, you can also create the dataset using the function ds/->dataset provided by tech.ml.dataset, with which you can create a dataset from:

common file types(e.g. csv, tsv, xls(x), json)taken from local - file system or URL
sequence of maps For detail usage, please refer to the official document of tech.ml.dataset.

Row Selection

The row selection operations are implemented with the ds/select-rows function in tech.ml.dataset, which natively supports both "by-filter" selection and "by-index" selection. When multiple filters are provided, we use reduce to repeatedly apply ds/filter-column.

Column Selection

The column selection operations are implemented with the ds/select-columns function in tech.ml.dataset.

Optional Selection

Group by

The grouping by operation is implemented using the ds/group-by-column function provided by tech.ml.dataset, which accepts a regular dataset and returns a grouped dataset. The grouped dataset will then be handled by the aggregate functions.

Sort by

The sorting by operation is implemented using the ds/sort-by-column function provided by tech.ml.dataset.

Aggregate Function

tech.ml.dataset lacks customized support for aggregate functions. Therefore, we need to use the ds/descriptive-stats function to generate relevant statistical data of relevant columns when performing grouping by operations, and select the required columns during column selection. We defined the get-agg-key function to solve the naming problem of aggregated columns.

Datajure REPL Implementation

Geni Backend Implementation