tech.ml.dataset Backend Implementation
tech.ml.dataset is a Clojure library for tabular data processing similar to Python's Pandas, or R's data.table. It supports pragmatic data-intensive work on the JVM by providing powerful abstractions that simplify implementing efficient solutions to real problems. Detailed documentations of tech.ml.dataset can be found on its official website.
Dataset Construction
Alternatively, you can also create the dataset using the function ds/->dataset
provided by tech.ml.dataset, with which you can create a dataset from:
- common file types(e.g. csv, tsv, xls(x), json)taken from local - file system or URL
- sequence of maps For detail usage, please refer to the official document of tech.ml.dataset.
Row Selection
The row selection operations are implemented with the ds/select-rows
function in tech.ml.dataset, which natively supports both "by-filter" selection and "by-index" selection. When multiple filters are provided, we use reduce
to repeatedly apply ds/filter-column
.
Column Selection
The column selection operations are implemented with the ds/select-columns
function in tech.ml.dataset.
Optional Selection
Group by
The grouping by operation is implemented using the ds/group-by-column
function provided by tech.ml.dataset, which accepts a regular dataset and returns a grouped dataset. The grouped dataset will then be handled by the aggregate functions.
Sort by
The sorting by operation is implemented using the ds/sort-by-column
function provided by tech.ml.dataset.
Aggregate Function
tech.ml.dataset lacks customized support for aggregate functions. Therefore, we need to use the ds/descriptive-stats
function to generate relevant statistical data of relevant columns when performing grouping by operations, and select the required columns during column selection.
We defined the get-agg-key
function to solve the naming problem of aggregated columns.