tech.ml.dataset Backend Implementation

​ tech.ml.dataset is a Clojure library for tabular data processing similar to Python's Pandas, or R's data.table. It supports pragmatic data-intensive work on the JVM by providing powerful abstractions that simplify implementing efficient solutions to real problems. ​ Detailed documentations of tech.ml.dataset can be found on its official website. ​

Dataset Construction

​ Alternatively, you can also create the dataset using the function ds/->dataset provided by tech.ml.dataset, with which you can create a dataset from: ​

  • common file types(e.g. csv, tsv, xls(x), json)taken from local - file system or URL
  • sequence of maps ​ For detail usage, please refer to the official document of tech.ml.dataset. ​

Row Selection

​ The row selection operations are implemented with the ds/select-rows function in tech.ml.dataset, which natively supports both "by-filter" selection and "by-index" selection. When multiple filters are provided, we use reduce to repeatedly apply ds/filter-column. ​

Column Selection

​ The column selection operations are implemented with the ds/select-columns function in tech.ml.dataset. ​

Optional Selection

Group by

​ The grouping by operation is implemented using the ds/group-by-column function provided by tech.ml.dataset, which accepts a regular dataset and returns a grouped dataset. The grouped dataset will then be handled by the aggregate functions. ​

Sort by

​ The sorting by operation is implemented using the ds/sort-by-column function provided by tech.ml.dataset. ​

Aggregate Function

​ tech.ml.dataset lacks customized support for aggregate functions. Therefore, we need to use the ds/descriptive-stats function to generate relevant statistical data of relevant columns when performing grouping by operations, and select the required columns during column selection. ​ We defined the get-agg-key function to solve the naming problem of aggregated columns.



Copyright © 2024 Datajure
Powered by Cryogen