Geni Backend Implementation
Geni is a Clojure dataframe library that runs on Apache Spark. It provides an idiomatic Spark interface for Clojure without the hassle of Java or Scala interop. Therefore, this backend is actually supported by Apache Spark, while we use Geni's APIs in the implementation.
For the details of Geni APIs, please refer to Geni Documentation. For the details of Apache Spark behaviour, please refer to Spark Documentation.
Dataset Construction
Datajure provides a function dataset
to construct a Geni dataset from an associative map. See Examples for specific usage.
In addition, Geni also provides a variety of APIs for creating/importing dataframes or datasets, such as g/to-df
, g/create-dataframe
, g/table->dataset
, g/map->dataset
, and g/records->dataset
.
For examples, please refer to the official document of Geni.
Row Selection
The row selection operations are implemented with the g/filter
function in Geni, which supports "by-filter" selection but not "by-index" selection.
Row Selection by Filter
The filter operation is implemented with the g/filter
function provided by Geni. When multiple filters are provided, we use reduce
to repeatedly apply g/filter
. However, due to the limitations of Geni itself, the filter functions must be expressions written with Geni operators, e.g., g/<
, instead of Clojure operators such as <
.
For more information, please refer to Geni Semantics.
Row Selection by Index
Since row selection by index is not natively supported by Geni and the order of the rows in the dataset is not guaranteed, the users are highly suggested to manually add a column for indices and then perform filter operations on that column.
Column Selection
The column selection operations are implemented with the g/select
function in Geni.
Optional Selection
Group by
The grouping by operation is implemented using the g/group-by
function provided by Geni, which accepts a regular dataset and returns a grouped dataset. The grouped dataset will then be handled by the aggregate functions.
Sort by
The sorting by operation is implemented using the g/sort
function provided by Geni. However, due to the limitations of Geni itself, customized comparators are not supported.
Aggregate Function
We use the g/agg
function to calculate statistical data of relevant columns when performing grouping by operations. To avoid runtime errors, we require that columns participating in aggregation must be of numeric type.
The aggregation result column naming follows Geni’s default. We define the get-agg-key
function to convert the Datajure-style column description statements in the query statement into Geni style.