Aggregation Functions
Aggregation Functions
In Clojask, you can aggregate the whole dataframe or aggregate the grouped by dataframe(s). The former could be known as "simple aggregation", and the latter as "group-by aggregation". Some given functions for simple aggregation are defined in the namespace clojask.api.aggregate
, and the given functions for group-by aggregation are defined in the namespace clojask.api.gb-aggregate
.
Below is the full list of given functions for the two types of aggregation.
clojask.api.aggregate
:
max
: Find the max value (use clojure.core/compare
as the comparator)
min
: Find the min value (use clojure.core/compare
as the comparator)
*Note that the default behaviour for clojask/min
is that null
could be returned as a minimal value.
clojask.api.gb-aggregate
:
max
: Find the max value (use clojure.core/compare
as the comparator)
min
: Find the min value (use clojure.core/compare
as the comparator)
In addition to these given functions, you are also welcomed to define your own aggregation function.
How to define group-by aggregation functions?
This is the template:
(defn gb-aggre-template
[col] ;; take only one argument which is the aggregation column in the format of vector
;; ... your implementation
result ;; return one variable (could be int / double / string / collection of above)
)
Basically, the function should take one argument only, which is the full aggregation column. Here we simply assume this column should be smaller than memory!
You may find many built-in function in Clojure also fulfilling this requirement, for example, count
, mean
, and countless function constructed from reduce
.
How to define simple aggregation functions?
This is the template:
(defn aggre-template
;; [new-value old-result]
[old-result new-value]
;; old-result: the value of the result for the previous gb-aggre-template
;; new-value: the value for the column on the current row
;; ... your implementation
new-result ;; return the new result, and this will be passed as old-result for the next gb-aggre-template
)
Notes:
- The old-result for the first
aggre-template
isclojask.api.aggregate/start
. So your function must be able to deal with cases when the first argument isclojask.api.aggregate/start
. - Your function should be self-sustainable, meaning that the result of
aggre-template
should be safe as the input foraggre-template
.
To better understand the this template, you may refer to the documentation of reduce
, the aggre-func
should be able to use in reduce
.