Get Started

Clojask Types

string

The default type for all columns

Class: java.lang.String

Examples

(set-type dataframe "col-name" "string")

int

Most efficiently stores an integer

Class: java.lang.Integer

Examples

(set-type dataframe "col-name" "int")

double

Accepts floats and integers

Class: java.lang.Double

Examples

(set-type dataframe "col-name" "double")

date

Transform a date string (no time field)

Class: java.util.Date (default format string: yyyy-MM-dd)

Examples

;; if the date looks like this 2020/11/12
(set-type dataframe "col-name" "date:yyyy/MM/dd")

datetime

Transform a date string (no time field)

Class: java.util.Date (default format string: yyyy-MM-dd HH:mm:ss)

Examples

;; if the date looks like this 2020/11/12 12:12:36
(set-type dataframe "col-name" "datetime:yyyy/MM/dd HH:mm:ss")

Hybrid Column & Empty Fields

Take this dataset as an example.

EmployeeEmployeeNameDepartmentSalary
1Alice11300
2Bob1134,000
3Carla 900
4Daniel121,000
5Evelyn13800
............

The first thing to do after creating this dataframe is to set the type of column Salary to int.

(ck/set-type "Salary" "int")

However, when you take a look at the data type of the latest Salary column using function print-df. It contains both int and string. This is because some numbers (contains ",") are not recognized by the integer parser, such as 34,000 and 1,000. Therefore, when you operate function to this column, you should accept both int and string as the input. One way to solve this issue is to create your own parser for integer that ignores comma.

(ck/set-parser "Salary" customized-parser)

The column will empty fields is just a special case of hybrid column. The Department column has an empty field. The type of it is string & nil. You can tell that empty fields simply contain value nil. It is worth notice that

(ck/compute ... :exception true)

will also assign nil to fields that throw exceptions during execution. Therefore, your operation functions have to consider nil input in these two scenarios.

System of operations

Clojask is defined by two key types: incremental and replaceable operations.

Replaceable operations

Replaceable operatons include

  • set-type
  • set-parser
  • group-by
  • add-formatter
  • join

Replaceable operations can be called multiple times to replace the previous call of the operation

(def x (dataframe "path/to/a.csv"))
;; defines dataframe from csv file "a.csv" 

(set-type x "int" "employee_no")
;; makes column "employee numbers" integers
 
(set-type x "str" "employee_no")
;; makes column "employee numbers" strings  

In the case above, Clojask will ultimately interpret the "employeee_no" column as a string, rather than an integer.

Incremental operations

Incremental operations include

  • filter
  • operate
  • aggregate

Incremental operations are able to build on top of one another and build up incrementally in their operation.

(filter x "Salary" (fn [salary] (<= salary 800)))
;; this statement deletes all the rows that have a salary larger than 800

(filter x "Department" (fn [dept] (= dept "computer science")))
;; this statements deletes all the rows that contain people not in the computer science department

(filter x ["Salary" "Department"] (fn [salary dept] (and (<= salary 800) (= dept "computer science"))))
;; keeps only people from computer science department with salary not larger than 800

This means additional incremental operations are applied on top of existing operations


Copyright © 2023 Clojask
Powered by Cryogen