Docs
- Background
- Overview
- Logical Processing Order
- Syntax
- Backends
Background
Clojure is a functional programming language, a dialect of Lisp. It is excellent for concurrency operations with concise syntax and immutable data structures. It simplifies concurrency or multithreaded programming due to its immutable core data structures. The in-built macro system in the Lisp languages with the code-as-data philosophy also enables huge flexibility in programs.
Macro System in Clojure
The macro system in Clojure allows the compiler to be extended by code. The reader accepts the input code and constructs an Abstract Syntax Tree (AST), which will then be transformed by the macro expander into another AST for execution. The transformation using the macros could be defined by the user. Such a feature provides an excellent foundation for defining a syntax in Clojure, subject to its fundamental syntax.
The figure below shows the overview of the evaluation process of Clojure.
Domain-specific Language
Domain-Specific Language (DSL) is a computer language, declared syntax or grammar that is specialised in a specific application. In contrast to General-Purpose Language (GPL), the implementation of DSL is designed with specific goals in that application domain. The use of macros in Lisp dialects enables developers to rewrite source code at compile-time, making implementation of DSL more convenient. As one of the Lisp dialects, Clojure also inherits such an advantage. In addition to macros, the heavy use of core data literals in Clojure also gives an extensive developing opportunity in implementing DSLs.
Overview
Datajure takes the query code written in the custom syntax as input, phrasing it into a Clojure map containing the arguments of different operations. The library functions are called according to our logical processing order, returning the data table.
Logical Processing Order
Simply put, Datajure implemented the functionality of the SELECT
statements in Structured Query Language (SQL), a declarative query language designed for managing the data in a Relational Database Management System (RDBMS), but with slightly different set of operations involved, due to the nature of our target usage scenario. Theerefore, the logical processing order of the SELECT
statement has been adopted in Datajure.
The following table compares the supported operations and their logical processing order of Datajure and SQL SELECT
.
Datajure Order | SQL Order | Operations | Description |
---|---|---|---|
1 | FROM | Specifies a table, view, table variable, or derived table source, with or without an alias, to use in the Transact-SQL statement | |
2 | ON | Specifies arbitrary conditions or specify columns to join | |
3 | JOIN | Retrieves data from two or more tables based on logical relationships between the tables | |
1 | 4 | WHERE | Specifies the search condition for the rows returned by the query |
2 | ROW | Specifies the row index for the rows returned by the query | |
3 | 5 | GROUP BY | Divides the query result into groups of rows |
6 | WITH CUBE /WITH ROLLUP | Extend functions for GROUP BY | |
4 | 7 | HAVING | Specifies a search condition for a group or an aggregate |
5 | 8 | SELECT | Specifies the columns to be returned by the query |
9 | DISTINCT | Specifies to return only distinct values | |
6 | 10 | ORDER BY | Sorts data returned by a query |
11 | TOP | Specifies the number of records to return |
Syntax
A query statement has three sections: row selection section, column selection section and optional section. Each section is represented by a sequence of operations enclosed within []
.
(dtj/query data [ROW-SELECTION-SECTION] [COLUMN-SELECTION-SECTION] [OPTIONS])
Row Selection Section
The first section of the argument input is the row selection section. It corresponds to the WHERE
, HAVING
and ROW
operations in the table for logical processing order. The user could either select the rows using filters or by row index. The use of a filter would override row index selection.
To select all rows, just leave the section empty instead.
Row Selection by Filter
[col filter-function]
This shows the syntax of row selection using a filter. col
refers to the column to be filtered, and filter-function
refers to the filtering function. This is one of the powerful features - the filtering function can be any custom function returning a boolean result. One can define a filtering function for the selection using Clojure built-in fast function syntax: #{ ... }
. This is valid as long as it returns a boolean.
Row Selection by Index
row-index
This shows the syntax of row selection using row index. row-index
refers to the index of the desired row.
Row Selection with Both Filter and Row Index
[col filter-function] row-index
This shows the case where filtering overrides the use of row index. In this case, the filtering function would override the row index. The pipeline will ignore the row-index part.
Column Selection Section
The second section of the argument input is the selection of columns.
col
This is the syntax of column selection, where col
refers to the column selected.
To select all columns, just put an empty list []
instead.
Optional Section
The third section of the argument section is the optional section. This section specifies all the optional operations, including the GROUP BY
and SORT BY
operations.
Optional Operation
operation-keyword operation-arguments
This shows the syntax of an optional operation. operation-keyword
refers to the operation keyword for the program to identify the operation. It includes :group-by
and :sort-by
. operation-arguments
refers to the corresponding operation arguments, subject to the operation.
Group by
:group-by col
This shows the syntax of a group by operation. col
refers to the column(s) to be grouped.
Sort by
:sort-by col sort-by-function
This shows the syntax of a sort by operation. col
refers to the column to be sorted. sort-by-function
refers to the sorting function, with <
(ascending order) as default. Similar to the filtering function, the sorting function can be any custom function returning a boolean result. It can also be Clojure operator like <
or >
, clojure.core/compare
or custom java.util.Comparator
.
Aggregate Function
With the group-by
operation is implemented, aggregate functions are also needed to be implemented in the syntax.
aggregate-keyword col
This shows the syntax of an aggregated column. aggregate-keyword
specifies the aggregated function. col refers to the column to be aggregated. One could directly replace the aggregated column syntax in any column argument. Table below shows the complete aggregate functions available and the corresponding aggregate keywords.
Aggregate Function | Keyword |
---|---|
Minimum | :min |
Maximum | :max |
Mode | :mode |
Summation | :sum |
Standard Deviation | :sd |
Skew | :skew |
NumberValid Rows | :n-valid |
Number of Missing Rows | :n-missing |
Total Number of Rows | :n |
Backends
Currently, Datajure supports the following data processing libraries as the backend: tech.ml.dataset
, Tablecloth, Clojask and Geni.
Although Datajure uses Tablecloth by default, the users can still specify their preferred backend. The statement to specify the backend has the following syntax:
(dtj/set-backend BACKEND)
For example, we can write (dtj/set-backend "tech.ml.dataset")
to specify tech.ml.dataset
as the backend.
For technical details, please refer to our posts.
Although we strive for consistency in the behavior of each backend. However, due to the differences in the APIs they provide, there are still some operations that are not fully supported in some backends.
tech.ml.dataset
All operations above are supported.
Tablecloth
All operations above are supported.
Clojask
All operations above are supported. However, due to the limitations of Clojask itself, the user must manually load the dataset from a .csv
file and store the final result in a file.
Example:
(ck/dataframe "example.csv")
In addition, it is important for users to be aware that Clojask offers mutable data structures, whereas other backends provide immutable data structures. While our development team could have implemented certain measures to make the Clojask backend align more closely with the behavior of other backends, we ultimately chose not to pursue this path due to inefficiencies. For instance, one approach we considered was creating a copy of the dataset before making modifications, but it proved to be inefficient.
Therefore, users should take into account this inconsistency. We advise users to primarily utilize Clojask's native API and complement it with Datajure's operations.
For more information, please refer to the API Docs of Clojask.
Geni
All operations above are supported. However, due to the limitations of Geni itself, customized comparators are not supported in the :sort-by
operation, and the filter-function
field must be an expression written with Geni operators, e.g., g/<
, instead of Clojure operators such as <
.
Example:
(g/=== :name (g/lit "Alice"))
(g/&& (g/> :age 20) (g/< :salary 1000))
For more information, please refer to the Docs of Geni.