Welcome to the Clojask Library
Welcome to the Clojask Library! This is an open-source library for parallel computing of larger-than-memory datasets developed at HKU Business School.
Website Navigation
- Part 1: About
- Part 2: Get Started
- Part 3: API Documentation
- Part 4: Examples
- Part 5: Extensions
- Part 6: Archives
Features
- Unlimited size
It supports datasets larger than memory!
- Various Operations
Although Clojask is designed for larger-than-memory datasets, like NoSQLs, it does not sacrifice common operations on relational dataframes, such as group by, aggregate, join.
- Lazy operations
Most operations will not be executed immediately. Dataframe will intelligently pipeline the operations altogether in computation.
- Fast
Faster than Dask in most operations, and the larger the dataframe is, the bigger the advantage. Please find the benchmarks here.
- All native types
All the datatypes used to store data is native Clojure (or Java) types!
- From file to file
Integrate IO inside the dataframe. No need to write your own read-in and output functions!
- Parallel
Most operations could be executed into multiple threads or even machines. See the principle in Onyx.
Demo Video
Here is a demo video for a basic introduction to Clojask and some of its applications, including inner join and group-by aggregation.
Report Bugs
Clojask is currently under active development.
If you find any bugs or errors, we would appreciate if you could help report these issues so that we could repair them accordingly.