Talk on Clojure data-recur meeting

This talk is about the general information and status of the project as of Oct 2022. (From: 9:09 To: 58:02)


Benchmarks

Number of workers = 4

OperationDask (N=1.8M)Dask (N=3.6M)Dask (N=80M)*Clojask (N=1.8M)Clojask (N=3.6M)Clojask (N=80M)
Element-wise operation119.3261.3N/A72.3133.31836.6
Row-wise selection115.0232.0N/A67.9145.61757.5
Aggregation116.0226.7N/A58.6112.11236.9
Groupby-aggregate116.7229.3N/A459.4803.125860.0
Left join114.7248.7N/A1174.42310.214007.9
Inner join116.7242.0N/A1138.82768.521609.3
Rolling join---2812.13943.1> 28800

Remarks:


System info

'platform': 'Darwin',
'platform-release': '20.4.0',
'platform-version': 'Darwin Kernel Version 20.4.0: Thu Apr 22 21:46:47 PDT 2021; root:xnu-7195.101.2~1/RELEASE_X86_64',
'architecture': 'x86_64',
'processor': 'i386',
'ram': '8 GB'

Source code

The benchmarking code for Dask and Clojask could be found here respectively:


Comparison/Advantages with other larger than memory systems

Hadoop MapReduce

FunctionsClojaskHadoop MapReduce
Larger-than-memory source file
Write intermediate results to tmp files
MapReduce paradigm
Join, filter, aggregate, etc. on large files

Spark

FunctionsClojaskSpark
Construct operations' DAG
Join, filter, aggregate, etc
Cache intermediate results between stages in memory
Minimum memory usage

Clojask Library Ecosystem

(Click to view a closeup of image)

Clojask ecosystem

Clojask Logic Flow Diagram

(Click to view a closeup of image) Clojask logic



Copyright © 2023 Clojask
Powered by Cryogen