Datajure REPL Implementation

2024-01-21 By: YANG Ming-Tian

Uberjar Server and Client
Shell Launcher

Starting from v1.1.0, each version of Datajure will come with a standalone REPL, which can run independently on devices with Java Runtime Environment (JRE) installed. In addition to including Datajure as a library into projects, users can also use Datajure REPL for data processing.

The Datajure REPL consists of two components:

Uberjar Server and Client
Shell Launcher

Uberjar Server and Client

The core functionality of the server and client is provided by REPL-y and nREPL, of which Datajure REPL has only limited customization. The relevant source codes are written in repl.dtj and dsl.dtj.

Datajure REPL does the following:

Prints a welcome note.
Launches an nREPL server, which writes to .nrepl-port for a text editor to connect to.
Starts a REPL(-y).

After that, the function init-eval defined in dsl.clj will be called. It will require all the needed libraries and put their APIs to the corresponding namespaces named after their standard abbreviations. The library name and the corresponding namespace correspond as follows:

Library	Namespace
`tech.v3.dataset`	`ds`
`tablecloth.api`	`tc`
`clojask.dataframe`	`ck`
`zero-one.geni.core`	`g`
`datajure.dsl`	`dtj`

By default, the APIs of each backend are also introduced. This is to facilitate users to use Datajure syntax while also calling the more basic API provided by the backend. However, users can also choose to cancel them manually.

In its command line parameters, the Datajure REPL accepts files with the suffix .dtj that contain Datajure statements. Technically, this is implemented as input stream redirection.

Shell Launcher

We provide users with shell scripts to automatically download and launch the Datajure REPL. It is divided into Linux/macOS version and Windows version. The former is implemented based on Bash, but it is also compatible with most other shells; the latter is implemented based on PowerShell, and it is not compatible with cmd.exe.

The launcher scripts are stored in the scripts folder of the main repository.

When the launcher is run, it will first try to parse parameters, defined by the following syntax:

datajure [<path>] [--force-download] [--proxy <https-proxy>] [--help]

<path> is an optional input file that should be a Datajure source file (.dtj suffix).

Afterwards, it will call the get_latest_release() fucntion to check the latest update.

If no Datajure REPL Uberjar is found on the local device, or the --force-download flag is specified, the launcher will download the latest version from GitHub releases.

In particular, if the --proxy parameter is specified, all network communications above will be conducted through the proxy. This will be convenient for users in special network environments.

Thereafter, Datajure will run the Datajure REPL using java in the system path. For compatibility with newer JRE versions, the following JVM options are specified:

--add-opens=java.base/java.nio=ALL-UNNAMED
--add-opens=java.base/java.net=ALL-UNNAMED
--add-opens=java.base/java.lang=ALL-UNNAMED
--add-opens=java.base/java.util=ALL-UNNAMED
--add-opens=java.base/java.util.concurrent=ALL-UNNAMED
--add-opens=java.base/sun.nio.ch=ALL-UNNAMED

The launcher for Windows will also automatically checks and downloads the Windows Utilities for Hadoop to ensure that the Datajure REPL runs smoothly.

tech.ml.dataset Backend Implementation