View on GitHub

jarasandha

Jarasandha is a small Java library to help build an archive of records

Build Status Code Coverage Codacy Badge Maven Central License


Table Of Contents

Introduction

What is it?

Jarasandha is a small Java (version 8+) library to help build an archive of records. It has very few moving parts, embraces immutability and provides efficient compression, buffer management and zero copy transfer. It delegates advanced functions to external services using interfaces.

It is composed of these parts:

File

  1. A file format that has blocks, records and an index
  2. Blocks can be compressed (optional). Blocks contain records
  3. The file is immutable, meaning once the file with all its records is written it cannot be modified
  4. The file is a “write once and read many times” format
  5. Checksums and compression on the internal index and blocks

Writer

  1. Records are written one at a time to the file using a “writer”. The writer returns a logical position within the file that has to be stored in an external system
  2. Internally, of course the records are flushed to the file one block at a time
  3. The “writer” and related classes provide ways to manage collections of files and hooks to archive to external stores

Reader

  1. Records can be retrieved using a “reader” by providing its logical position
  2. It also supports iterating over the records or blocks of records in the file
  3. The “reader” and related classes provide efficient, selective loading and caching of blocks and files for repeated reads
  4. It also has hooks to read from external stores
  5. It is meant to be embedded inside your application that serves records from a remote archive and a local file system
  6. Both the reader and writer components make heavy use of Netty’s Bytebuf to keep heap and in general memory usage low with a controllable budget

What it is not

Jarasandha does not aim to compete with systems or libraries like Apache ORC or Apache Parquet or PalDB or embedded Key-Value stores or Ambry or Apache HBase.

  1. It does not provide key-value access, rather it provides a simple position based access to records
  2. It is not a database of any sort
  3. It has no opinion in terms of what you store as a record but it can compress a block that has multiple records before storing them to the file
  4. It does not provide querying or searching based on keys or values rather on logical positions

What’s with the name?

The name (Jarasandha) is a reference to an Indian mythological character named Jarasandha who was put back together from two halves. I found the name vaguely related to this Java library which puts your records back together from blocks of compressed records in a file. Well, I did say - “vaguely related”.

License

The Jarasandha library is licensed under the Apache License.

Possible use cases

Hot-cold store

Store records in Jarasandha, move the files out to object stores like Amazon S3 or Minio when they are not in use.

Jarasandha can be the underlying layer that efficiently stores and retrieves records and blocks based on logical key positions. A second index layer using Lucene or RocksDB could provide a more advanced mapping from keys, labels or queries to Jarasandha’s logical key positions.

Assuming that the keys and metadata to service queries are much smaller than the actual records, they can be stored onsite, on fast and expensive hardware. The actual record can then be retrieved from the Jarasandha files and blocks that are cached locally or downloaded on demand from remote object stores.

See Hot-cold store for details.

Basics

ReadersAndWritersDemoTest demonstrates how to write to files and also to read them back using the APIs:


⚠️ The following content is work in progress ⚠️


Advanced

A command line tool to import and inspect encoded files is also available.

Efficiency

Compression, blocks, memory efficiency of ByteBuf, native heap size.

Zero copy with compressed index and uncompressed blocks. Zip the entire file while archiving.

Writing - NoOpFileWriteProgressListener to push files to S3

Reading - DefaultFileEventListener to build archiving and retrieval

Architecture

File format

Index and block format

Logical record position, need to secondary store

Compression and caching

Writer and reader efficiency - ButeBuf