Configuration

TANK will operate in either standalone mode, or in cluster/distributed-data mode. Currently, standalone mode is implemented and works; support for clusters will be provided later.

Standalone Mode

In this mode, you need to specify the base topics directory command line option

$> ./tank -p /data/TankTopics/

In that base directory, you can create a directory for every topic - the directory should be named after the topic.

$> ls /data/TankTopics/
ads_clicks  alerts  notifications  orders

Within each directory(topic), you can create directories for however many partitions you want that topic to have. You need to make sure that they are ordered, starting from 0 and incrementing by 1, otherwise Tank will report a startup error.

$> ls /data/TankTopics/orders/
0  1  2  4  5

Please note that you can create new topic using tank-cli like so:

$> ./tank-cli -t messages -b :11011 create_topic 64

this will create a new topic named 'messages' with 64 partitions on broker at endpoint localhost:11011

You can optionally create a file in any of partition directory named config, and in there you can specify key-value configuration options that override the topic options, which can also be specified in a file named config in the the topic directory.

The config file contains key=value declarations, each in its own line. Most of those match Kafka's both in name and semantics, with a few exceptions. This is chosen for simplicity, and because many people are already familiar/using Kafka anyway, and there is no need come up with a different naming scheme.

The configuration displayed here contains all available keys, along with their default value (if unspecified).

$> cat /data/TankTopics/orders/0/config
retention.segments.count=0
log.retention.secs= 0
log.retention.bytes=0
log.segment.bytes=1gb
log.index.interval.bytes=4k
log.index.size.max.bytes=10mb
log.roll.jitter.secs=0
log.roll.secs = 1week
flush.messages=0
flush.secs=0
log.cleanup.policy=delete
log.cleaner.min.cleanable.ratio=0.5

Available configuration options include:

retention.segments.count: Maximum number of immutable segments per partition. If this not 0, and the number of such segments is greater than that number, however many oldest segments will be removed to comply with this limit. Please note this count does not include the current read/write segment.
log.retention.secs: The approximate maximum age, in seconds, of the oldest segment to retain. Set it to 0 to disregard this option.
log.retention.bytes: Maximum aggregate file size of all immutable segments per partition. If this not 0, the that sum is greater than that number, however many oldest segments will be removed to comply with that limit. Please note this does not factor in the size of the current active segment.
log.segment.bytes: If the current mutable segment file size reaches this value, and this value is not 0, it will become immutable and will be added to the list of tracked immutable segments(roll operation), and a new append-only segment will be created.
log.index.interval.bytes: Every more than that many bytes have been added to the current append-only segment, a new index entry will be added that maps an absolute sequence number to an absolute physical file offset. The lower the number, the more accurate the resolution of sequence numbers of file offsets, and potentially the excess data for partial bundles that would be need to be transferred to clients and other replicas, but the larger the index file will get, and broker may take longer time to compute the file range to stream (in the order of a few dozen or 100 microseconds typically).
log.index.size.max.bytes: if the current's segment index size exceeds that many bytes, and this value is not 0, Tank will roll to a new segment.
log.roll.secs: The maximum time before a new log is rolled out (will apply if value is not 0), even if it has not reached the size limit.
log.roll.jitter.secs: If not 0, a maximum random jitter will be computed from [0, value) and that will be subtracted from log.roll.secs value, in order to avoid thundering herd problems when rolling segments.
flush.messages: The number of messages accumulated on a log before they are flushed, by means of fdatasync() on disk. If value is set to 0, this will be ignored. Flushing is useful for resilience, and always takes place on a dedicated thread.
flush.secs: The amount of time the log can have dirty data, before a flush is forced, if value is not 0. See previous option for more information on flush semantics.
log.cleanup.policy: This can set to either "delete" or "cleanup". The default delete setting will consider the retention options described above in order to determine when and what tail (earliest) segments to delete from a partition. If instead cleanup is specified, then, potentially, a compaction for that partition will be scheduled. See compactions for how this work and why this is useful.
log.cleaner.min.cleanable.ratio: If the ratio of the log(in aggregate file size of segments, for all segments that have not been compacted in the past) to the aggregate file size sum of all segments is higher or equal this number, a compaction will be scheduled for the partition. The lower this ratio, the less duplication in the partition log, but will require more frequent cleanups and thus will require more I/O.

For sizes and durations, you can use value(unit)[,+value(unit).. notation. For example, the following are valid durations:

1day
1week+2hours+1minute
1y
1hour+5minutes
80

and the following valid sizes:

1tb
1gb+25bytes
1000
1mb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configuration

Standalone Mode

Clone this wiki locally