Skip to content

allenai/datamap-rs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

49 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DataMap

A high-performance data processing pipeline for large-scale text datasets. (Note, readme generated by Claude, seems ...okay -mj)

Overview

DataMap is a Rust-based toolkit designed for efficient processing, filtering, and resharding of large text datasets, primarily in JSONL format. It provides a flexible pipeline architecture for text data transformations with various filters and modifiers.

Key features:

  • Multi-threaded processing with Rayon
  • Configurable processing pipeline via JSON/YAML configuration
  • Comprehensive set of text filters and modifiers
  • Data resharding capabilities
  • Utilities for S3/GCP/WEKA integration

Components

Rust Core (src/main.rs, src/map_fxn.rs)

The core functionality is implemented in Rust for high performance:

  1. Main Module (src/main.rs):

    • Command-line interface with subcommands
    • Pipeline execution logic
    • I/O and file operations
  2. Data Processors (src/map_fxn.rs):

    • Pipeline processor architecture
    • Text filters (length, language, URL, etc.)
    • Content modifiers (newline removal, ID generation, etc.)
    • Analytics processors (FastText annotation, etc.)

Python Utilities (utils/s5cmd_wrapper.py)

Python utilities for cloud storage operations:

  • S3/GCP/WEKA integration via s5cmd
  • Parallel file download/upload capabilities
  • Progress tracking

Usage

Data Mapping

Process data through a filtering/modification pipeline:

datamap map --input_dir ./data/input --output_dir ./data/output --config pipeline_config.yaml [--err_dir ./data/errors] [--threads 16]

Data Resharding

Reshard files into specific size or line count chunks:

datamap reshard --input_dir ./data/input --output_dir ./data/output --max_lines 10000 --max_size 100000000 [--subsample 0.1] [--threads 16]

Cloud Storage Integration

Upload/download files from cloud storage:

python utils/s5cmd_wrapper.py download --src s3://bucket/path --dst ./local/path [--part 0 --num-parts 4]
python utils/s5cmd_wrapper.py upload --src ./local/path --dst s3://bucket/path

Configuration

Pipelines are defined using YAML or JSON configuration files. Example config:

text_field: "text"
pipeline:
  - name: "text_len_filter"
    kwargs:
      lower_bound: 100
      upper_bound: 100000
  - name: "subsample"
    kwargs:
      subsample_rate: 0.8
  - name: "stop_word_filter"
    kwargs:
      min_stop_word: 3
  - name: "word_count_adder"
    kwargs:
      word_count_field: "word_count"

Available Processors

The toolkit includes many processors for various text transformation and filtering needs:

Filters

  • text_len_filter: Filter by text length
  • page_len_filter: Filter by length of words, sentences, etc.
  • word_len_filter: Filter by average word length
  • subsample: Randomly subsample documents
  • url_substring_filter: Filter URLs by domain, subdomain, etc.
  • float_filter: Filter by float field values
  • symbol_ratio_filter: Filter by symbol density
  • bullet_filter: Filter by bullet point density
  • ellipsis_line_ratio_filter: Filter by ellipsis usage
  • alphabetic_word_ratio_filter: Filter by non-alphabetic word ratio
  • stop_word_filter: Filter by presence of stop words
  • massive_web_repetition_filter: Filter by content repetition patterns
  • word_removal_ratio_filter: Filter by word removal ratio
  • madlad400_sentence_filter: Multi-criteria sentence filter from Madlad400

Modifiers

  • add_id: Add UUID to documents
  • newline_removal_modifier: Control consecutive newlines
  • ratio_line_modifier: Filter lines by uppercase or digit ratio
  • regex_line_modifier: Filter lines using regex
  • line_len_modifier: Filter lines by word count
  • substring_line_modifier: Filter or modify lines with banned substrings
  • word_count_adder: Add word count field

Annotators

  • fasttext_annotator: Add language classification with FastText

Dependencies

Rust

  • rayon (parallel processing)
  • clap (command-line parsing)
  • serde_json/serde_yaml (config parsing)
  • anyhow (error handling)
  • dashmap (concurrent hashmap)
  • zstd (compression)

Python

  • boto3
  • click
  • tqdm

Installation

  1. Install Rust: https://www.rust-lang.org/tools/install
  2. Clone the repository
  3. Build the project:
    cargo build --release
  4. Install Python dependencies:
    pip install boto3 click tqdm
  5. Install s5cmd if using cloud storage utilities:
    # Instructions vary by platform

License

[Insert your license information here]

About

Data mapping framework for rust stuff

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published