Skip to content

Latest commit

 

History

History
290 lines (220 loc) · 6.74 KB

Training.md

File metadata and controls

290 lines (220 loc) · 6.74 KB

Training System

Table of Contents

Overview

LlamaHome's training system provides a comprehensive environment for training and fine-tuning large language models. The system focuses on efficient resource utilization, robust monitoring, and flexible configuration options.

Key Features

  • Hybrid Training Pipeline: Combines H2O's attention optimizations with llama-recipes' training features
  • Advanced Monitoring: Real-time metrics tracking and visualization
  • Distributed Training: Efficient multi-GPU and multi-node training support
  • Performance Optimization: Automatic memory management and gradient optimization
  • Flexible Configuration: Comprehensive YAML-based configuration system

Quick Start

Basic Training

# Start training with default configuration
make train DATA_PATH=.data/training/dataset.jsonl

# Resume from checkpoint
make train-resume CHECKPOINT=checkpoints/checkpoint-1000

# Evaluate model
make evaluate MODEL=models/fine-tuned/model DATA=.data/eval/test.jsonl

Distributed Training

# Single-node multi-GPU training
make train-distributed WORLD_SIZE=4 CONFIG=configs/distributed.toml

# Multi-node training
make train-multi-node \
    NUM_NODES=2 \
    NODE_RANK=0 \
    MASTER_ADDR=192.168.1.100 \
    CONFIG=configs/multi_node.toml

Monitoring and Visualization

# Start training with monitoring
make train MONITOR=true VIZ_PORT=8080

# View real-time metrics dashboard
open http://localhost:8080

# Export training visualizations
make export-viz OUTPUT=reports/training

Directory Structure

src/
├── training/              # Training implementation
│   ├── pipeline.py       # Training pipeline
│   ├── monitoring.py     # Metrics and visualization
│   ├── distributed.py    # Distributed training
│   ├── optimization.py   # Training optimizations
│   └── data.py          # Data management
.data/
├── training/             # Training data
│   ├── raw/             # Original datasets
│   └── processed/       # Preprocessed data
├── models/              # Model storage
│   ├── base/           # Base models
│   └── fine-tuned/     # Fine-tuned models
└── metrics/            # Training metrics
    ├── logs/          # Detailed logs
    ├── viz/           # Visualizations
    └── tensorboard/   # TensorBoard data

Configuration

Environment Variables

# Resource limits
MAX_GPU_MEMORY=0.9
MAX_CPU_THREADS=8

# Training settings
CUDA_VISIBLE_DEVICES=0,1
TORCH_DISTRIBUTED_DEBUG=INFO

# Logging
LOG_LEVEL=INFO
TENSORBOARD_DIR=runs

Training Configuration

[training]
batch_size = 32
learning_rate = 1e-4
gradient_accumulation_steps = 4
max_grad_norm = 1.0

[cache]
memory_size = "4GB"
disk_size = "100GB"
policy = "lru"

[checkpoint]
save_steps = 1000
keep_last_n = 5
save_best = true

Training Pipeline

The training pipeline consists of several integrated components:

  1. Data Management

    • Efficient data loading and preprocessing
    • Dynamic batching and caching
    • Distributed data sampling
  2. Model Optimization

    • Hybrid attention mechanism
    • Gradient accumulation and clipping
    • Memory-efficient training
    • Automatic mixed precision
  3. Monitoring System

    • Real-time metrics collection
    • Performance monitoring
    • Resource tracking
    • Interactive visualizations
  4. Distributed Training

    • Multi-GPU synchronization
    • Gradient aggregation
    • Checkpoint management
    • Process coordination

Advanced Features

Custom Training Configurations

from src.training.pipeline import TrainingPipeline
from src.core.config_handler import ConfigManager

# Initialize with custom configuration
config_manager = ConfigManager(config_path=".config")
pipeline = TrainingPipeline(
    model_name="llama",
    model_version="3.3-7b",
    config_manager=config_manager
)

# Start training with monitoring
pipeline.train(
    train_dataset=train_data,
    val_dataset=val_data,
    monitor_performance=True
)

Distributed Training Setup

from src.training.distributed import DistributedTrainer

# Initialize distributed trainer
trainer = DistributedTrainer(config=config)

# Distribute model and data
model = trainer.distribute_model(model)
sampler = trainer.create_distributed_sampler(dataset)

# Train with synchronization
trainer.train(model, dataset, sampler)

Performance Monitoring and Visualization Setup

from src.training.monitoring import MetricsCollector, Visualizer

# Initialize monitoring
metrics = MetricsCollector()
visualizer = Visualizer(output_dir="visualizations")

# Log and visualize metrics
metrics.log_metric("train_loss", 0.5)
visualizer.plot_losses(
    train_loss=metrics.get_metric_history("train_loss"),
    val_loss=metrics.get_metric_history("val_loss")
)

Best Practices

1. Data Management

  • Use appropriate batch sizes for your hardware
  • Enable dynamic batching for variable length sequences
  • Implement proper data validation
  • Use caching for frequently accessed data

2. Training

  • Monitor GPU memory usage
  • Use gradient accumulation for large models
  • Enable automatic mixed precision
  • Implement proper cleanup

3. Checkpointing

  • Save checkpoints regularly
  • Track best models based on metrics
  • Clean up old checkpoints
  • Use safe file operations

4. Monitoring

  • Track essential metrics
  • Set up proper logging
  • Monitor resource usage
  • Export metrics regularly

Troubleshooting

Common Issues

  1. Out of Memory

    • Reduce batch size
    • Enable gradient accumulation
    • Use dynamic batching
    • Enable memory efficient training
  2. Slow Training

    • Check data loading bottlenecks
    • Enable caching
    • Optimize batch size
    • Use multiple workers
  3. Checkpoint Issues

    • Verify disk space
    • Check file permissions
    • Enable safe file operations
    • Monitor I/O performance

Next Steps

  1. Advanced Features

    • Custom optimizers
    • Learning rate scheduling
    • Advanced monitoring
    • Custom augmentations
  2. Performance Optimization

    • Memory profiling
    • Training speed optimization
    • Cache efficiency
    • Resource utilization
  3. Integration

    • Custom datasets
    • New model architectures
    • External monitoring tools
    • Custom metrics