SummaFi is an advanced financial news summarization system designed for my BSc final project. It uses Facebook's BART model fine-tuned on the CNN/DailyMail dataset to deliver concise, accurate summaries of financial news articles. Key metrics achieved by the summarizer:
- Rouge1: 0.4223
- Rouge2: 0.1935
- RougeL: 0.2889
- AI-powered summarization using fine-tuned
BART
- Article extraction from URLs with
newspaper3k
- Financial sentiment analysis using
FinBERT
- Interactive web interface built with
Gradio
- GDPR compliance with real-time, stateless processing
- Evaluation metrics using ROUGE scores
- Base Model: facebook/bart-base
- Dataset: CNN/DailyMail (3.0.0)
- Sentiment Analysis: ProsusAI/finbert
- Frameworks: PyTorch, HuggingFace Transformers
- Web Interface: Gradio
- Experiment Tracking: Weights & Biases
- Testing Framework: pytest
- Python 3.8+ (Python 3.10 recommended)
- CUDA-capable GPU (optional but recommended)
- At least 16GB RAM
-
Clone the repository:
git clone https://github.com/formater/SummaFi.git cd SummaFi
-
Create a virtual environment:
python -m venv venv source venv/bin/activate # Windows: venv\Scripts\activate
-
Install dependencies:
pip install --upgrade pip pip install -r requirements.txt
-
Configure Weights & Biases (if applicable):
wandb login
Fine-tune the model using:
python main.py --mode train --config config/config.yaml
Due to file-size limits, the trained model cannot be uploaded to GitHub. If you do not want to train your own model, you can download the trained from: https://huggingface.co/formater/summarizer/tree/main
Evaluate model performance:
python main.py --mode evaluate --config config/config.yaml --model-path outputs/final_model
Launch the interactive Gradio web interface:
python main.py --mode serve --model-path outputs/final_model --port 7860
summa_fi/
├── config/
│ └── config.yaml # Configuration file
├── src/
│ ├── data/ # Data processing utilities
│ ├── models/ # Model definition and handling
│ ├── training/ # Training pipeline
│ ├── evaluation/ # Evaluation utilities
│ ├── utils/ # Helper functions
│ └── web/ # Web interface implementation
├── tests/ # Test suite
├── docs/ # Documentation
├── requirements.txt # Dependencies
└── README.md # This file
Configuration details can be adjusted in config/config.yaml
.
model:
name: "facebook/bart-base"
max_length: 1024
min_length: 56
length_penalty: 2.0
training:
batch_size: 8
learning_rate: 3e-5
num_epochs: 3
warmup_steps: 500
# Run all tests
pytest
# Run specific test files
pytest tests/test_data_loader.py
pytest tests/test_article_extractor.py
pytest tests/test_sentiment_analyzer.py
# Run tests by marker
pytest -m gpu # Run GPU-specific tests
pytest -m integration # Run integration tests
# Run with coverage report
pytest --cov=src --cov-report=html
-
Unit Tests: Testing individual components like:
- Data loading and preprocessing
- Model initialization and inference
- URL validation and article extraction
-
Integration Tests: Ensuring components work together, including:
- Complete training pipeline
- End-to-end summarization process
- Web interface functionality
-
GPU Tests: Testing hardware-specific features:
- Model GPU utilization
- Mixed precision training
- Memory management
Please find detailed testing documentation at docs/testing.md.
- No personal data storage
- Real-time processing only
- No cookies or tracking
- Transparent data handling practices
- Articles are processed under fair use principles
- No article content is stored
- Source attribution provided
- Code adheres to PEP 8 style guidelines
- Comprehensive docstrings and type hints
- Minimum test coverage: 70%
- Fork the repository
- Create a feature branch
- Implement your changes
- Submit a pull request
This project is licensed under the MIT License. See the LICENSE file for details.
- Author: Dudás József
- Email: [email protected]
- GitHub: formater
- LinkedIn: Dudás József