Multimodal Text Processing System: An Integrated Approach

This repository implements a pipeline for processing long-form audio and integrating multiple NLP tasks, including speech-to-text transcription, text summarization, and text classification. The system utilizes state-of-the-art transformer models and deep learning techniques for efficient handling of multimodal data.

Project Overview

The pipeline consists of the following components:

Speech-to-Text Transcription:
- Uses OpenAI's Whisper base model for transcription.
- Processes long-form audio by splitting it into 30-second chunks with a 2-second overlap to ensure continuity in transcription.
- Supports long audio datasets like the TEDLIUM long-form dataset.
- Outputs transcriptions for each audio sample in text format to a designated directory..
- File: whisper_transcriber.py.
Text Summarization:
- Summarizes transcriptions using Facebook's BART large model (facebook/bart-large-cnn).
- Preprocesses transcription text by:
  - Removing filler words and repetitions.
  - Restoring punctuation using the DeepMultilingualPunctuation library.
  - Correcting grammar with LanguageTool.
- Splits long text into manageable chunks for summarization.
- Outputs summarized text for each transcription.
- File: batch_text_summarizer.py
Pipeline Orchestration:
- Automates the transcription and summarization process.
- Ensures the creation of all intermediate files (e.g., transcription logs, summarized texts).
- Verifies output integrity at each step to ensure pipeline robustness.
Post-Processing with Notebook:
- Processes summarized texts using a custom LSTM-CRF model for named entity recognition.
- Identifies and classifies key entities (people, organizations, locations) in the summaries.
- Generates detailed entity analysis reports with classification metrics.
- Outputs structured NER data for further downstream processing.
- Use the file (combined_summaries.txt) in demo folder as input for the Final_LSTMCRF.ipynb notebook.

Dataset

For the speech-to-text and summarization pipeline:

Used distil-whisper/tedlium-long-form dataset
19 total samples (8 validation, 11 test)
~11.5 minutes average duration per sample

For the LSTM-CRF NER model:

Used the CoNLL-2003 dataset
Training data: 14,041 sequences
Validation data: 3,250 sequences
Test data: 3,453 sequences
Contains labeled data for 4 entity types: PER, ORG, LOC, MISC

Then the trained LSTM-CRF model was applied to our 307 summarized sequences from the TEDLIUM transcriptions, which explains the performance difference between:

Original CoNLL test set performance (F1 scores 0.60-0.83)
Summarized text performance (F1 scores 0.08-0.29)

This significant drop in performance suggests the model had difficulty generalizing from the CoNLL news text it was trained on to our summarized transcription text.

Requirements

# Core dependencies
pip install librosa soundfile datasets torch transformers numpy tqdm nltk==3.5 deepmultilingualpunctuation language-tool-python

# For WER calculation
pip install jiwer

# For BERTScore calculation
pip install bert_score

Usage

Create and Activate Anaconda environment:

conda create -n pipeline python=3.9 -y
conda activate pipeline

Install

conda install -c huggingface datasets transformers -y
conda install pytorch torchvision torchaudio -c pytorch
pip install nltk==3.5 numpy tqdm soundfile librosa deepmultilingualpunctuation language-tool-python
conda install -c conda-forge python-dotenv

Run pipeline to transcribe speech to text and summarize transcriptions:

python run_pipeline.py

Run NER analysis on summaries:

# Open Jupyter notebook
jupyter notebook Final_LSTMCRF.ipynb

# Use combined_summaries.txt from demo/ folder as input
# Run classification using LSTM-CRF model trained on CoNLL dataset

Demo

To see a demo of the pipeline in action, watch the video here. To see a demo of the classification process in action, watch the video here.

Technical Details

Whisper Configuration

# Model settings
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-base")      # Load base model (~244MB)
processor = WhisperProcessor.from_pretrained("openai/whisper-base")                 # Load processor which:
                                                                                    # 1. Feature extractor: Converts audio to spectrograms (audio fingerprints)
                                                                                    # 2. Tokenizer: Converts between text and token IDs (e.g., "hello" → [123, 456])
                                                                                    # Both are needed to translate audio into text the model can understand

# Generation parameters
predicted_ids = model.generate(
    input_features,                # Processed audio input
    language="en",                # Force English language output
    num_beams=5,                  # Beam search with 5 parallel beams for better transcription quality 
    no_repeat_ngram_size=3        # Prevent repetition of 3-word phrases (e.g., "the the the" or "going to going to")
)

Chunking Strategy for Speech to Text

CHUNK_LENGTH_SEC = 30  # Standard Whisper input size
OVERLAP_SEC = 2       # Overlap for context continuity

Maintains context between chunks
Prevents word splitting at boundaries
Ensures smooth transcription flow

Chunking Overlap Rationale

The 2-second overlap is designed to capture complete phrases (4-6 words) at chunk boundaries, based on average English speaking rates of 2-3 words per second. This duration aligns with natural speech patterns including sentence transitions and pauses (0.5-3 seconds), ensuring smooth context preservation between consecutive 30-second chunks.

Text Summarization Configuration

# Model settings
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")        # Load BART tokenizer
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn")    # Load BART model
summarizer = pipeline("summarization", model=model, tokenizer=tokenizer)     # Initialize summarization pipeline

# Generation parameters
summary = summarizer(
    text,                          # Input text chunk
    max_length=150,                # Maximum length of summary
    min_length=50,                 # Minimum length of summary
    do_sample=True,                # Enable sampling for diverse outputs
    num_beams=4,                   # Beam search with 4 parallel beams
    length_penalty=2.0,            # Favor longer summaries
    early_stopping=True            # Stop when valid output is found
)

Chunking Strategy for Summarization

MAX_TOKENS = 900     # Reduced from BART's 1024 limit for safety
MIN_LENGTH = 50      # Minimum summary length per chunk

Splits text using sentence boundaries to maintain coherence
Tracks token count to prevent exceeding model limits
Preserves complete sentences within chunks
Performs two-pass summarization for very long texts

Preprocessing Pipeline

# Key preprocessing steps
text = remove_non_speech_artifacts(text)      # Remove [brackets], (parentheses)
text = remove_filler_words(text)              # Remove um, uh, you know, etc.
text = remove_repetitions(text)               # Remove immediate word/phrase repetitions
text = restore_punctuation(text)              # Add missing punctuation using deep learning
text = correct_grammar(text)                  # Fix grammatical errors with LanguageTool

Cleans transcription artifacts and noise
Enhances readability through punctuation restoration
Improves grammar and removes redundancies
Prepares text for optimal summarization quality

Two-Pass Summarization Rationale

The system implements a two-pass approach where long texts are first summarized in chunks, then combined and summarized again if needed. This ensures that even very long transcriptions (e.g., hour-long talks) can be effectively condensed while maintaining coherence and capturing key information across the entire text. The chunk size of 900 tokens provides a safety margin below BART's 1024 token limit while maximizing context available for each summary.

Custom LSTM-CRF Model Configuration

# Model Architecture Settings
EMBEDDING_DIM = 100     # Word embedding dimensions
HIDDEN_DIM = 128       # LSTM hidden layer size
BATCH_SIZE = 32        # Training batch size
MAX_LEN = 50          # Maximum sequence length
DROPOUT = 0.5         # Dropout rate for regularization

# Model Initialization
model = LSTM_CRF(
    vocab_size,           # Size of vocabulary from training data
    tagset_size,         # Number of unique NER tags
    EMBEDDING_DIM,       # Embedding layer dimensions
    HIDDEN_DIM,          # Hidden layer size
    padding_idx=PAD_IDX, # Index for padding tokens
    dropout=DROPOUT      # Dropout rate
)

Bidirectional LSTM Architecture

class BidirectionalCustomLSTM(nn.Module):
    # Forward LSTM processes sequence left-to-right
    # Backward LSTM processes sequence right-to-left
    # Outputs concatenated for richer context
    
    hidden_size = HIDDEN_DIM // 2  # Split for bidirectional
    outputs = torch.cat([forward_out, backward_out], dim=2)
    # Final output shape: (batch_size, seq_length, hidden_dim)

CRF Layer Configuration

# CRF Parameters
crf = CRF(tagset_size)  # Conditional Random Field layer
# Uses Viterbi algorithm for optimal tag sequence
predictions = crf.viterbi_decode(
    emissions,          # LSTM output scores
    mask=mask          # Mask for padding tokens
)

Data Processing Pipeline

# Preprocessing Steps
def preprocess_text(text):
    # Convert to lowercase for consistency
    text = text.lower()
    # Tokenize using spaCy
    doc = nlp(text)
    # Extract features and create masks
    tokens, pos_tags, ner_tags = extract_features(doc)
    # Handle padding and truncation
    return pad_sequences(tokens, maxlen=MAX_LEN)

Training Configuration

# Training Parameters
optimizer = optim.Adam(
    model.parameters(), 
    lr=0.001           # Learning rate
)
clip_value = 5.0      # Gradient clipping threshold

# Loss Calculation
loss = model(
    sentences,        # Input sequences
    tags,            # True NER tags
    mask             # Attention mask
)

Entity Recognition Features

Supports standard IOB (Inside, Outside, Beginning) tagging scheme
Recognizes four entity types: LOC, ORG, PER, MISC
Handles nested entities through layered tag structure
Maintains entity boundaries across sentence splits

Performance Optimization Techniques

Uses dynamic batching for varying sequence lengths
Implements attention masking for efficient computation
Employs gradient clipping to prevent exploding gradients
Utilizes early stopping based on validation metrics

Output Processing

def process_predictions(predictions, idx2tag):
    # Convert numeric predictions to tag names
    tags = [idx2tag[idx] for idx in predictions]
    # Group consecutive tags into entities
    entities = group_entities(tags)
    # Calculate confidence scores
    scores = calculate_confidence(emissions)
    return entities, scores

Evaluation Metrics

Precision, recall, and F1 score per entity type
Micro and macro-averaged metrics
Confusion matrix for error analysis
Token-level and entity-level accuracy measures

Metrics

Calculate WER:

python wer_calculator.py

Results Summary for WER

Word Error Rate (WER) is a metric that measures the accuracy of speech recognition systems by calculating the minimum number of word insertions, deletions, and substitutions needed to transform the predicted transcript into the reference transcript, divided by the number of words in the reference. A WER of 0.32 means that approximately 32% of words contain errors compared to the reference transcript, where a lower score indicates better performance (e.g., a WER of 0 would mean perfect transcription).

Performance metrics:

Average WER: 0.3240
Processing both validation and test splits
Total samples processed: 19
Consistent performance across different speakers
WER range: 0.2644 - 0.4016

WER Calculation

Implemented both using jiwer library and from scratch
Current average WER: 0.3240 (32.4%)
Best performing sample: 0.2644 (26.44%)
Worst performing sample: 0.4016 (40.16%)

Results Summary BERTScore

BERTScore is a text evaluation metric that leverages BERT's contextual embeddings to measure semantic similarity between generated text and a reference, allowing it to recognize synonyms and paraphrases rather than requiring exact word matches. Unlike traditional metrics, BERTScore provides more nuanced evaluation scores that better align with human judgment, making it particularly valuable for assessing the quality of text generation tasks like summarization and translation.

Performance metrics:

Average Precision: 0.8394 (83.94%)
Average Recall: 0.8020 (80.20%)
Average F1 Score: 0.8201 (82.01%)
Average Compression Ratio: 16.34x
Total samples processed: 19 (8 validation, 11 test)

BERTScore Analysis

Implemented using bert-score library
Strong semantic similarity maintained despite high compression
Best performing sample: Brian Cox (P: 0.9126, F1: 0.8546)
Worst performing sample: James Cameron (P: 0.8015, F1: 0.7882)
Compression ratios range from 2.91x to 24.27x, with most samples between 15-18x

Key Findings

Consistent F1 scores above 0.80 across most speakers
Higher precision than recall suggests accurate but concise summaries
Short talks (<5 minutes) tend to achieve higher similarity scores
Longest compression achieved on James Cameron talk (24.27x) while maintaining 0.79 F1 score
Exceptional performance on both technical and narrative content types

Results Summary LSTM-CRF NER using CONLL

The LSTM-CRF model combines a bidirectional LSTM network with a Conditional Random Field layer for named entity recognition in summarized text. This architecture allows the model to capture both word-level context and label dependencies, making it particularly effective for identifying and classifying named entities in summarized content.

Performance Metrics:

Average Precision: 0.8300 (83.00%)
Average Recall: 0.7900 (79.00%)
Average F1 Score: 0.8200 (82.00%)
Total Entities Processed: 9,109
Total Samples Analyzed: 19 (8 validation, 11 test)

Entity-wise Analysis

Location (LOC):
- Precision: 0.84, Recall: 0.82, F1: 0.83
- Total instances: 1,834
Organization (ORG):
- Precision: 0.72, Recall: 0.52, F1: 0.60
- Total instances: 1,339
Person (PER):
- Precision: 0.80, Recall: 0.69, F1: 0.74
- Total instances: 1,796
Miscellaneous (MISC):
- Precision: 0.81, Recall: 0.61, F1: 0.70
- Total instances: 919

Key Findings

Strongest performance in identifying location entities (F1: 0.83)
High precision across all entity types (>0.72)
Lower recall for organization entities suggests higher false negatives
Consistent performance across different text lengths
Effective handling of both technical and conversational content
Successfully maintains entity recognition quality in compressed summaries

These results demonstrate the model's robust performance in identifying named entities within summarized text, with particularly strong results for location and person entities. The balanced precision-recall trade-off indicates reliable entity recognition, though there's room for improvement in organization detection.

Results Summary: Combined Summaries NER Analysis

Document Overview

Total sequences analyzed: 307
Total entity instances detected: 36

Entity Performance Breakdown

Organizations (ORG):
- Most frequent entity type (27 instances)
- Precision: 0.71 (71%) - highest among all categories
- Recall: 0.19 (19%)
- F1-score: 0.29 (29%)
Locations (LOC):
- 9 instances identified
- Precision: 0.06 (6%)
- Recall: 0.11 (11%)
- F1-score: 0.08 (8%)
Person (PER) & Miscellaneous (MISC):
- No successful identifications
- Support: 0

Overall Performance

Micro-average metrics:
- Precision: 0.07
- Recall: 0.17
- F1-score: 0.10

Key Observations

Model shows strongest performance in identifying organizations
Limited success with location entities
Significant room for improvement across all categories
Performance substantially lower than expected, suggesting potential issues with:
- Model adaptation to summarized text format
- Entity recognition in condensed content
- Handling of context in shortened text

References

TEDLIUM Dataset: distil-whisper/tedlium-long-form
CoNLL-2003 Dataset: eriktks/conll2003
Whisper Model: openai/whisper-base

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
data		data
demo		demo
iterations		iterations
presentation		presentation
.DS_Store		.DS_Store
Final_LSTMCRF.ipynb		Final_LSTMCRF.ipynb
Multimodal_Text_Processing_System___Project_Proposal.pdf		Multimodal_Text_Processing_System___Project_Proposal.pdf
README.md		README.md
batch_text_summarizer.py		batch_text_summarizer.py
concatenate_summaries.py		concatenate_summaries.py
evaluate_bert_scores.py		evaluate_bert_scores.py
run_pipeline.py		run_pipeline.py
wer_calculator.py		wer_calculator.py
whisper_transcriber.py		whisper_transcriber.py

purva-agarwal/mtps

Folders and files

Latest commit

History

Repository files navigation

Multimodal Text Processing System: An Integrated Approach

Project Overview

Dataset

Requirements

Usage

Demo

Technical Details

Whisper Configuration

Chunking Strategy for Speech to Text

Chunking Overlap Rationale

Text Summarization Configuration

Chunking Strategy for Summarization

Preprocessing Pipeline

Two-Pass Summarization Rationale

Custom LSTM-CRF Model Configuration

Bidirectional LSTM Architecture

CRF Layer Configuration

Data Processing Pipeline

Training Configuration

Entity Recognition Features

Performance Optimization Techniques

Output Processing

Evaluation Metrics

Metrics

Results Summary for WER

Performance metrics:

WER Calculation

Results Summary BERTScore

Performance metrics:

BERTScore Analysis

Key Findings

Results Summary LSTM-CRF NER using CONLL

Performance Metrics:

Entity-wise Analysis

Key Findings

Results Summary: Combined Summaries NER Analysis

Document Overview

Entity Performance Breakdown

Overall Performance

Key Observations

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages