Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add the project number 11 #2

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
59 changes: 59 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ Below is an **enhanced** version of your README. All original content is retaine
- [8) End-to-End Data Lake (Generic Implementation)](#8-end-to-end-data-lake-generic-implementation)
- [9) Spark-Based Historical Tracking (Dimension Changes over Time)](#9-spark-based-historical-tracking-dimension-changes-over-time)
- [10) Basic Machine Learning Pipeline Integration](#10-basic-machine-learning-pipeline-integration)
- [11) Real-Time Streaming Data Pipeline with Kafka and Spark](#11-Real-Time-Streaming-Data-Pipeline-with-Kafka-and-Spark)

---

Expand Down Expand Up @@ -508,6 +509,64 @@ Below is an **enhanced** version of the **General Guidance** section. The origin
5. **Simple Dashboard**
- **Charts**: Churn rate by segment, predicted vs. actual churn.
- **Use Case**: Marketing or retention teams filter high-risk customers for proactive promotions.
---
### 11) Real-Time Streaming Data Pipeline with Kafka and Spark

### Business Goal

**Objective**: Build a scalable real-time streaming pipeline to ingest, process, and store data, enabling near-instant insights.
**Key Insight**: Businesses can leverage real-time analytics to make data-driven decisions, enhance customer experiences, and respond to events proactively.


### Data & Dataset Description

**Data Source**:
[Random User API](https://randomuser.me/) for generating real-time sample data.

**Data Format**:
JSON payload containing:
- `id`: Unique identifier for the record.
- `name`: Randomly generated name.
- `email`: Randomly generated email address.
- `location`: A location associated with the user.

**Data Details**:
Generated in real-time with randomness to mimic diverse inputs and simulate production traffic.



### Expectations

#### **Ingest**
- Fetch data from the [Random User API](https://randomuser.me/) at regular intervals.
- Publish the fetched data to a Kafka topic named `random_names`.

#### **Stream and Process**
- Use Spark Structured Streaming to consume data from the `random_names` topic.
- Perform transformations:
- Parse JSON records.
- Filter incomplete or malformed data.

#### **Store**
- Persist processed data into Apache Cassandra for structured storage and efficient querying.
- Use an appropriate schema with indexed columns like `name` and `location`.

#### **Orchestrate**
- Automate the pipeline using Apache Airflow:
- Schedule data ingestion.
- Trigger Spark jobs for processing.

#### **Simple Dashboard**
- Visualize insights from Cassandra:
- Message volume by time intervals.
- Distribution of locations.
- Patterns of invalid or malformed data.
- Use lightweight tools like Plotly or Grafana.


### Use Case
Real-time monitoring of customer activity or operational metrics with rapid responses to anomalies. For instance, tracking a sudden influx of users from a specific location or identifying a rise in incomplete data records.


---

Expand Down