IMDb Webscraping Project

Assessment for "Applied Data Science with Python" course taught at OTH Regensburg by Istvan Lengyel, Senior Lecturer at EIT, Napier, NZ.

Getting started

The application is deployed as a complete application stack. To make that work you need to have two things installed:

Docker: container runtime to run paravirtualized workloads (= container)
docker-compose: allows to deploy a stack of containers

Second, you need to clone this repository:

git clone https://github.com/Mushroomator/IMDb-Analyser.git

Now move into the cloned repository

cd IMDb-Webscraping-Project

and deploy the stack on your local machine

docker-compose up

Note: At first run this command will take some time as all containers must be downloaded first before they get started by the Docker engine.

You can check whether the deployment was successful by running docker ps -a which displays all currently running containers on the system. As each of the deployed containers has a healthcheck configured to ensure the service runs as expected, you can also check the current health status for a container using this command. Wait a few seconds and run docker ps -a again and you should see the status of the containers going from starting to healthy. That's when everything is ready to go. You should get an output similiar to the following table:

CONTAINER ID	IMAGE	...	STATUS	PORTS	NAMES
cc7b6b081254	postgres:14.0-alpine	...	Up 2 minutes (healthy)	0.0.0.0:5432->5432/tcp, :::5432->5432/tcp	postgres-db
df3e6dd5c141	dpage/pgadmin4:6.2	...	Up 2 minutes (healthy)	0.0.0.0:80->80/tcp, :::80->80/tcp, 443/tcp	pgadmin
7d0a99cef6b0	ghcr.io/mushroomator/imdb-analyser	...	Up 2 minutes (healthy)	0.0.0.0:5000->5000/tcp, :::5000->5000/tcp	imdb-analyser

You may now access pgAdmin on http://localhost:80 and login as admin user using the password simplepw or access the web GUI for the application by visiting http://localhost:5000.

Using the following command you can later stop the complete stack:

docker-compose down

Repository structure

Here is a list of the most important directories and their content within this repository:

backend: Code for the backend
charts: Source code for charts shown in project report
docs: All files related/ required for the documentation of this project
frontend: Code for the frontend
pgAdmin: Configuration files for pgAdmin
PostgreSQL: Configuration files/ queries for PostgreSQL database

User documentation

Pages

There are three main pages within the application.

List of actors

Shows a list of all 50 actors listed here. You may also search for a specific actor.

Actor details

Shows all details on an actor/ actress i. e. their awards, movies, ratings, rank, genres, an image and a short biography.

List of movies

Shows a list of all movies within the database. You can sort the table by each of the available columns by clicking on the respective column.

Load data/ start webscraping

When you first visit the application at http://localhost:5000 there won't be any data to be displayed as webscraping must be manually triggered first using the button on the top right.

When clicking on the button you must confirm that you want to start webscraping.

After confirmation a progress bar will always show you the current progress of the webscraping process. The process will take approximately 17 minutes.

When the process is finished click Close and refresh the page if necessary to show the results.

In case the webscraping failed, wait a few minutes (to ensure the imdb.com is not blocking the IP due to the huge rate of requests) and restart the webscraping process.

Delete all data

You can delete all data in the database manually. To do so select the button on the top right and confirm the deletion when prompted.

Deletion only takes a short amount of time and you will be notified whether the deletion was successful or not.

Export data as .csv

After the webscraping process is done the data is written to the database and can be viewed by using pgAdmin and they are also written to .csv files within the container at /imdb-analyser-project/backend/data. You can copy those files out of the container to your host system using the docker cp command:

docker cp imdb-analyser:/imdb-analyser-project/backend/data/ path/to/data/on/your/host

Get logs

Each container writes its logs and you can access those with docker logs <container-name>. By using the -f option you will attach the terminal to the stream of logs and get the logs live as they are written.

Services

Web application

The web application can be accessed at http://localhost:5000 with any browser installed on your system. The docker images for the applcation is hosted on the Github Container Registry here.

Frontend

React.js application written in TypeScript using Chakra UI component library. The frontend takes care of routing by using React Router and will fetch required data from the API when needed. Using the Web GUI it is possible to start webscraping, deleting database content or just view the results.

Backend

The backend was written using Python 3.9. Flask was used to as the web framework.

PostgreSQL

Persistence is provided by a PostgreSQL database running inside of the container. The container exposes default port 5432 within the created user-defined Docker network so the web app can access it using TCP. All data is persisted to the created volume db-data. The official PostgreSQL docker image was used for this container.

pgAdmin

You can access pgAdmin by visiting http://localhost:80 on your browser. Login using the username [email protected] and password simplepw. After login you will find the connection to the PostgreSQL database server already pre-configured. Click on the collection re-enter the password and inspect the database contents using the Query Tool. The official pgAdmin docker image was used for this container.

Requirements

The Python backend relies on ordered dictionaries so a Python version >= 3.7 is needed for the application to work correctly. The provided Docker image uses Python 3.9 as this was also used during development.

Development/ Debug/ Run from IDE

In case you want to further develop this application, run or debug the app using an IDE you can of course do so. In that case you need to clone the repository first.

git clone https://github.com/Mushroomator/IMDb-Analyser.git

Then you need to install the frontend project using npm and the backend project using pip and requirements.txt.

cd IMDb-Analyser/frontend
npm install

cd IMDb-Analyser/backend
pip install -R requirements.txt

You should now be able to run the project locally (IDE, terminal etc.).

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
PostgreSQL		PostgreSQL
backend		backend
charts		charts
docs/assets		docs/assets
frontend		frontend
pgAdmin		pgAdmin
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yaml		docker-compose.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IMDb Webscraping Project

Table of Contents

Getting started

Repository structure

User documentation

Pages

List of actors

Actor details

List of movies

Load data/ start webscraping

Delete all data

Export data as .csv

Get logs

Services

Web application

Frontend

Backend

PostgreSQL

pgAdmin

Requirements

Development/ Debug/ Run from IDE

License

About

Releases

Packages

Languages

License

Mushroomator/IMDb-Analyser

Folders and files

Latest commit

History

Repository files navigation

IMDb Webscraping Project

Table of Contents

Getting started

Repository structure

User documentation

Pages

List of actors

Actor details

List of movies

Load data/ start webscraping

Delete all data

Export data as .csv

Get logs

Services

Web application

Frontend

Backend

PostgreSQL

pgAdmin

Requirements

Development/ Debug/ Run from IDE

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages