Skip to content

Model development

Duong Tieu Dong edited this page Jul 14, 2021 · 4 revisions

Approach

Digital image analysis was chosen for the method of detecting changes. There are two main ways to check if the changes are legitimate or not:

  • Abnormally detection by using an Autoencoder or Isolation Forest
  • Image classification with a Convolutional Neural Network (the approach we went with).

The process of building the model can be split into 5 main steps:

  • Data collection
  • Data preprocessing
  • Model development
  • Model optimization
  • Model evaluation

Data collection

Normal website data

  • moz.com/top500
  • github.com/GSA/govt-urls

Defaced website data

Data preprocessing

The dataset is separated into two category of images:

  • Clean (normal) websites - 6333 images
  • Defaced websites - 4815 images

These images is then downscaled to 250x250px and split into two parts: 80% of them are used for training and 20% for validation. After that, the data is augmented by rotating, flipping and cropping the images and then normalized by re-scaling from the [0, 255] range of RGB values down to the [0, 1] range that neural networks are familiar with.

Model optimization

Overfitting can be addressed with:

  • The addition of dropout layers
  • Data augmentation
  • The addition of batch normalization layers

Model evaluation

Clone this wiki locally