Deployment scripts for the database supporting the misinformation project.
See main project for project board and issues.
Ensure you have the Azure CLI is installed. This is a command line client for administering Azure resources. The easiest way to install it if you have Python available is using the Python package manger PIP. If this doesn't work for you, try one of the alternative installation methods.
pip install azure-cli
There should be a KeyVault named misinformation-admin
in the
misinformation-common
Azure subscription. All users who are authorised to
deploy and administer the database server should be added as principals in the
"Access policies" section of the KeyVault. The KeyVault should be set to enable
access to Azure Resource Manager for template deployment in "Advanced access
policies" part of the "Access policies" section.
The KeyVault should contain the following sensitive secrets.
-
A Secret named
database-admin
containing a strong password in its "value" field. This will only ever be accessed programatically so can be as random and complex as you like. It is recommended to randomly generate a long (16+ characters) password that uses at lower case letters, upper case letters and numbers. If supported by the random generator, you should also include special characters (e.g. £$%^& etc). -
A Certificate named
misinformation-admin-sp
. This certificate should be self-signed, have a Subject set toCN=admin-sp.misinformation.turing.ac.uk
a validity of 12 months and an auto-renewal trigger set to 90% of the validity period.
There should be a KeyVault named misinformation-user
in the
misinformation-common
Azure subscription. All users who are authorised to
deploy and administer the database server should be added as principals in the
"Access policies" section of the KeyVault. The KeyVault should be set to enable
access to Azure Resource Manager for template deployment in "Advanced access
policies" part of the "Access policies" section.
The KeyVault should contain the following sensitive secrets.
-
A Secret named
database-crawler-user
containing a strong password in its "value" field. This will only ever be accessed programatically so can be as random and complex as you like. It is recommended to randomly generate a long (16+ characters) password that uses at lower case letters, upper case letters and numbers. If supported by the random generator, you should also include special characters (e.g. £$%^& etc). -
A Secret named
database-annotation-user
containing a strong password in its "value" field. This will only ever be accessed programatically so can be as random and complex as you like. It is recommended to randomly generate a long (16+ characters) password that uses at lower case letters, upper case letters and numbers. If supported by the random generator, you should also include special characters (e.g. £$%^& etc). -
A Secret named
database-crawl-storage-read
containing an SAS token with read-only access to the https://misinformation.blob.core.windows.net/crawl Azure blob storage container in its "value" field. This SAS token should be bound to a policy to allow it to be revoked. You can create policies and SAS tokens bound to them by right-clicking on the storage container in Azure Storage Explorer.
We use Azure SQL Data Studio to run SQL queries against the Misinformation database we create. We use these SQL queries to create and update tables within the Misinformation database.
az ad sp create-for-rbac --name misinformation-admin \
--cert misinformation-admin-sp --keyvault misinformation-admin
TODO: Script creation of the KeyVault. This should securely generate a strong secret and add the necessary users (ideally using a security group).
- Ensure you are logged into the Azure CLI.
az login
- Ensure the active subscription is
misinformation
az account set --subscription misinformation
- Ensure the
misinformation-database
resource group exists in thesouthcentralus
region.
az group create --name misinformation-database --location southcentralus
- Deploy the database using the template in this repository.
az group deployment create \
--resource-group misinformation-database \
--template-file database-server.json \
--parameters @database-server.parameters.json
If you get the error No module named 'msrest.polling'
then manually install
the missing package using pip install msrestazure
.
- Browse to the Misinformation database in the Azure portal.
- Click on the "Query editor" option in the menu sidebar and log in with
"SQL server authentication" option, with userdbadmin
and the password you created earlier and stored in thedatabase-admin
KeyVault secret. - Run the following commands to create database users for the crawler and
annotator apps, using
database-crawler-user
anddatabase-annotation-user
as the usernames and the corresponding KeyVault secrets created earlier as the passwords.
CREATE USER [user_name] WITH PASSWORD = 'STRONG_PASSWORD_GOES_HERE';
- Run the following commands to give the above users write access to the Misinformation database.
EXEC sp_addrolemember 'db_datareader', 'database-crawler-user';
EXEC sp_addrolemember 'db_datawriter', 'database-crawler-user';
EXEC sp_addrolemember 'db_datareader', 'database-annotation-user';
EXEC sp_addrolemember 'db_datawriter', 'database-annotation-user';
Add the database-crawl-storage-read
SAS token created earlier to the database
so it can access the Azure blob storage container that is currently used to
store article data generated by the crawler. Replace SAS_TOKEN_GOES_HERE
with the value of the database-crawl-storage-read
secret stored in the
project user KeyVault created earlier.
CREATE MASTER KEY
GO
CREATE DATABASE SCOPED CREDENTIAL [database-crawl-storage-read]
WITH IDENTITY = 'SHARED ACCESS SIGNATURE',
SECRET = 'SAS_TOKEN_GOES_HERE';
GO
CREATE EXTERNAL DATA SOURCE misinformation_crawl_container
WITH (
TYPE = BLOB_STORAGE,
LOCATION = 'https://misinformation.blob.core.windows.net/crawl',
CREDENTIAL = [database-crawl-storage-read]
);
TODO: Add instructions for enabling and using AAD SSO once Service Principal creation by AAD members works.
The following steps will ensure the required tables exist in the Misinformation
database. None of the create_*_table.sql
scripts will overwrite or delete any
existing tables, so they are safe to re-run. corresponding delete_*_table.sql
scripts are provided to delete tables if this is required, though this will
result in all the data stored in those tables being deleted.
Set up and save the connection details for the Misinformation database within Azure SQL Data Studio.
- Server:
misinformation.database.windows.net
- User name:
dbadmin
- Pasword: Use the command below to retrieve the database admin password
from the
misinformation-admin
KeyVault. It will be displayed in the "value" field of the JSON output to the console.
az keyvault secret show --vault-name misinformation-admin \
--name database-admin
- If you have not connected to the Misinformation database from your current IP address before, you will be prompted to login with an Azure account and add a firewall rule to allow access from your current IP address. To do this you will need to follow the prompts to authenticate the Azure SQL Data Studio app in the same way as you did when logging into the Azure CLI.
- Double click on the
misinformation
database - Click on the sheet of paper "Explorer" icon in the thin panel on the left of the application.
- Open the folder for this repository.
- Open the
create_raw_crawl_table.sql
file - Connect to the Misinformation database using the saved
misinformation.database.windows.net
connection listed above the connection details form. - Run the
create_raw_crawl_table.sql
query.
- Open the
- Repeat for the following queries
create_raw_crawl_indexes.sql
create_articles_tables.sql
create_articles_indexes.sql
create_annotations_tables.sql
create_annotations_indexes.sql
- Load the crawl data from the files on Azure blob storage
- Run
import_articles_data.sql
once for each<site_name>_extracted.txt
file in thecrawl
container of themisinformation
storage account. You will need to replace the filename immediately following theBULK
keyword for each new file.
- Run
select_n_articles_per_site.sql
: Select N articles for each site. The ordering of articles within each site grouping is fixed (determined by the auto incremented ID column), regardless of the chosen N, so increasing N will always result in a strict superset of the previously selected articles and reducing N will always result in a strict subset of previously selected articles.articles_per_site
: Lists all unique sites in articles table, along with the number of articles for each site.