Skip to content

Latest commit

 

History

History
293 lines (224 loc) · 11.3 KB

deployment.md

File metadata and controls

293 lines (224 loc) · 11.3 KB

Deployment Instructions

Check the following sections for deployment instructions for Scrapinghub and Scrapydweb.

Scrapinghub Deployment

Create an free account and create a project: Screen Shot 2019-08-19 at 11 27 48 AM

We will use the shub command line to deploy. You can find your API key and deploy number once in your project Code & Deploys page: Screen Shot 2019-08-19 at 11 33 05 AM

Go back to the root of Scrapy-tutorial (the root of the Scrapy project) and use the following command to deploy your project to Scrapyinghub.

(venv) dami:scrapy-tutorial harrywang$ shub login
Enter your API key from https://app.scrapinghub.com/account/apikey
API key: xxxxx
Validating API key...
API key is OK, you are logged in now.
(venv) dami:scrapy-tutorial harrywang$ shub deploy 404937
Messagepack is not available, please ensure that msgpack-python library is properly installed.
Saving project 404937 as default target. You can deploy to it via 'shub deploy' from now on
Saved to /Users/harrywang/xxx/scrapy-tutorial/scrapinghub.yml.
Packing version b6ac860-master
Created setup.py at /Users/harrywang/xxx/scrapy-tutorial
Deploying to Scrapy Cloud project "404937"
{"status": "ok", "project": 4xxx, "version": "b6ac860-master", "spiders": 3}
Run your spiders at: https://app.scrapinghub.com/p/404937/

Scrapinghub configuration file is created scrapinghub.yml and you need to edit it to specify:

  • scrapy 1.7 running Python 3
  • requirements files for other packages
project: 404937

stacks:
    default: scrapy:1.7-py3

requirements:
  file: requirements.txt

run $ shub deploy to deploy again.

We have three spiders in the project:

  • quotes_spider.py is the main spider
  • quotes_spider_v1.py is the version 1 of the spider that writes to files, etc.
  • authors_spider.py is the spider to get author page from the official tutorial

You can see your current deployment on scrapinghub.com: Screen Shot 2019-08-19 at 11 44 31 AM

Then, you can run your spider:

Screen Shot 2019-08-19 at 12 47 48 PM

Screen Shot 2019-08-19 at 12 48 51 PM

Once the job is complete, you can check the results and download the items: Screen Shot 2019-08-19 at 1 57 49 PM

Screen Shot 2019-08-19 at 1 58 22 PM

You can schedule periodic jobs if you upgrade your free plan.

Scrapydweb Deployment

I found this repo https://github.com/my8100/scrapydweb and follow https://github.com/my8100/scrapyd-cluster-on-heroku to setup the server.

We need a custom deployment because our scrapy project has specific package requirements, e.g., SQLAlchemy, MySQL, etc. if no special package is needed, you can follow the easy setup below.

Custom Setup

Setup repo and Heroku account

fork a copy of https://github.com/my8100/scrapyd-cluster-on-heroku to your account, e.g., https://github.com/harrywang/scrapyd-cluster-on-heroku

create a free account at heroku.com and install Heroku CLI: brew tap heroku/brew && brew install heroku

clone the repo:

git clone https://github.com/harrywang/scrapyd-cluster-on-heroku
cd scrapyd-cluster-on-heroku/

login to Heroku

scrapyd-cluster-on-heroku harrywang$ heroku login
heroku: Press any key to open up the browser to login or q to exit:
Opening browser to https://cli-auth.heroku.com/auth/browser/3ba7221b-9c2a-4355-ab3b-d2csda
Logging in... done
Logged in as [email protected]

Set up Scrapyd server/app

In this step, you should update the runtime.txt to specify the Python version and requirements.txt to include all packages your spider needs.

After changes, runtime.txt is:

python-3.6

requirements.txt is:

pip>=19.1
#Twisted==18.9.0
scrapy
scrapyd>=1.2.1
scrapy-redis
logparser>=0.8.2

mysqlclient>=1.4.4
SQLAlchemy>=1.3.6

Setup the repo and commit the changes we just made:

cd scrapyd
git init
git status
git add .
git commit -a -m "first commit"
git status

Deploy Scrapyd app

heroku apps:create scrapy-server1
heroku git:remote -a scrapy-server1
git remote -v
git push heroku master
heroku logs --tail
# Press ctrl+c to stop logs outputting
# Visit https://svr-1.herokuapp.com

Add environment variables

Timezone

# python -c "import tzlocal; print(tzlocal.get_localzone())"
heroku config:set TZ=US/Eastern
# heroku config:get TZ

Redis (optional - not in this tutorial) Redis account (optional, see settings.py in the scrapy_redis_demo_project.zip)

heroku config:set REDIS_HOST=your-redis-host
heroku config:set REDIS_PORT=your-redis-port
heroku config:set REDIS_PASSWORD=your-redis-password

Repeat this step if multiple scrapyd server is needed.

Setup ScrapydWeb server/app

go to scrapydweb subfolder and update runtime.txt, requirements.txt, and scrapydweb_settings_v10.py if needed.

Let's enable authentication, edit the following section of scrapydweb_settings_v10.py:

# The default is False, set it to True to enable basic auth for the web UI.
ENABLE_AUTH = True
if os.environ.get('ENABLE_AUTH', 'False') == 'True':
    ENABLE_AUTH = True
# In order to enable basic auth, both USERNAME and PASSWORD should be non-empty strings.
USERNAME = 'admin'
PASSWORD = 'scrapydweb'
USERNAME = os.environ.get('USERNAME', 'admin')
PASSWORD = os.environ.get('PASSWORD', 'scrapydweb')

Otherwise, proceed as follows:

cd ..
cd scrapydweb
git init
git status
git add .
git commit -a -m "first commit"
git status

Deploy ScrapydWeb app

heroku apps:create scrapyd-web
heroku git:remote -a scrapyd-web
git remote -v
git push heroku master

Add environment variables

Timezone

heroku config:set TZ=US/Eastern

Scrapyd servers - you have to use the scrapyd server address you just setup above (see scrapydweb_settings_vN.py in the scrapydweb directory)

heroku config:set SCRAPYD_SERVER_1=scrapy-server1.herokuapp.com:80
# heroku config:set SCRAPYD_SERVER_2=svr-2.herokuapp.com:80#group1
# heroku config:set SCRAPYD_SERVER_3=svr-3.herokuapp.com:80#group1
# heroku config:set SCRAPYD_SERVER_4=svr-4.herokuapp.com:80#group2

Deploy the scrapy project

We need to package the project and upload to the server.

First, install scrapyd-client using pip install git+https://github.com/scrapy/scrapyd-client (note: pip does not work as of writing this document see: https://stackoverflow.com/questions/45750739/scrapyd-client-command-not-found)

change the deploy setting in scrapy.cfg:

[deploy]
url = http://scrapyd-server1.herokuapp.com
username = admin
password = scrapydweb
project = scrapy-tutorial

Then, use scrapyd-deploy to package and deploy to scrapyd server:

(venv) dami:scrapy-tutorial harrywang$ scrapyd-deploy
/Users/harrywang/sandbox/scrapy-tutorial/venv/lib/python3.6/site-packages/scrapyd_client/deploy.py:23: ScrapyDeprecationWarning: Module `scrapy.utils.http` is deprecated, Please import from `w3lib.http` instead.
  from scrapy.utils.http import basic_auth_header
Packing version 1566253506
Deploying to project "scrapy-tutorial" in http://scrapyd-server1.herokuapp.com/addversion.json
Server response (200):
{"node_name": "9177f699-b645-4656-82d1-beef2898fdc1", "status": "ok", "project": "scrapy-tutorial", "version": "1566253506", "spiders": 3}

go to https://srapyd-web.herokuapp.com, you should see your project deployed: Screen Shot 2019-08-19 at 6 27 32 PM

go to the following page to run the spider:

Screen Shot 2019-08-19 at 8 56 23 PM

Once the spider finishes, you can check the items in Files menu.

You can specify Timer Tasks. The following shows a task that runs every 10 minutes. This part is based on APScheduler, see document to figure out how to set the values (this could be confusing.) Screen Shot 2019-08-19 at 10 28 04 PM

Easy Setup

Use the following settings (No redis setting) and the app is at scrapyd-server1.herokuapp.com Screen Shot 2019-08-19 at 5 19 26 PM

Use the following settings (No redis setting) and the app is at scrapyd-server1.herokuapp.com Screen Shot 2019-08-19 at 5 31 15 PM

Screen Shot 2019-08-19 at 5 37 25 PM

We need to package the project and upload to the server.

First, install scrapyd-client using pip install git+https://github.com/scrapy/scrapyd-client (note: pip does not work as of writing this document see: https://stackoverflow.com/questions/45750739/scrapyd-client-command-not-found)

change the deploy setting in scrapy.cfg:

[deploy]
url = http://scrapyd-server1.herokuapp.com
username = admin
password = scrapydweb
project = scrapy-tutorial

Then, use scrapyd-deploy to package and deploy to scrapyd server:

(venv) dami:scrapy-tutorial harrywang$ scrapyd-deploy
/Users/harrywang/sandbox/scrapy-tutorial/venv/lib/python3.6/site-packages/scrapyd_client/deploy.py:23: ScrapyDeprecationWarning: Module `scrapy.utils.http` is deprecated, Please import from `w3lib.http` instead.
  from scrapy.utils.http import basic_auth_header
Packing version 1566253506
Deploying to project "scrapy-tutorial" in http://scrapyd-server1.herokuapp.com/addversion.json
Server response (200):
{"node_name": "9177f699-b645-4656-82d1-beef2898fdc1", "status": "ok", "project": "scrapy-tutorial", "version": "1566253506", "spiders": 3}

go to https://srapyd-web.herokuapp.com, you should see your project deployed: Screen Shot 2019-08-19 at 6 27 32 PM