Merlin 0.4.0 #129

stooit · 2020-04-09T21:37:49Z

No description provided.

Ensure that media types are added to output filenames.

… migrate generate command (#79) * Set crawl and migrate limits using an optional runtime flag * Documentation for -l flag for Crawler * Documentation for -l limit flag on migrate generate command * Refactor "-l/--limit" to "-n/--number" * Revert "Refactor "-l/--limit" to "-n/--number"" This reverts commit 8b03d72. * Make runWeb/RunXml consistent. Remove requirement for urls/url_file in XML parsing. * Tweak docs.

Resolve issue where errors were only reported once.

* Adds support for inline links via linkit. * Improved test coverage for linkit plugin.

* Multiple selectors for config map Introduces multiple selectors and multiple target fields per mapping, work towards #40 allowing you to re-use mappings * phpcs * Remove debug print * Multiple selectors documentation * Tests for various multiple selector and field modes

* Add crawler test. * Add group testing. - Adds tear down to remove output files to remove chance of false-positives. - Adds group testing. - Adds additional pages to the test server. * Use sys_get_temp_dir(). * Remove cleanup to debug CI * Add sleep and php server from the right directory. * Strip trailing slash if exists on domain for cache store dir * WIP but functional spider-crawler cache Work towards #74. Spider-crawler now caches results for much faster reruns. Note this introduces breaking changes in the Fetcher cache which now stores and expects json data not just the html content. This allows to store extra info along with the content and is needed for the spider-crawler cache. * A missing ingredient * Resolved URL count. * Added no-cache runtime flag. * Code standards cleanup. * cache_enabled flag docs * Fixes cache write before grouping logic * Detect urls with duplicate content in crawler-spider * Detect urls with duplicate content in crawler-spider * crawler duplicate flag docs * Remove newline * Refactoring to allow common php server base for tests * Crawler tests using common php server class * phpdoc fixes * Cache update to allow recursive removal of a domains cached content * Crawler tests based on common php server including caching and duplicate options * clear output dirs on runs * Stop server on start to reset if running * Move sleep, increase * readd html pages

* Support staring points for crawler * Documentation for starting URLs

Provide tests for the urls option in crawler config [#83]

Expose a number of crawler configuration options so they can be controlled via the crawler configuration yaml file. - Adds timeout - Adds connect_timeout - Adds verify - Adds cookies

Update crawler configuration options.

* Ensure filenames are decoded. * Fix phpcs.

* Use correct urls key for crawled urls. * Fix tests.

Use guzzle to resolve relative urls.

* Exclude external media assets by default.

* Added entity_type requirement on spider.

#116) * Functionality and tests for providing urls_file as an array * Documentation for multiple urls_files

…e changes (#120) * Implementation of saving redirect source and destination results for crawler and fetcher * Change from collection to arrays for queues #115 * Check for UTF8 encoded issues in html content prior to json for cache * Added support for custom boolean yes/no values. * Provides cache purge for a single URL or domain from migrate cmd line * Change default fetcher to multicurl * Update tests to account for redirects * Rejigged redirect support. Raw headers and redirect info is now also cached so the redirect reports can be built from cache for crawler and fetcher. * Slight change in status code behaviour * Migrate reporting work towards #109 and improves #93 * Split duplicate log by entity_type. * Added better xpath support for ordered type. * Added html template for report * Support for port in host name * Reporting and Cache Tools documentation * Split exclusion/inclusion lists by crawl and url output. * Allow processors to alter UUID value. * Alter output of include/exclude warnings. * Improved options for reporting media can use list of files to merge or wildcard patterns. * Updated inflateMappings to remove case of selector array, now handled by TypeBase. Fixes #110. * Adds new url_options to strip script tags or other pattern from raw html using regex * Ignore case when building menu uuid. * Moves crawler redirect detection outside of cache enabled

* Prepare merlin for the wild. - Adds some niceness to the Composer file. - Updates the namespace to Merlin. - Renames the migrate executable to merlin. * Update README.md * Add licenses for dependencies.

steveworley and others added 30 commits September 4, 2019 13:28

Ensure that media types are added to output filenames.

35651cc

Use file not node.

c9ebfda

Merge pull request #80 from salsadigitalauorg/feature/media-type-file

ccad69c

Ensure that media types are added to output filenames.

Fixes #81 trailing slash (#82)

f7a4053

Fix docs so media appears in sidebar (#87)

d9091be

Ensure unique filenames for logging. (#89)

a5c5910

Resolve issue where errors were only reported once.

Adds support for inline links via linkit. (#91)

76bcda2

* Adds support for inline links via linkit. * Improved test coverage for linkit plugin.

Provide starting point URL/s for the crawler [Feature/issue 83] (#94)

10ccf7f

* Support staring points for crawler * Documentation for starting URLs

Tests for urls option in crawler

c98ffc9

Write crawler tests for URL when provided as a string

c71fa14

Reduce delays between requests to shave seconds of test time

15ff76c

Force add gitignored files

23ebb44

Merge pull request #95 from salsadigitalauorg/feature/issue-83-tests

e036573

Provide tests for the urls option in crawler config [#83]

Update crawler configuration options.

8c4b99f

Expose a number of crawler configuration options so they can be controlled via the crawler configuration yaml file. - Adds timeout - Adds connect_timeout - Adds verify - Adds cookies

Merge pull request #96 from salsadigitalauorg/crawler-timeout

5834886

Update crawler configuration options.

Added PR template. (#102)

90efae4

Fixed several punctuation issues and nl2br example. (#101)

f0c45b6

Added CI badge.

59f88f4

Support for separate parent menu selector. (#103)

37593c5

Ensure filenames are decoded. (#104)

2afd4ac

* Ensure filenames are decoded. * Fix phpcs.

Use correct urls key for crawled urls. (#105)

a2fc83e

* Use correct urls key for crawled urls. * Fix tests.

Use guzzle to resolve relative urls.

89f380b

Merge pull request #111 from salsadigitalauorg/feature/guzzle-resolve

3d8dd48

Use guzzle to resolve relative urls.

Allow cache_dir override in spider.

f250b88

Exclude external media assets by default. (#112)

19f2457

* Exclude external media assets by default.

Added entity_type requirement on spider. (#114)

8d3b9a6

* Added entity_type requirement on spider.

Output crawler logs specific to entity type.

7386144

stooit and others added 6 commits November 7, 2019 13:59

Update filepaths to reflect entity_type.

8e5a986

Feature/issue 113 - Support for arrayed urls_file (multiple URL files) (

fcbe36a

#116) * Functionality and tests for providing urls_file as an array * Documentation for multiple urls_files

Fixed error when menu parent could not be found.

2e0ddf2

Added support for global restrict by inclusion. (#117)

db54733

Prepare merlin for the wild. (#125)

defb702

* Prepare merlin for the wild. - Adds some niceness to the Composer file. - Updates the namespace to Merlin. - Renames the migrate executable to merlin. * Update README.md * Add licenses for dependencies.

stooit merged commit 1fe152d into master Apr 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merlin 0.4.0 #129

Merlin 0.4.0 #129

stooit commented Apr 9, 2020

Merlin 0.4.0 #129

Merlin 0.4.0 #129

Conversation

stooit commented Apr 9, 2020