Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use correct urls key for crawled urls. #105

Merged
merged 2 commits into from
Oct 20, 2019
Merged

Conversation

stooit
Copy link
Contributor

@stooit stooit commented Oct 17, 2019

Issue URL: None

Changed

  1. Ensure crawled urls exist under correct urls key in output yaml so these files can be immediately used via urls_file in config

Screenshots

@@ -140,13 +140,13 @@ public function crawled(

if ($type->match($url_string, $response)) {
// Only match on the first option.
$this->json->mergeRow("crawled-urls-{$type->getId()}", $type->getId(), [$return_url], true);
$this->json->mergeRow("crawled-urls-{$type->getId()}", 'urls', [$return_url], true);
return;
}
}//end foreach

// Add this to the default group if it doesn't match.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SRowlands Is this comment still relevant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlexSkrypnyk yes still relevant, it's related to the fact that there may potentially be multiple matches and we return after the first.

It's probably self-evident and could be removed, but not really a part of this PR.

@stooit stooit merged commit a2fc83e into develop Oct 20, 2019
stooit pushed a commit that referenced this pull request Apr 9, 2020
* Ensure that media types are added to output filenames.
* Use file not node.
* Origin/feature/issue 5: runtime flag for limiting number of pages for migrate generate command (#79)
* Set crawl and migrate limits using an optional runtime flag
* Documentation for -l flag for Crawler
* Documentation for -l limit flag on migrate generate command
* Refactor "-l/--limit" to "-n/--number"
* Make runWeb/RunXml consistent.
Remove requirement for urls/url_file in XML parsing.
* Tweak docs.
* Fixes #81 trailing slash (#82)
* Fix docs so media appears in sidebar (#87)
* Ensure unique filenames for logging. (#89)
Resolve issue where errors were only reported once.
* Adds support for inline links via linkit. (#91)
* Adds support for inline links via linkit.
* Improved test coverage for linkit plugin.
* Feature/issue 40 multiple selectors (#92)
* Multiple selectors for config map
Introduces multiple selectors and multiple target fields per mapping, work towards #40 allowing you to re-use mappings
* Multiple selectors documentation
* Tests for various multiple selector and field modes
* Spidercache (#85)
* Add crawler test.
* Add group testing.
- Adds tear down to remove output files to remove chance of false-positives.
- Adds group testing.
- Adds additional pages to the test server.
* Use sys_get_temp_dir().
* Remove cleanup to debug CI
* Add sleep and php server from the right directory.
* Strip trailing slash if exists on domain for cache store dir
* WIP but functional spider-crawler cache
Work towards #74.  Spider-crawler now caches results for much faster reruns.  Note this introduces breaking changes in the Fetcher cache which now stores and expects json data not just the html content.  This allows to store extra info along with the content and is needed for the spider-crawler cache.
* Resolved URL count.
* Added no-cache runtime flag.
* Code standards cleanup.
* cache_enabled flag docs
* Fixes cache write before grouping logic
* Detect urls with duplicate content in crawler-spider
* crawler duplicate flag docs
* Refactoring to allow common php server base for tests
* Crawler tests using common php server class
* Cache update to allow recursive removal of a domains cached content
* Crawler tests based on common php server including caching and duplicate options
* clear output dirs on runs
* Stop server on start to reset if running
* Provide starting point URL/s for the crawler [Feature/issue 83] (#94)
* Support staring points for crawler
* Documentation for starting URLs
* Tests for urls option in crawler
* Write crawler tests for URL when provided as a string
* Reduce delays between requests to shave seconds of test time
* Force add gitignored files
* Update crawler configuration options.
Expose a number of crawler configuration options so they can be controlled via the crawler configuration yaml file.
- Adds timeout
- Adds connect_timeout
- Adds verify
- Adds cookies

* Added PR template. (#102)
* Fixed several punctuation issues and nl2br example. (#101)
* Added CI badge.
* Support for separate parent menu selector. (#103)
* Ensure filenames are decoded. (#104)
* Use correct urls key for crawled urls. (#105)
* Use guzzle to resolve relative urls.
* Allow cache_dir override in spider.
* Exclude external media assets by default. (#112)
* Exclude external media assets by default.
* Added entity_type requirement on spider. (#114)
* Output crawler logs specific to entity type.
* Update filepaths to reflect entity_type.
* Feature/issue 113 - Support for arrayed urls_file (multiple URL files) (#116)
* Functionality and tests for providing urls_file as an array
* Documentation for multiple urls_files
* Fixed error when menu parent could not be found.
* Added support for global restrict by inclusion. (#117)
* Feature/issue 110 multiple selectors error reporting + 115 performance changes (#120)
* Implementation of saving redirect source and destination results for crawler and fetcher
* Change from collection to arrays for queues #115
* Check for UTF8 encoded issues in html content prior to json for cache
* Added support for custom boolean yes/no values.
* Provides cache purge for a single URL or domain from migrate cmd line
* Change default fetcher to multicurl
* Update tests to account for redirects
* Rejigged redirect support.  Raw headers and redirect info is now also cached so the redirect reports can be built from cache for crawler and fetcher.
* Slight change in status code behaviour
* Migrate reporting work towards #109 and improves #93
* Split duplicate log by entity_type.
* Added better xpath support for ordered type.
* Added html template for report
* Support for port in host name
* Reporting and Cache Tools documentation
* Split exclusion/inclusion lists by crawl and url output.
* Allow processors to alter UUID value.
* Alter output of include/exclude warnings.
* Improved options for reporting media can use list of files to merge or wildcard patterns.
* Updated inflateMappings to remove case of selector array, now handled by TypeBase.  Fixes #110.
* Adds new url_options to strip script tags or other pattern from raw html using regex
* Ignore case when building menu uuid.
* Moves crawler redirect detection outside of cache enabled
* Prepare merlin for the wild. (#125)
* Prepare merlin for the wild.
- Adds some niceness to the Composer file.
- Updates the namespace to Merlin.
- Renames the migrate executable to merlin.
* Update README.md
* Add licenses for dependencies.

Co-authored-by: Steve Worley <[email protected]>
Co-authored-by: Nick Georgiou <[email protected]>
Co-authored-by: Andy Rowlands <[email protected]>
Co-authored-by: Alex Skrypnyk <[email protected]>
@stooit stooit deleted the feature/urls-crawl-key branch April 27, 2020 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants