Replies: 3 comments 1 reply
-
FYI, I converted the issue to a discussion since it’s more of a question and less of an issue. In general, questions like this are much easier to answer if you provide a link to the actual site you’re trying to scrape as otherwise it’s difficult to fully understand the scenario. I currently don’t have my laptop with me so it’s difficult to provide a more comprehensive answer. I’ll try and get back to this in the next couple of days. |
Beta Was this translation helpful? Give feedback.
-
I just tagged a new release of Roach which contains this PR which makes it possible to have item processors only respond to certain type of items. In this case, I would suggest you only have a single spider and instead dispatch multiple requests for the different pages with different parsing callbacks. To answer your other question about how to deal with passing data between different requests. You can store arbitrary meta data on a request which you are then able to access in your parsing callback via One thing you will have to work around is that item processors don't return anything, so you won't be able to return the model id after saving it to the database for instance. In this case, I would suggest that instead of relying on the database PK, you generate a UUID in your spider and use that to reference models in the database. So putting it all together, you could write something like this: A custom final class Country extends AbstractItem
{
public function __construct(
public readonly UuidInterface $countryID,
public readonly string $name,
) {
}
} A custom final class City extends AbstractItem
{
public function __construct(
public readonly UuidInterface $countryID,
public readonly string $name,
public readonly int $population,
) {
}
} An item processor that only handles countries and saves them to the database: final class SaveCountryToDatabaseProcessor extends CustomItemProcessor
{
/**
* @param Country $item
*/
public function handleItem(ItemInterface $item): ItemInterface
{
CountryModel::create([
'uuid' => $item->countryID,
'name' => $item->name,
])
}
protected function getHandledItemClasses(): array
{
return [
Country::class,
];
}
} A custom city processor that saves the city to the database and relates it to the country final class SaveCityToDatabaseProcessor extends CustomItemProcessor
{
public function processItem(ItemInterface $item): ItemInterface
{
CityModel::create([
'name' => $item->name,
'population' => $item->population,
// Or you could query for the country's database ID
// based on the UUID and save that if you prefer
'country_id' => $item->countryID,
]);
}
protected function getHandledItemClasses(): array
{
return [
City::class,
];
}
} And finally the spider class. From what I understood, you would be starting on the city page, so the first thing we do is extract the URL for the country and dispatch a new request. We will also generate a UUID for this city which we attach to the request's meta data by using We parse the country and yield a Since we generated the UUID ourselves, we can now pass it along to the city we scraped so we can use it in the item processor. class MySpider extends BasicSpider
{
public function parse(Response $response): Generator
{
$countryURL = /* extract URL from page */;
$countryRequest = new Request('GET', $countryURL, $this->parseCountry(...));
$countryUUID = Uuid::uuid4();
yield ParseResult::fromValue(
$countryRequest->withMeta('countryID', $countryUUID),
);
$city = new City(
$countryUUID,
/* extract other data from response */
);
yield $this->item($city);
}
public function parseCountry(Response $response): Generator
{
$country = new Country(
$response->getRequest()->getMeta('countryID'),
/* extract other data from response */
);
yield $this->item($country);
}
} I hope this helped. |
Beta Was this translation helpful? Give feedback.
-
One thing to keep in mind is that requests don't immediately get dispatched when yielding them from a spider. This means that the |
Beta Was this translation helpful? Give feedback.
-
Hello,
I am taking a look at this package, using the Laravel integration, actually.
I have a specific need. I would like to scrape a page, but the page has three components:
The country and citizen components are provided via links on the main page, with the latter possibly having a few bits of data that the main citizen page might not provide.
I'd like to parse the city page, but I'm not quite sure how to handle processing everything.
I would like to be able to grab the country link, creating it or updating it in my database, and then passing its ID along to the city processing, so that the city can be inserted/updated to become a part of the country.
Finally, I'd like to be able to process any of the citizens, again inserting/updating them in my database, as needed, all with reference to the city ID to which they belong. (And, ideally, handling some of the "extra" bits of data that might exist on the main city page.)
I can't quite figure out if I should have three different spiders, with the city spider calling the country and citizen spiders...or one city spider with different parser methods...? 🤔 In any case, I can't figure out how to pass the country/city database IDs along...and in the case of a single spider, I can't figure out how to make the item processors process one component versus another.
Any help or suggestions? I took a look at https://github.com/ksassnowski/roach-example-project, which was quite helpful, in general, but I didn't see how it could help with these particular problems.
Thanks in advance for your help. 🤓
Beta Was this translation helpful? Give feedback.
All reactions