Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexation in scheduled mode process same products more than once #37916

Closed
1 of 5 tasks
gimanik opened this issue Aug 24, 2023 · 3 comments
Closed
1 of 5 tasks

Indexation in scheduled mode process same products more than once #37916

gimanik opened this issue Aug 24, 2023 · 3 comments

Comments

@gimanik
Copy link

gimanik commented Aug 24, 2023

Preconditions and environment

  • 2.4.5-p3 (but it's older problem and still exists on development branch)
  • Indexers in scheduled mode
  • Default batch size: 1000
  • In indexer change-log more than 1000 unique entity_id's to be indexed and lot of duplicated records (preferable 10 000+)

Steps to reproduce

Run indexer in mview mode, for example by running cron job indexer_update_all_views.
Example based on catalog_product_price index:
Current version_id = 40000
Last version_id = 66630
Index schedule_status (unique product_ids): idle (1093 in backlog)

Expected result

Only a total of 1093 products should be indexed, each product once.

Actual result

Some products are indexed more than once.
Magento\Framework\Mview\View uses Magento\Framework\Mview\View\ChangeLogBatchWalker to iterate through collection
However first batch is taken by select:
SELECT DISTINCT `catalog_product_price_cl`.`entity_id` FROM `catalog_product_price_cl` WHERE (version_id > 40000) AND (version_id <= 66630) GROUP BY `entity_id` LIMIT 1000
which returns 1000 IDs which are passed to indexer and indexed.
Second batch:
SELECT DISTINCT `catalog_product_price_cl`.`entity_id` FROM `catalog_product_price_cl` WHERE (version_id > 41000) AND (version_id <= 66630) GROUP BY `entity_id` LIMIT 1000
which also returns 1000 IDs which are passed to indexer and indexed.
This is because we take unique IDs from whole range, and we increase only vsFrom by batchSize (@see Magento\Framework\Mview\View)
In my case to index those unique 1093 products it needed to run 27 batches, each batch processed this amount of products:
1000
1000
1000
968
927
885
843
802
760
718
677
635
593
552
510
468
427
385
343
302
260
218
177
135
93
52
12

Which adds up to total of 14742 indexer executions, so the last 12 products were reindexed 27 times in each batch.

Additional information

Magento\Framework\Mview\View
Each batch query:
Current: $vsFrom to $currentVersionId
Should be: $vsFrom to ($vsFrom + $batchSize).
So instead of batch queries:

WHERE (version_id > 40000) AND (version_id <= 66630)
WHERE (version_id > 41000) AND (version_id <= 66630)
WHERE (version_id > 42000) AND (version_id <= 66630)
WHERE (version_id > 43000) AND (version_id <= 66630)
WHERE (version_id > 44000) AND (version_id <= 66630)
WHERE (version_id > 45000) AND (version_id <= 66630)
WHERE (version_id > 46000) AND (version_id <= 66630)
WHERE (version_id > 47000) AND (version_id <= 66630)
WHERE (version_id > 48000) AND (version_id <= 66630)
WHERE (version_id > 49000) AND (version_id <= 66630)
WHERE (version_id > 50000) AND (version_id <= 66630)
WHERE (version_id > 51000) AND (version_id <= 66630)
WHERE (version_id > 52000) AND (version_id <= 66630)
WHERE (version_id > 53000) AND (version_id <= 66630)
WHERE (version_id > 54000) AND (version_id <= 66630)
WHERE (version_id > 55000) AND (version_id <= 66630)
WHERE (version_id > 56000) AND (version_id <= 66630)
WHERE (version_id > 57000) AND (version_id <= 66630)
WHERE (version_id > 58000) AND (version_id <= 66630)
WHERE (version_id > 59000) AND (version_id <= 66630)
WHERE (version_id > 60000) AND (version_id <= 66630)
WHERE (version_id > 61000) AND (version_id <= 66630)
WHERE (version_id > 62000) AND (version_id <= 66630)
WHERE (version_id > 63000) AND (version_id <= 66630)
WHERE (version_id > 64000) AND (version_id <= 66630)
WHERE (version_id > 65000) AND (version_id <= 66630)
WHERE (version_id > 66000) AND (version_id <= 66630)

Each batch query should look like:

WHERE (version_id > 40000) AND (version_id <= 41000)
WHERE (version_id > 41000) AND (version_id <= 42000)
WHERE (version_id > 42000) AND (version_id <= 43000)
WHERE (version_id > 43000) AND (version_id <= 44000)
WHERE (version_id > 44000) AND (version_id <= 45000)
WHERE (version_id > 45000) AND (version_id <= 46000)
WHERE (version_id > 46000) AND (version_id <= 47000)
WHERE (version_id > 47000) AND (version_id <= 48000)
WHERE (version_id > 48000) AND (version_id <= 49000)
WHERE (version_id > 49000) AND (version_id <= 50000)
WHERE (version_id > 50000) AND (version_id <= 51000)
WHERE (version_id > 51000) AND (version_id <= 52000)
WHERE (version_id > 52000) AND (version_id <= 53000)
WHERE (version_id > 53000) AND (version_id <= 54000)
WHERE (version_id > 54000) AND (version_id <= 55000)
WHERE (version_id > 55000) AND (version_id <= 56000)
WHERE (version_id > 56000) AND (version_id <= 57000)
WHERE (version_id > 57000) AND (version_id <= 58000)
WHERE (version_id > 58000) AND (version_id <= 59000)
WHERE (version_id > 59000) AND (version_id <= 60000)
WHERE (version_id > 60000) AND (version_id <= 61000)
WHERE (version_id > 61000) AND (version_id <= 62000)
WHERE (version_id > 62000) AND (version_id <= 63000)
WHERE (version_id > 63000) AND (version_id <= 64000)
WHERE (version_id > 64000) AND (version_id <= 65000)
WHERE (version_id > 65000) AND (version_id <= 66000)
WHERE (version_id > 66000) AND (version_id <= 66630)

Release note

No response

Triage and priority

  • Severity: S0 - Affects critical data or functionality and leaves users without workaround.
  • Severity: S1 - Affects critical data or functionality and forces users to employ a workaround.
  • Severity: S2 - Affects non-critical data or functionality and forces users to employ a workaround.
  • Severity: S3 - Affects non-critical data or functionality and does not force users to employ a workaround.
  • Severity: S4 - Affects aesthetics, professional look and feel, “quality” or “usability”.
@m2-assistant
Copy link

m2-assistant bot commented Aug 24, 2023

Hi @gimanik. Thank you for your report.
To speed up processing of this issue, make sure that the issue is reproducible on the vanilla Magento instance following Steps to reproduce. To deploy vanilla Magento instance on our environment, Add a comment to the issue:


Join Magento Community Engineering Slack and ask your questions in #github channel.
⚠️ According to the Magento Contribution requirements, all issues must go through the Community Contributions Triage process. Community Contributions Triage is a public meeting.
🕙 You can find the schedule on the Magento Community Calendar page.
📞 The triage of issues happens in the queue order. If you want to speed up the delivery of your contribution, join the Community Contributions Triage session to discuss the appropriate ticket.

@gimanik gimanik changed the title Indexation in scheduled mode process same products Indexation in scheduled mode process same products more than once Aug 24, 2023
@hostep
Copy link
Contributor

hostep commented Aug 24, 2023

Hi @gimanik, this has already been reported before a couple of times:

The issue is not only that certain ID's are indexed multiple times, it's also that certain ID's are being missed, which is even worse.

A solution exists here: #36155, it just needs to get merged.

@gimanik
Copy link
Author

gimanik commented Aug 24, 2023

Ok great. I've found only #37398 which reports different problem - it's about entity_id ordering, which is also a problem.
But indeed #30012 is about same problem as mine.

@gimanik gimanik closed this as completed Aug 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants