Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Huge slowdown of dbt initialization for projects with many tests since v0.21 #4135

Closed
1 task done
dvalchev opened this issue Oct 26, 2021 · 9 comments · Fixed by #4155
Closed
1 task done

[Bug] Huge slowdown of dbt initialization for projects with many tests since v0.21 #4135

dvalchev opened this issue Oct 26, 2021 · 9 comments · Fixed by #4155
Assignees
Labels
bug Something isn't working performance regression
Milestone

Comments

@dvalchev
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

This is a continuation of the closed #4012.
Since the upgrade from v0.20 to v0.21, when the project has a lot of schema tests, dbt startup time is severely increased. In one of the projects (343 models, 3882 tests) the startup time increased from 7 seconds to over 3 minutes! Using the timing recording, I managed to track down the root cause to #4022 (compilation.py:114(<listcomp>)). In addition, it seems that the digraph.py:638(add_edges_from) also performs significantly worse as well.
With #4012, the first issue seems to be resolved and the startup gets faster but is still impacted by the digraph.py:638(add_edges_from).
Disabling all tests makes the performance on par with 0.20.

Here are the details from each version tests. I used pre-parsed simple dbt compile to showcase the issue:

0.20.2

  • dbt.log
2021-10-26 13:09:56.378448 (MainThread): Running with dbt=0.20.2
2021-10-26 13:09:57.385906 (MainThread): running dbt with arguments Namespace(cls=<class 'dbt.task.compile.CompileTask'>, debug=False, defer=None, exclude=None, full_refresh=False, log_cache_events=False, log_format='default', models=None, parse_only=False, partial_parse=None, profile=None, profiles_dir='C:\\Users\\dvalchev\\.dbt', project_dir=None, record_timing_info='..\\timimng-compile-0.20.txt', rpc_method='compile', selector_name=None, single_threaded=False, state=None, strict=False, target=None, test_new_parser=False, threads=15, use_cache=True, use_colors=None, use_experimental_parser=False, vars='{}', version_check=True, warn_error=False, which='compile', write_json=True)
2021-10-26 13:09:57.387901 (MainThread): Tracking: do not track
2021-10-26 13:09:59.117617 (MainThread): Partial parsing enabled: 0 files deleted, 0 files added, 0 files changed.
2021-10-26 13:09:59.117617 (MainThread): Partial parsing enabled, no changes found, skipping parsing
2021-10-26 13:10:01.176648 (MainThread): write_gpickle is deprecated and will be removed in 3.0.Use ``pickle.dump(G, path, protocol)``
2021-10-26 13:10:01.239455 (MainThread): Found 343 models, 3882 tests, 0 snapshots, 7 analyses, 387 macros, 0 operations, 1 seed file, 5 sources, 0 exposures
2021-10-26 13:10:04.374412 (MainThread): 
  • Timings
    image

0.21

  • dbt.log
2021-10-26 12:44:30.354647 (MainThread): Running with dbt=0.21.0
2021-10-26 12:44:31.292324 (MainThread): running dbt with arguments Namespace(cls=<class 'dbt.task.compile.CompileTask'>, debug=False, defer=None, exclude=None, full_refresh=False, log_cache_events=False, log_format='default', parse_only=False, partial_parse=None, profile=None, profiles_dir='C:\\Users\\dvalchev\\.dbt', project_dir=None, record_timing_info='..\\timimng-compile-0.21.txt', rpc_method='compile', select=None, selector_name=None, single_threaded=False, state=None, strict=False, target=None, test_new_parser=False, threads=15, use_cache=True, use_colors=None, use_experimental_parser=False, vars='{}', version_check=True, warn_error=False, which='compile', write_json=True)
2021-10-26 12:44:31.292324 (MainThread): Tracking: do not track
2021-10-26 12:44:32.226488 (MainThread): Partial parsing enabled: 0 files deleted, 0 files added, 0 files changed.
2021-10-26 12:44:32.227487 (MainThread): Partial parsing enabled, no changes found, skipping parsing
2021-10-26 12:46:55.233418 (MainThread): write_gpickle is deprecated and will be removed in 3.0.Use ``pickle.dump(G, path, protocol)``
2021-10-26 12:46:55.329506 (MainThread): Found 343 models, 3882 tests, 0 snapshots, 7 analyses, 408 macros, 0 operations, 1 seed file, 5 sources, 0 exposures
2021-10-26 12:47:59.318122 (MainThread): 
  • Timings
    image

1.0.0-b2

  • dbt.log
2021-10-26 13:15:11.414604 (MainThread): Running with dbt=1.0.0-b2
2021-10-26 13:15:12.295335 (MainThread): * Deprecation Warning: The `source-paths` config has been deprecated in favor of
`model-paths`. Please update your `dbt_project.yml` configuration to reflect
this change.
2021-10-26 13:15:12.341246 (MainThread): running dbt with arguments Namespace(cls=<class 'dbt.task.compile.CompileTask'>, debug=None, defer=None, exclude=None, fail_fast=None, full_refresh=False, log_cache_events=False, log_format=None, parse_only=False, partial_parse=None, printer_width=None, profile=None, profiles_dir='C:\\Users\\dvalchev\\.dbt', project_dir=None, record_timing_info='..\\timimng-compile-1.00.txt', rpc_method='compile', select=None, selector_name=None, send_anonymous_usage_stats=None, single_threaded=False, state=None, static_parser=None, target=None, threads=15, use_colors=None, use_experimental_parser=None, vars='{}', version_check=False, warn_error=None, which='compile', write_json=None)
2021-10-26 13:15:12.342210 (MainThread): Tracking: do not track
2021-10-26 13:15:13.291790 (MainThread): Partial parsing enabled: 0 files deleted, 0 files added, 0 files changed.
2021-10-26 13:15:13.291790 (MainThread): Partial parsing enabled, no changes found, skipping parsing
2021-10-26 13:15:16.127705 (MainThread): write_gpickle is deprecated and will be removed in 3.0.Use ``pickle.dump(G, path, protocol)``
2021-10-26 13:15:16.210482 (MainThread): Found 343 models, 3882 tests, 0 snapshots, 7 analyses, 408 macros, 0 operations, 1 seed file, 5 sources, 0 exposures
2021-10-26 13:16:20.580208 (MainThread): 
  • Timings
    image

Expected Behavior

dbt startup performance is on par with 0.20

Steps To Reproduce

Execute dbt compile or dbt run on a project with a large number of tests

Relevant log output

see above

Environment

- OS: Windows 10
- Python: 3.8.7
- dbt: 0.21+

What database are you using dbt with?

snowflake

Additional Context

No response

@dvalchev dvalchev added bug Something isn't working triage labels Oct 26, 2021
@jtcohen6
Copy link
Contributor

jtcohen6 commented Oct 26, 2021

@dvalchev Thanks for the detailed report!

I'm glad to see that the improvement in #4022 cut down the graph compilation time significantly, though I'm unpleasantly surprised to see this line is still chewing up 54.1s:

new_graph = nx.algorithms.transitive_closure(self.graph)

When I run a sample project with 2k models and 6k tests on my machine locally, using 1.0.0-b2, and output detailed timing information, I see that method taking 5.5s—still much longer than I'd like, but an order of magnitude faster than the performance you're seeing.

In order to achieve the test-blocking behavior desirable in dbt build, we need to create graph edges between upstream tests and downstream models. The performance improvement in #4022 made the process of adding those edges (linking nodes) much, much faster. Yet we now need to operate over a much bigger graph, one with many more edges. In this regard, we're limited by the performance of networkx, and the technical capabilities of each machine running dbt.

I think there are a few paths we could pursue:

  1. Add the extra edges for the build task only. This would add complexity to our task architecture, but it would restore previous behavior for other tasks, such as dbt compile.
  2. Apply selection criteria before passing the graph into networkx methods. This wouldn't help performance for a full dbt build, but it would be much better in development, where you're likely running a handful of models and tests.
  3. Save the graph as an intermediate artifact (graph.gpickle), and reuse it if there are no file changes (known via partial parsing). We wouldn't attempt partial graph reconstruction, just full reuse or full invalidation.

I don't think we'd be able to do both (2) and (3); if selection criteria impacts the graph we build, then we can't reuse it in subsequent invocations with different selection criteria.

From my point of view, all options are on the table, including ones not listed here. We've been working hard on improving performance all year, and I'd like to see us carry that through into v1.0.

@dvalchev
Copy link
Author

Thanks @jtcohen6 for the quick response!
I think probably option (2) would be potentially a good trade-off. The biggest negative impact on the issue I'm expecting in our case would be really during the day-to-day development. Definitely it would be better at least to scale the processing time with the selection size.
On that train of thought, (1) can be also beneficial for minimizing the dev impact. We're still not yet onboard with the build setup but I expect it to be used much more often in the automated pipeline executions (where the impact will be minimized) as opposed on more of a manual run/test/selection scenarios during the daily development.

@jtcohen6
Copy link
Contributor

@dvalchev We think we've come up with a performance boost that doesn't require one of the trade-offs I mention above. If you get a moment, could you take it for a spin? #4155

I see a 5-10x speedup on my machine. Granted, that reduction is from 5.5s to 0.5-1s, so I'm interested to know what it looks like when starting from 54s.

@dvalchev
Copy link
Author

@jtcohen6 I tested the #4155 but unfortunately in my case I see negative impact. graph.py:81(get_subset_graph) now went up to 66s and on top of that for whatever reason I also get a constant re-parsing of over 50% of the models.

Here is the back-to-back run before and after #4155:

Before

image

After

image

@iknox-fa
Copy link
Contributor

iknox-fa commented Oct 29, 2021

Thanks for giving the new code a spin @dvalchev although I confess I'm a little surprised at your results. I have a couple of thoughts/questions that may help us proceed here.

1 - Can you try applying that patch to 1.0-b2? That's the version I've been developing against and would ensure we're working from the same baseline.

2 - Can you share a little more about the structure of your project? Differently shaped graphs can impact the algos we use in different ways.

3 - It strikes me that the % of the overall run time that get_subset_graph consumes has gotten smaller but the overall runtime got larger. Is there any chance that there was some resource contention going on during that second run?

4 - I sincerely hope this isn't the case, but the other obvious difference in what we're doing and what you're doing is the platform (we're all macos at dbtlabs).

@dvalchev
Copy link
Author

@iknox-fa, there you go:

1 - Can you try applying that patch to 1.0-b2? That's the version I've been developing against and would ensure we're working from the same baseline.

I have tested using directly your leahwicz/perfImprovement branch and have compared with the regular 1.0.0-b2

2 - Can you share a little more about the structure of your project? Differently shaped graphs can impact the algos we use in different ways.

The specifics with our project which I think causes the issue are as follows:

  • We have on average ~10 tests per model and this ratio is more or less consistent throughout the transformation steps:
Running with dbt=1.0.0-b2
Found 343 models, 18 tests, 0 snapshots, 7 analyses, 408 macros, 0 operations, 1 seed file, 5 sources, 0 exposures
  • The transformation pipeline can go up to 10 layers and have quite a number of dependent models. Multiplied by the ~10 tests per model, I think this is where the graph complexity explodes. E.g:
    image

3 - It strikes me that the % of the overall run time that get_subset_graph consumes has gotten smaller but the overall runtime got larger. Is there any chance that there was some resource contention going on during that second run?

I deliberately ran the tests multiple times and tried to capture the best possible result to minimize any resource contention impact during the test runs. As I mentioned earlier, the issue with the overall duration increase is caused by the additional re-parsing of more than 50% of the models. It seems that the changes made in #4155 is causing on its own the invalidation of the parsing cache. Every run (with no changes to the models whatsoever) does this unnecessary re-parsing:

2021-10-28 20:59:07.341849 (MainThread): Running with dbt=1.0.0-b2
...
2021-10-28 20:59:08.797434 (MainThread): Partial parsing enabled: 0 files deleted, 0 files added, 194 files changed.
2021-10-28 20:59:08.807405 (MainThread): Partial parsing: updated file: XXXXXXXXXXXXXXXXXX
2021-10-28 20:59:08.808402 (MainThread): Partial parsing: updated file: YYYYYYYYYYYYYYYYYY
...
2021-10-28 20:59:08.882238 (MainThread): 1603: static parser failed on XXXXXXXXXXXXXXXXXX
2021-10-28 20:59:08.986919 (MainThread): 1602: parser fallback to jinja rendering on XXXXXXXXXXXXXXXXXX
2021-10-28 20:59:08.988950 (MainThread): 1603: static parser failed on YYYYYYYYYYYYYYYYYY
2021-10-28 20:59:09.002903 (MainThread): 1602: parser fallback to jinja YYYYYYYYYYYYYYYYYY
2021-10-28 20:59:09.003874 (MainThread): 1603: static parser failed on YYYYYYYYYYYYYYYYYY
...

4 - I sincerely hope this isn't the case, but the other obvious difference in what we're doing and what you're doing is the platform (we're all macos at dbtlabs).

I have tested this on both Windows 10 and RedHat Enterprise Linux 8.2 machines and the behavior is consistent. I think we can rule out any environmental dependency.
Although I am quite ignorant on the graph processing, I think the the problem in our case is caused by the graph 'depth' rather than its 'width' (if these are the right terms) which creates the explosion of the edges(?).

@dvalchev
Copy link
Author

@jtcohen6, while testing, I noticed another potential bug which seems to be a side effect of the graph processing changes. The lineage graph in the generated docs is working only when focused on models with few dependencies. Anything beyond that results in something like:
image

Dragging dots around results in:
image

All works fine until 0.21 and is broken in 1.0.0-b2 (haven't tested with b1)
Shall we open a new issue for this?

@jtcohen6
Copy link
Contributor

jtcohen6 commented Oct 29, 2021

@dvalchev That's very weird! The dbt-docs site doesn't use the networkx graph at all. The DAG visualization is powered completely by the parent_map + child_map in manifest.json, neither of which has changed.

I also don't know why the change in #4155 is causing dbt to detect 194 files changed every time it runs. The changes in that PR make no changes whatsoever to project parsing. There are other changes made in main since we released b2, notably #4101. Any you make significant use of env vars, such as in macros you use across many models, and are you regularly changing their values?

You're welcome to open a new issue. I just worry I'm going to have a hard time reproducing.

@dvalchev
Copy link
Author

We don't directly use env vars in the models and macros. We only use them to pass the initial values to the internal dbt vars in dbt_project.yml. We use the internal vars extensively however for both setting the table aliases as well as for incremental logic. Still the reparsing behavior makes little sense as none of the env vars are changed between the runs.
I will take a look at #4101 but I will need to first figure out how to properly test before/after past PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working performance regression
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants