-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Unable to do partial parsing after disabling Cosmos caching. #1041
Comments
Hello @josefbits! I'm here to help you with any bugs, questions, or contributions you might have. Let me know how I can assist you! To resolve the issue with partial parsing in dbt after disabling Cosmos caching in Airflow, you need to ensure that the partial parse file and the manifest file are correctly copied and updated in the target dbt project directory. Here is a step-by-step approach:
Here is the relevant code to perform these steps: from pathlib import Path
from cosmos.dbt.project import get_partial_parse_path
from cosmos.cache import _get_latest_partial_parse, _update_partial_parse_cache, _copy_partial_parse_to_project
# Define the paths
dbt_project_path = Path("/path/to/your/dbt/project")
cache_dir = Path("/path/to/your/cache/directory")
# Step 1: Identify the latest partial parse file
latest_partial_parse_filepath = _get_latest_partial_parse(dbt_project_path, cache_dir)
if latest_partial_parse_filepath:
# Step 2: Update the cache with the latest partial parse file
_update_partial_parse_cache(latest_partial_parse_filepath, cache_dir)
# Step 3: Copy the partial parse file to the project directory
_copy_partial_parse_to_project(latest_partial_parse_filepath, dbt_project_path)
else:
print("No partial parse file found.") Ensure that the paths to your dbt project and cache directory are correctly specified. This code will help maintain the partial parse file and manifest file in sync, thus resolving the issue with partial parsing in dbt [1]. |
@dosu @tatiana If I understand correctly, during execution Cosmos will symlink the project files from the dbt project path to the temporary directory. However, as per the create_symlinks function it ignores
However, this will not work on most managed solutions using an object storage back end as the directory is read-only. Of course it can be synced to Airflow worker local disk, but this brings us full circle to caching and whether or not concurrent context updates to the manifests will be a problem (#1042). IMO the simplest solution would be to check for existing manifests, either in the dbt project target/ directory or, when using LoadMode.DBT_MANIFEST or AUTOMATIC, the manifest_path root provided in the ProjectConfig. These could then be copied instead to the temporary directory using a function similar to |
You're correct that Cosmos symlinks project files to a temporary directory, ignoring To address this, you can modify the process to check for existing manifests in the
Here’s a code snippet to illustrate this approach: from pathlib import Path
import shutil
def copy_manifests_to_temp(project_path: Path, tmp_dir: Path) -> None:
"""Copy existing manifest files to the temporary directory."""
target_dir = project_path / "target"
manifest_files = ["manifest.json", "partial_parse.msgpack"]
for manifest_file in manifest_files:
source_path = target_dir / manifest_file
if source_path.exists():
shutil.copy(source_path, tmp_dir / manifest_file)
def create_symlinks_with_manifests(project_path: Path, tmp_dir: Path, ignore_dbt_packages: bool) -> None:
"""Create symlinks to the dbt project files and copy manifests to the temp directory."""
ignore_paths = ["logs", "target", "profiles.yml"]
if ignore_dbt_packages:
ignore_paths.append("dbt_packages")
for child_name in os.listdir(project_path):
if child_name not in ignore_paths:
os.symlink(project_path / child_name, tmp_dir / child_name)
# Copy manifests to the temporary directory
copy_manifests_to_temp(project_path, tmp_dir)
# Example usage
dbt_project_path = Path("/path/to/dbt/project")
tmp_dir = Path("/path/to/temp/directory")
create_symlinks_with_manifests(dbt_project_path, tmp_dir, ignore_dbt_packages=False) This approach ensures that the necessary manifest files are available in the temporary directory, enabling partial parsing even when caching is disabled. This should help mitigate issues with concurrent context updates to the manifests in environments with multiple Airflow workers or managed solutions using object storage backends [1][2]. |
Partial parsing support was introduced in astronomer#800 and improved in astronomer#904 (caching). However, as the caching layer was introduced, we removed support to use partial parsing if the cache was disabled. This PR solves the issue. Fix: astronomer#1041
Astronomer Cosmos Version
Other Astronomer Cosmos version (please specify below)
If "Other Astronomer Cosmos version" selected, which one?
1.4.3
dbt-core version
1.7.15
Versions of dbt adapters
dbt-postgres 1.7.15
LoadMode
DBT_LS_MANIFEST
ExecutionMode
VIRTUALENV
InvocationMode
DBT_RUNNER
airflow version
2.8.2
Operating System
Ubuntu 22.04.4 LTS
If a you think it's an UI issue, what browsers are you seeing the problem on?
No response
Deployment
Docker-Compose
Deployment details
What happened?
After setting
AIRFLOW__COSMOS__ENABLE_CACHE=False
dbt tasks are unable to run partial parsing due to the error "Unable to do partial parsing because saved manifest not found. Starting full parse."The dbt artifacts are pre-generated in CI and the entire
target/
directory synced to the dbt project path on the Airflow host.Relevant log output
How to reproduce
Set
AIRFLOW__COSMOS__ENABLE_CACHE=False
environment variable in the Airflow configuration.Anything else :)?
I have been able to reproduce this in our MWAA deployment as well. It's also running astronomer-cosmos v1.4.3 but on a different Airflow version, dbt-core version, and the dbt-teradata adapter so it doesn't seem to be environment specific.
Are you willing to submit PR?
Contact Details
The text was updated successfully, but these errors were encountered: