Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OmegaConfigLoader chokes when .ipynb_checkpoints contains checkpoint of catalog.yml #2583

Closed
astrojuanlu opened this issue May 16, 2023 · 2 comments · Fixed by #2977
Closed
Assignees

Comments

@astrojuanlu
Copy link
Member

Description

As per title.

Context

I was migrating a project from TemplatedConfigLoader to OmegaConfigLoader, and got an error that was not previously happening.

Jupyter automatically creates checkpoints of files (both notebooks and plain text files) in an .ipynb_checkpoints directory:

$ find conf/base -name "catalog*.yml"
conf/base/.ipynb_checkpoints/catalog-checkpoint.yml
conf/base/catalog.yml

Steps to Reproduce

  1. Copy conf/base/catalog.yml to conf/base/.ipynb_checkpoints/catalog-checkpoint.yml
  2. Try kedro catalog list while using TemplatedConfigLoader, everything works okay.
  3. Switch to OmegaConfigLoader and try kedro catalog list, see error.

Expected Result

OmegaConfigLoader behavior matches the other catalog files when there are duplicates.

$ kedro catalog list
DataSets in '__default__' pipeline:
  Datasets mentioned in pipeline:
    CSVDataSet:
    - openrepair-0_3-combined
    - openrepair-0_3-events-raw
    - openrepair-0_3-categories
    - openrepair-0_3
DataSets in 'data_processing' pipeline:
  Datasets mentioned in pipeline:
    CSVDataSet:
    - openrepair-0_3-combined
    - openrepair-0_3-events-raw
    - openrepair-0_3-categories
    - openrepair-0_3

Actual Result

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /Users/juan_cano/.micromamba/envs/kpolars310/bin/kedro:8 in <module>                             │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/kpolars310/lib/python3.10/site-packages/kedro/framework/cli/cl │
│ i.py:211 in main                                                                                 │
│                                                                                                  │
│   208 │   """
│   209 │   _init_plugins()                                                                        │
│   210 │   cli_collection = KedroCLI(project_path=Path.cwd())                                     │
│ ❱ 211 │   cli_collection()                                                                       │
│   212                                                                                            │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/kpolars310/lib/python3.10/site-packages/click/core.py:1130 in  │
│ __call__                                                                                         │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/kpolars310/lib/python3.10/site-packages/kedro/framework/cli/cl │
│ i.py:139 in main                                                                                 │
│                                                                                                  │
│   136 │   │   )                                                                                  │
│   137 │   │                                                                                      │
│   138 │   │   try:                                                                               │
│ ❱ 139 │   │   │   super().main(                                                                  │
│   140 │   │   │   │   args=args,                                                                 │
│   141 │   │   │   │   prog_name=prog_name,                                                       │
│   142 │   │   │   │   complete_var=complete_var,                                                 │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/kpolars310/lib/python3.10/site-packages/click/core.py:1055 in  │
│ main                                                                                             │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/kpolars310/lib/python3.10/site-packages/click/core.py:1657 in  │
│ invoke                                                                                           │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/kpolars310/lib/python3.10/site-packages/click/core.py:1657 in  │
│ invoke                                                                                           │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/kpolars310/lib/python3.10/site-packages/click/core.py:1404 in  │
│ invoke                                                                                           │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/kpolars310/lib/python3.10/site-packages/click/core.py:760 in   │
│ invoke                                                                                           │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/kpolars310/lib/python3.10/site-packages/click/decorators.py:38 │
│ in new_func                                                                                      │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/kpolars310/lib/python3.10/site-packages/kedro/framework/cli/ca │
│ talog.py:56 in list_datasets                                                                     │
│                                                                                                  │
│    53 │                                                                                          │
│    54 │   session = _create_session(metadata.package_name, env=env)                              │
│    55 │   context = session.load_context()                                                       │
│ ❱  56 │   datasets_meta = context.catalog._data_sets  # pylint: disable=protected-access         │
│    57 │   catalog_ds = set(context.catalog.list())                                               │
│    58 │                                                                                          │
│    59 │   target_pipelines = pipeline or pipelines.keys()                                        │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/kpolars310/lib/python3.10/site-packages/kedro/framework/contex │
│ t/context.py:236 in catalog                                                                      │
│                                                                                                  │
│   233 │   │   │   KedroContextError: Incorrect ``DataCatalog`` registered for the project.       │
│   234 │   │                                                                                      │
│   235 │   │   """                                                                                │
│ ❱ 236 │   │   return self._get_catalog()                                                         │
│   237 │                                                                                          │
│   238 │   @property                                                                              │
│   239 │   def params(self) -> Dict[str, Any]:                                                    │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/kpolars310/lib/python3.10/site-packages/kedro/framework/contex │
│ t/context.py:279 in _get_catalog                                                                 │
│                                                                                                  │
│   276 │   │                                                                                      │
│   277 │   │   """
│   278 │   │   # '**/catalog*' reads modular pipeline configs                                     │
│ ❱ 279 │   │   conf_catalog = self.config_loader["catalog"]                                       │
│   280 │   │   # turn relative paths in conf_catalog into absolute paths                          │
│   281 │   │   # before initializing the catalog                                                  │
│   282 │   │   conf_catalog = _convert_paths_to_absolute_posix(                                   │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/kpolars310/lib/python3.10/site-packages/kedro/config/omegaconf │
│ _config.py:168 in __getitem__                                                                    │
│                                                                                                  │
│   165 │   │   │   base_path = str(Path(self.conf_source) / self.base_env)                        │
│   166 │   │   else:                                                                              │
│   167 │   │   │   base_path = str(Path(self._fs.ls("", detail=False)[-1]) / self.base_env)       │
│ ❱ 168 │   │   base_config = self.load_and_merge_dir_config(                                      │
│   169 │   │   │   base_path, patterns, read_environment_variables                                │
│   170 │   │   )                                                                                  │
│   171 │   │   config = base_config                                                               │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/kpolars310/lib/python3.10/site-packages/kedro/config/omegaconf │
│ _config.py:272 in load_and_merge_dir_config                                                      │
│                                                                                                  │
│   269 │   │   │   file: set(config.keys()) for file, config in config_per_file.items()           │
│   270 │   │   }                                                                                  │
│   271 │   │   aggregate_config = config_per_file.values()                                        │
│ ❱ 272 │   │   self._check_duplicates(seen_file_to_keys)                                          │
│   273 │   │                                                                                      │
│   274 │   │   if not aggregate_config:                                                           │
│   275 │   │   │   return {}                                                                      │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/kpolars310/lib/python3.10/site-packages/kedro/config/omegaconf │
│ _config.py:311 in _check_duplicates                                                              │
│                                                                                                  │
│   308 │   │                                                                                      │
│   309 │   │   if duplicates:                                                                     │
│   310 │   │   │   dup_str = "\n".join(duplicates)                                                │
│ ❱ 311 │   │   │   raise ValueError(f"{dup_str}")                                                 │
│   312 │                                                                                          │
│   313 │   @staticmethod                                                                          │
│   314 │   def _resolve_environment_variables(config: Dict[str, Any]) -> None:                    │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: Duplicate keys found in /Users/juan_cano/Projects/QuantumBlack 
Labs/talk-kedro-polars/conf/base/.ipynb_checkpoints/catalog-checkpoint.yml and /Users/juan_cano/Projects/QuantumBlack 
Labs/talk-kedro-polars/conf/base/catalog.yml: openrepair-0_3, openrepair-0_3-categories, openrepair-0_3-combined, 
openrepair-0_3-events-raw

Workaround

Changing the file patterns is an effective workaround:

# settings.py

CONFIG_LOADER_ARGS = {
    "config_patterns": {
        "catalog": ["catalog.yml", "**/catalog.yml"],
    }
}

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

  • Kedro version used (pip show kedro or kedro -V): 0.18.7
  • Python version used (python -V): 3.10.9
  • Operating system and version: macOS Ventura
@noklam
Copy link
Contributor

noklam commented Aug 9, 2023

Reported by user

[Omegaconf]
Hi Team,
I’d like to use Omegaconf templating for the project, but when I set CONFIG_LOADER_CLASS = OmegaConfigLoader in settings.yml and run kedro ipython, I get a ValueError: Duplicate keys found in my raw_layer.yml and .ipynb_checkpoints/raw_layer-checkpoint.yml.

settings.yml
from .hooks import SparkHooks
from kedro.config import OmegaConfigLoader

CONFIG_LOADER_CLASS = OmegaConfigLoader

CONFIG_LOADER_ARGS = {
"config_patterns": {
"spark": ["spark*", "spark*/"],
"parameters": ["parameters*", "parameters*/
", "/parameters*"],
"catalog": ["catalog*", "catalog*/
", "/catalog*"],
"credentials": ["credentials*", "credentials*/
", "/credentials*"],
"logging": ["logging*", "logging*/
", "**/logging*"],
}
}

HOOKS = (SparkHooks(),)
hooks.yml
from kedro.framework.hooks import hook_impl
from pyspark import SparkConf
from pyspark.sql import SparkSession

class SparkHooks:
@hook_impl
def after_context_created(self, context) -> None:
"""Initialises a SparkSession using the config
defined in project's conf folder.
"""

     # Load the spark configuration in spark.yaml using the config loader
     parameters = context.config_loader.get("spark*", "spark*/**")
     spark_conf = SparkConf().setAll(parameters.items())

     # Initialise the spark session
     spark_session_conf = (
         SparkSession.builder.appName(context.project_path.name)
         .enableHiveSupport()
         .config(conf=spark_conf)
     )
     _spark_session = spark_session_conf.getOrCreate()
     _spark_session.sparkContext.setLogLevel("WARN")

If I use the default ConfigLoader it works as expected. Why would OmegaConf read from ipynb checkpoints? How to switch that off?
Thank you!

@noklam
Copy link
Contributor

noklam commented Aug 10, 2023

We should consider to bump this priority a bit from Low to Medium or High, I will try to reproduce it. The user report that he is using SageMaker and this bug prevent him to work from a notebook.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants