Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

contributing support for running calrissian on AKS (Azure), EKS (AWS) and GKE (Google) #124

Open
pymonger opened this issue Oct 14, 2021 · 3 comments

Comments

@pymonger
Copy link

Greetings,

I'm interested in using calrissian to run CWL workflows on the K8s service for the 3 major cloud vendors. I'm starting with Azure and am running into caveats (e.g. #123) that are relate to the ReadWriteMany requirement of PersistentVolumes. I'm willing to work through these issues for each of the cloud vendors but would like to know what would be the best approach to implement them for contribution back to main.
Since calrissian uses capability in https://github.com/common-workflow-language/cwltool some of the kludges I've implemented just to get it to work on Azure actually required me to update cwltool (e.g. common-workflow-language/cwltool#1544). That's probably not the right approach so I'm looking for guidance on whether to proceed with making updates to cwltool or to find a way to build in the capability into calrissian.

Thanks in advance.

@pymonger pymonger changed the title contributing support for running calrissian on AKS (Azure), EKS (AWS), and GKE (Google) contributing support for running calrissian on AKS (Azure), EKS (AWS) and GKE (Google) Oct 14, 2021
@fabricebrito
Copy link
Collaborator

@pymonger can you share the current and expected behaviour? We'd be happy to help on getting Calrissian to work on several KaaS providers

@fabricebrito
Copy link
Collaborator

fabricebrito commented Oct 19, 2021

@pymonger regarding https://github.com/pymonger/soamc-cwl-demo#google-kubernetes-engine and the associated cost, we use https://longhorn.io/ as it provides ReadWriteMany using the nodes' disks. I wonder if that works on GKE.

@pymonger
Copy link
Author

@fabricebrito: without making these changes to cwltool:

https://github.com/common-workflow-language/cwltool/pull/1544/files

I would get the following error:

--------------------------------------------------------------------------------
apiVersion: v1
kind: Pod
metadata:
  labels: {}
  name: stage-in-cwl-pod-ydxduxah
spec:
  containers:
  - args:
    - curl -O https://s3-us-west-2.amazonaws.com/landsat-pds/L8/010/117/LC80101172015002LGN00/LC80101172015002LGN00_BQA.TIF
      > stdout_stage-in.txt 2> stderr_stage-in.txt
    command:
    - /bin/sh
    - -c
    env:
    - name: HOME
      value: /XiTfjy
    - name: TMPDIR
      value: /tmp
    image: curlimages/curl
    name: stage-in-cwl-container
    resources:
      requests:
        cpu: '1'
        memory: 1024Mi
    volumeMounts:
    - mountPath: /XiTfjy
      name: calrissian-tmpout
      readOnly: false
      subPath: sqh3fknm
    - mountPath: /tmp
      name: tmpdir
    workingDir: /XiTfjy
  initContainers: []
  restartPolicy: Never
  securityContext:
    runAsGroup: 0
    runAsUser: 1001
  volumes:
  - name: calrissian-input-data
    persistentVolumeClaim:
      claimName: calrissian-input-data
      readOnly: true
  - name: calrissian-tmpout
    persistentVolumeClaim:
      claimName: calrissian-tmpout
      readOnly: false
  - name: calrissian-output-data
    persistentVolumeClaim:
      claimName: calrissian-output-data
      readOnly: false
  - emptyDir: {}
    name: tmpdir
--------------------------------------------------------------------------------

Created k8s pod name stage-in-cwl-pod-ydxduxah with id f17fc3f2-b49b-4182-bfc1-379eaac5a691
PodMonitor adding stage-in-cwl-pod-ydxduxah
k8s pod 'stage-in-cwl-pod-ydxduxah' started
[stage-in-cwl-pod-ydxduxah] follow_logs start
[stage-in-cwl-pod-ydxduxah] follow_logs end
Handling terminated pod name stage-in-cwl-pod-ydxduxah with id f17fc3f2-b49b-4182-bfc1-379eaac5a691
handling completion with 0
PodMonitor removing stage-in-cwl-pod-ydxduxah
shutil.rmtree(/tmp/tjb__2wk, True)
shutil.rmtree(/tmp/4oavux2h, True)
DEBUG restore [ram: 1024, cores: 1] to available [ram: 14976.0, cores: 7.0]
DEBUG Finishing ThreadPoolExecutor.run_jobs: total_resources=[ram: 16000.0, cores: 8.0], available_resources=[ram: 16000.0, cores: 8.0]
DEBUG Moving /calrissian/tmpout/sqh3fknm/LC80101172015002LGN00_BQA.TIF to /calrissian/output-data/LC80101172015002LGN00_BQA.TIF
ERROR Unhandled error:
  [Errno 1] Operation not permitted
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/shutil.py", line 566, in move
    os.rename(src, real_dst)
OSError: [Errno 18] Invalid cross-device link: '/calrissian/tmpout/sqh3fknm/LC80101172015002LGN00_BQA.TIF' -> '/calrissian/output-data/LC80101172015002LGN00_BQA.TIF'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/cwltool/main.py", line 1248, in main
    tool, initialized_job_order_object, runtimeContext, logger=_logger
  File "/usr/local/lib/python3.7/site-packages/cwltool/executors.py", line 60, in __call__
    return self.execute(process, job_order_object, runtime_context, logger)
  File "/usr/local/lib/python3.7/site-packages/cwltool/executors.py", line 157, in execute
    path_mapper=runtime_context.path_mapper,
  File "/usr/local/lib/python3.7/site-packages/cwltool/process.py", line 401, in relocateOutputs
    stage_files(pm, stage_func=_relocate, symlink=False, fix_conflicts=True)
  File "/usr/local/lib/python3.7/site-packages/cwltool/process.py", line 297, in stage_files
    stage_func(entry.resolved, entry.target)
  File "/usr/local/lib/python3.7/site-packages/cwltool/process.py", line 374, in _relocate
    shutil.move(src, dst)
  File "/usr/local/lib/python3.7/shutil.py", line 580, in move
    copy_function(src, real_dst)
  File "/usr/local/lib/python3.7/shutil.py", line 267, in copy2
    copystat(src, dst, follow_symlinks=follow_symlinks)
  File "/usr/local/lib/python3.7/shutil.py", line 206, in copystat
    follow_symlinks=follow)
PermissionError: [Errno 1] Operation not permitted
Starting Cleanup
Finishing Cleanup

I filed this github issue on it but closed it because I thought it was straightforward to add an Azure StorageClass that supports ReadWriteMany:

#123

The issue is that the Azure StorageClass that supports it is based on AzureFile which mounts volumes using CIF and doesn't allow the modification of file attributes which is why I get the above PermissionError:

https://docs.microsoft.com/en-us/answers/questions/89827/how-can-i-change-folder-or-file-permissions-when-m.html

So for the time being, I'm using my fork of cwltool (https://github.com/pymonger/cwltool/tree/handle-unsupported-file-ops) to work with calrissian to address the issue above.

In regards to GKE, thanks for the pointer to longhorn. I'll look into it. I was able to run my CWL workflows on GKE using an NFS solution as described here but longhorn may be a better solution for operational use:

https://medium.com/@Sushil_Kumar/readwritemany-persistent-volumes-in-google-kubernetes-engine-a0b93e203180

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants