Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why does the Webserver WORKER TIMEOUT #16505

Closed
LChloe opened this issue Jun 17, 2021 · 9 comments
Closed

Why does the Webserver WORKER TIMEOUT #16505

LChloe opened this issue Jun 17, 2021 · 9 comments
Labels
area:webserver Webserver related Issues kind:bug This is a clearly a bug

Comments

@LChloe
Copy link

LChloe commented Jun 17, 2021

Apache Airflow version:2.0.2
Kubernetes version:
	Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.4", GitCommit:"224be7bdce5a9dd0c2fd0d46b83865648e2fe0ba", GitTreeState:"clean", BuildDate:"2019-12-11T12:47:40Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64"}
	Server Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.4", GitCommit:"224be7bdce5a9dd0c2fd0d46b83865648e2fe0ba", GitTreeState:"clean", BuildDate:"2019-12-11T12:37:43Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64"} 
gunicorn 19.10.0

Output error log when the webserver pod is started:

[�[34m2021-06-17 09:35:00,378�[0m] {�[34mopentelemetry_tracing.py:�[0m29} INFO�[0m - This service is instrumented using OpenTelemetry.OpenTelemetry could not be imported; pleaseadd opentelemetry-api and opentelemetry-instrumentationpackages in order to get BigQuery Tracing data.�[0m
[�[34m2021-06-17 09:35:00,418�[0m] {�[34mproviders_manager.py:�[0m299} WARNING�[0m - Exception when importing 'airflow.providers.google.common.hooks.leveldb.LevelDBHook' from 'apache-airflow-providers-google' package: No module named 'airflow.providers.google.common.hooks.leveldb'�[0m
[�[34m2021-06-17 09:35:01,963�[0m] {�[34mproviders_manager.py:�[0m299} WARNING�[0m - Exception when importing 'airflow.providers.google.common.hooks.leveldb.LevelDBHook' from 'apache-airflow-providers-google' package: No module named 'airflow.providers.google.common.hooks.leveldb'�[0m
  ____________       _____________
 ____    |__( )_________  __/__  /________      __
____  /| |_  /__  ___/_  /_ __  /_  __ \_ | /| / /
___  ___ |  / _  /   _  __/ _  / / /_/ /_ |/ |/ /
 _/_/  |_/_/  /_/    /_/    /_/  \____/____/|__/
[�[34m2021-06-17 09:35:02,052�[0m] {�[34mdagbag.py:�[0m451} INFO�[0m - Filling up the DagBag from /dev/null�[0m
[�[34m2021-06-17 09:35:03,317�[0m] {�[34mproviders_manager.py:�[0m299} WARNING�[0m - Exception when importing 'airflow.providers.google.common.hooks.leveldb.LevelDBHook' from 'apache-airflow-providers-google' package: No module named 'airflow.providers.google.common.hooks.leveldb'�[0m
[�[34m2021-06-17 09:35:03,389�[0m] {�[34mproviders_manager.py:�[0m299} WARNING�[0m - Exception when importing 'airflow.providers.google.common.hooks.leveldb.LevelDBHook' from 'apache-airflow-providers-google' package: No module named 'airflow.providers.google.common.hooks.leveldb'�[0m
[�[34m2021-06-17 09:35:13,116�[0m] {�[34mopentelemetry_tracing.py:�[0m29} INFO�[0m - This service is instrumented using OpenTelemetry.OpenTelemetry could not be imported; pleaseadd opentelemetry-api and opentelemetry-instrumentationpackages in order to get BigQuery Tracing data.�[0m
[�[34m2021-06-17 09:35:13,118�[0m] {�[34mopentelemetry_tracing.py:�[0m29} INFO�[0m - This service is instrumented using OpenTelemetry.OpenTelemetry could not be imported; pleaseadd opentelemetry-api and opentelemetry-instrumentationpackages in order to get BigQuery Tracing data.�[0m
[�[34m2021-06-17 09:35:13,128�[0m] {�[34mproviders_manager.py:�[0m299} WARNING�[0m - Exception when importing 'airflow.providers.google.common.hooks.leveldb.LevelDBHook' from 'apache-airflow-providers-google' package: No module named 'airflow.providers.google.common.hooks.leveldb'�[0m
[�[34m2021-06-17 09:35:13,132�[0m] {�[34mproviders_manager.py:�[0m299} WARNING�[0m - Exception when importing 'airflow.providers.google.common.hooks.leveldb.LevelDBHook' from 'apache-airflow-providers-google' package: No module named 'airflow.providers.google.common.hooks.leveldb'�[0m
[�[34m2021-06-17 09:35:13,132�[0m] {�[34mopentelemetry_tracing.py:�[0m29} INFO�[0m - This service is instrumented using OpenTelemetry.OpenTelemetry could not be imported; pleaseadd opentelemetry-api and opentelemetry-instrumentationpackages in order to get BigQuery Tracing data.�[0m
[�[34m2021-06-17 09:35:13,194�[0m] {�[34mproviders_manager.py:�[0m299} WARNING�[0m - Exception when importing 'airflow.providers.google.common.hooks.leveldb.LevelDBHook' from 'apache-airflow-providers-google' package: No module named 'airflow.providers.google.common.hooks.leveldb'�[0m
[�[34m2021-06-17 09:35:13,403�[0m] {�[34mopentelemetry_tracing.py:�[0m29} INFO�[0m - This service is instrumented using OpenTelemetry.OpenTelemetry could not be imported; pleaseadd opentelemetry-api and opentelemetry-instrumentationpackages in order to get BigQuery Tracing data.�[0m
[�[34m2021-06-17 09:35:13,420�[0m] {�[34mproviders_manager.py:�[0m299} WARNING�[0m - Exception when importing 'airflow.providers.google.common.hooks.leveldb.LevelDBHook' from 'apache-airflow-providers-google' package: No module named 'airflow.providers.google.common.hooks.leveldb'�[0m
[�[34m2021-06-17 09:35:14,502�[0m] {�[34mproviders_manager.py:�[0m299} WARNING�[0m - Exception when importing 'airflow.providers.google.common.hooks.leveldb.LevelDBHook' from 'apache-airflow-providers-google' package: No module named 'airflow.providers.google.common.hooks.leveldb'�[0m
[�[34m2021-06-17 09:35:14,591�[0m] {�[34mproviders_manager.py:�[0m299} WARNING�[0m - Exception when importing 'airflow.providers.google.common.hooks.leveldb.LevelDBHook' from 'apache-airflow-providers-google' package: No module named 'airflow.providers.google.common.hooks.leveldb'�[0m
[�[34m2021-06-17 09:35:14,708�[0m] {�[34mproviders_manager.py:�[0m299} WARNING�[0m - Exception when importing 'airflow.providers.google.common.hooks.leveldb.LevelDBHook' from 'apache-airflow-providers-google' package: No module named 'airflow.providers.google.common.hooks.leveldb'�[0m
[�[34m2021-06-17 09:35:14,941�[0m] {�[34mproviders_manager.py:�[0m299} WARNING�[0m - Exception when importing 'airflow.providers.google.common.hooks.leveldb.LevelDBHook' from 'apache-airflow-providers-google' package: No module named 'airflow.providers.google.common.hooks.leveldb'�[0m
[�[34m2021-06-17 09:35:17,614�[0m] {�[34mproviders_manager.py:�[0m299} WARNING�[0m - Exception when importing 'airflow.providers.google.common.hooks.leveldb.LevelDBHook' from 'apache-airflow-providers-google' package: No module named 'airflow.providers.google.common.hooks.leveldb'�[0m
[�[34m2021-06-17 09:35:17,724�[0m] {�[34mproviders_manager.py:�[0m299} WARNING�[0m - Exception when importing 'airflow.providers.google.common.hooks.leveldb.LevelDBHook' from 'apache-airflow-providers-google' package: No module named 'airflow.providers.google.common.hooks.leveldb'�[0m 
.............
$ cat gunicorn_error.log
[2021-06-17 09:35:05 +0000] [58] [INFO] Starting gunicorn 19.10.0
[2021-06-17 09:35:05 +0000] [58] [INFO] Listening at: http://0.0.0.0:8080 (58)
[2021-06-17 09:35:05 +0000] [58] [INFO] Using worker: sync
[2021-06-17 09:35:05 +0000] [64] [INFO] Booting worker with pid: 64
[2021-06-17 09:35:05 +0000] [65] [INFO] Booting worker with pid: 65
[2021-06-17 09:35:05 +0000] [66] [INFO] Booting worker with pid: 66
[2021-06-17 09:35:05 +0000] [67] [INFO] Booting worker with pid: 67
[2021-06-17 09:35:35 +0000] [58] [INFO] Handling signal: ttin
[2021-06-17 09:35:35 +0000] [204] [INFO] Booting worker with pid: 204
[2021-06-17 09:35:42 +0000] [58] [INFO] Handling signal: ttou
[2021-06-17 09:40:22 +0000] [58] [CRITICAL] WORKER TIMEOUT (pid:64)
[2021-06-17 09:40:22 +0000] [58] [CRITICAL] WORKER TIMEOUT (pid:65)
[2021-06-17 09:40:22 +0000] [58] [CRITICAL] WORKER TIMEOUT (pid:67)
[2021-06-17 09:40:22 +0000] [65] [INFO] Worker exiting (pid: 65)
[2021-06-17 09:40:22 +0000] [64] [INFO] Worker exiting (pid: 64)
[2021-06-17 09:40:22 +0000] [67] [INFO] Worker exiting (pid: 67)
[2021-06-17 09:40:23 +0000] [250] [INFO] Booting worker with pid: 250
[2021-06-17 09:40:23 +0000] [251] [INFO] Booting worker with pid: 251
[2021-06-17 09:40:25 +0000] [58] [CRITICAL] WORKER TIMEOUT (pid:66)
[2021-06-17 09:40:25 +0000] [66] [INFO] Worker exiting (pid: 66)
[2021-06-17 09:40:26 +0000] [316] [INFO] Booting worker with pid: 316
[2021-06-17 09:40:36 +0000] [58] [INFO] Handling signal: ttin
[2021-06-17 09:40:36 +0000] [355] [INFO] Booting worker with pid: 355 

airflow.cfg:

  • web_server_master_timeout -> 300
  • web_server_worker_timeout -> 300

I generated the image using Airflow2.0.2 Dockerfile as the base image:

My Yaml:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  creationTimestamp: null
  name: ops-airflow-docker-151358
  labels:
    name: airflow-151358
    runEnv: live
    io.kompose.service: airflow-151358
spec:
  replicas: 1
  revisionHistoryLimit: 3
  minReadySeconds: 10
  strategy:
    type: Recreate
  selector:
    matchLabels:
      name: airflow-151358
  template:
    metadata:
      creationTimestamp: null
      labels:
        name: airflow-151358
        runEnv: live
        io.kompose.service: airflow-151358
    spec:
      dnsPolicy: Default
      hostNetwork: false
      restartPolicy: Always
      volumes:
      - name: host-log-dir
        hostPath:
          path: /docker-logs/applogs/ops-airflow-docker
      - name: webserver-claim0
        hostPath:
          path: /home/airflow/dags
      - name: webserver-claim1
        hostPath:
          path: /home/airflow/logs
      - name: scheduler-claim0
        hostPath:
          path: /home/airflow/dags
      - name: scheduler-claim1
        hostPath:
          path: /home/airflow/logs
      - name: worker-claim0
        hostPath:
          path: /home/airflow/dags
      - name: worker-claim1
        hostPath:
          path: /home/airflow/logs
      containers:
      - name: webserver
        ports:
        - containerPort: 8080
        env:
        - name: DETONATOR_LOG_ON_CONSOLE
          value: "false"
        - name: DETONATOR_SIG_TERM_DELAY
          value: "10"
        - name: DETONATOR_COLORED_CONSOLE_OUTPUT
          value: "false"
        - name: DETONATOR_START_TIMEOUT
          value: "600"
        - name: DY_APP_LOG_DIR
          value: /home/www/logs/applogs/ops-airflow-docker
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: POD_IP
          valueFrom:
            fieldRef:
              fieldPath: status.podIP
        - name: NODE_IP
          valueFrom:
            fieldRef:
              fieldPath: status.hostIP
        - name: LOAD_EX
          value: '"y"'
        - name: FERNET_KEY
          value: aKSXskUMeVdRpOBOVsxWUFixZkRgMMCq6Z6_fdae6yo=
        - name: EXECUTOR
          value: Celery
        - name: containerPort
          value: "8080"
        resources:
          limits:
            cpu: "2"
            memory: 3Gi
          requests:
            cpu: "1"
            memory: 2Gi
        image: *******/ops-airflow-docker:20210617162117-master-e334b7b1
        volumeMounts:
        - name: host-log-dir
          mountPath: /home/www/logs/applogs/ops-airflow-docker
        - name: webserver-claim0
          mountPath: /opt/airflow/dags
        - name: webserver-claim1
          mountPath: /opt/airflow/logs
        args:
        - webserver
      - name: flower
        ports:
        - containerPort: 5555
        env:
        - name: DETONATOR_LOG_ON_CONSOLE
          value: "false"
        - name: DETONATOR_SIG_TERM_DELAY
          value: "10"
        - name: DETONATOR_COLORED_CONSOLE_OUTPUT
          value: "false"
        - name: DETONATOR_START_TIMEOUT
          value: "600"
        - name: DY_APP_LOG_DIR
          value: /home/www/logs/applogs/ops-airflow-docker
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: POD_IP
          valueFrom:
            fieldRef:
              fieldPath: status.podIP
        - name: NODE_IP
          valueFrom:
            fieldRef:
              fieldPath: status.hostIP
        - name: EXECUTOR
          value: Celery
        - name: FLOWER_PORT
          value: "5555"
        - name: containerPort
          value: "5555"
        resources:
          limits:
            cpu: "2"
            memory: 2Gi
          requests:
            cpu: "1"
            memory: 1Gi
        image: ***/ops-airflow-docker:20210617162117-master-e334b7b1
        volumeMounts:
        - name: host-log-dir
          mountPath: /home/www/logs/applogs/ops-airflow-docker
        args:
        - celery
        - flower
      - name: scheduler
        env:
        - name: DETONATOR_LOG_ON_CONSOLE
          value: "false"
        - name: DETONATOR_SIG_TERM_DELAY
          value: "10"
        - name: DETONATOR_COLORED_CONSOLE_OUTPUT
          value: "false"
        - name: DETONATOR_START_TIMEOUT
          value: "600"
        - name: DY_APP_LOG_DIR
          value: /home/www/logs/applogs/ops-airflow-docker
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: POD_IP
          valueFrom:
            fieldRef:
              fieldPath: status.podIP
        - name: NODE_IP
          valueFrom:
            fieldRef:
              fieldPath: status.hostIP
        - name: LOAD_EX
          value: '"y"'
        - name: FERNET_KEY
          value: aKSXskUMeVdRpOBOVsxWUFixZkRgMMCq6Z6_fdae6yo=
        - name: EXECUTOR
          value: Celery
        resources:
          limits:
            cpu: "2"
            memory: 2Gi
          requests:
            cpu: "1"
            memory: 1Gi
        image: ***/ops-airflow-docker:20210617162117-master-e334b7b1
        volumeMounts:
        - name: host-log-dir
          mountPath: /home/www/logs/applogs/ops-airflow-docker
        - name: scheduler-claim0
          mountPath: /opt/airflow/dags
        - name: scheduler-claim1
          mountPath: /opt/airflow/logs
        args:
        - scheduler
      - name: worker
        env:
        - name: DETONATOR_LOG_ON_CONSOLE
          value: "false"
        - name: DETONATOR_SIG_TERM_DELAY
          value: "10"
        - name: DETONATOR_COLORED_CONSOLE_OUTPUT
          value: "false"
        - name: DETONATOR_START_TIMEOUT
          value: "600"
        - name: DY_APP_LOG_DIR
          value: /home/www/logs/applogs/ops-airflow-docker
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: POD_IP
          valueFrom:
            fieldRef:
              fieldPath: status.podIP
        - name: NODE_IP
          valueFrom:
            fieldRef:
              fieldPath: status.hostIP
        - name: FERNET_KEY
          value: aKSXskUMeVdRpOBOVsxWUFixZkRgMMCq6Z6_fdae6yo=
        - name: EXECUTOR
          value: Celery
        resources:
          limits:
            cpu: "4"
            memory: 4Gi
          requests:
            cpu: "1"
            memory: 2Gi
        image: ***/ops-airflow-docker:20210617162117-master-e334b7b1
        volumeMounts:
        - name: host-log-dir
          mountPath: /home/www/logs/applogs/ops-airflow-docker
        - name: worker-claim0
          mountPath: /opt/airflow/dags
        - name: worker-claim1
          mountPath: /opt/airflow/logs
        args:
        - celery
        - worker
      imagePullSecrets:
      - name: dik-renge
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: name
                  operator: In
                  values:
                  - airflow-151358
              topologyKey: kubernetes.io/hostname
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: name
                operator: In
                values:
                - airflow-151358
            topologyKey: kubernetes.io/hostname
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: devops
                operator: In
                values:
                - enable
status: {}

I think it also caused me to access WebServer very slowly

I would like to know what causes the following error and how to solve it。

Thank you very much

@LChloe LChloe added the kind:bug This is a clearly a bug label Jun 17, 2021
@boring-cyborg
Copy link

boring-cyborg bot commented Jun 17, 2021

Thanks for opening your first issue here! Be sure to follow the issue template!

@turbaszek
Copy link
Member

@LChloe do you have installed all dependencies on webserver node? From logs it's visible that there are multiple missing. Is there anything in .err file generated by webserver?

@turbaszek turbaszek added the area:webserver Webserver related Issues label Jun 20, 2021
@LChloe
Copy link
Author

LChloe commented Jun 21, 2021

@LChloe do you have installed all dependencies on webserver node? From logs it's visible that there are multiple missing. Is there anything in .err file generated by webserver?

thank you for your reply.
I see the following two problems with the logs being launched:

  1. INFO�[0m - This service is instrumented using OpenTelemetry.OpenTelemetry could not be imported; pleaseadd opentelemetry-api and opentelemetry-instrumentationpackages in order to get BigQuery Tracing data
  2. WARNING�[0m - Exception when importing 'airflow.providers.google.common.hooks.leveldb.LevelDBHook' from 'apache-airflow-providers-google' package: No module named 'airflow.providers.google.common.hooks.leveldb

Question 1 is due to the version of Google-Cloud-BigQuery is 1.28.0. #13131
Question 2. I think he is not influential.#15451

Temporarily did not see some other error log output
I will try to modify the Google-Cloud-BigQuery version upgrade to 2.6.2 and update dependence on LevelDB

@fabianf92
Copy link

fabianf92 commented Jun 30, 2021

Hi, facing same issue on Airflow 2.0.1. Log is free from other error, but webserver shows
2021-06-30T07:37:57.599484978Z [2021-06-30 07:37:57 +0000] [21] [CRITICAL] WORKER TIMEOUT (pid:27) 2021-06-30T07:37:58.638426571Z [2021-06-30 07:37:58 +0000] [21] [WARNING] Worker with pid 27 was terminated due to signal 9
@LChloe any updates here?

@potiuk
Copy link
Member

potiuk commented Jun 30, 2021

I think you need to look in your Kubernetes logs. My guess is that some resources are missing and your gunicorn workers cannot start - but this is likely not an airflow problem, but a problem connected with your deployment. Examine your K8S logs. I heartily recommend K9S tool for that. it helps to analyse all-things-kubernetes, you will be able to easily browse logs generated by different part of the system and see the errors generated by kubernetes.

@fabianf92
Copy link

Hi @potiuk my webserver log doesn't contain any errors.
I am not running on Kubernetes, but on Azure App Service. Service is running fine most of the time.
After some time I am getting 2021-06-30T11:45:57.188369769Z [2021-06-30 11:45:57 +0000] [23] [CRITICAL] WORKER TIMEOUT (pid:30) 2021-06-30T11:45:58.201109676Z
Then the GUI is not reachable and I have to wait for 5 minutes until it recovered automatically.
I didn't face this issue in Airflow 2.0.0, but after migrating to 2.0.1.
Already updated the parameters web_server_master_timeout and web_server_worker_timeout to 300.

@potiuk
Copy link
Member

potiuk commented Jun 30, 2021

This is again, very hard to say without seeing azure service app logs. I am sure there must be some additional logs for the app service managing Airflow. Typically those kinds of problems occur with some resources not being available - connections, memory, disk, DNS querying service being slow, networking configuration, firewalls etc. All those might cause such result. Since there are no logs in Airflow about some errors or warnings, i think one of those might be the culprit.

I think the only way to be able to help is to take a look at the app service logs and try to see what kind of resources are problematic. There is no known problem in 2.0.1 that i am aware of that is an obvious candidate that could cause it.

Sorry I cannot help but some more digging from your side is needed. We need to be able to see the logs to be able to diagnose it. Unfortunately i know nothing about the Azure App Service to be able to tell you how to do it but there must be a way to review and diagnose all those various resources there as they are manged by the deployment service, not airflow.

@LChloe
Copy link
Author

LChloe commented Jun 30, 2021

Hi, facing same issue on Airflow 2.0.1. Log is free from other error, but webserver shows
2021-06-30T07:37:57.599484978Z [2021-06-30 07:37:57 +0000] [21] [CRITICAL] WORKER TIMEOUT (pid:27) 2021-06-30T07:37:58.638426571Z [2021-06-30 07:37:58 +0000] [21] [WARNING] Worker with pid 27 was terminated due to signal 9
@LChloe any updates here?
I am sorry that I can't help you, I have been busy with other work recently, so I have not dealt with this problem.

@LChloe
Copy link
Author

LChloe commented Jun 30, 2021

I think you need to look in your Kubernetes logs. My guess is that some resources are missing and your gunicorn workers cannot start - but this is likely not an airflow problem, but a problem connected with your deployment. Examine your K8S logs. I heartily recommend K9S tool for that. it helps to analyse all-things-kubernetes, you will be able to easily browse logs generated by different part of the system and see the errors generated by kubernetes.

Thank you for your suggestion, I think so, because using Docker Compose is normal. But I didn't see the K8S error log, I will try to use the K9S Tool you recommend.

@apache apache locked and limited conversation to collaborators Oct 29, 2021
@eladkal eladkal closed this as completed Oct 29, 2021

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
area:webserver Webserver related Issues kind:bug This is a clearly a bug
Projects
None yet
Development

No branches or pull requests

5 participants