-
Notifications
You must be signed in to change notification settings - Fork 87
Opearator handler shows connection rest by peer or sometimes just hangs and doesn't respond #204
Comments
@amolkavitkar Sorry for the delayed answer. First of all, I see Python 3.5.2 mentioned. Kopf uses some features of Python 3.7, which did not exist earlier (such as Second, can you please provide the Kopf version used, and a sample code with this issue reproduced (with the proprietary and unnecessary code removed)? Without the code, it is unclear what is happening and why, and what can cause the problems. In the stack trace, I see that you call the Third, it is worth mentioning that the API requests for creation of the custom resources are, to my knowledge, non-blocking — i.e. k8s creates the resources and finishes the API request, and starts the handling in the operators after that — i.e., the operators cannot block the resource creation. If it is blocked, it is for some other reason (an assumption). Also, some information on the scale of operator & cluster would be helpful to reproduce it: how many handled resources do you have (just the order of magnitude: 1s, 10s, 100s, 1000s?) And how often do they change (times per second/minute/hour, also a rough approximation)? High-scale operators can have their own issues with connection- & thread-pools, which has to be taken into account. I would appreciate if this information is provided here in the issue, so that I could reproduce it or guess what happens internally. One of the mentioned effects or the one that looks like it —operators not exiting properly— is known for some time, is well-described in #152, and is going to be fixed with asyncio/aiohttp for all HTTPS operations (drafted in #176). This happens with any sync HTTPS client libraries inside of the async app, so there is no good way to solve it now. The proper aiohttp way requires few hops to implement prerequisite changes as separate PRs (in focus right now). Currently, it is "kind-of-solved" by Kubernetes using SIGKILL on the pod after it does not exit properly with SIGTERM. |
We've also been having the same problem with our deployments, using kopf v0.20, although we are not experiencing any exceptions or dropped connections. In our case the pod simply stops receiving updates from the kubernetes API: Nothing is sent from the kubernetes API when a custom resource is created, or updated; and logs are no longer emitted. Upon further investigation we've discovered that the connection between the kubernetes API and the operator is still in the still 'ESTABLISHED' state and there are no clear issues in the logs which would suggest that the API and operator can no longer communicate, however there is no longer any any data being exchanged on the connection as the octets sent and received by the kopf process remains steady (according to I'd like to get you some sample code so you can investigate this further. However since we need this working ASAP, I've developed a liveness check based on my investigations which checks to ensure that there is some kind of communication occurring between the API and operator: liveness-probe.sh -- To be added to the operator container #! /bin/sh
# Files storing previous data values for comparison
# You may want to change these file depending on frequency of /tmp deletion and frequency of liveness poll probing
LAST_DATA_FILE=/tmp/last_data
LAST_TIME_FILE=/tmp/last_time
# Process ID of kopf within container PID namespace (probably 1)
PID=1
get_data() {
CURRENT_DATA=`cat /proc/$PID/net/netstat | grep 'IpExt' | tail -n 1 | cut -d ' ' -f8`
}
record_data() {
echo $CURRENT_DATA > $LAST_DATA_FILE
}
record_time() {
date > $LAST_TIME_FILE
}
get_data
if [ ! -e $LAST_DATA_FILE ]; then
record_data
record_time
exit 0
fi
LAST_DATA=`cat $LAST_DATA_FILE`
LAST_TIME=`cat $LAST_TIME_FILE`
# provide some output for display in pod events
echo "previous = $LAST_DATA, current = $CURRENT_DATA, last received time = $LAST_TIME"
if [ ! $CURRENT_DATA -gt $LAST_DATA ]; then
# implies it is equal, not neccessary to record data...
record_data
exit 1
fi
record_data
record_time
# exit explicitly
exit 0 snippet of liveness probe livenessProbe:
exec:
command:
- /bin/sh
- /srv/liveness-probe.sh
initialDelaySeconds: 60
periodSeconds: 60
failureThreshold: 4
successThreshold: 1
timeoutSeconds: 1 The code works well with the These liveness probe values also work well for us but can mean there are 'unhealthy' events in the operator as the period is less than the frequency of communication I've seen in between the operator and the API. The period can be bumped to 4 minutes and the failure threshold reduced to 1 to clean up your events as well. |
I also observed this strange behaviour once (accidentally) — when the operator didn't see the custom resource created. Surprisingly, it saw other resources and reacted to them. But I couldn't catch and reproduce this situation after the restart. Meanwhile, there is a new pre-release: Among many other changes, this version fully switches Kopf to asyncio/aiohttp. Previously, there were synchronous It is highly probable that the issue will not reappear, since the whole I/O machinery is now changed. I'll be looking for this case more carefully. Please, do so too, and report if it does reappear. I will keep this issue open for few weeks, but then will close it if everything goes smooth. Also —though not directly related to the reported issue— there is now a liveness endpoint possible with |
@nolar We experience something very similar with latest kopf release still (minus connection reset by peer part): Platform: Ubuntu 18.04 After some netstatting and tcpdumping it seems that TCP connection is there, but the HTTP one is gone, but controller does not notice. Current suspect is that Line 204 in 51a3a70
While I understand why it may be preferable to do so for such watch requests, it seems that (bad?) networking conditions / blips can cause such hangups when no strict timeout is set. Our preliminary testing shows that setting some specific timeout in this place does not stop normal controller operations, but in case of lost connection controller crashes and is restarted by K8s, usually alleviating the problem. I'll push a PR to make this timeout configurable (with default behaviour preserved as current one). |
I too saw a connection reset by peer today, which caused the thread to hang. Here was the output:
|
Expected Behavior
The pod should respond immediately
Actual Behavior
When operator pod is long standing, and if we send and create request for custom resource it just hangs and doesn't respond, we have to restart pod then it comes back alive
Connection reset by peer in middle of creating resources
Steps to Reproduce the Problem
Note (sometimes it gives connection reset by peer)
Specifications
kubectl version
) 1.15python --version
) 3.5.2pip freeze --all
)connection reset by peer traceback
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/kopf/reactor/handling.py", line 387, in _execute
lifecycle=lifecycle, # just a default for the sub-handlers, not used directly.
File "/usr/local/lib/python3.7/site-packages/kopf/reactor/handling.py", line 478, in _call_handler
**kwargs,
File "/usr/local/lib/python3.7/site-packages/kopf/reactor/invocation.py", line 79, in invoke
result = await loop.run_in_executor(config.WorkersConfig.get_syn_executor(), real_fn)
File "/usr/local/lib/python3.7/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/handlers.py", line 159, in create_fn
obj = v1_client.create_namespaced_service(namespace, SVC_NOSELECTOR_TEMPLATE)
File "/usr/local/lib/python3.7/site-packages/kubernetes/client/apis/core_v1_api.py", line 6934, in create_namespaced_service
(data) = self.create_namespaced_service_with_http_info(namespace, body, **kwargs)
File "/usr/local/lib/python3.7/site-packages/kubernetes/client/apis/core_v1_api.py", line 7025, in create_namespaced_service_with_http_info
collection_formats=collection_formats)
File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 334, in call_api
_return_http_data_only, collection_formats, _preload_content, _request_timeout)
File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 168, in __call_api
_request_timeout=_request_timeout)
File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 377, in request
body=body)
File "/usr/local/lib/python3.7/site-packages/kubernetes/client/rest.py", line 266, in POST
body=body)
File "/usr/local/lib/python3.7/site-packages/kubernetes/client/rest.py", line 166, in request
headers=headers)
File "/usr/local/lib/python3.7/site-packages/urllib3/request.py", line 72, in request
**urlopen_kw)
File "/usr/local/lib/python3.7/site-packages/urllib3/request.py", line 150, in request_encode_body
return self.urlopen(method, url, **extra_kw)
File "/usr/local/lib/python3.7/site-packages/urllib3/poolmanager.py", line 326, in urlopen
response = conn.urlopen(method, u.request_uri, **kw)
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 641, in urlopen
_stacktrace=sys.exc_info()[2])
File "/usr/local/lib/python3.7/site-packages/urllib3/util/retry.py", line 368, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/local/lib/python3.7/site-packages/urllib3/packages/six.py", line 685, in reraise
raise value.with_traceback(tb)
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 603, in urlopen
chunked=chunked)
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 344, in _make_request
self._validate_conn(conn)
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 843, in validate_conn
conn.connect()
File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 370, in connect
ssl_context=context)
File "/usr/local/lib/python3.7/site-packages/urllib3/util/ssl.py", line 368, in ssl_wrap_socket
return context.wrap_socket(sock)
File "/usr/local/lib/python3.7/ssl.py", line 423, in wrap_socket
session=session
File "/usr/local/lib/python3.7/ssl.py", line 870, in _create
self.do_handshake()
File "/usr/local/lib/python3.7/ssl.py", line 1139, in do_handshake
self._sslobj.do_handshake()
urllib3.exceptions.ProtocolError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
The text was updated successfully, but these errors were encountered: