In case of sync protocol such as OPCUA / Modbus OIBus becomes a single point of failure. How to mitigate that? #171

marouanehassanioptimistik · 2019-05-15T17:12:00Z

No description provided.

kukukk · 2019-05-22T17:18:13Z

This may not be accurate, but this is how I see it.

SPOF can be caused by 2 things:

Problem with the application
Problem with the hardware running the application

Problem with the application
It can be caused by a crash, but also an update procedure can cause an outage. You have to decide what kind of uptime do you want to guarantee, whether an app update is acceptable or not. We could think about a solution to monitor the application and send a notification if it's down, so the IT administrator gets notified about the problem and intervene as soon as possible. Maybe offer some solution to integrate into company wide monitoring applications. We could also think about a patching solution for upgrading the application, so it don't have to be stopped, uninstalled, installed, started, and reduce the outage.

Problem with the hardware running the application
To avoid this problem we should implement a redundancy solution, where 2 OIBus instances are running on 2 different hardware. They could continuously monitor each other, and when the primary instance is down the secondary could replace it.

jfhenon · 2020-04-26T13:30:26Z

osisoft doc: Hot, warm, and cold failover modes
PreviousNext
The failover mode specifies how the backup interface instance handles connecting to a data source and adding points when failover occurs. The sooner the backup interface can take over data collection, the less data is lost. However, increasing the failover level also increases data source load and system resource usage.

To determine which mode to use, consider how much data you can afford to lose and how much workload your system can handle. Be prepared to experiment, and consult your data source documentation and vendor as needed.

UniInt provides three levels of failover: cold, warm and hot. Higher ("hotter") levels preserve more data in the event of failover, but impose increasing workload on the system.

Hot failover
Hot failover is the most resource-intensive mode. Both the primary and backup interface instances collect data. No data is lost during failover (unless both the primary and backup interface nodes fail together), but the data source carries a double workload.

Warm failover
In a warm failover configuration, the backup interface does not actively collect data. The backup interface loads the list of PI points and waits to collect data until the primary interface fails or stops collecting data for any reason. If the backup interface assumes the role of primary, it starts collecting data. Some data loss can occur in a warm failover configuration.

Cold failover
In cold failover, the backup instance does not connect with the data source or load the list of PI points until it becomes primary. This delay almost always causes some data loss but imposes no additional load on the data source. Cold failover is required for the following cases:
A data source can support only one client.
You are using redundant data sources and the backup data source cannot accept connections.

kukukk · 2020-04-26T14:11:43Z

Since we rewrote OPCUA, it is no longer a SPOF. However, we still have Modbus and MQTT.

These failover modes requires the redundancy solution I mentioned in my second description (for hardware failure).

Do you have any specific requirements from clients?

jfhenon · 2020-04-26T15:41:52Z

no, I start thinking about this and collect ideas but we did not decide working on this yet.

kukukk · 2020-04-26T16:51:35Z

If I remember correctly, the OPC HDA server at the client had limitation for the number of clients (if you killed OIBus, you had to wait a few minutes to be able to connect to it again). So, the cold failover could be a real use case.

The failover will require continuous interaction between the master and backup instances. I think, we can implement this interaction in such a way to support all 3 failover type:

hot failover: we call connect() and onScan() to get the data and check at the end of onScan whether the master is alive. If it is alive, do nothing with the data. If it is not alive, send the data.
warm failover: we call connect() and onScan(), but we check for master at the beginning of onScan. If master is alive, we do nothing. If it is not, we read the data.
cold failover: we only call connect() and subsequent onScan() if master is not alive.

It may require a small refactor at some South implementation to properly follow the same flow: connect to target server in connect() and get data in onScan()

It also requires a synchronization between the master and backup. I'm thinking about the lastCompletedAt value, but there can be other informations too.

kukukk · 2020-07-30T05:41:56Z

Any decision regarding this issue?

jfhenon · 2020-07-30T08:23:02Z

not yet. We wait for a customer case before engaging this. In the meantime, we should add some additional tests to the backend.

marouanehassanioptimistik changed the title ~~In case of sync protocol OPCUA / Modbus OIBus becomes a single point of failure. How to mitigate that?~~ In case of sync protocol such as OPCUA / Modbus OIBus becomes a single point of failure. How to mitigate that? May 15, 2019

jfhenon added this to the future milestone May 15, 2019

jfhenon modified the milestones: future, 0.7 Jun 29, 2020

jfhenon added the enhancement New feature or request label Aug 12, 2020

burgerni10 modified the milestones: 1.1.0, 1.2.0 Jan 20, 2021

burgerni10 self-assigned this Jun 3, 2021

burgerni10 added bug Something isn't working priority:medium labels Jun 3, 2021

burgerni10 modified the milestones: 2.0.0, 2.2.0 Jan 6, 2022

burgerni10 modified the milestones: 2.2.0, future Sep 19, 2022

burgerni10 modified the milestones: future, v3.X Mar 17, 2023

burgerni10 removed the priority:medium label Mar 26, 2024

burgerni10 modified the milestones: v3.5, future Jul 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

In case of sync protocol such as OPCUA / Modbus OIBus becomes a single point of failure. How to mitigate that? #171

In case of sync protocol such as OPCUA / Modbus OIBus becomes a single point of failure. How to mitigate that? #171

marouanehassanioptimistik commented May 15, 2019

kukukk commented May 22, 2019

jfhenon commented Apr 26, 2020

kukukk commented Apr 26, 2020

jfhenon commented Apr 26, 2020

kukukk commented Apr 26, 2020

kukukk commented Jul 30, 2020

jfhenon commented Jul 30, 2020

In case of sync protocol such as OPCUA / Modbus OIBus becomes a single point of failure. How to mitigate that? #171

In case of sync protocol such as OPCUA / Modbus OIBus becomes a single point of failure. How to mitigate that? #171

Comments

marouanehassanioptimistik commented May 15, 2019

kukukk commented May 22, 2019

jfhenon commented Apr 26, 2020

kukukk commented Apr 26, 2020

jfhenon commented Apr 26, 2020

kukukk commented Apr 26, 2020

kukukk commented Jul 30, 2020

jfhenon commented Jul 30, 2020