Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In case of sync protocol such as OPCUA / Modbus OIBus becomes a single point of failure. How to mitigate that? #171

Open
marouanehassanioptimistik opened this issue May 15, 2019 · 7 comments
Assignees
Labels
bug Something isn't working enhancement New feature or request
Milestone

Comments

@marouanehassanioptimistik
Copy link
Contributor

No description provided.

@marouanehassanioptimistik marouanehassanioptimistik changed the title In case of sync protocol OPCUA / Modbus OIBus becomes a single point of failure. How to mitigate that? In case of sync protocol such as OPCUA / Modbus OIBus becomes a single point of failure. How to mitigate that? May 15, 2019
@jfhenon jfhenon added this to the future milestone May 15, 2019
@kukukk
Copy link
Contributor

kukukk commented May 22, 2019

This may not be accurate, but this is how I see it.

SPOF can be caused by 2 things:

  • Problem with the application
  • Problem with the hardware running the application

Problem with the application
It can be caused by a crash, but also an update procedure can cause an outage. You have to decide what kind of uptime do you want to guarantee, whether an app update is acceptable or not. We could think about a solution to monitor the application and send a notification if it's down, so the IT administrator gets notified about the problem and intervene as soon as possible. Maybe offer some solution to integrate into company wide monitoring applications. We could also think about a patching solution for upgrading the application, so it don't have to be stopped, uninstalled, installed, started, and reduce the outage.

Problem with the hardware running the application
To avoid this problem we should implement a redundancy solution, where 2 OIBus instances are running on 2 different hardware. They could continuously monitor each other, and when the primary instance is down the secondary could replace it.

@jfhenon
Copy link
Collaborator

jfhenon commented Apr 26, 2020

osisoft doc: Hot, warm, and cold failover modes
PreviousNext
The failover mode specifies how the backup interface instance handles connecting to a data source and adding points when failover occurs. The sooner the backup interface can take over data collection, the less data is lost. However, increasing the failover level also increases data source load and system resource usage.

To determine which mode to use, consider how much data you can afford to lose and how much workload your system can handle. Be prepared to experiment, and consult your data source documentation and vendor as needed.

UniInt provides three levels of failover: cold, warm and hot. Higher ("hotter") levels preserve more data in the event of failover, but impose increasing workload on the system.

Hot failover
Hot failover is the most resource-intensive mode. Both the primary and backup interface instances collect data. No data is lost during failover (unless both the primary and backup interface nodes fail together), but the data source carries a double workload.

Warm failover
In a warm failover configuration, the backup interface does not actively collect data. The backup interface loads the list of PI points and waits to collect data until the primary interface fails or stops collecting data for any reason. If the backup interface assumes the role of primary, it starts collecting data. Some data loss can occur in a warm failover configuration.

Cold failover
In cold failover, the backup instance does not connect with the data source or load the list of PI points until it becomes primary. This delay almost always causes some data loss but imposes no additional load on the data source. Cold failover is required for the following cases:
A data source can support only one client.
You are using redundant data sources and the backup data source cannot accept connections.

@kukukk
Copy link
Contributor

kukukk commented Apr 26, 2020

Since we rewrote OPCUA, it is no longer a SPOF. However, we still have Modbus and MQTT.

These failover modes requires the redundancy solution I mentioned in my second description (for hardware failure).

Do you have any specific requirements from clients?

@jfhenon
Copy link
Collaborator

jfhenon commented Apr 26, 2020

no, I start thinking about this and collect ideas but we did not decide working on this yet.

@kukukk
Copy link
Contributor

kukukk commented Apr 26, 2020

If I remember correctly, the OPC HDA server at the client had limitation for the number of clients (if you killed OIBus, you had to wait a few minutes to be able to connect to it again). So, the cold failover could be a real use case.

The failover will require continuous interaction between the master and backup instances. I think, we can implement this interaction in such a way to support all 3 failover type:

  • hot failover: we call connect() and onScan() to get the data and check at the end of onScan whether the master is alive. If it is alive, do nothing with the data. If it is not alive, send the data.
  • warm failover: we call connect() and onScan(), but we check for master at the beginning of onScan. If master is alive, we do nothing. If it is not, we read the data.
  • cold failover: we only call connect() and subsequent onScan() if master is not alive.

It may require a small refactor at some South implementation to properly follow the same flow: connect to target server in connect() and get data in onScan()

It also requires a synchronization between the master and backup. I'm thinking about the lastCompletedAt value, but there can be other informations too.

@jfhenon jfhenon modified the milestones: future, 0.7 Jun 29, 2020
@kukukk
Copy link
Contributor

kukukk commented Jul 30, 2020

Any decision regarding this issue?

@jfhenon
Copy link
Collaborator

jfhenon commented Jul 30, 2020

not yet. We wait for a customer case before engaging this. In the meantime, we should add some additional tests to the backend.

@jfhenon jfhenon added the enhancement New feature or request label Aug 12, 2020
@burgerni10 burgerni10 modified the milestones: 1.1.0, 1.2.0 Jan 20, 2021
@burgerni10 burgerni10 self-assigned this Jun 3, 2021
@burgerni10 burgerni10 added bug Something isn't working priority:medium labels Jun 3, 2021
@burgerni10 burgerni10 modified the milestones: 2.0.0, 2.2.0 Jan 6, 2022
@burgerni10 burgerni10 modified the milestones: 2.2.0, future Sep 19, 2022
@burgerni10 burgerni10 modified the milestones: future, v3.X Mar 17, 2023
@burgerni10 burgerni10 modified the milestones: v3.5, future Jul 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants