Skip to content
jonmeredith edited this page Sep 30, 2014 · 1 revision

Brief History

  • v1 written by Andy in a couple of days.

  • v2 written by Andy in a couple of week, patched up onsite at Comcast by me & Dizzy, really patched up by Andrew Thompson.

  • v3 written by me, CT, DP under extreme pressure for iOS6 release. Patched up by CMJ, Sparrow and others since.

Current Issues

Fullsync Issues

  • Fullsync time is proportional to data size, not size of differences.

  • The ring sizes are required to be the same on source and sink clusters, this limits deployment options.

  • Fullsync is unidirectional, the process needs to complete both ways.

  • Synchronization is vnode-based, so WAN bandwidth is used for Nx synchronization.

  • Keylist fullsync strategy is resource hungry and slow. Regardless of the delta, a full scan of all bkeys needs to be performed before it is determined.

  • AAE fullsync strategy is less resource hungry, but even slower as it doesn't pipeline requests so suffers from large RTTs.

  • Disabling fullsync at source does not prevent sink reconnecting and continuing.

  • [Kresten] Bloom based fullsync causes full partition scan for small differences.

  • [Kresten] Failure during the keylist/bloom building process makes no progress towards sync. Only once those two time consuming stages complete are differences exchanged.

  • [Kresten] There may be some unnecessary AAE tree locking?

Networking Issues

  • Connection manager, Service manager, Cluster manager all have code quality issues.

    • Code is nothing recognizable as OTP.

    • Supervision structure handles crashes poorly - better now, but still broken for proxy get.

    • Unestablished connection attempts do not get cancelled.

  • Proxy get code is hard to follow/read

    • Proxy get was created to proxy requests for RiakCS when the local cluster has the manifest, but not the blocks. It's design was overly influenced by the customer. They ran a system that could only connect out through a NAT, so it is set up to establish the connection in reverse and make proxy requests against the incoming connection. All of that leaks into the proxy get abstraction.
  • Replication connectivity problems are hard to debug

    • Variable bandwidth may be present at customer sites. Riot games had multi-homed data centers where two routes were high bandwidth low latency and the third was misconfigured and had lower bandwidth/higher latency. Connections were established round-robin so sometimes realtime would connect 'fast' and be fine, other times it would connect 'slow' and cause RTQ backup during peak load.

    • Connection failure reasons can be hard to diagnose.

      • Bad IP address/port - unrecoverable failure with no answer however many times you try.

      • Misconfigured port - connecting things to the wrong service, connecting a local cluster to itself, trying to connect to the PBC/HTTP ports rather than the cluster manager.

      • Flapping connections - looks like it's working, but it isn't working that well.

Realtime Issues

  • Realtime workload is not balanced.

    • v2 replication used to monitor how many incoming realtime connections there were on a node and when overconnected steer the connection to another node. v3 does not.

    • When connecting rt_sources to rt_sinks, the cluster manager (via a locator callback registered with the connection manager) dishes out a randomized list of connection IPs that it round-robins until they are refreshed every 10s. We currently resolve connection imbalance by either stop/starting realtime (no data loss, just a bit scary) or by using console snippets to override the locator function and steer

  • Nodes have single outbound connection for realtime. Replicating from a small cluster to a large one does not utilize all remote nodes.

  • Realtime worker pool should be shared between rtsinks (basho/riak_kv#254).

    • Each incoming rtsink creates it's own worker pool, so if a node is overconnected you cannot control the amount of put load coming from realtime.
  • Cascading realtime sends more data than needed.

    • v2/early v3 replication did not support realtime between A <-> B <-> C connected clusters. Data would replicate realtime from A to B, then B to C.

    • The project was planned in two phases, a dumb version that send the data on every path and detected when data came back in a loop, and an advanced one that use spanning trees to send on a single path only.

    • We have the spanning tree code written in replication, but also have spanning tree code as part of the cluster metadata project. Ideally we would have a single, tested implementation that can be used for both (if it makes sense).

  • Single RTQ per node can overload

    • Each node runs a single realtime queue process. We had cases in production systems where the realtime queue was overloaded and filled faster than we could trim it.

    • We added a drop mechanism to discard realtime queue messages before they arrived during overloads, but we should be able to tolerate more parallelism by supporting multiple queues.

  • Realtime heartbeat has been fragile

    • We had a customer issue where their kernel kept panicing, and TCP keepalive wasn't possible to configure low enough to detect the problem. They were very sensitive to realtime latency and wanted that connection to fail over before the RTQ filled and dropped.

    • A naive heartbeat system was added but to the wrong process which made RT status block, although that is fixed, heartbeat is still vulnerable to very large objects / slow connections taking too long to transmit the data.

  • Stats do not differentiate between re-sending and dropped messages. This is important to know if a fullsync needs to be run (basho/riak_kv#426)

  • Although puts will always be coordinated by a node that is responsible for a portion of the keyspace, no effort is made to route to rtsink nodes based on their ownership. Sending directly to a responsible node in the remote cluster would reduce network traffic.

General

  • No support for strong consistency.

    • Strongly consistent operations were added to Riak for 2.0, but are single-data center only. Without coordination it is impossible to replicate between sites. At the moment there is nothing to prevent users creating isolated SC buckets that we can never join back together

    • Customers may want global-strong consistency where several sites are involved in updating and the others are observers.

    • Customers may want proxied-strong consistency where a single site stores all data and requests are all proxied, possibly with a dirty-read option in local sites.

    • I'm sure @jtuple has many more thoughts than me on this.

  • Configuration is single cluster only

    • No support for bucket type/bucket properties, each site must be configured separately.

    • No support for security settings between clusters, authorization can always be shared even if authentication is offloaded. Each site must be configured separately.

  • Realtime puts and fullsync folding get the bucket properties for every object to decide whether to replicate or not.

  • MDC data stored in the ring, which we would like to deprecate.

  • Replication leader is a hack on gen_leader. As the fork we use does not support node addition/removal, we name the election after a hash of participants. Although we've patched it, would be good to replace it with something like riak_ensemble.

  • No way to know if objects have been sent (basho/riak_repl#308)