More and more enterprises are managing distributed infrastructures and applications that need to share data. This data sharing can be viewed as data flows that connect (and flow through) multiple applications. Applications are partly managed on-premise, and partly in (multiple) off-premise clouds. Cloud infrastructures need to elastically scale over multiple data centers and software defined networking (SDN) is providing more network flexibility and dynamism. With the advent of the Internet of Things (IoT) the need to share data between applications, sensors, infrastructure and people (specifically on the edge) will only increase. This raises fundamental questions on how we develop scalable distributed systems: How to manage the flow of events (data flows)? How to facilitate a frictionless integration of new components into the distributed systems and the various data flows in a scalable manner? What primitives do we need, to support the variety of protocols? A term that is often mentioned within this context is Reactive Programming, a programming paradigm focusing on data flows and the automated propagation of change. The reactive programming trend is partly fueled by event driven architectures and standards such as for example XMPP, RabbitMQ, MQTT, DDS.
One way to think about distributed systems (complementary to the reactive programming paradigm) is through the concept of a shared (distributed) data fabric (akin to the shared memory model concept). An example of such a shared data fabric is Tuple spaces, developed in the 1980’s. You can view the data fabric as a collection of (distributed) nodes that provides a uniform data layer to the applications. The data fabric would be a basic building block, on which you can build for example a messaging service by having applications (consumers) putting data in the fabric, and other applications (subscribers) getting the data from the fabric. Similarly such a data fabric can function as a cache, where a producer (for example a database) would put data into the fabric but associates this to a certain policy (e.g. remove after 1 hour, or remove if exceeding certain storage conditions). The concept of a data fabric enables applications to be developed and deployed independently from each other (zero-knowledge) as they only communicate via the data fabric publishing and subscribing to messages in an asynchronous and data driven way.
The goal of the fabric is to offer an infrastructure platform to develop and connect applications without applications having to (independently) implement sets of basic primitives like security, guaranteed delivery, routing of messages, data consistency, availability, etc… and free up time of the developer to focus on the core functionality of the application. This implies that the distributed data fabric is not only a simple data store or messaging bus, but has a set of primitives to support easier and more agile application development.
Such a fabric should be deployable on servers and other devices like for example routers and switches (potentially building on top of a Fog infrastructure). The fabric should be distributed and scalable: adding new nodes should re-balance the fabric. The fabric can span multiple storage media (in-memory, flash, SSD, HDD, …). Storage is transparent to the application (developer), and applications should be able to determine (as a policy) what level of storage they require for certain data. Policies are a fundamental aspect of the data fabric. Some other examples of policies are: (1) time (length) data should remain in the fabric, (2) what type of applications can access particular data in the fabric (security), (3) data locality, the fabric is distributed, but sometimes we know in advance that data produced by one application will be consumed by another that is relative close to the producer.
It is unlikely that there will be one protocol or transportation layer for all applications and infrastructures. The data fabric should therefore be capable to support multiple protocols and transportation layers, and support mappings of well-known data store standards (such as object-relational mapping)
The data fabric can be queried, to enable discovery and correlation of data by applications, and support widely used processing paradigms, such as map-reduce enabling applications to bring processing to the data nodes.
It is unrealistic to assume that there will be one data fabric. Instead there will be multiple data fabrics managed by multiple companies and entities (similar to the network). Data fabrics should therefore be connected with each other through gateways creating a “fabric of fabrics” were needed.
This distributed data fabric can be viewed as a set interconnected nodes. For large data fabrics (many nodes) it will not be possible to connect each node with all other nodes without sacrificing performance or scalability, instead a connection overlay and smart routing algorithms are needed (for example a distributed hash tables) to ensure scalability and performance of this distributed data fabric. The data fabric can be further optimized by coupling this fabric (and its logical connection overlay) to the underlying (virtual) network infrastructure and exploit this knowledge to further optimize the data fabric to power IoT, Cloud and SDN infrastructures.
Special thanks to Gary Berger and Roque Gagliano for their discussions and insights on this subject.