The size of a Graph is its number of edges, which is the number of “is” relations. The simplest way to store the graph is as a list of pairs of datums (datumtrons) with each pair representing an “is” relation. This means that the memory required to store a graph is proportional to the size of the graph i.e. to the number of “is” relations. Notice that the size is independent on the number of datums in the graph.
The “is” relation is transitive. This means that if we have “a is b” and “b is c” then we can conclude that “a is c”. This conclusion is implicit and cannot be made explicit. In other words, If we try to add “a is c” to the Datum Universe, it will not be accepted as shown in Figure 2.
In Figure 2a, we have "apple is red" and "apple1 is apple" and we cannot add "apple is red".
In Figure 2b: we have "apple1 is apple" and "apple1 is red". If we add "apple is red", then "apple1 is red" is removed.
In Figure 3c: we have "apple is red" and "apple1 is red". If we add "apple1 is apple", then "apple1 is red" is removed.
The Transitive Reduction property eliminates any redundant links and puts the DAG in a Transitive Reduction form. The Transitive Reduction form of a graph is unique and reduces the number of links to the minimum possible without affecting the information content in the graph [AHO]. We show that there is a further reduction in graph size by modifying the graph using an induction process. First, let’s recall the definition of a biclique (Figure 3) as a subgraph where there are two sets of vertices U and L, and every vertex in L is connected to every vertex in U. If the number of vertices are n and m respectively, then the biclique subgraph size is m*n.
|G biclique| = m * n
In the datum universe, an example of a biclique is apple is red, apple is sweet, strawberry is red, strawberry is sweet, cherry is red, cherry is sweet Where the U set is the upper set of datums {red, sweet} and the L set is the lower set of datums {apple, strawberry, cherry} as shown in Figure 3.
The induction process creates a new vertex (datum) and inserts it in the middle as in Figure 4. Even though the process increases the number of vertices by one, it actually reduces the size of the graph to m + n.
|G induction | = m + n
In our example, we create a new datum x and modify all the links as follows: x is red, x is sweet, apple is x, strawberry is x, cherry is x We have now reduced the subgraph size from 6 to 5 edges.
As the numbers of n and m increase, the savings in graph size becomes substantial. Therefore, the induction process decreases the amount of memory needed to store the datum universe content. Moreover, the new datum represents a class of all datums that are red and sweet. If a new red and sweet datum is added, it can be added as a value of the new class. This means that in the datum universe, the induction process is a classification as well as memory conservation process.
If we apply the induction process on the Datum Universe graph such that we eliminate all bicliques, we reach the minimal size of the graph. The resultant graph is equivalent to the original graph from a transitivity point of view. At this point, the graph becomes a Lattice since every two vertices have one Least Upper Bound element (LUB) and one Greatest Lower Bound element (GLB). The resultant lattice is not unique and depends on the order of induction. We elaborate on this process elsewhere.
Strings, numbers, dates and other primitive data types are also datums but they are represented in their native form as objects. Exactly one or zero objects may be attached to a datum. In the examples given above all labels are objects. For example, "apple", "strawberry", "red", "sweet" are all string objects. The only datum that doesn't have an object is the induced datum in Figure 4. Notice that a katum is represented by a solid circle with a label showing its object while a datum is represented by a hollow circle.
A datum that has an object attached is called Katum. Because of the object attached to it, a katum is more concrete to us than a datum which represents an abstract class.
The Datum Universe has 4 standard sets. These are the Class set (C), the attribute set (A), the Value set (V) and the Instance set (I). The Class set of an element e contains the elements that are directly above e. While the Attribute set of e contains only the katums that are directly above e -- ignoring any datums in the path. In Figure 5:
C(k0) = { k1, d1, d2 }
A(k0) = {k1, k4, k5, k6, k7}
The Value set and the Instance sets are the mirror images of the Class set and the Attribute set and are not shown.
The Value set of an element e is the set of datums or katums directly below e. The Instance set of e is the set of katums that are directly below e ignoring any datums in the path.
For more on these sets and other derived sets, see the two white papers.
The Datum Universe:
Datumtron In-Memory Graph Database API.
A quick tutorial to the Datumtron API. Providing code to query a converted MS-Northwind database and showing a datamining example.
What is the most fundamental element of knowledge?
A brief introduction to the Datum Universe Model which is the theory behind the Datumtron API.
Graph Properties
Biclique
A Biclique is a graph whose vertices can be partitioned into two subsets V1 and V2 such that no edge has both endpoints in the same subset, and every possible edge that could connect vertices in different subsets is part of the graph. That is, it is a bipartite graph (V1, V2, E) such that for every two vertices v1 ∈ V1 and v2 ∈ V2, v1v2 is an edge in E.
Lattice
A lattice is a Partial order in which every two-element set has a least upper bound element (LUB) and a greatest lower bound element (GLB).