I keep running into situations where I need to model temporal states in the graph structures I’m creating. The current situation I’m dealing with is updating GitDetective to allow querying source code references and filtering by some temporal predicate. This would allow for the creation of trend graphs showing when and to what degree a project on GitHub is used by other projects.
In order to achieve this I’ve been looking into how to store temporal data in graphs and to the best of my knowledge the general principles to follow are:
Below I will present the current scenario I’m working with and the solution I’ve come up with. I’m no expert when it comes to graphs so I’m hoping the Grakn developers could point me in the right direction.
Essentially I wish to store a particular
Thing at a specific
Date can have several
Time entities attached to it and a
Time entity can have many
Thing entities attached to it. A
Thing entity may be marked as either added or deleted. In my scenario a
Date is some given day and a
Time is a given hour in that day.
In addition to these constraints a
Thing may be inserted into a
Time which precedes a previously inserted
Thing and querying the aggregated counts over a somewhat large
Time range should be feasible and not correlated to the quantity of
Thing entities which are stored. Ideally it should be correlated to the
Time range only.
define #attributes dayMonthYear sub attribute datatype string; hour sub attribute datatype long; date_index sub attribute datatype long; time_index sub attribute datatype long; data sub attribute datatype string; added sub attribute datatype boolean; #entities Date sub entity has dayMonthYear plays has_time plays has_most_recent_thing; Time sub entity has hour plays is_time plays has_thing plays has_most_recent_thing; Thing sub entity has data has date_index has time_index plays is_thing plays is_most_recent_thing; #relationships time_relation sub relationship relates is_time, relates has_time; is_time sub role; has_time sub role; thing_relation sub relationship has added relates is_thing, relates has_thing; is_thing sub role; has_thing sub role; most_recent_thing_relation sub relationship relates is_most_recent_thing, relates has_most_recent_thing; is_most_recent_thing sub role; has_most_recent_thing sub role;
As described above, there are
Thing entities and their related attributes and relationships. In addition to these there are
time_index attributes and a
date_index represents the index at which a particular
Thing was inserted under a given
time_index is the same but for the
Time entity. These indexes are added so that when asking a particular
Time for the amount of
Thing nodes it has, it can produce that number without counting each
Thing. It finds the correct
time_index value through the
most_recent_thing relationship on the
Time entity. As the relationship’s name suggests, this would be the most recent
Thing which was inserted into that
Time entity. The
date_index works the same way but instead attaches a
Date to the most recent
Thing in all of the
Time entities which are attached to it. These indexes are incremented and decremented as a
Thing is marked as added or deleted.
In this way, when asking a particular
Date for the amount of
Thing entities which are contained within it, it simply needs to follow the
most_recent_thing_relation to find the
Thing which has the
date_index which would represent the sum of all of the
Thing entities contained within all the
Time entities contained within that
Date entity. Again, the same thing for the
Time entity when asking how many
Thing entities exist at that level. Simply follow the
most_recent_thing_relation and the
Thing which is found will have a
time_index which represent the sum of
Thing entities for that
While testing this design I ran into this problem so I haven’t been able to test the query speed but I imagine it will be pretty good given in no situation will it have to count all of the
Thing entities which exist but rather a select few to gather what the real total is.
My main concern is how well this performs on the write and delete operations and if there are any tricks I’m missing to speed it up. Even without the
most_recent_thing_relation I’m only able to insert 50 things a second on a single-thread and insert 200 things a second using multiple threads.
Here is the full project: https://github.com/BFergerson/grakn-calendar-test
I’m hoping I’m doing something wrong on the inserts and I’m curious if anyone has any thoughts on getting accumulated sums of entities by time efficiently in graph databases. Thanks.