Modeling time-based graphs in Grakn


#1

I keep running into situations where I need to model temporal states in the graph structures I’m creating. The current situation I’m dealing with is updating GitDetective to allow querying source code references and filtering by some temporal predicate. This would allow for the creation of trend graphs showing when and to what degree a project on GitHub is used by other projects.

In order to achieve this I’ve been looking into how to store temporal data in graphs and to the best of my knowledge the general principles to follow are:

Below I will present the current scenario I’m working with and the solution I’ve come up with. I’m no expert when it comes to graphs so I’m hoping the Grakn developers could point me in the right direction.

Scenario

Essentially I wish to store a particular Thing at a specific Date and Time. A Date can have several Time entities attached to it and a Time entity can have many Thing entities attached to it. A Thing entity may be marked as either added or deleted. In my scenario a Date is some given day and a Time is a given hour in that day.

In addition to these constraints a Thing may be inserted into a Date/Time which precedes a previously inserted Thing and querying the aggregated counts over a somewhat large Date/Time range should be feasible and not correlated to the quantity of Thing entities which are stored. Ideally it should be correlated to the Date/Time range only.

Solution

Schema:

define

#attributes
dayMonthYear sub attribute datatype string;
hour sub attribute datatype long;
date_index sub attribute datatype long;
time_index sub attribute datatype long;
data sub attribute datatype string;
added sub attribute datatype boolean;

#entities
Date sub entity
    has dayMonthYear
    plays has_time
    plays has_most_recent_thing;

Time sub entity
    has hour
    plays is_time
    plays has_thing
    plays has_most_recent_thing;

Thing sub entity
    has data
    has date_index
    has time_index
    plays is_thing
    plays is_most_recent_thing;

#relationships
time_relation sub relationship
    relates is_time, relates has_time;
is_time sub role;
has_time sub role;

thing_relation sub relationship
    has added
    relates is_thing, relates has_thing;
is_thing sub role;
has_thing sub role;

most_recent_thing_relation sub relationship
    relates is_most_recent_thing, relates has_most_recent_thing;
is_most_recent_thing sub role;
has_most_recent_thing sub role;

Logic:

As described above, there are Date, Time, and Thing entities and their related attributes and relationships. In addition to these there are date_index and time_index attributes and a most_recent_thing_relation relationship.

The date_index represents the index at which a particular Thing was inserted under a given Date. The time_index is the same but for the Time entity. These indexes are added so that when asking a particular Time for the amount of Thing nodes it has, it can produce that number without counting each Thing. It finds the correct time_index value through the most_recent_thing relationship on the Time entity. As the relationship’s name suggests, this would be the most recent Thing which was inserted into that Time entity. The date_index works the same way but instead attaches a Date to the most recent Thing in all of the Time entities which are attached to it. These indexes are incremented and decremented as a Thing is marked as added or deleted.

In this way, when asking a particular Date for the amount of Thing entities which are contained within it, it simply needs to follow the most_recent_thing_relation to find the Thing which has the date_index which would represent the sum of all of the Thing entities contained within all the Time entities contained within that Date entity. Again, the same thing for the Time entity when asking how many Thing entities exist at that level. Simply follow the most_recent_thing_relation and the Thing which is found will have a time_index which represent the sum of Thing entities for that Time entity.

Testing

While testing this design I ran into this problem so I haven’t been able to test the query speed but I imagine it will be pretty good given in no situation will it have to count all of the Thing entities which exist but rather a select few to gather what the real total is.

My main concern is how well this performs on the write and delete operations and if there are any tricks I’m missing to speed it up. Even without the date_index, time_index, and most_recent_thing_relation I’m only able to insert 50 things a second on a single-thread and insert 200 things a second using multiple threads.

Here is the full project: https://github.com/BFergerson/grakn-calendar-test

I’m hoping I’m doing something wrong on the inserts and I’m curious if anyone has any thoughts on getting accumulated sums of entities by time efficiently in graph databases. Thanks.


#2

I wonder, in comparison with the indexing approach, if modelling the knowledge graph differently could lead to a more significant improvement in performance.

This is what I have in mind:

define
    hour sub attribute datatype long,
        plays at;

    day sub attribute datatype date,
        plays on;

    data sub attribute datatype string,
        plays added,
        plays deleted;

    addition sub relationship,
        relates on,
        relates at,
        relates added;

    deletion sub relationship,
        relates on,
        relates at,
        relates deleted;

I’m very curious to find out, based on this model, how the following queries would perform within your dataset.

insert $d isa day <value for day>;
insert $h isa hour <value for hour>;
insert $da isa data <value for data>;
match 
    $d isa dat <value for day>;
    $h isa hour <value for hour>; 
    $da id <the id from the previous insert>; 
insert (on: $d, at: $h, added: $da) isa addition;
match
    $d isa day <target day>;
    $h isa hour <target hour>;
    $adi (on: $d, at: $h) isa addition;
aggregate count $adi;

#3

@soroush, I had no idea you could attach relationships to attributes. I’m rethinking every Grakn schema I’ve ever created :sweat_smile:. It might explain a lot of the poor performance I’ve seen but I’m still trying to wrap my head around when an attribute is preferable to an entity.

As far your model goes, I do not believe it’ll work for me as the response time for getting the count of additions and deletions is highly dependent on the number of things stored.

With 10 things, querying maxes out at:

  • exact day and hour: ~450 per second
  • by day: ~550 per second
  • by week: ~400 per second

With 1,000 things, querying maxes out at:

  • exact day and hour: ~450 per second
  • by day: ~500 per second
  • by week: ~11 per second

With 100,000 things, querying maxes out at:

  • exact day and hour: ~28 per second
  • by day: ~14 per second
  • by week: ~0.02 per second

Code used: https://github.com/BFergerson/grakn-calendar-test/tree/alternate

I believe the insert speed of your model is a lot better though (around ~225 per second on a single thread). I would like to try your model with the Graphity indexes when this is fixed. It’ll slow the insert speed down by ideally it won’t suffer from longer query times based on inserted things.

@soroush, given I was able to achieve a constant and quick read time for the queries I’m interested in and an acceptable write time, does scaling this solution simple mean adding more Grakn servers. This is something I’ve seen Grakn developers say to newcomers that ask how Grakn scales. They say Grakn scales lineally. Does this mean if a single Grakn server achieves ~50 per second writing and ~600 per second reading, that adding another Grakn server could potentially increase the overall throughput to something like ~75 per second writing and ~900 per second reading and so on ad infinitum?


#4

I had no idea you could attach relationships to attributes.

The documentation should have pointed this out. I’ll make sure to include this and other similar use cases in our docs :slight_smile:

I’m still trying to wrap my head around when an attribute is preferable to an entity.

An attribute is simply a piece of information that determines the property of a thing in the domain. Basically, anything that is to be assigned a value should be defined as an attribute. A person will never hold a value by itself, a name will. Another slightly more complex example could be language. If all you needed to store about a language was its name, then language could be defined as an attribute. If there were more values to be inserted as properties of a language, then you’d define it as an entity and have it own those properties as its attributes (for example, name, popularity, …). Lastly, it may be worth mentioning that an attribute can also own an attribute of its own.

Thanks for sharing the test results with me. Let us know how it goes once you extend the schema to include the indexes. Hopefully, @ganesh( :wink: ) will find some time in the near future to resolve the issue you mentioned. Also, in case this may be an interest to you, I also did run a test on a dataset with 2 million data instances. The aggregate count for an exact day and hour took about 5 seconds.

With regards to your question about scalability, I believe @joshua could provide you with an answer.

Keep up the good work @BFergerson and I hope that you will continue to inspire the community with your projects and challenge them with your questions :slight_smile:


#5

Hi Brandon! For Grakn Core, we don’t offer a distributed/multi-server scaling option, the best you could probably do is add more CPU cores and try to multithread your work (we’re working on stability during multithreaded writes, so no guarantees quite yet!). The coming quarters will also bring read and write speed improvements for single-threads. One thing that should in an upcoming release is a fix for the issue you nicely documented here.

Regarding distributed scaling, this is a KGMS offering that if you’re interested in you could ping @tomas.


#6

We took a quick look at some of the queries you are performing in the program. For instance, this query here performs a date comparison which requires full-featured indexing support.

Currently that particular feature is already in the development roadmap for KGMS. In general, operations which uses full-featured indexing such as date comparison, order by, offset / limit, and contains, can be done much faster in KGMS.