Query speed on chained concepts in multi-node lab system


I continue to find uses for organizing data ontologically. It’s very exciting and in fact, I’m probably getting myself into trouble by continuing to ask more and more difficult questions of the knowledge graphs that I’m producing.

I have a lab system put together which consists of two physical nodes. Both systems running Ubuntu 16.04 on bare metal; one with 8 cpu cores and 32 gigs of RAM, one with 4 cpu cores and 16 gigs of RAM. I have configured cassandra to run on each of these machines. I then installed Grakn 1.1 on the larger system, and configured it to use the two nodes as its record storage.

In the current challenge, I am tracking data related to a log of web site activity which I have loaded into my keyspace. All is well with the loading, and I can both consistently retrieve data as well as mess around in the visualizer with no trouble.

HOWEVER! When I attempt to write a big nasty query to retrieve something very specific (let’s say a specific page event as related the value of an inbound campaign) the grakn server decides to go on vacation. Example query:

match $pageevent isa pageevent,
    has process $pageevent_process; 
    has category $pageevent_category; 
        $pageevent_process val contains 'Report';
        $pageevent_category val contains 'Yield';

(consumer: $localuser, consumable: $pageevent) isa consumption; 
    $localuser isa localuser; 

(consumable: $pagehit, consumer: $localuser) isa consumption;
    $pagehit isa pagehit;

(describer: $utm-campaign, described: $pagehit) isa description; 
    $utm-campaign isa utm-campaign,
    has dimensionvalue $campaign_dimensionvalue;
    {$campaign_dimensionvalue val contains 'Performance';};

limit 100;
offset 0;
get $pageevent;

My server in this example never comes back from its catatonia and I have to abort the request. Other subsequent requests are fine, but unfortunately at a certain level of complexity, my ontology database simply refuses to return results. This is a real bummer because my colleagues are starting to get very excited about some of the possibilities here.

So, what is it I am doing wrong here? What optimization have I failed to make? Is my query syntax incorrect? Do I need to configure Cassandra or my keyspace in a certain way? Is there a specific Java optimization I need to set in motion? Should I not be using openJDK 8 or Ubuntu, or Intel processors? I think I have some idea about my options (or lack thereof) for indexing or other situational optimizations, but maybe there’s something I’ve forgotten.

How do I once again dazzle my colleagues with more semantic wizardry?? Please advise.



Hi Andy,

I’m glad you’re enjoying using Grakn!

I think it’s possible the query is slow because it hits on a current indexing limitation we have: The contains to search for substrings is not currently indexed. Given that, this query doesn’t have an obvious “starting point” which could make it much slower, especially with large knowledge bases.

To fix this, you could try to refer to an exact value (e.g. $pageevent_process val = 'An exact string';) - these look-ups are indexed.




Just in case you are interested, we are planning to release our new offering, the Grakn KGMS which will have a proper clustering support as well as more efficient indexing mechanism.


Ahhhh, yes indeed. Results confirmed. The “contains” clause really murders performance. Thanks for the tip! Are there any issues with reusing a relation type multiple times in a single query? In my example I’m using my “consumer/consumable/consumption” relation in two different contexts. Could this cause issues as well?


I am_very interested indeed_ in the KGMS and I’m very much looking forward to making use of it when it becomes available.


Another follow-up question: While contains is a performance killer, what’s the word on regex for performance? Better, worse, the same? Any best practices here we should be aware of?


Regex and order by operation would exhibit the same performance limitations as the contains operation. They would be much faster in Grakn KGMS with its superior indexing capability.