Tiny graph with long query time


#1

Grakn version: 1.2.0
OS: Ubuntu 16.04
Node count: 299 (match $x isa entity; aggregate count;)
Edge count: 1483 (is there a simple universal query for this too?)

I’m seeing large query times (over a minute) on a very tiny graph (under 299 nodes). What’s most alarming about the query time is that it only needs to traverse a specific type of node of which there is only 2.

Db schema:

define

##########---------- Kythe Entities ----------##########
node is-abstract sub entity
    has name
    has create_date;
project sub node
    plays has_defines;
file sub node
    has subkind
    has file_location
    has qualified_name
    plays is_defines
    plays has_defines
    plays has_ref
    plays has_ref_call;
function sub node
    has subkind
    has qualified_name
    has commit_sha1
    plays is_defines
    plays is_ref
    plays is_ref_call
    plays has_ref
    plays has_ref_call;

##########---------- Kythe Attributes ----------##########
name sub attribute datatype string;
create_date sub attribute datatype string;
subkind sub attribute datatype string;
file_location sub attribute datatype string;
qualified_name sub attribute datatype string;
commit_sha1 sub attribute datatype string;

start_offset sub attribute datatype long;
end_offset sub attribute datatype long;

##########---------- Kythe Relationships ----------##########
defines sub relationship
    relates has_defines, relates is_defines
    has create_date
    has start_offset
    has end_offset;
has_defines sub role;
is_defines sub role;

ref sub relationship
    relates has_ref, relates is_ref
    has create_date
    has start_offset
    has end_offset;
has_ref sub role;
is_ref sub role;

ref_call sub ref
    relates has_ref_call, relates is_ref_call;
has_ref_call sub has_ref;
is_ref_call sub is_ref;

With what’s actually loaded in the database being:
Concept counts:

  • Node: 299
  • Project: 2
  • File: 32
  • Function: 265

Relationship counts:

  • Defines: 474
  • Ref: 1009

Essentially the database I’ve described above holds two GitHub projects (“weijiama/codeu-spring-2018” and “codeu-s2018-t26/codeu-spring-2018”) and the files and functions defined in those projects. They are very closely related projects so they share a lot of the same files/functions.

The query I’m running over this small dataset is:

match
$project isa project has name "weijiama/codeu-spring-2018"; 
$diffProject isa project has name $diffProjectName; $project != $diffProject;
get $diffProjectName;

I believe this query to be pretty straight forward so the long response time is weird. How I read the query is:

  • A project ($project) has a name “weijiama/codeu-spring-2018”
  • There is a different project ($diffProject) which has a name ($diffProjectName)
  • The project ($project) is not the same as the different project ($diffProject)
  • Return the name ($diffProjectName) of all the different projects ($diffProject)

The above query is meant to just simply return the other project “codeu-s2018-t26/codeu-spring-2018” (since there is only 2 projects). How does it spend time traversing the graph if there is only one other place to go to? This query is a subset of a larger query I’m having trouble getting to return and I think a large part of it is coming from the simple example above. I’m seeing times near 2 minutes just to return the one other possible project.

As always, link to re-create: https://github.com/BFergerson/grakn-simple-query-test


Additional has clause causes long query time
#2

Hey Brandon.

I have investigated the issue you posted.

The issue is that in this case the query planner can randomly produce a grossly suboptimal plan.

In general, plans will start with indexed lookups - attribute values, labels, ids. In this case, the produced plan will likely begin from the attribute with value weijiama/codeu-spring-2018, fetch an instance for the $project variable and then turn its attention into remaining bits of the query.

Now there are at least two possibilities:

Variant A:

Start with project label and do:

[type: project] -> project instance -> map $diffProject 

Variant B:

Start with name label and do:

[type: name] -> name instance -> check if attached to project -> map $diffProject

Currently, these two variants are equivalent. Now, even though there 2 instances of project in the data you posted, there are 175 unique instances of name and 299 concepts with name attribute, which leads to unnecessary checking and hence huge query time difference.

To remedy this and some other similar problems, we are planning to incorporate data statistics into our query algorithm in the following weeks, to allow prioritising more query fruitful paths.

Hope that helps
Kasper


#3

@kasper, thank you for the comprehensive response. Didn’t know the issue was random. I actually tried running the project above several times today and haven’t been able to reproduce the issue. Swear it was happening every time last night though. Did it occur for you?

Also, I’m having trouble understanding how to apply your suggested variants into Graql queries. They say to “start” with a project/name label but I thought the ordering of individual statements in a Graql query didn’t matter. If the ordering of statements doesn’t matter how do I start with a specific type?