Grakn version: 1.2.0
OS: Ubuntu 16.04
Node count: 299 (match $x isa entity; aggregate count;)
Edge count: 1483 (is there a simple universal query for this too?)
I’m seeing large query times (over a minute) on a very tiny graph (under 299 nodes). What’s most alarming about the query time is that it only needs to traverse a specific type of node of which there is only 2.
Db schema:
define
##########---------- Kythe Entities ----------##########
node is-abstract sub entity
has name
has create_date;
project sub node
plays has_defines;
file sub node
has subkind
has file_location
has qualified_name
plays is_defines
plays has_defines
plays has_ref
plays has_ref_call;
function sub node
has subkind
has qualified_name
has commit_sha1
plays is_defines
plays is_ref
plays is_ref_call
plays has_ref
plays has_ref_call;
##########---------- Kythe Attributes ----------##########
name sub attribute datatype string;
create_date sub attribute datatype string;
subkind sub attribute datatype string;
file_location sub attribute datatype string;
qualified_name sub attribute datatype string;
commit_sha1 sub attribute datatype string;
start_offset sub attribute datatype long;
end_offset sub attribute datatype long;
##########---------- Kythe Relationships ----------##########
defines sub relationship
relates has_defines, relates is_defines
has create_date
has start_offset
has end_offset;
has_defines sub role;
is_defines sub role;
ref sub relationship
relates has_ref, relates is_ref
has create_date
has start_offset
has end_offset;
has_ref sub role;
is_ref sub role;
ref_call sub ref
relates has_ref_call, relates is_ref_call;
has_ref_call sub has_ref;
is_ref_call sub is_ref;
With what’s actually loaded in the database being:
Concept counts:
- Node: 299
- Project: 2
- File: 32
- Function: 265
Relationship counts:
- Defines: 474
- Ref: 1009
Essentially the database I’ve described above holds two GitHub projects (“weijiama/codeu-spring-2018” and “codeu-s2018-t26/codeu-spring-2018”) and the files and functions defined in those projects. They are very closely related projects so they share a lot of the same files/functions.
The query I’m running over this small dataset is:
match
$project isa project has name "weijiama/codeu-spring-2018";
$diffProject isa project has name $diffProjectName; $project != $diffProject;
get $diffProjectName;
I believe this query to be pretty straight forward so the long response time is weird. How I read the query is:
- A project ($project) has a name “weijiama/codeu-spring-2018”
- There is a different project ($diffProject) which has a name ($diffProjectName)
- The project ($project) is not the same as the different project ($diffProject)
- Return the name ($diffProjectName) of all the different projects ($diffProject)
The above query is meant to just simply return the other project “codeu-s2018-t26/codeu-spring-2018” (since there is only 2 projects). How does it spend time traversing the graph if there is only one other place to go to? This query is a subset of a larger query I’m having trouble getting to return and I think a large part of it is coming from the simple example above. I’m seeing times near 2 minutes just to return the one other possible project.
As always, link to re-create: https://github.com/BFergerson/grakn-simple-query-test