Loading long files into Grakn [solved]


#1

Hey,

Just wanted to let you know that after struggling with various issues, we’ve managed to generate a GQL script that inserts all our data - around 46 000 entities and relationships, but grakn is not only very slow, it just stalls at some point and then crashes after the 26k or so entity. We have concluded that we cannot really use it for our needs - it’s a pity because it has so many nice features, including the visualisation app.

Do you plan to do any performance improvements in the future? I am not sure why it is so slow for such a small set of data (46k is nothing, really). But if you would like the software to be deployed and used in large-scale, production level, real-world projects , I think performance should be GRAKN’s number 1 priority in the future.

Thank you
Angel


#2

Hey,

We are going to solely focus on usability and performance in the following weeks. In the meantime if you could give us more information on what exactly is slow/what kind of queries are slow we would find it helpful.


#3

hi @attodorov,

We have loaded much bigger datasets than that just fine in the past. Perhaps you can share the kind of queries (and the code calling those queries) so we can see if there are any points we can optimise?

Nevertheless, as Kasper said, we are now solely focusing on usability and performance for the next release, and we will continue to do so.

Thanks for sharing this with us! It helps us improve Grakn!


#4

Hi guys,

definitely, I’d be glad to share the actual schema and load script - if you can find anything wrong with it that can be negatively affecting performance, please let me know. I hope you find something that’s on my side , and it’s not GRAKN :slight_smile:

We have a couple of projects where we tried to use grakn, one of them is about code analysis (similar to another topic in the discussion board), so i am attaching examples for it. It’s basically metadata about the Cassandra project’s source code - we’d like to load that in GRAKN.

schema.gql is the schema, cassandra.gql is the load script. I am using the following command:

/usr/local/bin/graql console -k cassandratest -f cassandra.gql

Here are links to the files because I cannot attach them directly:

https://drive.google.com/file/d/1j4rCsPlUTuYc6T0qA9bzPSKOqNrYdaxS/view?usp=sharing
https://drive.google.com/file/d/1eNCgzlP61f4fhEikT7Hw4uVqU-fLUMs3/view?usp=sharing

Thank you
Angel


#5

hi @attodorov,

So first of all, I think you’re missing -b to do batch loading. If you do the normal loading of a very large file, it’s not a good idea. We ran it with batch loading and it was very fast, 1+ minute to run the whole file.

Second of all, there are a lot of errors in the queries. We haven’t dug into all of them, but one of them is: You are referring to variables that are not declared in the same query. For example, $ent2 and $ent1 at line 56971 is used in an insert statement, while that match clause does not declare these variables.


#6

hey,

Thanks for the feedback - my understanding was that i cannot use batch loading if my script has dependent queries. What would happen if an insert for a relationship that depends on another insert (for an entity) is sent (and executed) before its dependent one ? It’s a valid use case, if the batch parallel executors are not guaranteeing the order.

Regarding errors - yes, that’s possible, I haven’t been able to get it to run after 26000th insert, so i haven’t reached to that point - but I can easily fix those.

Thanks
Angel


#7

FYI , another thing I noticed - when using -b (disregarding any potential errors related to dependent queries and such) , after the script executes, my CPU still stays at 100 % (or rather 400%) - it has been staying like this for about 10 minutes , is this something normal ? Because I can see the data in the web interface, maybe it is analyzing something.

Thanks,
Angel


#8

There is a lot of post-processing Grakn does after the batch inserting. I’m working on Grakn with source code as well. Our schemas look somewhat alike :open_mouth: .

After batch inserting large amounts of queries Grakn pretty much becomes unusable until the post-processing is done. Can’t ./grakn console and can’t ./grakn server stop.


#9

hi @attodorov,

  1. Regardless of whether you are batch loading or not, you cannot refer to variables from other queries. Every query (match-get, match-delete, match-insert, or insert) are all independent queries from each other.

  2. If you want to reference something you previously inserted, you need to do a match-insert, which involves a read, just like any database, this will be slower than just writing.

  3. If you want to insert something that depends on reading other things (e.g. inserting relationships between entities that you assume to already exist), you need to make sure you load all the entities first. In this case, it would be a good idea to split your data loading into two: load all entities, then load all relationships.

  4. The batch loading API is asynchronous, so after you send all the data, Grakn could still have more work to do, such as post-processing. We are also working on reducing this post-processing run-time significantly in the next release.


Use specific ID in insert query
#10

Ok, thanks - this makes sense. So basically I would be inserting entities first, then doing match-insert, in order to find the entities first - similar to what I am already doing partially in my existing script.

So the only remaining issue I have is performance-related - unfortunately, my CPU is at 100 % even after about half an hour after the batch query completes. I will leave it to see how long it goes. I checked the web interface - the entities are there, but the relationships are not, so I assume that grakn is working on the relationships in the background.

Another thing i noticed is that if I do queries like “compute count in methods” in the web console, it’s taking a couple of seconds to complete and cpu is at 100% . IMO, this should be a very fast operation - maybe you can add some internal indices ?

Thanks for all the help
Angel


#11

Yes @attodorov,

I just want to confirm that the fastest way for you to load data would be to put all the insert queries in one file and batch load it, then have all the match insert queries in a second file and batch load it after the first one is done.


#12

Hi @attodorov

compute ... queries are OLAP queries (see this OLAP wiki), which is a type of query that requires touching ALL THE INSTANCES of the given scope of types. Some database may have cached the count value, but in our case, we’re still performing OLAP over all the instances because this allows us to do compute count various combinations of types at once (which other databases can’t do).

The first time you run graql compute queries, it may take up time to setup the OLAP engine, but the second time it won’t need to.