Slow performances writing data [SOLVED]


#1

Hello,

I’m trying to import a big tsv using a simple ETL I wrote earlier in the morning. The tsv is 1,4G big and contains about 13 million of lines, but only 60000 are going to be imported.

The current ontology is something like this:

#
# Generic role for biological entities
#
biologicalEntity sub role;

#
# NCBI Gene
#
gene sub entity
    plays-role biologicalEntity
    has-resource geneId
    has-resource symbol
    has-resource locusTag
    has-resource chromosome
    has-resource mapLocation
    has-resource geneDescription
    has-resource geneType
    has-resource authorityFullName
    has-resource nomenclatureStatus;

geneId sub resource datatype long;
symbol sub resource datatype string;
locusTag sub resource datatype string;
chromosome sub resource datatype string;
mapLocation sub resource datatype string;
geneDescription sub resource datatype string;
geneType sub resource datatype string;
authorityFullName sub resource datatype string;
nomenclatureStatus sub resource datatype string;
otherDesignations sub resource datatype string;

and the main loop of the ETL is:

    while ((line = reader.readLine()) != null) {
    	String datavalue[] = line.split("\t");
    	
    	String taxId = datavalue[0];
    	if (!taxId.equals("9606"))
    		continue;

    	Instance gene = geneEntity.addEntity();
    	
    	Resource<Long> r_geneId = geneId.putResource(Long.valueOf(datavalue[1]));
    	gene.hasResource(r_geneId);
    	
    	if (!datavalue[2].equals("-")) {
        	Resource<String> r_symbol = symbol.putResource(datavalue[2]);
        	gene.hasResource(r_symbol);
    	}
    	
    	Resource<String> r_locusTag = locusTag.putResource(datavalue[3]);
    	gene.hasResource(r_locusTag);
    	
    	//String synonyms = datavalue[4];
    	//String dbXrefs = datavalue[5];

    	Resource<String> r_chromosome = chromosome.putResource(datavalue[6]);
    	gene.hasResource(r_chromosome);
    	
    	Resource<String> r_mapLocation = mapLocation.putResource(datavalue[7]);
    	gene.hasResource(r_mapLocation);
    	
    	Resource<String> r_description = geneDescription.putResource(datavalue[8]);
    	gene.hasResource(r_description);
    	
    	Resource<String> r_geneType = geneType.putResource(datavalue[9]);
    	gene.hasResource(r_geneType);
    	
    	if (!datavalue[10].equals("-")) {
        	Resource<String> r_symbol = symbol.putResource(datavalue[10]);
        	gene.hasResource(r_symbol);
    	}
    	
    	Resource<String> r_authorityFullName = authorityFullName.putResource(datavalue[11]);
    	gene.hasResource(r_authorityFullName);
    	
    	Resource<String> r_nomenclatureStatus = nomenclatureStatus.putResource(datavalue[12]);
    	gene.hasResource(r_nomenclatureStatus);
    	
    	Resource<String> r_otherDesignations = otherDesignations.putResource(datavalue[13]);
    	gene.hasResource(r_otherDesignations);

    	//String modificationDate = datavalue[14];

    	entryCounter++;
    }

If I had understood a bit how Grakn works, those plain 60000 “records” (with 11 columns) will be converted behind the scenes into, at least, something like 60000 + (60000 * 11) = 720000 vertices and 60000 edges. Is it right?

After about 6 hours the import process is still running.

Am I doing anything wrong or is it expected?

FYI: the same import takes less than 1 minute with OrientDB or Neo4J.

TIA,
Antonio


#2

Hi again @xMAnton,

What version are you using and when are you committing ? I don’t see the commit in your sample code. It is best to commit small transactions frequently rather than build large transactions to commit.

We are still in the process of performance tweaking things so things are not yet as fast as they could be but 6 hours sounds far too slow.

Thanks for brining this to our attention.


#3

Hi Filipe,

I’m on 0.9.0 and yes, you are right: there are no commits in that block code.
A final commit is expected at the end, because I’m using a batch loading:

 GraknGraph graph = Grakn.factory(Grakn.DEFAULT_URI, "biograph").getGraphBatchLoading();

Maybe that “batch loading” has pointed me out the right way?

Anyway, 6 hours seem to me too much.

I’ve stopped it. I’m going to insert a commit at the end before the closing brace and rerun.

Thank you.


#4

Ahhh our documentation/API may have lead you astray. The batch loading graph is not for loading large transactions but it rather makes commits faster at the cost of disabling some safety checks. Safety checks which in your case are not needed due to the cleaness and single threaded nature of your load. So you can use the batch graph easily.

As to why things may be slow. As I said we are still in the process of optimising loading performance but the biggest issue is that the underlaying database, Titan, caches a lot for each transaction and with particularly large transaction Garbage Collection can kick in and slow things down quite severely.

Please keep us posted on.


#6

Hello,

I just took a look at this problem.

I’m not sure if you sent us the entire ontology, but there were two errors in the snippet you put above. First, you defined the biologicalEntity role without connecting it to a relation. Second, gene needs to be allowed to have the resource otherDesignations. I’ve made the appropriate changes here:

insert

#
# NCBI Gene
#
gene sub entity
    has-resource geneId
    has-resource symbol
    has-resource locusTag
    has-resource chromosome
    has-resource mapLocation
    has-resource geneDescription
    has-resource geneType
    has-resource authorityFullName
    has-resource otherDesignations
    has-resource nomenclatureStatus;

geneId sub resource datatype long;
symbol sub resource datatype string;
locusTag sub resource datatype string;
chromosome sub resource datatype string;
mapLocation sub resource datatype string;
geneDescription sub resource datatype string;
geneType sub resource datatype string;
authorityFullName sub resource datatype string;
nomenclatureStatus sub resource datatype string;
otherDesignations sub resource datatype string;

Because of these errors in the ontology, I suspect that many of the inserts were failing and that the validation errors were printing to the logs. That could also explain why the insertions were taking such a long time. You may not have been seeing these errors originally because the commit was outside of the loop and committing the data to the graph is when validation is performed.

The other issue with the code you provided is that it loads in a single thread. I’ve that code and turned it into Graql statements which can be inserted using out Loader class. The Loader is an interface that will insert transactions multithreaded and so as to optimize loading time and not overwhelm the memory on your laptop. This is the code I mocked up:

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;

import ai.grakn.engine.loader.Loader;
import ai.grakn.engine.loader.LoaderImpl;
import ai.grakn.graql.Graql;
import ai.grakn.graql.Var;
import ch.qos.logback.classic.Level;
import ch.qos.logback.classic.Logger;
import org.slf4j.LoggerFactory;

import ai.grakn.exception.GraknValidationException;

import static ai.grakn.graql.Graql.var;

public class Main {
    @SuppressWarnings("unused")
    private static final org.slf4j.Logger LOG = LoggerFactory.getLogger(Main.class);

    private static String timeConversion(long seconds) {

        final int MINUTES_IN_AN_HOUR = 60;
        final int SECONDS_IN_A_MINUTE = 60;

        long minutes = seconds / SECONDS_IN_A_MINUTE;
        seconds -= minutes * SECONDS_IN_A_MINUTE;

        long hours = minutes / MINUTES_IN_AN_HOUR;
        minutes -= hours * MINUTES_IN_AN_HOUR;

        return hours + " hours " + minutes + " minutes " + seconds + " seconds";
    }

    public static void main(String[] args) throws IOException, GraknValidationException{
        disableInternalLogs();

        String homeDir = System.getProperty("user.home");
        String fileName = homeDir + "FILE_NAME";
        String line;
        int entryCounter = 0;
        //int edgeCounter = 0;
        long startTime = System.currentTimeMillis();

        BufferedReader reader = new BufferedReader(new FileReader(fileName));

        // skip first line
        reader.readLine();

        String geneEntity = "gene";
        String geneId = "geneId";
        String symbol = "symbol";
        String locusTag = "locusTag";
        String chromosome = "chromosome";
        String mapLocation = "mapLocation";
        String geneDescription = "geneDescription";
        String geneType = "geneType";
        String authorityFullName = "authorityFullName";
        String nomenclatureStatus = "nomenclatureStatus";
        String otherDesignations = "otherDesignations";

        Loader loader = new LoaderImpl("biographdb");

        while ((line = reader.readLine()) != null) {
            String datavalue[] = line.split("\t");

            String taxId = datavalue[0];
            if (!taxId.equals("9606"))
                continue;

            Var gene = var().isa(geneEntity);

            gene.has(geneId, Long.valueOf(datavalue[1]));

            if (!datavalue[2].equals("-")) {
                gene.has(symbol, datavalue[2]);
            }

            gene.has(locusTag, datavalue[3]);

            //String synonyms = datavalue[4];
            //String dbXrefs = datavalue[5];

            gene.has(chromosome, datavalue[6]);
            gene.has(mapLocation, datavalue[7]);
            gene.has(geneDescription, datavalue[8]);
            gene.has(geneType, datavalue[9]);

            if (!datavalue[10].equals("-")) {
                gene.has(symbol, datavalue[10]);
            }

            gene.has(authorityFullName, datavalue[11]);
            gene.has(nomenclatureStatus, datavalue[12]);
            gene.has(otherDesignations, datavalue[13]);

            //String modificationDate = datavalue[14];

            loader.add(Graql.insert(gene));

            entryCounter++;

            if (entryCounter % 10000 == 0) {
            	System.out.print("."); System.out.flush();
            }
        }

        loader.flush();
        loader.waitToFinish();

        long stopTime = (System.currentTimeMillis()-startTime)/1000;
        System.out.println("\n\nCreated " + entryCounter + " entities and " + (entryCounter*11) + " resources in " + timeConversion(stopTime));

        reader.close();
    }

    public static void disableInternalLogs(){
        Logger logger = (Logger) org.slf4j.LoggerFactory.getLogger(org.slf4j.Logger.ROOT_LOGGER_NAME);
        logger.setLevel(Level.INFO);
    }
}

This is the output of the above code:

Created 59579 entities and 655369 resources in 0 hours 25 minutes 16 seconds

I went into Graql and confirmed the number of entities inserted was the same!

Keep in mind that to run this, you need to add a dependency on grakn-engine where the Loaders are located. You may also need to add a grakn-engine.properties file.

I want to apologize for not having sufficient documentation of these batch loading features. We’ve been making a lot of changes very quickly and have fallen behind with our docs- we aim to fix this before release.

Just so that we can confirm what the issue was: would you be able to send us your grakn.log file? We want to make sure that we solve any bugs and are give better failure information to the user.

Hope this helps! And definitely let us know if you have any other problems.

Alexandra


#7

THANK YOU, @alex!

Let us know if it works well on your side, @xMAnton! And sorry for the troubles! We’ll make sure to update our docs to clearly describe the workflow better! CC: @stichbury


#8

I would also like to add, @xMAnton, although 25 minutes is still much longer than 1 minute load time on Noe4j and OrientDB, this is far from our final write performance. We haven’t released any optimisation work yet, and we are only starting the work and will release them in the coming month.

There are several key optimisations for both write and read operations that the guys are working on right now. And an important thing to note is that Grakn will also be able to scale write transactions horizontally!


#9

First, thanks for your support and responsiveness!

@alex

Alexandra, you were right: I’ve left some lines in my previous quick copy&paste of the ontology, just because I’ve initially created it via API. In fact, there were no validation errors after moving the commit inside the loop.

Using your suggested ontology (without the biologicalEntity role linked to gene) gives me “good” results:

Created 59839 entities and 658229 resources in 0 hours 18 minutes 26 seconds

However, I think that a better way to specify the concept of biological entity is as follow:

biologicalEntity sub entity;

gene sub biologicalEntity
    ....

Times don’t change much:

Created 59839 entities and 658229 resources in 0 hours 18 minutes 56 seconds

Due the absence of any error in my first long runs, I think that Loader really makes the difference now. I’ll use it for all the next imports for sure. Thank you very much for the hints.

@haikal I’ll keep you all informed of any future problem and I am pretty sure that you will do your best.

Thanks a lot,
Antonio


How to insert relations between entities via Loader.add()?