At GrapheneDB, a question we get asked quite often from users is how to import data. Sample datasets are good, but loading your own data is even better. This post will explain how to import data from a CSV file into Neo4j. After outlining the steps to take, we list some special considerations for GrapheneDB users.
One of the most important steps when evaluating a new technology for your stack is importing existing data. CSV is one of the most popular standards for data exchange and most of the popular database engines support exporting data in CSV format.
Starting with 2.1, Neo4j includes a
LOAD CSV [Neo4j Docs] Cypher clause for data import, which is a powerful ETL tool:
- It can load a CSV file from the local filesystem or from a remote URI (i.e. S3, Dropbox, Github, etc.)
- It can perform multiple operations in a single statement
- It can be combined with
USING PERIODIC COMMITto group the operations on multiple rows in transactions to load large amounts of data [Neo4j Docs]
- Input data is mapped directly into a complex graph structure as outlined by the user
- It’s possible to manipulate or compute values in runtime
- It allows merging existing data (nodes, relationships, properties) rather than just adding it to the store
Have your graph data model ready
Before running the import process you will need to know how you want to map your data onto the graph. What are the nodes and relationships, and which properties will they have?
Tune cache and heap configuration
Make sure to increase the heap size generously, specially if importing large datasets, and also make sure the file buffer caches fit the entire dataset.
You can estimate the size of your dataset on disk after the import by using the table in the official Neo4j docs.
Let’s assume we are going to store 100K nodes, 1M relationships and a fixed-size property per node/relationship (i.e. an integer number) :
- Node store: 100,000 * 15B = 1.5 MB
- Relationship store: 1,000,000 * 34B = 34MB
- Property store: 1,100,000 * 41B = 45.1 MB
Those are the minimum values that we should use in your filebuffer cache configuration.
Set up indexes and constraints
Indexes will make lookups faster during and after the load process. Make sure to include an index for every property used to locate nodes in MERGE queries.
An index can be created with the
CREATE INDEX clause. Example:
If a property must be unique, adding a constraint will also implicitly create an index. For example, if you we want to make sure we don’t store any duplicated user nodes, we could use a constraint for the email property.
Loading and mapping data
The easiest way to load data from CSV is to use the
LOAD CSV statement. It supports common options, such as accessing via column header or column index, configuring the terminator character and other common options. Please refer to the official docs for further details.
To speed up the process, make sure to use
USE PERIODIC COMMIT, which will group multiple operations (by default 1000) into transactions and reduce the times Neo4j has to hit the disk to commit the changes.
Please note that values are read as Strings, so make sure you do format conversion where appropiate, i.e.
toInt(csv.columns) when loading integer numbers.
The load process can be run from the Neo4j shell, either interactively, or by loading the Cypher code from a file using the option
Alternatively, the code can be entered manually into the shell or the browser UI.
Considerations for GrapheneDB users
A few considerations when loading data into your GrapheneDB Neo4j instance:
- caches and heap can only be configured on the Standard plans and higher. They are fixed on the lower-end plans
- neo4j-shell does not support authentication and thus it can’t be used to load data into an instance hosted on GrapheneDB or otherwise secured with authentication credentials
- when running the command from the browser UI, bear in mind Neo4j won’t be able to access your filesystem. You should provide a publicly available URL instead, i.e. a file hosted on AWS S3
- for larger datasets, we recommed running the import process locally and once completed, perform a restore on your GrapheneDB instance
For a comprehensive tutorial, including tools to clean up the CSV files, common pitfalls and more advanced tools like the super fast batch importer please refer to this comprehensive CSV import guide.
Please don’t hesitate to post any comments or contact our support team if you are having issues loading data into your GrapheneDB instance.