Neo4j recently introduced the concept of labels and their sidekick, schema indexes. Labels are a way of attaching one or more simple types to nodes (and relationships), while schema indexes allow to automatically index labelled nodes by one or more of their properties. Those indexes are then implicitly used by Cypher as secondary indexes and to infer the starting point(s) of a query.

I would like to shed some light in this blog post on how these new constructs work together. Some details will be inevitably specific to the current version of Neo4j and might change in the future but I still think it’s an interesting exercise.

Before we start though I need to populate the graph with some data. I’m more into cartoon for toddlers than second-rate sci-fi and therefore Peppa Pig shall be my universe.
So let’s create some labeled graph resources.

create (n:Person {first_name : “Peppa”, last_name : “Pig”})
create (n:Person {first_name : “George”, last_name : “Pig”})
create (n:Person {first_name : “Mummy”, last_name : “Pig”})
create (n:Person {first_name : “Daddy”, last_name : “Pig”})
create (n:Location {location : “The Pigs house”})
create (n:Person {name : “Bob the bat”})

The previous Cypher statements insert 4 nodes that represents the “Pig” family, which are labeled as “Person”, and a “Location” node. If you are paying attention, you might have noticed that I also added a dubious “Bob the bat” node, also labeled as “Person”. This node bears different properties from the other “Person” nodes and is intended to help me illustrate a point below.At this point Cypher knows how to find nodes by their labels but nothing has been indexed yet. Let’s try a very simple query.

match p:Person
where p.first_name = “George”
return p.first_name, p.last_name

This will cause Cypher to scan through all the nodes labelled as “Person” and will fail in this case with a message similar to “The property ‘first_name’ does not exist on Node[6]“. The reason is that when “Bob the bat” is encountered, Cypher can’t find a “first_name” property on it.

We can fix the query by adding “!” on the “first_name” property in the “where” clause to instruct Cypher to disregard any node that doesn’t have that property

match p:Person

where p.first_name! = “George”
return p.first_name, p.last_name

This query will give the expected result but we can do better. Let’s create a schema index on “first_name”.

create index on :P erson(first_name)
Now we can rerun the first query (the one without the “!”), which should this time return the expected result. The “Bob the bat” node will no more cause us any trouble because the “George” node was returned following an index lookup on “first_name”. “Bob the bat” isn’t even in the index! Obviously, for a bigger graph, performance will also be significantly better.