Graph databases are a special type of databases that focus on relationships and flexibility. They belong to the category NOSQL, and thus differ from the typical SQL databases mostly used. Neo4j is the best known graph database and was originally developed in Sweden. There are, of course, many database types and you can not learn everything. But there are some reasons that right now make it more interesting to look into graph databases than before:
- Graph databases have started to improve.
There has been a lot of creativity in this segment. Graph databases are now a category that has matured and is ready for production use. The last 2-3 years, Neo4j has been established and has received a lot of positive attention.
- There are now more relevant usecases.
People want to search data in a different way than before. Examples: Social networks, recommendations as a basis for new purchases, etc. There is also an emerging field of sciences with a demand for flexible data structures: Life sciences, forensic analysis, etc.
- The architectural drive.
As microservice architecture has become popular, the choice of data storage has become much more flexible. When you had to pick a data storage that would cover all the needs in your system, then the traditional relational database was often the obvious choice. Now when each microservice is free to pick the data storage that is optimal for that usage, then graph database suddenly becomes a viable option.
With this as background I have the last year become more interested to look into graph databases. That has so far resulted in a couple of talks at two conferences / events: SweTugg Stockholm and nForum Göteborg.
I have read a lot on graph database design and modeling and have prototyped some demo cases useful for learning:
This year Neo4j are conducting a global Graph Tour to show case the technique for a broader audience. One of the ten stops is Stockholm. The conference was one day and was preceded by an optional workshop day of data modeling.
Graph modeling workshop
The workshop day was very intense with a lot of material to cover. To complete the exercises I had to spend some additional hours on my hotel room.
The main focus was on the thinking as it is very different from traditional modeling done in relational databases. You need to understand how the graph database works in order to model the nodes and relations in a fruitful way.
The major take away for me was that there seldom is a ”correct” way to model a graph database. It all depends on how you want to query the data. Often you need to experiment and adjust the model along the way. Fortunately it is quite easy to start with a prototype and try out. There are good ways to profile the query plans which will give good insight for changes and remodeling.
There are however some common hints that you should be aware of:
- You will not find the optimal model right away. Start with the picture you have and evolve as you learn. This is a paradigm shift as we all are used to look for the ”perfect data schema” before we continue with the business logic. Typically we spend a lot of time here and are hesitant to change it later. In graph modeling you expect change and welcome it.
- Avoid bulk querying data. Try to target some specific nodes of interest and ”traverse the graph” from there. That is: don’t use bulk joins like in traditional relational data modeling.
- Make a balanced choice of properties vs relations. The rule of thumb is that if you will search for data it is often better to model them as relations.
- Avoid dense nodes with too many relations. There are patterns to split them up in supporting nodes that are easier to query.
- Use ”in graph indexes” over traditional indexes when possible. The idea is that you try to model the data in a way that makes it be its own ”index”.
- Allow redundant data when it is appropriate. Depending on your queries some minor redundancy can be accepted as it might greatly improve the query speed if you need to support queries that look at data from different angles.
Graph Tour conference
Neo4j founder Emil Eifrem started the conference day with a keynote giving an overview of what has happened the last 10 years in the graph database market. He focused on a couple of case studies that have shown the power of Neo4j: The Panama papers, solving tricking NASA problem and curing cancer.
In 2014 Forrester Research presented a prediction that in three years 25% of top world enterprises would be using graph databases. At the time it was a stunning prediction that seemed quite unlikely. When measuring 2017 the outcome even superseded the prediction: more than 50% of the top enterprises today uses graph databases.
Next talk continued to describe the product development done the last years. There has been a lot of focus on performance, clustering and stability. Another area has been to improve developer tools and platforms to make modeling, optimizing and visualization much easier.
The major piece of news was probably that Neo4j now are announcing a coming cloud platform. Previously you had to either host your own instances or rely on a third party company providing PAAS. Now Neo4j will provide a native cloud that will allow enterprise global scale. For many companies this will probably be a very interesting option.
One talk gave a lot of practical suggestions on how to use features of the tooling to solve specific problems.
The last talk emphasized the fundamental difference it makes to process graph data in a product that is built from bottom up to target graphs. As graph databases now are becoming increasingly popular many companies release products that are in fact add-ons or twists of current products and techniques. It looks good at the surface, but when tested in real world scenarios they simply don’t perform. Neo4j is one of the very few alternatives that is dedicated to think graph in a consistent way. This is especially true when you need to perform extensive graph analysis in your data. For simple read-write scenarios some competitors perform OK, but for deep queries, analyzing and predictions, they simply can’t solve the problems. On the other hand Neo4j never pretends to solve other data scenarios. But for the right usecases it is a very good tool for the job.
Three customers described how they are using Neo4j:
- Volvo Cars use it for advanced analytics in production development.
- Previa use it to extract data from their huge historical data stores and find relevant information. This way they are able to predict upcoming health issues and even prevent them with the right form of intervention.
- The insurance company If use it to investigate suspicious customer claims. By gathering data from ten internal systems they are able to find interesting patterns in ways that previously took days or were not even possible.
I am more persuaded than ever before that graph databases are becoming a very interesting option for many projects. On the other hand it is also obvious that there is a learning curve that you should be aware of and graph databases are not suitable for all usecases. In 2-3 years I expect that most developers are as familiar with graph databases as they are now with document databases, dictionary databases and column stores. It will become a natural tool that you should be familiar with and have in your toolbox.