I recently started work on a new project which, to avoid getting into too many details, is a social media application with similarities to Twitter, Facebook, and FourSquare – though it is not a clone of any of these!
This is currently a hobby project of mine that I think may have some future potential. As a result, I’m allowing myself the freedom to experiment with technologies outside of what I’d normally work with. On this specific project I plan to use Ruby On Rails 3 (currently in Beta) and deploy the final application to Heroku. (Side note: we really need something like Heroku in the ColdFusion world.)
Because this is a social application and many social applications make use of so-called NoSQL databases, I started researching these. My research began by talking to John Paul Ashenfelter at this years CF.objective(). John gave a talk entitled Say NO to SQL. Sadly, I missed this talk, but I sat down with John after the fact and talked about the various types of NoSQL databases and where they fit in.
It turns out that there are several types of NoSQL databases. These include column stores, key value stores, document stores, and graph databases. For this article I’m going to ignore all except for graph databases.
So, what exactly is a graph database?
Let’s start exploring this by looking at standard relational database systems. There are a number of well-established RDBMS such as MySQL, PostgreSQL, MSSQL, Oracle, etc. Chances are you’re familiar with at least one of these. These types of systems store data in tables that are made up of columns. Each record in the database provides values for these columns. For referential integrity, some columns may reference a column in another table via a foreign key.
These foreign keys are the only way to relate data in a relational database. And, for the most part, through normalization, these systems can model essentially any data.
The problem isn’t really in the modeling however, the problem is in how you get data out of the database. Consider a situation where you’re modeling a social network. In this network I may have dozens of friends and you may have dozens of friends, and each of our friends may have their own dozens of friends. Invariably, some of these will be the same people.
Now, I ask you, how would you find out how I know you using SQL? How would you be able to figure out how I know Keven Bacon in SQL? Furthermore, how would you do this in any efficient manner? The answer is: not easily.
The fact of the matter is that despite the fact that you can model this data in relational databases, these systems are simply not optimized to query this type of information back out.
There are, however, alternatives. You guessed it, graph databases.
A graph database is a system that stores data in “nodes” that are connected to other nodes via “edges”. In most graph databases nodes and edges can have associated properties. Most graph databases allow for traversals between related nodes.
This image shows how you might model the information used in a social network.
In the example above I have created five nodes to represent people in a social network. I’ve also created relationships between them. The relationships would be the “edges” referenced above. Note that each node and reference has various properties. For example, you can see that I (Doug Hughes) am 32. You can also see that I know Joe Blow and Jim Bob. Of course, the graphing database can also store different types of objects such as products, etc.
One of what is supposed to be a defining characteristic of a graph is the ability to quickly traverse nodes. So, using an API provided by the graph database system I can quickly find out how I know John Doe. The answer is through our mutual friendships with Jim Bob (or through Jim and Belva). This is also useful for situations where you want to find common themes. For example, Amazon has a feature that shows what other customers that purchased a specific product also purchased.
There are variations between graph databases as well. For example, some use directional relationships and others use bidirectional relationships. The difference is that a directional relationship may not necessarily be reciprocal. For example, on twitter, I could follow you, but maybe you don’t follow me. Bidirectional relationships are more like Facebook where if I’m your friend, you’re my friend. The example above would be bidirectional.
Because of the nature of graph databases, they are very fast for traversing nodes and finding related data. I’m not entirely sure at this point where they break down. I’ve read that they’re not as efficient for large-scale updates where you may be updating a lot of records at one time. Beyond this, your mileage may vary.
I’ve done a lot of reading up on different graph database. The ones that stuck out to me were these:
Neo4J appears to be the most widely used graph database and is the one I’ve spent the most time researching. It’s available through a very restrictive AGPL license or commercially. It strikes me as very expensive to license.
Neo4J is an embedded directional graphing database written in Java. The FOSS version provides a JAR that you download and make use of in your application. Alternatively, there is also a stand-alone version that exposes a RESTful API.
There are a number of language bindings available, most of which use the REST API. There are however native JRuby bindings. I’d be interested in this, expect for the fact that Heroku doesn’t support JRuby.
It’s my interpretation that Neo4J still needs a little baking to really be a good solution. For example, the REST API has no security built in. Anyone who can connect to the port that is exposed can add, update, or delete information in the database.
Neo4J handles scaling and redundancy similarly to other RDBMS. Specifically, the paid version allows you to somehow replicate data to hot-spare servers. If you need to shard your data across multiple servers you must manage it manually within your application.
From everything I’ve read, Neo4J really seems to be the strongest graphing database. However, it has negatives in that I’m not sure if the paid version differs any from the FOSS version. Documentation is pretty good, but seems to be lacking in some areas (specifically related to high availability).
Oh, it’s also apparently blindingly fast. However, I can’t find any information on how the use of the REST API impacts performance.
If you’re interested in experimenting with Neo4J in ColdFusion, I suggest you check out Brian Panulla’s blog entry entitled Using Neo4j Graph Databases With ColdFusion.
InfiniteGraph describes itself as distributed database for web-scale systems. Currently in public beta, it is slated for release in late July 2010 (any time now, really).
This system is written in Java and supports server based, cloud based, and embedded use. The basis of this InfiniteGraph is that it can apparently see nearly linear performance scaling by the addition of additional servers.
I can’t find where I read this, but my memory is telling me that InfiniteGraph uses bidirectional relationships.
InfiniteGraph will be a closed source, proprietary, for-fee product when it is released. They do have programs for free usage, but they seem to tie you to a specific hosting provider. It even looks like you need to pay for developer licenses.
This bears watching, but I’m concerned about the licensing details and pricing. Furthermore, it might not be terribly easy to connect to from non-Java languages. A C# API is due in the next major release.
FlockDB is Twitter’s own graphing database. However, this is not quite what it seems to be. As the Twitter developer blog explains in the provided link, FlockDB was engineered as a specific solution to scalability problems Twitter was experiencing.
Behind the scenes FlockDB actually just uses MySQL. Additionally, despite the fact that it FlockDB is a graphing database, it’s not actually optimized for graph traversal. Instead, it’s very good at adjacency lists (who’s following whom). Flock also allows for horizontal scaling, though this appears to be somewhat manual.
In the end, I honestly haven’t done much reading on this tool since it didn’t really match what I was looking for in a graphing database.
There are a number of other graphing database engines available, but most of them are fairly specialized or are pretty esoteric. I’m not sure I would want to deploy a large scale system on any of the alternatives.
What am I Doing With My Social Application?
After doing quite a bit of research in this area and briefly experimenting with Neo4J, I’ve actually elected not to go the route of using a graphing database for my project. The reason I made this decision is that all of the graphing database implementations I looked at were either immature, lacking in documentation, or were difficult to talk to from my chosen language.
Furthermore, you may remember that my hosting platform of choice is Heroku. Heroku actually runs in Amazon’s EC2 service that makes it easy for me to run my own EC2 servers to host my database server instances. However, in the end, I’ve decided to simply use PostgreSQL which Heroku already supports.
I have to keep in mind that what I’m building right now is really just a hobby application. I can’t justify spending a ton of money experimenting with a database offering that isn’t really required at this point. If, in the future I do reach a point where I need really, really, fast access to related data I may port over to using Neo4J. Only time will tell!