Mapping Water Supply Systems With a Graph Database in the Cloud

If you’re used to everything being in tables, rows and columns like I am, you’ve probably heard about this fancy thing called graph databases that does things a little bit differently.

I was curious, so I decided to try my hand at setting one up in the cloud, putting some data in it and doing some queries. Naturally, that means this is very much a blog post for newbies by another newbie.

Dunning-Kruger effectIt also means that this is the perfect opportunity to introduce my “Dunning-Kruger Zone” category. This is the place on my blog where I more than anything else write about stuff that I’ve only just tested out or areas where I consider myself completely green.

So why would you want to read about some newbie explaining things that he doesn’t know anything about?

Well, you might not. In that case, here’s a video of some guy hitting himself in the nuts with a nunchuck instead. Enjoy!

If you’re still here, the idea is that I can explain things to others in the same manner I explained them to myself while I was trying to learn. I’m hoping this will make it easier for other newbies to get started by providing just the right initial amount and level of understanding.

So if I accidentally write stupid or incorrect stuff, at least I can say I was being perfectly honest about it!

And yes, I realize the fact that pointing this out kind of defeats the whole idea of the Dunning-Kruger effect. Still, I thought “Dunning-Kruger Zone” was a bit more catchy than the “Dumbass Zone.”

The Associative Structure of Graph Databases

As I said initially, graph databases aren’t concerned with tables, rows and columns. They are instead concerned with the following 3 types of elements:

1. Nodes/Vertices

These are the objects in the graph, like a person, firm, product or whatever else you can imagine. You might think of them as the nouns in the graph.

Graph structure
Source: Neo4j

In the example above, we have nodes of different types like persons, locations, restaurants and cuisines. These are the node labels, or the different categories of nouns if you will.

In a relational database, a person node would instead be a row in a table called “Persons” – or “DimPersons,” if you’re old school and don’t mind obvious double entendres in your database.

Similarly, a restaurant node would be a row in a table called “Restaurants.” In a graph database, person and restaurant nodes are instead distinguished by their different labels.

In addition to labels, nodes have properties – which I will explain in a moment.

2. Edges

Edges are relationships between nodes. If nodes are the nouns, then edges are the verbs in the graph.

In relational databases, relationships between different objects are implicit from sharing common keys and are made explicit by joining tables. Even then, the actual characteristics of the relationship are not necessarily explicit in the data unless you store them in one or both of the tables or use join tables – which you usually always have to do with many-to-many relationships. Creating parent-child relationships and larger hierarchical structures also demands a little bit of trickery.

And while all of these things are of course very much solveable in relational databases, graph databases solve them more elegantly by always storing the relationships themselves as entities that exist between the nodes of both the same type and different types.

In the example above, the edges represent friendships between person nodes, as well as how these persons like different restaurants, both of which serve the same cuisine and are located in the same city.

And like nodes, edges can have properties, which is the final important element.

3. Properties

Properties are the adjectives in the graph. They describe the characteristics of both nodes and edges, just like a row in a relational database has columns to describe the characteristics of this record.

The main difference from a relational database here is of course the fact that you can store properties in the relationships themselves in a very natural way.

Not only can you explicitly say that a particular relationship is a friend type of relationship, but you might also say that these two person nodes have been friends for 7 years and that they met at soccer practice. This information is put in the edge between the nodes.

Maybe the person in the graph who likes the iSushi restaurant likes it because he had his first date with his wife at that place. That’s a characteristic of the relationship between him and the restaurant, and can be stored as a property of that edge.

Why Graph Databases?

So with these things in mind, why and when would you use a graph database instead of a relational database?

Well, in most cases you would probably not use a graph database as your primary storage unless you’re dealing with data that is in no way a natural fit for a classical schema and a much better fit for a graph-like schema. There’s a reason that relational databases have been around for so long and will likely continue to serve us well for a long time.

Relational database

However, if …

  • the actual relationships between entities in your data are just as interesting or more interesting than the entities themselves,
  • you have a lot of these relationships,
  • and you’re dealing with evolving schemas

… a graph database will likely be …

  • a lot faster for graph-like queries,
  • more scaleable,
  • and more flexible with regards to schemas.

Neo4j – the most popular graph database currently in use – has a nice explanation of all this on their website, where they also point out the striking irony in the fact that graph databases are much better at dealing with relationships than relational databases.

The prime example of a good fit for graph databases are social networks, because a social network is very much an extremely large and complex graph.

And before somebody start talking about “unstructured” data …

Graph databases are not for “unstructured” data. Graph databases are for data with a specific structure – a graph structure, in the same way that document databases are for data with a document structure and spatial databases are for data with a geometric structure.

Cloud is King

As the true BI/Data Science-millennial that I am, I always do everything in the cloud whenever possible.

I’ll be using the Neo4j graph database that I mentioned earlier. It’s open source, it’s been around for a little while, and it uses an SQL-like query language called Cypher which will make the SQL-savvy feel right at home.

It’s easy to get started with, and there’s a lot of accessible documentation on their websites. If you want very detailed instructions on setting up, configuring and querying your Neo4j DB, that’s where you should go. I’ll cover how you can get it running in the cloud.

Setting Up Your Own Graph Database in the Cloud

Microsoft Azure allows you to set up a virtual machine running Neo4j 2.0.1 on a Linux Ubuntu distribution with a few clicks.

This also sets up a storage account, an app service for accessing the database through a simple URL and a virtual network. Plug and play!

Graph database in the cloud; Neo4j VM in Microsoft Azure

There’s a guide for setting up the DB in Azure here, but it’s outdated and overly complicated because the new Azure portal does almost everything for you now.

The only thing you have to do is add an endpoint to the VM called “Neo4j” with port 7474 on TCP public and private. Just keep in mind that this makes your database publically available – albeit with a password-protected login – so you might not want to put your credit card information or naked pictures in there (but hey, I don’t judge!).

After this, the setup works straight out of the box and you can access your database through the URL for your app service and start playing with it. I highly recommend checking out the tutorials on the Neo4j website, as they do a great job of teaching the basics. They’re also accessible directly through the web GUI.

If you want to get more advanced, you can start using the Neo4j REST API and build your own stuff on top of the Neo4j DB.

Upgrading Neo4j in Azure

Unfortunately, 2.0.1 is an old version of Neo4j that doesn’t support the uploading of CSV files and some other new features, making it difficult to test it out with data you’ve got lying around.

This means you’ll probably want to upgrade the database, which for me was a royal pain in the ass – especially since I’ve never used Linux before in my whole life.

I’ll try to save you some pain by explaining what I had to do. First, I had to download Putty to access Ubuntu on the VM, which was fine.

Then came the actual updating of the Neo4j package. This would’ve usually been done by writing the following in Putty:

sudo apt-get update
sudo apt-get upgrade

The first command will give you an error for the Neo4j package, saying that …

The following signatures couldn't be verified because the public key is not available: NO_PUBKEY

… and then a key consisting of 16 letters and numbers. Write down this key.

Presumably, this error occurs because Neo4j have changed their key server and you need to get a new key, but being a complete Linux newbie I’m really just guessing.

Anyway, what you need to do is write the following:

sudo apt-key adv --recv-keys --keyserver https://debian.neo4j.org/neotechnology.gpg.keyXXXXXXXXXXXXXXXX

Replace the X’es with the key you wrote down earlier. After this, you should be able to update the repository and upgrade Neo4j by running …

sudo apt-get update

… and then …

sudo apt-get upgrade

Please note that I’ve tried to reproduce this process from my memory as accurately as possible, but I went back and forth with this so many times that I might’ve accidentally left something out or made some other mistakes.

I made this explanation because I was unable to find any information about upgrading Neo4j on the Azure Linux VM specifically, and it took a long time and a lot of trial and error to piece the solution together from different sources. All the other stuff on how to actually work with Neo4j is easy to find and understand.

If you’re still unable to upgrade the database with these simple steps, please drop a comment below and I’ll try to help out!

Importing, Querying and Visualizing Water Supply Systems

Hopefully, you now have your graph DB up and running. At this point, you can start either writing data to the database directly or import data that you’d like to see in a graph. Neo4j has a simple guide to importing CSV’s.

Since everyone is mostly using social networks or recommender systems for this, I decided to try something different and test it with water supply systems, which I felt would be a nice fit for a graph database.

The Norwegian Food Safety Authority has data on selected water supply systems publically available on their website. I imported these data sets into the graph DB and created relationships between the different entities. I won’t cover how to write queries here, because Neo4j already covers that extremely well on their website.

The result?

Water supply systems in graph database

The graph contains nodes with four different labels: water supply systems, water intake points, water transportation systems and water treatment plants.

  • One cluster of nodes represents a water supply system, connected to the large, blue water supply system node in the middle.
  • Each water supply system has one or several different water transportation systems that transport water within the system, and these are the yellow nodes.
  • Each water supply system also has one or several different water treatment plants to process water, indicated by the green nodes.
  • Finally, the red nodes are the water intake points, which are the water sources for the water supply system.

Water supply systems in graph database

The different nodes are connected to each other in different ways. For example:

  • Water transportation systems transport water to water treatment plants.
  • Water treatment plants process water from water intake points.
  • Water intake points are sources of water for water supply systems.

The properties of both nodes and edges are also accessible in the GUI.

watersupplysystems4

I’m not an expert on how water supply systems work, but it was still fun experimenting with creating relationships and visualizing the results.

I could’ve easily gone further and created node labels from many of the properties of all the different components, added nodes for things such as regions, provinces and municipalities. I could’ve also added much more information to the edges from the available data, like how much water was transported in a given year.

Wrapping up, I’m looking forward to experimenting more with graph databases in the coming months. They’re definitely not the singular future of data storage and the answer to all your relational database woes, but they’re a very good fit for data that actually has a natural graph-like structure.

In the future, we might stop talking about “unstructured” data and fix-it-all storage solutions and instead be more concerned with finding and using the appropriately structured storage for the natural structure in the data – rather than trying to fit it within a very particular schema or no schema at all.

The type of data and variety of use cases should dictate the type of database, not the other way around.

Anyway, I hope I’ve provided you with just enough material to get you started on your own DB – whether it be in the cloud or on the ground. Good luck!

Classifying “Bullshit” Quotes With Azure Machine Learning

Machine Learning has a lot of really useful applications which I’ve previously written about. It allows us to do things such as predictive maintenance, fraud detection, churn/upselling/cross-selling analytics, customer lifetime value estimation and segmentation, supply chain management, demand forecasting and a host of other stuff.

However, Machine Learning also enables us to do a lot of really quite stupid and funny things, which is one of the reasons I like it so much. One such thing would be the classification of quotes into “bullshit” and “non-bullshit” categories.

That’s right; I’m not above using my powers for evil!

logo

Sorry, I’m allergic to bullshit

Lately there’s been some research into the mechanisms of human receptivity to bullshit and capacity for analytical thinking and reflection, with the former apparently being linked to the latter.

The most recent study “On the reception and detection of pseudo-profound bullshit” by Pennycook et al. (2015) has been making the rounds in the tabloids, with journalists and writers happily generalising the rather specific findings of the researchers to question the mental capacity of those who enjoy sharing motivational and inspirational quotes on social media.

Acceptance Of Profound-Sounding "Bullsh*t" Linked To Lower Intelligence
Source: IFLScience

Obviously it’s not quite that simple, as the study deals specifically with so-called “pseudo-profound” bullshit – which is basically stuff that sounds profound in some way, but is in fact complete, meaningless nonsense. Real life examples of this, according to the researchers, are some quotes by New Age-guru Deepak Chopra, such as the following:

“Attention and intention are the mechanics of manifestation.”

You really can’t make this shit up, can you? 

(Well actually, you can, using Wisdom of Chopra, which generates a random quote based on text from Chopras Twitter account.)

Anyway, in order to help those who were slightly worried that their quote-sharing practices may or may not reflect a below average mental capacity, I decided to use Azure Machine Learning and R (and some Power Query) to create a classification service to determine the “bullshit probability” of quotes and statements.

Hey, maybe I am using my powers for good after all!

Building a bullshit classifier

So how did I do this? Well, my approach was to gather a limited sample of both bullshit and non-bullshit quotes and treat it as a binary classification problem (supervised learning), producing a probability of 0 to 1 of any quote being bullshit based on the contents of the text. This requires an initial labeling of quotes, with 0 denoting non-bullshit and 1 denoting bullshit.

Since Pennycook et al. (2015) specifically refer to tweets by Deepak Chopra as examples of pseudo-profound bullshit and use randomly generated statements based on said tweets for their studies, I decided it was an acceptably small leap of logic for me to use quotes by Deepak Chopra exported from Goodreads as labeled cases of bullshit in my dataset rather than tweets. I’m not on good terms with the Twitter API, and using import.io I was easily able to export the quotes into the format I needed.

import.io
You really should be using this tool.

For cases of non-bullshit, I decided to use quotes by evolutionary biologist Richard Dawkins. For defining a polar opposite of Deepak Chopra, he is the first one that comes to mind.

So in my dataset, bullshit is defined as stuff Deepak Chopra says – labeled 1 – and non-bullshit is defined as stuff Richard Dawkins says – labeled 0. For my classifier, every piece of text exists only on a scale from Richard Dawkins to Deepak Chopra, and is scored by the probability of that piece of text being bullshit, i.e. being somehow similar to something Chopra might say, or dissimilar in the sense that it is more similar to something Dawkins might say.

After having collected and labeled the data, I did some quick text preprocessing using Power Query and R, and made sure the cases of bullshit and non-bullshit were reasonably well balanced in the dataset.

I ended up with a current total of 3200 cases (each being approximately one sentence of variable length) after preprocessing, which is I would say is a very small sample for training such a text classifier and a very limited basis for generalisation. I’ll definitely be looking to expand the data set with more cases later, but still, this experiment was really just for learning and for fun.

Moving on, I used feature hashing to turn the text into feature columns, and principal component analysis to reduce the number of features to a reasonable amount for preserving their variance. I used only these text features for classification.

I ended up using a Two-Class Boosted Decision Tree (blue below) with the principal components as features for classifying, but I also had luck with using a Two-Class Locally-Deep Support Vector Machine (red below) with only raw hashing features.

True vs. false positives

I was happy to see that the model both cross-validates and performs quite all right, especially considering the very limited sample size and my rather frivolous labeling strategy for those samples.

There are definitely some patterns seperating the language of Chopra from that of Dawkins, and the model is able to distinguish between these with accuracy of about 0.65 on as of the time of writing.

It’s obviously going to be wildly inaccurate for statements that have nothing in common with the language in the quotes by either Chopra and Dawkins, but if a quote has some elements from one or both of those extremes, the classifier should in theory work reasonably well. Certainly better than a coin flip!

Bullshit-as-a-Service

So being quite satisfied, I decided to publish the model as prediction service using Azure. This was done with a few clicks, and lo and behold: Kjetils bullshit classifier as a web service.

Kjetils bullshit classifier

Input any quote you like, and the model scores its probability of being bullshit on a scale from Richard Dawkins (0.00) to Deepak Chopra (1.00). The classifier has obvious limitations, but it was a fun exercise nonetheless. Maybe I’ll improve it with a better sampling and labeling strategy in the future?

The cool part is that with Azure Machine Learning, you can get an idea for doing stuff like this and get it up and running in one form or another within a couple of evenings. It’s really quite ingenious.

Maybe you’ll even find a problem of some real business value to tackle, rather than classifying bullshit statements …

If you want to learn more about text classification, Microsoft has several ready-made examples of how to do this within Azure ML, and there are tutorials readily available. Machine Learning is becoming a natural part of the toolkit of any data analyst or data scientist, and Azure ML makes it very approachable for pretty much anyone with a certain amount of technical and analytical skills.

So what are you waiting for? Get modeling!