Vixen Blog Awards, Machine Learning Edition – Which Blogger Are You?

The Vixen Blog Awards, which took place a few of months ago, is pretty much the highlight of the year for most Norwegians. Or so you would think, anyway.

Lately, it seems not a day goes by without one or several of our treasured fitness, lifestyle, fashion or furniture bloggers being front and center in some newspaper or magazine with hot off the press social commentary.

Common topics include how to oppose the use of palm oil while still using palm oil, making food with zero calories, “janteloven”, flashing of various body parts, and of course how to not care about the looks of said body parts because you’re pretty just the way you are.

The annual Vixen Blog Awards is an opportunity to reap the rewards from these endeavours from the last twelve months. It’s basically like the Oscars, but for blogs, and without the Leonardo DiCaprio-memes.

Bloggers in their natural habitat.

The truth is I actually think blogs are pretty cool (especially if they’re about food), and that blogging is an extremely interesting medium. And I do have my own blog, after all, so who am I to make fun?

I’m just a little jealous because I’ll never qualify for the Vixen Blog Awards myself. My chances of that pretty much went down the toilet when I stopped blogging about nutrition and exercise several years back.

Still, what if I wanted to get back into that kind of blogging? I can’t help but wonder which blogger I have the most in common with now and which blog award I’ve got the biggest chance of winning next year if I stop writing about tech stuff and start writing about everything else.

vixen blog awards 2015
The winners of the 2015 Vixen Blog Awards.

What, don’t tell me you’ve never thought about which “rosablogger” you are, or how many likes, shares and <3’s you would get if you started writing about about mundane, everyday stuff on the internet. It’s right up there with the meaning of life!

Unfortunately, machine learning cannot tell us the meaning of life. Well, not yet anyway. It can, however, tell you which blogger you have the most in common with based on what you write, and by extension which blog award you’re likely to win next year if you were to start blogging today.

That’s right – it’s time to use our powers for evil again, and this time it’s multi-class classification on eleven different classes.

Dude, I don’t care about any of this rocket science stuff, I just want to put all my friend’s Facebook posts into a web service and make fun of them.

Fine, dude.

Go to the Vixen Blog Awards, Machine Learning Edition web service, enter any NORWEGIAN block of text of any size, and press the submit button. You might need to give the page 10-15 seconds to work, especially if the VM running the web service needs to boot up.

After this short wait, what you will get is a set of probabilities, each one representing the probability of your text being something a particular blogger would write, and by extension which blog award this piece of text is worthy of. Neat, huh?

Things you can try out:

  • Entire blog posts, paragraphs or excerpts
  • News articles
  • Facebook posts and tweets
  • Government reports
  • Legal texts
  • Bible passages
  • Random quotes

I put in a few of my recent Facebook posts and this is what I got:



I guess I’m still a fitness blogger whether I like it or not …

Like I said earlier, the model is for Norwegian text. It is not built for any other language, so don’t expect reliable or meaningful results on text written in English, Spanish or anything else that is not Norwegian.

My good sir, I care neither about any of this rocket science stuff or any of this stupid stuff. I want to know how to know how my business can make or save money on this stuff!

Fine, Uncle Scrooge. Scroll all the way to the very bottom of the post and look for the headline “An actual business case.” 

If you want to know what goes on behind the scenes and how you might do something like this yourself, read on. In the end I’ll also teach you how to manipulate the model and bend the results to your will so you can troll your friends!


So how does this work, and how do I create a Blog Classifier in Azure Machine Learning?

You need three things to create your own blog classifier the same way I did it:

If you want to operationalize your service in the cloud for others to access and use, you also need a Microsoft Azure subscription – either trial, MSDN or paid.

Obviously, you don’t have to do any of this with blogs. This is a generalized approach that you can use for any kind of text mining from web sources, be it tabloids, news magazines or scientific studies. 

You don’t have to use the data for classification either. How about predicting the number of clicks, likes, shares or comments that any block of text is likely to get, for example?

1. Gather training data

If you want to scrape data from the web or crawl sites for the information you need, has got you covered.

I used this awesome service to create API’s on top of the 11 winning blogs and extract more or less all their blog posts from 2015 in tabular format.

Basically, this means that I turn the web pages containing the blog posts into tables with columns containing the post title, post content, post timestamp, comments, likes etc. This is exactly what you want the data to look like for what we’re going to do.

I saved the tables as CSV’s – one file for each blog – and created connections to all the 11 files in Excel using Power Query.

Oh, and did I mention that rocks? (They’re really not paying me for this – I swear!)

2. Prepare training and test data

Using Power Query, I chose the columns I wanted from the CSV’s – most importantly the actual blog post contents, as well as the post titles and dates for traceability purposes. I then labeled the posts with the correct author based on the name of the source file (one file for each blogger, remember?) and appended them together into one giant query.

Finally, I split up the post contents column into multiple columns using periods as a delimiter, unpivoted the resulting columns to make a nice mix of sentences and paragraphs to use as training and testing samples for the model, dumped the results into Excel and saved them as a UTF-8 TSV (to keep those pesky Norwegian special characters).

Using smaller blocks of text as the unit if analysis makes the classification problem a lot more difficult. However, it’s more appropriate, given that the algorithm will most likely be analyzing sentences and paragraphs rather than whole posts when it’s operationalized.

If I’d made the model with whole posts, I’m quite certain that it would’ve been freakishly accurate from the get-go. And where’s the fun in that?

3. Text preprocessing in Azure ML

I uploaded the TSV into Azure ML Studio to start creating the training experiment.

Remember that it’s very important to do all the text preprocessing in Azure ML for an experiment such as this if you’re going to operationalize the model online or in your business.

If you use Power Query, R or other tools locally for anything other than combining data sources and tweaking your source material to reflect the unit of analysis that your operationalized model will be working with in its “natural” environment, your model is not going to be very reliable.

Simply put: the “real” data, meaning the blocks of text your model will be analyzing in “real life,” needs to go through the exact same preprocessing that your training and testing data does. That means it should be done in Azure ML, unless you’d rather automate the preprocessing using another tool or service.

text preprocessing

First, you’ll probably have to replace any specific foreign characters with their internationally friendly alternatives. Because even though you can get these characters into Azure ML Studio using an UTF-8 encoded file, you lose them when you try to put them into an R data frame running on a computer with English settings – which is apparently the case with the computers who run R for Azure ML.

Then comes standard text preprocessing like the removal of special characters (like punctuation, parentheses, brackets, hyphens etc.), conversion to lower case, removal of stop words and word stemming.

Azure ML has example experiments featuring text preprocessing, as well as a readymade text preprocessing script for R which can easily be “hacked” to work with any language even with minimal knowledge of R. I’ll leave this challenge to you for now, but I might do a short writeup later on how to do this.

4. Feature extraction

Now comes the fun part – it’s time to turn the text into features for training our models. There are several ways both to look at text for this purpose and to actually turn it into features.

A simple and superficial way to look at text is as sequences of words. A sequence might consist of just one word – a unigram – or any other number of words, where a sequence of two words is a bigram, a sequence of three words is a trigram, a sequence of four words is a quadgram and so on.

We won’t be using anything longer than trigrams for this case, which I understand is pretty common for many problems. This might be because of scaling issues, but one might also intuitively assume that by going up to trigrams, you ensure that each word can be associated with both its preceeding and subsequent word in a feature, and that trigrams therefore represent a natural threshold.

The easiest way to make features from these so-called n-grams is to turn them into hash keys of limited size, and then count their occurences in the text. By specifiying the hasing bitsize, you control how much information is retained and by extension how many columns you end up with. This process is called feature hashing, or simply the “hashing trick”.

The resulting columns from feature hashing are then usually reduced into principal components, but they can also be used directly in the models.

feature hashing principal component analysis

The feature hashing approach has the advantage of being very fast (the PCA might take some time, though) while allowing us to keep a limited amount of information from word doubles and triples. It also lets the model work with previously unseen n-grams, and allows you to quickly establish a great benchmark for any text classification case early on.

The disadvantage is that we lose traceability and we’ll have a very hard time figuring out exactly which aspect of the text our models are using to make their predictions. We’ll also lose some information that might help us distinguish the least represented bloggers and fine-tune the model later on. You might choose to look at feature hashing + PCA as the quick and dirty way to do text classification.

Another way of creating features is to create a dictionary of the n-grams in the text, calculate their relative frequencies and use the unigrams, bigrams and trigrams as features directly in the models.

unigram bigram trigram dictionary

The dictionary approach has the advantage of being very traceable, and it provides a high level of detail. However, it does not scale well beyond single words and might produce hundreds of thousands of columns when you use it to create bigrams or trigrams – not to mention quadgrams – depending on the amount of text.

While some machine learning models can handle such a large amount of features, it’s hardly a very cost- and time-efficient way to do things.

My preferred, general approach to get the best of both worlds is the following:

  1. Generate as many hashing features as is feasible to run a PCA on, and reduce them to 10-20 principal components (look for when the standard deviation starts stabilizing).This allows me to keep a reasonable amount of information and variance from unigrams, bigrams and trigrams, as well as enabling the models to work with unseen n-grams and giving them some robust, dense features to use for for training.It also lets me quickly establish a benchmark and compare algorithms on an even playing field early on, since training and testing some of the more complex models with a very sparse feature vector containing thousands of features takes a lot of time.
  2. Supplement these features with a deliberate selection of n-gram frequencies.This gives me some traceability and allows me to do some more meaningful exploratory analysis if I want to. This is the part that enables me to tell you that if the model tells you that you’re Funkygine – like I am, apparently – it might be because you’re writing about amino acids, “starting positions” or Amsterdam. To do this with feature hashing + PCA, you would need to do some heavy simulations.

In Azure ML, this whole shebang looks like this:

text preparation

Yeah, I love creating sexy workflows in the ML studio. Sue me.

Note that I actually preprocess the text slightly differently for creating the bigram and trigram dictionary. Specifically, this has to do with the treatment of stopwords, which I remove before creating bigram and trigram dictionaries.

Why? Well, I’m definitely interested in analyzing the frenquency of single stopwords for the purpose of telling the bloggers apart. However, I’m less interested in analyzing those stopwords in combination with their immediately surrounding words – simply because this causes the bigram and trigram dictionaries to explode in size.

On another note: the pre-packaged R script bundle for text preprocessing in the studio is not configured to let you extract anything more than unigrams. I had to “hack” it to enable the extraction of bigrams and trigrams.

I might cover this in a later post, as it was a little bit more cumbersome to do than changing the language.

5. Feature selection

Like I mentioned earlier, the dictionary approach does not scale very well when it comes to the actual modeling, so I wanted to do some deliberate feature selection before releasing the hounds (the algorithms, obviously).

I dug up some research on selection of features for text classification, and ended up with the following approach:

1. Filter out the rarest terms during the dictionary creation process

I set a frequency threshold of 5 for unigrams, 6 for bigrams and 3 for trigrams. If a term does not appear at least that many times in the whole dataset, it is dropped. I chose these specific numbers because they gave me a manageable total dictionary of ~24 000 features to continue selecting from.

class distribution
The distribution of the classes in my dataset – sans Agnes Lovise, who has produced even less text than Hviit and didn’t make it onto the chart.

One drawback to this approach is that it is a bit biased in favor of the dominant classes in the dataset, meaning the bloggers who have produced the most text in 2015.

As seen above, my labels are quite imbalanced, and we’re more likely to lose some information relevant to the minority classes if we remove features based on unadjusted frequencies. The next step does not really reverse or mitigate this information loss in any way, but it does not fuel the bias in favor of the dominant classes and puts equal emphasis on all of them.

2. Do seperate feature selections for each class

For a rather complex and quite imbalanced multi-class problem such as this, a simultaneous feature selection might ignore the terms that can be used to distinguish the least represented classes from the others.

Let’s say that Jenny Skavlan quite often writes “parrots eat seeds” in her blog posts. None of the other bloggers care much about parrots eating seeds, so writing something like this is pretty distinctive of Jenny Skavlan.

In practice, this means that a machine learning model can make good use of a feature counting the frequency of the trigram “parrots eat seeds” to classify any piece of text as being either Jenny Skavlan or not being Jenny Skavlan.

But the same feature is pretty useless for distinguishing Sophie Elise from Caroline Berg Eriksen, for example, because neither of them write anything about parrots eating seeds at all. This wouldn’t necessarily be a problem, had it not been for the fact that Sophie Elise and Caroline Berg Eriksen – together with all the other bloggers – make up a much bigger part of the dataset than Jenny Skavlan does.

Parrots eating seeds.

So since Jenny Skavlan only has a pretty small amount of the text in the data set, and the term in question is only really useful for distinguishing this small amount text from a much larger amount of text, traditional feature selection on the entire dataset at once is not likely to place much importance on the “parrots eat seeds” feature.

The way to deal with this is to emulate the situation I described earlier, in which the feature “parrots eat seeds” is used to classify a piece of text as being either Jenny Skavlan or not Jenny Skavlan. This is a very different situation than using it to classify a piece of text as belonging to any of the 11 bloggers.

In practice, what I did was create 11 complete versions of the dataset, one for each blogger. I then replaced the labels to either represent that particular blogger (“true”) or any of the other bloggers (“false”), making it a binary classification problem for the purposes of feature selection.

I then used the chi-squared test, which is the generally preferred feature selection method for text classification, to pick the same number of features from each of the 11 datasets. I finally merged all the feature sets together, dropping duplicate features.

The result is a limited set of all my initial features selected on a class-by-class basis, balanced for distinguishing all of the 11 classes from each other and not just for the dominant ones.

What the hell does that look like in Azure ML, you say? I’m so glad you asked!

feature selection

Of course, this could’ve been solved much more elegantly with a single R script module – probably with only a few lines of code – but where’s the fun in that?

When it comes to the number of features to keep for each category, there’s really no right or wrong answer. You should experiment with different amounts. It’s most likely going to be a decision based on some of the following factors:

  • How much time you have
  • How general you want your model to be
  • Which models you’re going to try

Another benefit of this type of feature selection is that it allows me to easily discover predictive features that shouldn’t be in the model, even for the least represented classes.

For example, for one of the blogs the web scraper had picked up some JavaScript error messages in the blog post contents. This would allow my models to “cheat” and more easily classify blocks of text from this blogger, simply because it would contain words such as “JavaScript” and “browser.” Removing these words from the dictionary solved this problem.

6. Model training and evaluation

Alright, now comes the REALLY fun part. It’s time to release the hounds.

release the hounds

For feature-rich text classification problems such as this where you probably want to take advantage of the large feature space, logistic regression (LR) and support vector machines (SVMs) might be the best way to go. Neural networks are also commonly used for text analysis, but to take full advantage of their capabilities in this case you’ll probably want to do feature preparation differently and use a specialized network architecture. Maybe I’ll try this next time!

Still, you shouldn’t make these decisions before you’ve tried. Therefore, I started out with a smaller number of features to test out boosted decision trees, decision forests and neural networks, together with logistic regression and SVMs.

Since I’ve spent so much energy on an unbiased feature selection, I’m going to use macro-averaged precision and recall as my chosen performance metrics. This should allow me to measure the benefits of my deliberate class-by-class feature selection directly.

If I didn’t place equal value on classifying each of the bloggers and cared more about getting the best represented classes right, I’d use micro-averaged performance metrics instead. The choice of performance metrics is always a “business” or subject matter decision, rather than a statistical or technical one.


For computing the evaluation metrics, I use 10-fold cross-validation.

Let’s get modeling!

~1 000 features

I got pretty decent results from most of the models even when using only 1 000 features “carefully” chosen on a class-by-class basis using the approach described in the last section. The results seemed to be a lot better with this method than they were with simultaneously selected features.

Below is the confusion matrix and macro-averaged evaluation metrics for Microsoft’s legendary boosted decision tree (BDT) model with features selected simultaneously for all classes.

boosted decision tree confusion matrix
BDT – simultaneous feature selection (~1 000)
Macro-averaged precision: 0.4788
Macro-averaged recall: 0.4271

Compare that to the results from the class-by-class feature selection.

boosted decision tree confusion matrix
BDT – class-by-class feature selection (~1 000)
Macro-averaged precision: 0.5250 (+0.0462 from feature selection)
Macro-averaged recall: 0.4631 (+0.0360 from feature selection)

I’d say that’s a pretty major difference for using exactly the same amount of features!

It should be noted that I only saw this improvement on models who produced decent results in the first place. Decision forests and jungles generally performed poorly for this problem, and feature selection didn’t seem to make much of a difference for these models.

Still, the results lead me to believe that this form of careful and deliberate feature selection can in fact be expected to produce more balanced results with a smaller amount of features for imbalanced text classification problems using the appropriate models. Not necessarily surprising, but still interesting to see that it actually works very well in practice.

~3 000 features

As I gradually increased the number of features, logistic regression started pulling ahead slightly while the others couldn’t quite keep up. Also, the models based on class-by-class feature selection continued to outperform the one based on simultaneous feature selection.

Below is a BDT with three times as many features as above.

boosted decision tree confusion matrix
BDT – class-by-class feature selection (~3 000)
Macro-averaged precision: 0.5780 (+0.0530 from more features)
Macro-averaged recall: 0.5103 (+0.0472 from more features)

Compare that to the results from logistic regression below.

logistic regression confusion matrix
LR – class-by-class feature selection (~3 000)
Macro-averaged precision: 0.6425 (+0.0645 compared to BDT)
Macro-averaged recall: 0.5251 (+0.0148 compared to BDT)

Now we’re talking!

Neural networks also continuted to do fairly well, but they also start becoming slow to train when the number of features increases into the thousands.

None of the other algorithms could even come close to the speed of logistic regression at this point, and when it also produces both the most accurate and the most balanced results, it’s pretty much a no-brainer.

The icing on the cake is that logistic regression has much better transparency than a neural network or any of the tree ensemble models. All you have to do if you want the reasoning behind a particular prediction is to look up the relevant feature weights in the formula!

~6 000+ features

I decided to double the number of features and, unsurprisingly, the model performed even better across the board.

logistic regression confusion matrix
LR – class-by-class feature selection (~6 000), final model
Macro-averaged precision: 0.6554 (+0.0129 from more features)
Macro-averaged recall: 0.5523 (+0.0272 from more features)

Of course, increasing the number of features to these amounts gradually reduces the usefulness of our deliberate feature selection, making it sound like we wasted a lot of time and energy in data preparation.

But here’s the kicker: in my case, a model based on 6 000 features selected on a class-by-class basis performs more or less exactly the same as a model based on 9 000 simultaneously selected features. And when using the same amount of features, the model based on features selected on a class-by-class basis ALWAYS seems to outperform the one with the simultaneously selected features.

Below is the model with 9 000 simultaneously selected features.

logistic regression confusion matrix
LR – simultaneous feature selection (~9 000)
Macro-averaged precision: 0.6475 (-0.0079 from feature selection)
Macro-averaged recall: 0.5550 (+0.0027 from feature selection)

If I have the choice between a 6 000 feature model and a 9 000 feature model with more or less the same performance, I’ll go with the first one.

I decided to stop at 6 000 features, although I probably could’ve kept increasing the number of features since LR is blazingly fast even with a huge feature space. Still, with 6 000 features I’ve got a 1:6 ratio of features to training samples, which does sound like it’s about to become a little excessive. More importantly, the improvements to my evaluation metrics from adding more features seemed to become marginal from this point onwards.

Why not use even fewer features since I went to such lengths to select the best ones? Well, in the end I figured that having many features would make the operationalized model more fun to play with, because you can expect it to take advantage of more of the terms you feed it and change its predicted probabilities accordingly. Since this is just for fun and learning and not exactly business critical, I’m not really that concerned with generalization.

So which “rosablogger” are you?

Being done with the model, I published it as a web service using Azure ML and Azure App Service for all to enjoy! 

And like I said in the start of my post: all you have to do is go to this website, paste or write any Norwegian block of text, and the model will score that text with probabilities for being written by each of the 11 bloggers and by extension which award you’re most likely win next year.

What is actually happening to the text behind the scenes? It’s processed through the finalized predictive experiment workflow below.


Exploring the final model, there are some interesting takeaways that we can expect to manifest “in production.” Look out for these things – in Norwegian, even though I’ve translated everything here – when you’re trying it out!

  • Generally, the model still favors the best represented classes. It is prone to think you are either Sophie Elise or Casa Kaos, especially if it has little meaningful information to go on.
  • Distinguishing the writings of “man-blogger” Ørjan Burøe from the other, exclusively female bloggers is not as easy as you might think. In fact, the model believes he is Sophie Elise 18.1 % of the time …
  • Writing about kids makes you either Ørjan Burøe or Casa Kaos, and it makes you extremely unlikely to be Sophie Elise. However, writing about dads makes you very likely to be Ørjan Burøe and very unlikely to be Casa Kaos.
  • Another thing that makes you very unlikely to be Sophie Elise is using the words “if” and “of course” – “hvis” and “selvfølgelig” in Norwegian. Apparently, there is no if for the blogger of the year, and she takes nothing for granted.
  • Writing about “Nelia” and “Lars-Kristian” will most definitely make you Caroline Berg Eriksen.
  • Writing in English is likely to make you Camilla Pihl. But again; the model is not really built for English text at all, so don’t expect any reliable or meaningful results on anything other than Norwegian text.
  • Laughing (“haha”) makes you Sophie Elise, Caroline Berg Eriksen or Nette Nestea. Interestingly, it also makes you a lot less likely to be Camilla Pihl, Ørjan Burøe and Mat på bordet, and especially Casa Kaos. Those blogs are apparently no fun.
  • Hviit is difficult to classify with few strong predictors. However, you can try writing about styling, posters, design and Copenhagen, and avoid the word “dere” like the plague. Hviit apparently has a strict policy of not talking directly to her readers!
  • Mat på bordet is the easiest blog to classify, with a recall of 81.8 %. I guess writing about food and recipes does make for a pretty unique blog after all!
  • Writing about training will obviously make you Funkygine, but it might also make you Caroline Berg Eriksen. It is very unlikely to make you Camilla Pihl, who apparently writes even less about training than Mat på bordet.
  • Amino acids is another thing that’s likely to make you Funkygine, as is Amsterdam and “initial position,” whatever that is.
  • Jenny Skavlan is the hardest blogger to distinguish from the others, with a recall of only 32.4 %. However, you can try writing about Fretex and see what happens.
  • Using passive formulations such as “man” or writing about Harstad and someone named Robin is likely to make you Sophie Elise. Interestingly, passive voice makes you extremely unlike to be Caroline Berg Eriksen. It makes you a little more likely to be Jenny Skavlan.
  • Misspelling “i dag” and “i kveld” as “idag” and “ikveld,” or misspelling “spesielt” as “spessielt,” also makes you Sophie Elise.
  • Writing about dough and bread will most definitely make you Mat på bordet, as does writing about cheese. Interestingly, writing about dough should also make you a lot less lilkely to be Casa Kaos.
  • Makeup stuff will make you Agnes Lovise, especially nail polish and eye shadow.

An actual business use case

So we’ve just used this methodology and technology for something that is pretty fun, but also pretty useless.

But let’s say that you’ve got thousands of documents or other blocks of text at work. Maybe it’s application forms, maybe it’s free text fields in feedback forms, maybe it’s something completely different. Or maybe it’s just old text files with information you’re unable to take advantage of because the documents are completely disorganized.

If you’re in the latter situation, a good place to start would be to simply categorize and tag these documents so that you can organize and use them. Instead of going through each and every document manually, you can:

  1. Select a small but presumably representative sample of documents. These will be your training documents.
  2. Manually classify, tag or otherwise label this sample with the information you’re interested in extracting from all your documents, for example categories, sentiment etc.
  3. Process all your documents – not just the training documents – in Azure Machine Learning like we did with the blog posts, and train models to predict your labels on your sample.
  4. Use your models to predict the labels for the rest of your documents that you didn’t manually label earlier.


Now, there are of course dedicated software suites which can do this and much, much more. And in some – perhaps many – cases, you’ll probably want to use those.

But what I think is so great about tools like Azure Machine Learning in combination with R is that you can learn to build solutions like this all by yourself and customize them exactly how you want.

By getting familiar with machine learning technologies and data science methodology, you’ll be much better equipped to deal with the next potential use case for machine learning when it pops up. This skillset is a lot more valuable than any particular software.

That’s all for now! Thanks for sticking with me through this monster blog post, and don’t forget to check back later for more ridiculously stupid use cases for Azure Machine Learning.

Mapping Water Supply Systems With a Graph Database in the Cloud

If you’re used to everything being in tables, rows and columns like I am, you’ve probably heard about this fancy thing called graph databases that does things a little bit differently.

I was curious, so I decided to try my hand at setting one up in the cloud, putting some data in it and doing some queries. Naturally, that means this is very much a blog post for newbies by another newbie.

Dunning-Kruger effectIt also means that this is the perfect opportunity to introduce my “Dunning-Kruger Zone” category. This is the place on my blog where I more than anything else write about stuff that I’ve only just tested out or areas where I consider myself completely green.

So why would you want to read about some newbie explaining things that he doesn’t know anything about?

Well, you might not. In that case, here’s a video of some guy hitting himself in the nuts with a nunchuck instead. Enjoy!

If you’re still here, the idea is that I can explain things to others in the same manner I explained them to myself while I was trying to learn. I’m hoping this will make it easier for other newbies to get started by providing just the right initial amount and level of understanding.

So if I accidentally write stupid or incorrect stuff, at least I can say I was being perfectly honest about it!

And yes, I realize the fact that pointing this out kind of defeats the whole idea of the Dunning-Kruger effect. Still, I thought “Dunning-Kruger Zone” was a bit more catchy than the “Dumbass Zone.”

The Associative Structure of Graph Databases

As I said initially, graph databases aren’t concerned with tables, rows and columns. They are instead concerned with the following 3 types of elements:

1. Nodes/Vertices

These are the objects in the graph, like a person, firm, product or whatever else you can imagine. You might think of them as the nouns in the graph.

Graph structure
Source: Neo4j

In the example above, we have nodes of different types like persons, locations, restaurants and cuisines. These are the node labels, or the different categories of nouns if you will.

In a relational database, a person node would instead be a row in a table called “Persons” – or “DimPersons,” if you’re old school and don’t mind obvious double entendres in your database.

Similarly, a restaurant node would be a row in a table called “Restaurants.” In a graph database, person and restaurant nodes are instead distinguished by their different labels.

In addition to labels, nodes have properties – which I will explain in a moment.

2. Edges

Edges are relationships between nodes. If nodes are the nouns, then edges are the verbs in the graph.

In relational databases, relationships between different objects are implicit from sharing common keys and are made explicit by joining tables. Even then, the actual characteristics of the relationship are not necessarily explicit in the data unless you store them in one or both of the tables or use join tables – which you usually always have to do with many-to-many relationships. Creating parent-child relationships and larger hierarchical structures also demands a little bit of trickery.

And while all of these things are of course very much solveable in relational databases, graph databases solve them more elegantly by always storing the relationships themselves as entities that exist between the nodes of both the same type and different types.

In the example above, the edges represent friendships between person nodes, as well as how these persons like different restaurants, both of which serve the same cuisine and are located in the same city.

And like nodes, edges can have properties, which is the final important element.

3. Properties

Properties are the adjectives in the graph. They describe the characteristics of both nodes and edges, just like a row in a relational database has columns to describe the characteristics of this record.

The main difference from a relational database here is of course the fact that you can store properties in the relationships themselves in a very natural way.

Not only can you explicitly say that a particular relationship is a friend type of relationship, but you might also say that these two person nodes have been friends for 7 years and that they met at soccer practice. This information is put in the edge between the nodes.

Maybe the person in the graph who likes the iSushi restaurant likes it because he had his first date with his wife at that place. That’s a characteristic of the relationship between him and the restaurant, and can be stored as a property of that edge.

Why Graph Databases?

So with these things in mind, why and when would you use a graph database instead of a relational database?

Well, in most cases you would probably not use a graph database as your primary storage unless you’re dealing with data that is in no way a natural fit for a classical schema and a much better fit for a graph-like schema. There’s a reason that relational databases have been around for so long and will likely continue to serve us well for a long time.

Relational database

However, if …

  • the actual relationships between entities in your data are just as interesting or more interesting than the entities themselves,
  • you have a lot of these relationships,
  • and you’re dealing with evolving schemas

… a graph database will likely be …

  • a lot faster for graph-like queries,
  • more scaleable,
  • and more flexible with regards to schemas.

Neo4j – the most popular graph database currently in use – has a nice explanation of all this on their website, where they also point out the striking irony in the fact that graph databases are much better at dealing with relationships than relational databases.

The prime example of a good fit for graph databases are social networks, because a social network is very much an extremely large and complex graph.

And before somebody start talking about “unstructured” data …

Graph databases are not for “unstructured” data. Graph databases are for data with a specific structure – a graph structure, in the same way that document databases are for data with a document structure and spatial databases are for data with a geometric structure.

Cloud is King

As the true BI/Data Science-millennial that I am, I always do everything in the cloud whenever possible.

I’ll be using the Neo4j graph database that I mentioned earlier. It’s open source, it’s been around for a little while, and it uses an SQL-like query language called Cypher which will make the SQL-savvy feel right at home.

It’s easy to get started with, and there’s a lot of accessible documentation on their websites. If you want very detailed instructions on setting up, configuring and querying your Neo4j DB, that’s where you should go. I’ll cover how you can get it running in the cloud.

Setting Up Your Own Graph Database in the Cloud

Microsoft Azure allows you to set up a virtual machine running Neo4j 2.0.1 on a Linux Ubuntu distribution with a few clicks.

This also sets up a storage account, an app service for accessing the database through a simple URL and a virtual network. Plug and play!

Graph database in the cloud; Neo4j VM in Microsoft Azure

There’s a guide for setting up the DB in Azure here, but it’s outdated and overly complicated because the new Azure portal does almost everything for you now.

The only thing you have to do is add an endpoint to the VM called “Neo4j” with port 7474 on TCP public and private. Just keep in mind that this makes your database publically available – albeit with a password-protected login – so you might not want to put your credit card information or naked pictures in there (but hey, I don’t judge!).

After this, the setup works straight out of the box and you can access your database through the URL for your app service and start playing with it. I highly recommend checking out the tutorials on the Neo4j website, as they do a great job of teaching the basics. They’re also accessible directly through the web GUI.

If you want to get more advanced, you can start using the Neo4j REST API and build your own stuff on top of the Neo4j DB.

Upgrading Neo4j in Azure

Unfortunately, 2.0.1 is an old version of Neo4j that doesn’t support the uploading of CSV files and some other new features, making it difficult to test it out with data you’ve got lying around.

This means you’ll probably want to upgrade the database, which for me was a royal pain in the ass – especially since I’ve never used Linux before in my whole life.

I’ll try to save you some pain by explaining what I had to do. First, I had to download Putty to access Ubuntu on the VM, which was fine.

Then came the actual updating of the Neo4j package. This would’ve usually been done by writing the following in Putty:

sudo apt-get update
sudo apt-get upgrade

The first command will give you an error for the Neo4j package, saying that …

The following signatures couldn't be verified because the public key is not available: NO_PUBKEY

… and then a key consisting of 16 letters and numbers. Write down this key.

Presumably, this error occurs because Neo4j have changed their key server and you need to get a new key, but being a complete Linux newbie I’m really just guessing.

Anyway, what you need to do is write the following:

sudo apt-key adv --recv-keys --keyserver

Replace the X’es with the key you wrote down earlier. After this, you should be able to update the repository and upgrade Neo4j by running …

sudo apt-get update

… and then …

sudo apt-get upgrade

Please note that I’ve tried to reproduce this process from my memory as accurately as possible, but I went back and forth with this so many times that I might’ve accidentally left something out or made some other mistakes.

I made this explanation because I was unable to find any information about upgrading Neo4j on the Azure Linux VM specifically, and it took a long time and a lot of trial and error to piece the solution together from different sources. All the other stuff on how to actually work with Neo4j is easy to find and understand.

If you’re still unable to upgrade the database with these simple steps, please drop a comment below and I’ll try to help out!

Importing, Querying and Visualizing Water Supply Systems

Hopefully, you now have your graph DB up and running. At this point, you can start either writing data to the database directly or import data that you’d like to see in a graph. Neo4j has a simple guide to importing CSV’s.

Since everyone is mostly using social networks or recommender systems for this, I decided to try something different and test it with water supply systems, which I felt would be a nice fit for a graph database.

The Norwegian Food Safety Authority has data on selected water supply systems publically available on their website. I imported these data sets into the graph DB and created relationships between the different entities. I won’t cover how to write queries here, because Neo4j already covers that extremely well on their website.

The result?

Water supply systems in graph database

The graph contains nodes with four different labels: water supply systems, water intake points, water transportation systems and water treatment plants.

  • One cluster of nodes represents a water supply system, connected to the large, blue water supply system node in the middle.
  • Each water supply system has one or several different water transportation systems that transport water within the system, and these are the yellow nodes.
  • Each water supply system also has one or several different water treatment plants to process water, indicated by the green nodes.
  • Finally, the red nodes are the water intake points, which are the water sources for the water supply system.

Water supply systems in graph database

The different nodes are connected to each other in different ways. For example:

  • Water transportation systems transport water to water treatment plants.
  • Water treatment plants process water from water intake points.
  • Water intake points are sources of water for water supply systems.

The properties of both nodes and edges are also accessible in the GUI.


I’m not an expert on how water supply systems work, but it was still fun experimenting with creating relationships and visualizing the results.

I could’ve easily gone further and created node labels from many of the properties of all the different components, added nodes for things such as regions, provinces and municipalities. I could’ve also added much more information to the edges from the available data, like how much water was transported in a given year.

Wrapping up, I’m looking forward to experimenting more with graph databases in the coming months. They’re definitely not the singular future of data storage and the answer to all your relational database woes, but they’re a very good fit for data that actually has a natural graph-like structure.

In the future, we might stop talking about “unstructured” data and fix-it-all storage solutions and instead be more concerned with finding and using the appropriately structured storage for the natural structure in the data – rather than trying to fit it within a very particular schema or no schema at all.

The type of data and variety of use cases should dictate the type of database, not the other way around.

Anyway, I hope I’ve provided you with just enough material to get you started on your own DB – whether it be in the cloud or on the ground. Good luck!

Classifying “Bullshit” Quotes With Azure Machine Learning

Machine Learning has a lot of really useful applications which I’ve previously written about. It allows us to do things such as predictive maintenance, fraud detection, churn/upselling/cross-selling analytics, customer lifetime value estimation and segmentation, supply chain management, demand forecasting and a host of other stuff.

However, Machine Learning also enables us to do a lot of really quite stupid and funny things, which is one of the reasons I like it so much. One such thing would be the classification of quotes into “bullshit” and “non-bullshit” categories.

That’s right; I’m not above using my powers for evil!


Sorry, I’m allergic to bullshit

Lately there’s been some research into the mechanisms of human receptivity to bullshit and capacity for analytical thinking and reflection, with the former apparently being linked to the latter.

The most recent study “On the reception and detection of pseudo-profound bullshit” by Pennycook et al. (2015) has been making the rounds in the tabloids, with journalists and writers happily generalising the rather specific findings of the researchers to question the mental capacity of those who enjoy sharing motivational and inspirational quotes on social media.

Acceptance Of Profound-Sounding "Bullsh*t" Linked To Lower Intelligence
Source: IFLScience

Obviously it’s not quite that simple, as the study deals specifically with so-called “pseudo-profound” bullshit – which is basically stuff that sounds profound in some way, but is in fact complete, meaningless nonsense. Real life examples of this, according to the researchers, are some quotes by New Age-guru Deepak Chopra, such as the following:

“Attention and intention are the mechanics of manifestation.”

You really can’t make this shit up, can you? 

(Well actually, you can, using Wisdom of Chopra, which generates a random quote based on text from Chopras Twitter account.)

Anyway, in order to help those who were slightly worried that their quote-sharing practices may or may not reflect a below average mental capacity, I decided to use Azure Machine Learning and R (and some Power Query) to create a classification service to determine the “bullshit probability” of quotes and statements.

Hey, maybe I am using my powers for good after all!

Building a bullshit classifier

So how did I do this? Well, my approach was to gather a limited sample of both bullshit and non-bullshit quotes and treat it as a binary classification problem (supervised learning), producing a probability of 0 to 1 of any quote being bullshit based on the contents of the text. This requires an initial labeling of quotes, with 0 denoting non-bullshit and 1 denoting bullshit.

Since Pennycook et al. (2015) specifically refer to tweets by Deepak Chopra as examples of pseudo-profound bullshit and use randomly generated statements based on said tweets for their studies, I decided it was an acceptably small leap of logic for me to use quotes by Deepak Chopra exported from Goodreads as labeled cases of bullshit in my dataset rather than tweets. I’m not on good terms with the Twitter API, and using I was easily able to export the quotes into the format I needed.
You really should be using this tool.

For cases of non-bullshit, I decided to use quotes by evolutionary biologist Richard Dawkins. For defining a polar opposite of Deepak Chopra, he is the first one that comes to mind.

So in my dataset, bullshit is defined as stuff Deepak Chopra says – labeled 1 – and non-bullshit is defined as stuff Richard Dawkins says – labeled 0. For my classifier, every piece of text exists only on a scale from Richard Dawkins to Deepak Chopra, and is scored by the probability of that piece of text being bullshit, i.e. being somehow similar to something Chopra might say, or dissimilar in the sense that it is more similar to something Dawkins might say.

After having collected and labeled the data, I did some quick text preprocessing using Power Query and R, and made sure the cases of bullshit and non-bullshit were reasonably well balanced in the dataset.

I ended up with a current total of 3200 cases (each being approximately one sentence of variable length) after preprocessing, which is I would say is a very small sample for training such a text classifier and a very limited basis for generalisation. I’ll definitely be looking to expand the data set with more cases later, but still, this experiment was really just for learning and for fun.

Moving on, I used feature hashing to turn the text into feature columns, and principal component analysis to reduce the number of features to a reasonable amount for preserving their variance. I used only these text features for classification.

I ended up using a Two-Class Boosted Decision Tree (blue below) with the principal components as features for classifying, but I also had luck with using a Two-Class Locally-Deep Support Vector Machine (red below) with only raw hashing features.

True vs. false positives

I was happy to see that the model both cross-validates and performs quite all right, especially considering the very limited sample size and my rather frivolous labeling strategy for those samples.

There are definitely some patterns seperating the language of Chopra from that of Dawkins, and the model is able to distinguish between these with accuracy of about 0.65 on as of the time of writing.

It’s obviously going to be wildly inaccurate for statements that have nothing in common with the language in the quotes by either Chopra and Dawkins, but if a quote has some elements from one or both of those extremes, the classifier should in theory work reasonably well. Certainly better than a coin flip!


So being quite satisfied, I decided to publish the model as prediction service using Azure. This was done with a few clicks, and lo and behold: Kjetils bullshit classifier as a web service.

Kjetils bullshit classifier

Input any quote you like, and the model scores its probability of being bullshit on a scale from Richard Dawkins (0.00) to Deepak Chopra (1.00). The classifier has obvious limitations, but it was a fun exercise nonetheless. Maybe I’ll improve it with a better sampling and labeling strategy in the future?

The cool part is that with Azure Machine Learning, you can get an idea for doing stuff like this and get it up and running in one form or another within a couple of evenings. It’s really quite ingenious.

Maybe you’ll even find a problem of some real business value to tackle, rather than classifying bullshit statements …

If you want to learn more about text classification, Microsoft has several ready-made examples of how to do this within Azure ML, and there are tutorials readily available. Machine Learning is becoming a natural part of the toolkit of any data analyst or data scientist, and Azure ML makes it very approachable for pretty much anyone with a certain amount of technical and analytical skills.

So what are you waiting for? Get modeling!

The Democratization of Data Science and Machine Learning

When I had the opportunity to work with Machine Learning for a client in the health sector, I started to realize exactly how Amazon had been cunningly manipulating me since I signed up for an account in 2010 to shop for some page-turning thrillers.

Not unexpectedly, this opened up the floodgates for a steady flow of emails, chock full of suggestions and recommendations for other books that Amazon was determined I needed to buy. This flow of emails has since been maintained by diligent clicking and shopping, because Amazon is 100 % correct. I simply must have these books.

In the exact same way, Netflix makes sure that you have no choice but to binge-watch the whole newest season of House of Cards when it is released. Netflix, through smart use of enormous amounts of data, machine learning and other advanced analytics techniques and technologies, knows better than anybody else – probably even better than yourself – exactly what you want to watch and how they’re going to get you to watch it.

Netflix tweet

This form of customer intelligence is now more available than ever, for the benefit of those with sufficient amounts of data and want to know the customers better than they know themselves.

The concepts mentioned previously are used today in a wide variety of fields, for example optimization of stock portfolios and price prediction, picture and speech recognition, fraud detection, text mining and analytics, sentiment analysis of social media traffic, data driven customer segmentation and marketing, predictive maintenance, automatic and personalized web design, and recommendations for what to get at restaurants.

The list of use cases is steadily growing, and almost every week a new, innovative use case for machine learning makes its way into the news. This is a cake that all firms who wish to be data driven both could and should help themselves to, if only just to have a taste.

I’ll be back

Machine learning is nothing new, and algorithms that learn from data and build statistical models to explore, predict and recommend has existed for a long time.

In the 70’s and beginning of the 80’s, many computer engineers, scientists and statisticians were preoccupied with figuring out artificial intelligence. The emergence of machine learning is in many ways related to these endeavours, although the first algorithms are much older.

The field of machine learning didn’t really come to fruition until the 90’s, which is undeniably connected to the fact that at this point most of the researchers had watched the first two Terminator movies. In their fear of accidentally creating Skynet, they instead turned their attention to using machine learning to handle practical business problems such as prediction and cluster analysis.

Arnold Schwarzenegger as The Terminator

Not unlike Arnold Schwarzenegger himself, machine learning is currently experiencing a new renaissance, which is primarily driven by the increased access to computational power and cloud services. This also means that the threshold for firms both big and small to experiment with sophisticated modeling and analytics on small, medium-sized or big data is lower than ever.

This dramatically shortens the road from a vision about data driven decision making to operationalized predictive models implemented in any analytical tool, readily available to the decision makers and operative personnel.

And as if that wasn’t already enough, these systems and models can keep learning and improving as they’re being used in the business, making them increasingly more accurate and able to account for changing assumptions over time if they’re implemented, used and maintained correctly.

The Data Science ecosystem

Many who are familiar with the term “data mining” might think this sounds remarkably similar, and they would be very much right. If you ask two different experts about what machine learning and data mining is, you’re likely to get two different but overlapping answers.

Lately, we’ve started using the term “Data Science” as the name of an independent discipline consisting of concepts from analytics, machine learning, data mining, pattern recognition, data warehousing, graph theory and databases, visualization and prediction. To this we add a large ecosystem of associated tools and platforms, and define the whole shebang as “extraction of knowledge from data.”

The Data Science ecosystem
Source: Computerworld

I prefer to look at Data Science as a more down to earth and practical approach to Big Data. For while the Data Science field by no means disciminates against data of smaller sizes, the methods and techniques you use will (mostly) be the same whether you’ve got with big or small data.

What Data Science does is deal with exactly what you’re going to do with all this data and how it connects to real business problems, and in the process remove some of the buzzing you usually hear when somebody says “big” and “data” in that order.

If you start talking about the Data Science method in a concrete and practical manner instead of going vaguely on about how all this Big Data must be useful somehow, you might experience that people actually listen to you instead of dismissing it as bullshit. And I find that machine learning especially, while admittedly being only a small part of Data Science, is an excellent approach to concretize how a tricky business problem can be solved with data, if appropriate.

This is perhaps also the biggest difference between Data Science and traiditional Business Intelligence. The Data Science umbrella term gathers under it a selection of concepts, methods and tools that actually empower the very desireable business approach of “think big, acting small.”

Data driven descisions and operations founded on sophisticated analytics will never become a reality unless a business is able and willing to start with something somewhere, and with the Data Science approach, the reward and potential is maintained while the risks are minimized.

Acute hopsital admissions, weather and climate – a proof of concept

A Data Scientist will start by asking increasingly concrete questions.

Which problem are we trying to solve? Is it a classifcation problem or a regression problem? Will it be most appropriately modeled by decision trees or neural networks? Maybe it’s a job for a recommender system, or is pattern recognition, sequence modeling or assocation rules more appropriate?

And last, but by no means least; do we have the data we need, is the information sufficient to solve the problem in an appropriate manner, and how can we prepare them to exploit the information maximally?

When I and a colleague were working on a proof of concept for a client in the health sector together with Microsoft, we used machine learning to analyze and model relationships between acute hospital admissions and external factors such as weather, climate and air quality.

Both research, intuition and current business practices dictate that these relationships are very much real, and our task was to explore, quantify and exploit them if they existed in the data. Since Microsoft has made machine learning readily available in their Azure cloud environment, it was natural for us to take it one step further and try to turn the relationships into predictive models.

Microsoft Azure Machine Learning
Source: CloudTimes

If sufficiently accurate, such models could be used to predict the number of acute admissions in hospitals, providing valuable input for emergency preparedness management and in turn ensure patients better and more timely care.

This covers only a very tiny portion of the scope of opportunities for such technology, and there is a wide variety of related potential use cases for machine learning and Data Science within healthcare.

Throughout the process we had to ask ourselves many of the questions I previously mentioned, and as I mentioned early on I finally realized exactly how Amazon knows better what I like than I do myself.

Properly prepared, quality data usually doesn’t lie, and if the connections, interactions and relationships you’re interested in are actually present in these data, sufficiently sophisticated machine learning algorithms will almost almost be able to find them with an appropriate amount of human help.

I need your clothes, your boots, and your motorcycle

Obviously you don’t need machine-made assication rules and sophisticated market basket analysis to reveal that the combination of full MC gear, a pair of boots and a motor cycle will satisfy any terminator.

However, there might be relationships in your data that are far less obvious, exceedingly more complex and much more time-consuming to uncover manually. If you’re then able to represent in your data set an almost complete selection of data points which are sufficiently relevant for the problem you’re trying to solve, you’ll likely be able to produce models with frighteningly accurate results on real data that might be worth their digital weight in gold in business settings.

In our POC we managed to account for a great deal of the variation in the number of acute admissions based on hospital data, calendar data, weather data and air quality data. This shows that it is definitely possible to use such data to model and predict how many patients with, for example, airway related or skin related diagnoses are likely to show up in the mergency ward on a given day.

There are of course a loads of other factors in addition to the weather that affects this number, and a machine learned model will never become more accurate than the data foundation it is based on.

Hospital admissions and weather correlations
Average correlations between groups of weather factors and admissions of patients aged 0-9 with respiratory diseases.

Still, it is important to be aware that such models don’t always have to be very accurate to provide valuable information.

If the alternative is to flip a coin to decide between two options, the recommendations of a model that is right 51 % of the time will qualify as actionable insights. If you value your gut feeling at 60 %, this becomes the benchmark for the model to beat instead.

For example, when planning costly direct marketing a predictive model which increases the response rate or conversion rate by only a few thousandths might be any marketers wet dream.

Hasta la vista, baby

Many firms talk about running the business on data and making decisions based on quality intelligence and well-founded assumptions, but the road from coming up with a hypothesis or an idea to operationalizing a predictive model to use for decision making may seem daunting.

The good news – or the bad news, depending on how you see it – is that the biggest challenge is the same as before: data quality. Data profiling and smart data preparation is likely to take up 80 % of the time in a Data Science project.

It’s worth noting that we’re definitely not talking about cleaning up your entire data warehouse here; this is neither realistic nor desireable. We want to do just enough to provide sufficiently reliable answers to the business problems at hand, guided by an implicit or explicit cost-benefit analysis.

When all this is done, you can lean back, drink your coffee and let the machines churn the numbers, guiding the algorithms and acting like the high-tech puppet master you’ve always wanted to be. After some adjustments you’ve created an API or a data feed that you can expose to everything from QlikView and Tableau to good old Excel, and you’re ready to start predicting.

Data Scientist
Source: Edureka!

Now that machine learning is available in the cloud, large on-premise installations of expensive software is no longer necessary. Knowledge of tools such as R and Python is very desireable, but even this is not necessarily required to get results.

What is needed first and foremost is vision and willingness to experiment to potentially change the business, and people who actually know how to run Data Science initiatives.

This entails an understanding of how to turn business problems into data problems – and in turn answers – as well as how to communicate across the business, sources of domain and data expertise, and disciplines such as analytics and statistics.

Perhaps most importantly, it requires a good head for validating and evaluating the process and the results to seperate the “maybe wrong” from the “perhaps right.” A real Data Scientist, unlike a Sith, never deals in absolutes.

An Introduction


Hello, and welcome to my blog! My name is Kjetil, and I work as a Business Intelligence consultant at Capgemini in Oslo, Norway. I’ve started this blog because I’m very enthusiastic about the things I do for a living, and I very much enjoy sharing and writing about things that I’m enthusiastic about.

I thought long and hard about what kind of name would be appropriate for the type of content I plan on putting out, and I realised that I didn’t want to restrict myself to things that only have to do with specific disciplines or technologies.

The common denominator for everything that I do is data. It’s not necessarily Business Intelligence or Data Science, it’s not necessarily machine learning, it’s not necessarily SQL or Microsoft Power BI or Azure or any other specific technology – but it always has to do with data in some shape or form.

Therefore, I landed on the rather unimaginative name Kjetil Åmdal’s Data Blog, which I realize is almost childish in its simplicity and literalness (especially in my mother tongue, in which it actually sounds completely asinine …), but at least does not really exclude me from writing about anything in particular.

I could even write about some of the other things I enjoy that are unrelated to my work, such as cooking and exercise, as long as I make it about data – which is more or less what I tend to do with everything anyway. My academic background is in business and economics, but I’m an IT guy at heart and I’m not afraid to get down and dirty with analytical tools and code.

I hope the aforementioned keywords and explanation sets some expectations for what kind of content I plan to provide, what form it will take and what level it will be on. I make no promises about the diligence with which I will put out content, but you can be certain that if I’m doing something fun that I want to share and can think of some way to make it interesting to somebody other than myself, it will make its way onto the blog.

See ya!