Machine Learning has a lot of really useful applications which I’ve previously written about. It allows us to do things such as predictive maintenance, fraud detection, churn/upselling/cross-selling analytics, customer lifetime value estimation and segmentation, supply chain management, demand forecasting and a host of other stuff.
However, Machine Learning also enables us to do a lot of really quite stupid and funny things, which is one of the reasons I like it so much. One such thing would be the classification of quotes into “bullshit” and “non-bullshit” categories.
That’s right; I’m not above using my powers for evil!
Sorry, I’m allergic to bullshit
Lately there’s been some research into the mechanisms of human receptivity to bullshit and capacity for analytical thinking and reflection, with the former apparently being linked to the latter.
The most recent study “On the reception and detection of pseudo-profound bullshit” by Pennycook et al. (2015) has been making the rounds in the tabloids, with journalists and writers happily generalising the rather specific findings of the researchers to question the mental capacity of those who enjoy sharing motivational and inspirational quotes on social media.
Obviously it’s not quite that simple, as the study deals specifically with so-called “pseudo-profound” bullshit – which is basically stuff that sounds profound in some way, but is in fact complete, meaningless nonsense. Real life examples of this, according to the researchers, are some quotes by New Age-guru Deepak Chopra, such as the following:
“Attention and intention are the mechanics of manifestation.”
You really can’t make this shit up, can you?
(Well actually, you can, using Wisdom of Chopra, which generates a random quote based on text from Chopras Twitter account.)
Anyway, in order to help those who were slightly worried that their quote-sharing practices may or may not reflect a below average mental capacity, I decided to use Azure Machine Learning and R (and some Power Query) to create a classification service to determine the “bullshit probability” of quotes and statements.
Hey, maybe I am using my powers for good after all!
Building a bullshit classifier
So how did I do this? Well, my approach was to gather a limited sample of both bullshit and non-bullshit quotes and treat it as a binary classification problem (supervised learning), producing a probability of 0 to 1 of any quote being bullshit based on the contents of the text. This requires an initial labeling of quotes, with 0 denoting non-bullshit and 1 denoting bullshit.
Since Pennycook et al. (2015) specifically refer to tweets by Deepak Chopra as examples of pseudo-profound bullshit and use randomly generated statements based on said tweets for their studies, I decided it was an acceptably small leap of logic for me to use quotes by Deepak Chopra exported from Goodreads as labeled cases of bullshit in my dataset rather than tweets. I’m not on good terms with the Twitter API, and using import.io I was easily able to export the quotes into the format I needed.
For cases of non-bullshit, I decided to use quotes by evolutionary biologist Richard Dawkins. For defining a polar opposite of Deepak Chopra, he is the first one that comes to mind.
So in my dataset, bullshit is defined as stuff Deepak Chopra says – labeled 1 – and non-bullshit is defined as stuff Richard Dawkins says – labeled 0. For my classifier, every piece of text exists only on a scale from Richard Dawkins to Deepak Chopra, and is scored by the probability of that piece of text being bullshit, i.e. being somehow similar to something Chopra might say, or dissimilar in the sense that it is more similar to something Dawkins might say.
After having collected and labeled the data, I did some quick text preprocessing using Power Query and R, and made sure the cases of bullshit and non-bullshit were reasonably well balanced in the dataset.
I ended up with a current total of 3200 cases (each being approximately one sentence of variable length) after preprocessing, which is I would say is a very small sample for training such a text classifier and a very limited basis for generalisation. I’ll definitely be looking to expand the data set with more cases later, but still, this experiment was really just for learning and for fun.
Moving on, I used feature hashing to turn the text into feature columns, and principal component analysis to reduce the number of features to a reasonable amount for preserving their variance. I used only these text features for classification.
I ended up using a Two-Class Boosted Decision Tree (blue below) with the principal components as features for classifying, but I also had luck with using a Two-Class Locally-Deep Support Vector Machine (red below) with only raw hashing features.
I was happy to see that the model both cross-validates and performs quite all right, especially considering the very limited sample size and my rather frivolous labeling strategy for those samples.
There are definitely some patterns seperating the language of Chopra from that of Dawkins, and the model is able to distinguish between these with accuracy of about 0.65 on as of the time of writing.
It’s obviously going to be wildly inaccurate for statements that have nothing in common with the language in the quotes by either Chopra and Dawkins, but if a quote has some elements from one or both of those extremes, the classifier should in theory work reasonably well. Certainly better than a coin flip!
So being quite satisfied, I decided to publish the model as prediction service using Azure. This was done with a few clicks, and lo and behold: Kjetils bullshit classifier as a web service.
Input any quote you like, and the model scores its probability of being bullshit on a scale from Richard Dawkins (0.00) to Deepak Chopra (1.00). The classifier has obvious limitations, but it was a fun exercise nonetheless. Maybe I’ll improve it with a better sampling and labeling strategy in the future?
The cool part is that with Azure Machine Learning, you can get an idea for doing stuff like this and get it up and running in one form or another within a couple of evenings. It’s really quite ingenious.
Maybe you’ll even find a problem of some real business value to tackle, rather than classifying bullshit statements …
If you want to learn more about text classification, Microsoft has several ready-made examples of how to do this within Azure ML, and there are tutorials readily available. Machine Learning is becoming a natural part of the toolkit of any data analyst or data scientist, and Azure ML makes it very approachable for pretty much anyone with a certain amount of technical and analytical skills.
So what are you waiting for? Get modeling!