Vector embeddings part 1: Word2vec with Gensim

Since the advent of neural networks, vector embeddings for text processing have gained traction in both scientific and applied text classification problems, for example in text sentiment analysis. Using (pre-trained) embeddings has become a de facto standard for attaining a high rating in scientific sentiment analysis contests such as SemEval. However, vector embeddings are finding their way into many other fascinating applications as well.

In the current post, I want to provide a intuitive, easy to understand introduction to the subject. I will touch upon both vector embeddings in general, and applied to text analysis in particular. Furthermore, I’ll provide some quite simple code with Python and Gensim for generating vector embeddings from text. In further posts, I will apply vector embeddings to transforming categorical to continuous variables and generating vector definitions in N-dimensional space for entities.

I’ll cycle through the following points:

  • What are vector embeddings?
  • How do you use vector embeddings?
  • Why do you need vector embeddings?
  • How do you generate vector embeddings?
  • How does the neural network optimize the vector matrix?
  • A practical example (with code): word vectorization for tweets.

What are vector embeddings?

In a nutshell, a vector embedding for a certain item (a word, or a person, or any other object you can think of) is a mathematical definition for that specific item. This definition is comprised of a set of float numbers between -1 and 1, which is called a vector.

A vector is a specific location in N-dimensional space, in comparison to the zero-point in that space. So, a vector embedding for an item means, that the item is represented by a vector, embedded in a space with as many dimensions as there are numbers in the vector. Every position in the vector, indicates the location of the item on a certain axis; the first number represents the X-axis, the second represents the Y-axis, and so forth.

To keep it simple, let’s say the vector embedding for the word ‘bird’ equals [0.6, -0.2, 0.6]. Therefore, ‘bird’ is represented in a specific spot in 3-dimensional space. Simple, right?

Like the following image:

However, it gets more complicated when you start adding dimensions. As a rule of thumb, 50 dimensions is a usual vector length for representing things like grocery items, or people, or countries. For language-related tasks, somewhere in the range of 200-300 dimensions is more common. Our human brains are just not equipped for imagining anything beyond 3 to 4 dimensions, so we’ll have to trust the math.

Lastly, a constraint that is put on vector embeddings, is that they should always add up to 1.

How do you use vector embeddings?

Of course, a list of float numbers on its own does not mean anything. It only becomes interesting once you start factoring in that you could encode multiple items in high-dimensional space. Naturally, items that are closer to each other can be thought of as more similar or compatible. Consequently, items that are far away, are dissimilar. Items that are neither far away nor close by (think of it like a 90 degree angle, just in high-dimensional space) do not really have any discernible relationship to each other. A mathematical term for this is ‘orthogonal’.

The measure that is used for checking whether two vectors are similar / close, is called cosine similarity. Cosine similarity can be computed by calculating the dot product between two vectors. It goes like this;  Given two vectors [A, B, C] and [D, E, F], the numbers in each position are multiplied by each other; similarity equals A times D plus B times E plus C times F. This outcome is called a scalar in linear algebra. Subsequently, the dot product equals the sum of the previously calculated scalar.

The trick here is that for each position in the two vectors that are closer to each other within the -1 to 1 space, the result of the multiplication is higher. Therefore, the dot product between two vectors increases as they resemble each other more.

To test this out, let’s add two more words:

‘bird’                   = [0.6, -0.2, 0.6]

‘wing’                 = [0.7, -0.3, 0.6]

‘lawnmower’    = [-0.2, 0.8, 0.4]

Now what we have here, is actually our first vector matrix, which is just a fancy name for a set of vectors.

Intuitively, ‘bird’ and ‘wing’ should be quite similar, so have a high dot product. Let’s try it out:

Dot([0.7, -0.3, 0.6], [0.6, -0.2, 0.6]) = 0.42 + 0.06 + 0.36 = 0.86

On the opposite, ‘bird’ and ‘lawnmower’ shouldn’t really be related in any way, so the dot product should be close to 0:

Dot([0.6, -0.2, 0.6], [-0.2, 0.8, 0.4]) = -0.12 + -0.16 + 0.24 = 0.06

When you start adding lots of words to your vector matrix, it will start becoming a word cloud in N-dimensional space, in which certain clusters of words can be recognized. Furthermore, vector embeddings do not only represent bilateral relationships between items, but also capture more complicated relationships; In our language use, we have concepts of how words relate to each other. We understand that a leg is to a human, what a paw is to a cat. If our model is correctly trained on sufficient data, these types of relationships should be available.

Why do you need vector embeddings?

In any language-related machine learning task, your results will be best when using an algorithm that actually speaks the language that you’re aiming at analysing. In other words, you might want to use some technique that somehow captures the meanings of the words and sentences you’re processing.

Now, all ML algorithms require numerical INPUT. So, a requirement for an algorithm to capture language meaning, is that it should be numerical; every word should be represented by either a number or a series of numbers.

Generally, for transforming categorical entities such as words to numerical representations, one would apply one-hot encoding; for each of the amount of entities, which is called the cardinality of the variable, a separate column is generated, filled with 1s and 0s, like this:

Country Country_Greece Country_Chile Country_Italy
Greece 1 0 0
Chile 0 1 0
Italy 0 0 1
Chile 0 1 0

Essentially, one-hot encoding already transforms entities to vectors; each entity is represented by a list of numbers. However, this transformation has a number of drawbacks;

  • If you apply one-hot encoding to a text corpus, you’ll need to create a massive amount of extra columns. That’s both very inefficient and complicated to analyse and interpret. Text datasets often contain thousands of unique words. In other words; the created vector is too large to be useful.
  • Due to the Boolean nature of this transformation, each entity is equally as similar to all other entities. No possibility exists for somehow putting more similar items close to each other. Yet, we also want to capture meanings of words. Therefore,  we need a certain way of representing whether two words are closely related or not.

Embedding entities as vectors solves both these problems. Summarized, the method lets you encode every word in your dataset as a short list of float numbers, which are all between 1 and -1. Given that two words are similar, their respective lists resemble each other more.

How do you generate vector embeddings?

As I described before, similarity is essential to understanding vector embeddings. In the case of actually generating the embeddings from scratch, we’ll need a measure for similarity. Usually, in text analysis, we derive that from word co-occurrence in a text corpus. Essentially, words that are close together often within sentences, are theorized to be quite similar. The opposite is true for dissimilar words; they are never close together in text.

This is where neural networks come in: you use the neural network to either make two vectors more similar to each other each time the underlying entities occur together, or to make two vectors for items more dissimilar given that they don’t occur together.

Firstly, we need to transform our dataset to a usable format. Since the neural network will require some number to be predicted, we need a binary output variable. all similar word combinations will be accompanied with a 1, and all dissimilar combinations will be accompanied by a 0, like this;

First word Second word Binary output
‘bird ‘wing’ 1
‘bird’ ‘lawnmower’ 0
‘wing’ ‘lawnmower’ 0

These word combinations, accompanied by the output variable, are fed to the neural network. Essentially, the neural network predicts whether a certain combination of words should be accompanied by a 1 or a 0, and if it’s wrong compared to the real value, subsequently adjusts the weights in its internal vector matrix to make a more correct prediction next time that specific combination comes around.

How does the neural network optimize the vector matrix?

I’ll try to keep it simple. The neural network is trained by feeding it lots of bilateral word combinations, accompanied by 0s and 1s. Once a certain word combination enters the model, the two respective word vectors are retrieved from the vector matrix. At first, the numbers in the vectors are randomly generated, close to 0.

Subsequently, the dot product for those specific two vectors is calculated. Based on the outcome, the model will predict either a 0 (non-combination) or a 1 (actual combination). A sigmoid transformation is applied to the dot product of each combination, so that the outcome of the model is either 0, for a prediction of a non-combination, or a 1, for the prediction of a true combination.

The learning of the model, as with all neural networks, happens through backpropagation; after feeding a word combination to the network, the model prediction is compared with the actual outcome value. If those don’t match, this is propagated back to the embedding layer, where the embeddings are adjusted accordingly. If two words occur together frequently as a true combination, their vector embeddings will be adjusted to resemble each other more. The opposite happens for false combinations; those embeddings are adjusted so that they resemble each other less.

After training, we end up with an embedding matrix containing optimized embeddings for all words in the embedding matrix. We discard the sigmoid layer so we can examine dot products between different words.

A practical example (with code): word vectorization for tweets.

So, neural network vector embeddings can be challenging to understand. However, there are some libraries available that do all the heavy lifting in terms of programming; I’ll be using Gensim.

Some library imports:

import pandas as pd
import numpy as np
import csv
from gensim.models import word2vec
import nltk
import re
import collections

Firstly, we need some textual data. I chose for a set of 1.6 million tweets that were provided with a sentiment analysis competition on Kaggle. The data were provided in .csv format. Let’s load them. I’ll be keeping only a subset, for quick training.

##open kaggle file. We only need the tweet text, which is in 5th position of every line
with open('sentiment140\kaggle_set.csv', 'r') as f:
    r = csv.reader(f)
    tweets = [line[5] for line in r]
tweets = tweets[:400000]

Here’s what they look like.

print(tweets[:10])

#["@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D", "is upset that he can't update his Facebook by texting it... and might cry as a result  School today also. Blah!", '@Kenichan I dived many times for the ball. Managed to save 50%  The rest go out of bounds', 'my whole body feels itchy and like its on fire ', "@nationwideclass no, it's not behaving at all. i'm mad. why am i here? because I can't see you all over there. ", '@Kwesidei not the whole crew ', 'Need a hug ', "@LOLTrish hey  long time no see! Yes.. Rains a bit ,only a bit  LOL , I'm fine thanks , how's you ?", "@Tatiana_K nope they didn't have it ", '@twittera que me muera ? ']

It appears that some data cleaning is in order.

##clean the data. We'll use nltk tokenization, which is not perfect. 
##Therefore I remove some of the things nltk does not recognize with regex
emoticon_str = r"[:=;X][oO\-]?[D\)\]pP\(/\\O]"
tweets_clean = []

counter = collections.Counter()
for t in tweets:
    t = t.lower()
    ##remove all emoticons
    t = re.sub(emoticon_str, '', t)
    ##remove username mentions, they usually don't mean anything
    t = re.sub(r'(?:@[\w_]+)', '', t)
    ##remove urls
    t = re.sub(r"http\S+", '', t)
    ##remove all reading signs
    t = re.sub(r'[^\w\s]','',t)
    ##tokenize the remainder
    words = nltk.word_tokenize(t)
    tweets_clean.append(words)

That’s better:

print(tweets_clean[:10])
##[['awww', 'thats', 'a', 'bummer', 'you', 'shoulda', 'got', 'david', 'carr', 'of', 'third', 'day', 'to', 'do', 'it', 'd'], ['is', 'upset', 'that', 'he', 'cant', 'update', 'his', 'facebook', 'by', 'texting', 'it', 'and', 'might', 'cry', 'as', 'a', 'result', 'school', 'today', 'also', 'blah'], ['i', 'dived', 'many', 'times', 'for', 'the', 'ball', 'managed', 'to', 'save', '50', 'the', 'rest', 'go', 'out', 'of', 'bounds'], ['my', 'whole', 'body', 'feels', 'itchy', 'and', 'like', 'its', 'on', 'fire'], ['no', 'its', 'not', 'behaving', 'at', 'all', 'im', 'mad', 'why', 'am', 'i', 'here', 'because', 'i', 'cant', 'see', 'you', 'all', 'over', 'there'], ['not', 'the', 'whole', 'crew'], ['need', 'a', 'hug'], ['hey', 'long', 'time', 'no', 'see', 'yes', 'rains', 'a', 'bit', 'only', 'a', 'bit', 'lol', 'im', 'fine', 'thanks', 'hows', 'you'], ['nope', 'they', 'didnt', 'have', 'it'], ['que', 'me', 'muera']]

Now building the actual model is super simple:

##now for the word2vec
##list of lists input works fine, you could train in batches as well if your set is too large for memory. 
model = word2vec.Word2Vec(tweets_clean, iter=5, min_count=30, size=300, workers=1)

And check out the results. The embeddings generate some cool results; you can clearly see that the most similar words to a certain word, are actually almost identical in semantic meaning:

print(model.wv.similarity('yellow','blue'))
print(model.wv.similarity('yellow','car'))
print(model.wv.similarity('bird','wing'))

#0.725584
#0.25973
#0.484771

print(model.wv.most_similar('car'))
print(model.wv.most_similar('bird'))
print(model.wv.most_similar('face'))

#[('truck', 0.7357473373413086), ('bike', 0.6849836707115173), ('apt', 0.6751911640167236), ('flat', 0.666846752166748), ('room', 0.6251213550567627), ('garage', 0.6184542775154114), ('van', 0.6047226190567017), ('house', 0.5993092060089111), ('license', 0.5938838720321655), ('passport', 0.5903675556182861)]
#[('squirrel', 0.7508804202079773), ('spider', 0.7459014654159546), ('cat', 0.7174966931343079), ('kitten', 0.7117133140563965), ('nest', 0.7008006572723389), ('giant', 0.6956182718276978), ('frog', 0.6906625032424927), ('rabbit', 0.685787558555603), ('mouse', 0.6779413223266602), ('hamster', 0.6754754781723022)]
#[('mouth', 0.6230158805847168), ('lip', 0.5884073972702026), ('butt', 0.5822890996932983), ('finger', 0.579236626625061), ('smile', 0.5790724754333496), ('eye', 0.5769941806793213), ('cheek', 0.5746228098869324), ('skin', 0.5726771354675293), ('arms', 0.5703110694885254), ('neck', 0.5571259260177612)]

That’s it for today! I’ll be elaborating more on how to generate vector embeddings by defining neural networks with Tensorflow and Keras, in a future post.

P = NP: A Million Dollar Problem

In 2000, the Clay Math Institute awarded one million dollars each for seven important, long-standing math problems. Of these, only one has been solved since (the Poincaré conjecture). I will discuss another one, the P = NP problem. This is arguably one of the most relevant and famous problems in modern computer science. Solving the question cures cancer, obliterates all existent computer privacy, instantly beats everyone at chess and Tetris, reconstructs the full history of species from DNA, and would most likely create general mayhem on earth.

I stumbled upon this problem while reading The Master Algorithm by Pedro Domingos. He provides a very intuitive description of the problem: ‘P and NP are the two most important classes of problems in computer science. [..] A problem is in P if we can solve it efficiently, and it’s in NP if we can efficiently check its solution. The famous P = NP question is whether every efficiently checkable problem is also efficiently solvable’.

Now if we want to solve the problem and win the lottery, we need to dig a little deeper than that. As Domingos writes, the two most important classes of problems in computer science are P and NP. We designate any problem that can be solved in polynomial time to be P (polynomial), and with NP (non-deterministic polynomial) we mean any problem that can be solved in at least exponential time but for which the solution can be checked in polynomial time. The catch here is, if you find a way to efficiently solve one of the NP problems, you solve all of them because at their core, they are all the same. This is called NP-completeness.

For understanding what the aforementioned sentences mean, we need to grasp exactly what computer scientists mean when they talk about time. For solving problems, they do not use the normal, linear time that we know, but correct for the amount of elements in solving a problem, namely N. In this case, time is linear only for very simple problems. For example, let’s say you need to find the highest number in a list (or select for any other condition). Then you need to iterate through the entire list only once. This is a linear problem, in the sense that every added element (to the length of the list) adds only one extra computation/extra time unit. In the graph, it’s the O(n) line.

However, many problems in computer science are in polynomial time. This means that the amount of computations needed to solve the problem is the amount of elements raised to some power (for example N-squared or N3; O(n^2) in Figure 1). One notable problem in P is that of exactly figuring out what the maximal profit is in the case of multiple cost functions, called linear programming. For example, a car manufacturer disposes of one factory, a certain amount of employees and a certain amount of raw materials. The company can choose between building two car models that both have a different required amount of raw materials, labor, and retail price. Within the boundaries of available resources, you’d compute all possible combinations of amounts for both car models (so compute the full solution space), and go with the most profitable, or optimal, solution. For more info, check this page.

However, problems in NP need more than that amount of computing to be solved. The solution space is way larger than that of exponential problems. The official definition is a problem that can be solved in polynomial time by a non-deterministic Turing machine (Cook, 2000).

An ‘easier’ way to look at NP problems is this: They require at the very least an amount of computations that is equal to some number raised to N, sometimes even a factorial of N. So for example if N = 20, at least 220 computations are needed (Figure 1: O(2^n)) . This means that the number of computations increases so fast with N, that we can only compute exact solutions for very small problems. To illustrate this, I’ll use an example from MIT: If a computer takes a second to perform a computation with 100 elements in the case the algorithm is in linear time (N = amount of computations), that same algorithm will take ~3 hours if the amount of computations is equal to N3 (polynomial time), and will take 300 quintillion years if the computation time is equal to 2N (exponential time). This example should provide some insight into the timescale which is required for solving NP problems with large Ns. As a result, we can currently only solve very simple NP problems with only a few components.

However, NP problems are usually easily checkable in polynomial time. For example, computing the optimal solution for a game of Tetris requires a ridiculously large amount of computations, but it is very easy to see that one has solved Tetris. Same for Sudoku; solving is difficult, checking the solution for mistakes is easy.

Another way of looking at many NP problems is that for most NP problems, one needs to compute all possible combinations of the included elements to find the optimal solution. This is way worse even than 2N; in this case the amount of possible combinations is (N-1)!, which is, in the case of 9 elements, 8 * 7 * 6 * 5 * 4 * 3 * 2 * 1=40320 (~1 hour), and in the case of 100 elements, too large to compute on my laptop. The point here is that the time increase for each added element is so large, that only very simple computations can be made with current methods.

An example of an NP problem is that of the travelling salesman. Let’s say that the salesman needs to visit a number of cities (N), that all have different distances from each other, and he wants to visit them in the most efficient route possible. In that case, he needs to compute distance to be travelled for all possible travel itineraries, which equals (N-1)!. That is, even for 10 cities, 362880 possibilities. In the case of the following picture, with 50 cities, that is approximately 6082818 * 1086 possibilities.

The aforementioned example is an example of a sub-class of NP problems, namely NP-hard problems. These are problems that cannot be solved in polynomial time and whose solutions cannot be checked in polynomial time. In the aforementioned example, this implicates that, even if you were able to have a computer put out a solution for optimal travel, you would never be sure if the computer got it right. Another example of an NP-hard problem is chess; if you have a computer find the optimal move to make, how would you even know that the computer got it right?

Here, I’ll recap NP-completeness; even though the travelling salesman problem, sudoku, cracking encryption and reverse engineering gene sequences seem not even remotely similar, they’re all the same problem at heart. They all consist of finding a unique, perfect solution, in which with current methods, iterating through all possible solutions is required. Therefore, solving one means solving them all. You’ll need to prove that problems that can only be solved in exponential or factorial time, perfect solutions exist as well in polynomial time (all relative to N). In terms that we can all understand, this implies that for receiving the million dollars, you need to prove that you can find the perfect combination of elements for a problem, without going through all possible combinations.

Now the aforementioned does not mean that scientists have not found solutions to any of the NP problems in this essay. For example, if you look at the map of the United States, it is likely that you could draw a line between all the cities that is going to be very close to the optimal solution. This is one way scientists approach NP problems; through heuristics. These are decision models that do not require computing all possible solutions, but that use a set of rules to approximate the solution. For example in the travelling salesman problem, one could set the rule that the computer/algorithm needs to move to the closest next city. This might in some cases result in an approximate solution. However, for things like cracking 50-digit passwords and reconstructing species from gene codes, this will not get us very far.

It is not very likely that mathematicians will ever find evidence for P = NP. For example. In a 2002 study, MIT researchers found that 61 computer scientists thought that P probably is not equal to NP, and 9 thought that is does (6). However, some of those told the researcher that they just said that to be controversial. Time has proven them right; no solution has arisen in the 16 years since. Thus, the general consensus us that P is not equal to NP, but no evidence has been found for this statement either.

Then why does anyone still look for the solution to P = NP? Firstly, many breakthroughs in computer science come from someone searching a solution and accidentally finding ways to make algorithms more efficient or powerful. Secondly, looking makes one understand one of the major limitations of computer science and how to deal with it. Thirdly, it provides insight into some major problems in modern society. And lastly, it’s just interesting. So have a go!

I refer to this video for another quite intuitive explanation of P=NP.

Multivariate Outlier Detection with Isolation Forests

Recently, I was struggling with a high-dimensional dataset that had the following structure: I found a very small amount of outliers, all easily identifiable in scatterplots. However, one group of cases happened to be quite isolated, at a large distance from more common cases, on a few variables. Therefore, when I tried to remove outliers that were at three, four, or even five standard deviations from the mean, I would also delete this group.

Fortunately, I ran across a multivariate outlier detection method called isolation forest, presented in this paper by Liu et al. (2012).

This unsupervised machine learning algorithm almost perfectly left in the patterns while picking off outliers, which in this case were all just faulty data points. I’ve used isolation forests on every outlier detection problem since. In this blog post, I’ll explain what an isolation forest does in layman’s terms, and I’ll include some Python / scikit-learn code for you to apply to your own analyses.

Outliers

First, some outlier theory. Univariate outliers are all cases in one’s data that are quite far from the mean in terms of standard deviation on a certain variable. Don’t confuse them with influential cases. Influential cases may be outliers, and vice versa, but they’re not identical. Multivariate outliers, which we are discussing in this post, are essentially cases that display a unique or divergent pattern on variables.

Outliers may have two causes:

  • There may be mistakes present in your data. Maybe someone filled in a faulty number or just typed 999; maybe the application generating your data contains some weird logic. You definitely want to remove these cases before analysis.
  • Some cases are quite abnormal in your data, but are valid. In this case, it’s up to the data scientist to remove them or not. If you’re for example doing a regression, those outliers may strongly influence your results (Cook’s distance), and you’ll remove them from the data. However, a clustering algorithm will just put the abnormal group of cases in a separate cluster.

What I find isolation forests to do well, is that they first start at picking off the false or bad cases, and only when those are all identified, will start on the valid, abnormal cases. This is because valid cases, however abnormal, are often still grouped together, where bad cases are truly unique. This is not true for all analyses; if a default is 999 for example, there may be many cases with that value on some variable. It’s up to the data scientist to be vigilant in those cases.

Isolation Forests

As the name suggests, isolation forests are based on random forests. I’ll dig into decision trees and random forests some other time, but here’s what you need to know; decision trees split data into classes in order to minimize prediction error. On each iteration, the tree gets to make one split on one of the included variables to removing the most entropy, or degree of uncertainty. For example, a decision tree could first split cases into younger and older people, when predicting SES. Subsequently, it could split the younger group into people with and without college degrees to remove entropy, and so on.

Introduce random forests; large, powerful ensembles of trees, in which individual quality of each tree is diminished due to random splits, but with low prediction error due to trees outperforming other trees gaining a larger weighting in the final decision.

An isolation forest is based on the following principles (according to Liu et al.); outliers are the minority and have abnormal behaviour on variables, compared to normal cases. Therefore, given a decision tree whose sole purpose is to identify a certain data point, less dataset splits should be required for isolating an outlier, than for isolating a common data point. This is illustrated in the following plot:

Based on this, essentially what an isolation forest does, is construct a decision tree for each data point. In each tree, each split is based on selecting a random variable, and a random value on that variable. Subsequently, data points are ranked on how little splits it took to identify them. Given that the model’s instructions were to identify X% as outliers, the top X% cases on rank score are returned.

Isolation forests perform well because they deliberately target outliers, instead of defining abnormal cases based on normal case behaviour in the data. They are also quite efficient; I’ve easily applied them on datasets containing millions of cases.

Now for the practical bit. Let’s generate some data. I generate a large sample of definite inliers, then some valid outliers, then some bad cases.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest

##first, build a simulated 2-dimensional dataset
inliers = [[np.random.normal(50, 5, 1), np.random.normal(50, 5, 1)] for i in range(900)]
outliers1 = [[np.random.normal(70, 1, 1), np.random.normal(70, 1, 1)] for i in range(50)]
outliers2 = [[np.random.normal(50, 20, 1), np.random.normal(50, 20, 1)] for i in range(50)]

#merge and reshape the data points
df = inliers+ outliers1 + outliers2
v1, v2 = [i[0] for i in df],[i[1] for i in df]

df = pd.DataFrame({'v1': v1, 'v2': v2})
df = df = df.sample(frac=1).reset_index(drop=True)

##and look at the data
plt.figure(figsize = (20,10))
plt.scatter(df['v1'], df['v2'])
plt.show()

Subsequently, we’ll train an isolation forest to identify outliers and examine the results. 5% of the set were generated as outliers. However, some of these are likely to have ended up as inliers. Therefore I set the outlier identification threshold to 4%.

##apply an Isolation forest
outlier_detect = IsolationForest(n_estimators=100, max_samples=1000, contamination=.04, max_features=df.shape[1])
outlier_detect.fit(df)
outliers_predicted = outlier_detect.predict(df)

#check the results
df['outlier'] = outliers_predicted
plt.figure(figsize = (20,10))
plt.scatter(df['v1'], df['v2'], c=df['outlier'])
plt.show()

As you see, the isolation forest nicely separates bad cases from actual data patterns, with barely any errors. If I would now want to remove the rightmost cluster as well, I could just increase my removal threshold. The next plot is the result with the threshold set to 15%. One can clearly see that it now starts picking at cases in the smaller cluster and in the periphery of the larger cluster. Keep in mind that this algorithm will always incur slight error, due to random generation of splits.

Hyperparameter tuning

This is how to tune the following parameters for optimal performance:

  • n_estimators: The number of decision trees in the forest. According to Liu et al., 100 should be sufficient in most cases.
  • max_samples: Given large datasets, you might want to train on random subsets of cases to decrease training time. This parameter lets you determine subset size.
  • contamination: The proportion of the data you want to identify as outlier. As demonstrated before, this parameter requires some trial and error combined with scatterplot visualisation, given no prior knowledge.
  • max_features: The amount of variables that should be used to define outliers on. Should be set to the amount of variables that you have in almost any situation. This feature allows one to iterate over variables, and do univariate outlier detection on each variable without specifying a standard deviation threshold.

That’s all you need to know for applying isolation forests to multivariate outlier detection! Please refer to this blog if you use any information written here and the scikit-learn documentation in your own work.