Vector embeddings part 2: Country Embeddings with TensorFlow and Keras

Following this previous post on neural network vector embeddings for text, I wanted to experiment some more with creating embeddings for items in a dataset. As you may remember, vector embeddings are lists of numbers that represent the location of certain objects in N-dimensional space. Essentially, embeddings encode categorical data to continuous. Objects that are more similar or compatible are placed closer to each other in vector space, and the opposite for dissimilar objects. Aforementioned (cosine) similarity is rooted in co-occurrence in the data; if two items are together often, they are placed closer together. It’s kind of like a sky with stars that all have their unique location, just the sky has between 50 and 300 dimensions in this type of analysis.

As I described in my previous post, one could capture meanings of words in relation to each other by generating embeddings from words that often co-occur. This is the most prevalent application of embeddings. Yet, there are other fascinating applications of vector embeddings. For example, one could attempt to embed sales data on which items are bought together frequently, for making purchase recommendations to customers. Even more interesting; encoding people and the networks between them, based on how often they interact on social media.

Credit where credit is due; Inspiration for this post partially came from TowardsDataScience and Deep Learning with Keras by Gulli & Pal (2017).

Summary for impatient readers

In the current post, I’ll describe a pet project I’ve been working on recently; recreating a world map of countries from purely relational data on those countries. Here’s a summary for what I did: Firstly, I gathered some data in which some countries co-occur and others don’t. Subsequently, I generated 50-dimensional embeddings for each country with a neural network. Lastly, I applied TSNE dimension reduction to two dimensions, and plotted those dimensions on a map and to see whether they in any way resemble either a world map, or geographical / diplomatic relations between countries. Stick around until the end of the post to see whether I succeeded or not!

These are my hypotheses:

  • I expect to see basic grouping of countries based on their respective continents and specific geographical locations. For example; the Netherlands might be wedged between Germany and Belgium, within the Europe cluster.
  • Countries that have strong diplomatic bonds, should be close together. For example, I expect countries that are part of the Commonwealth, such as the UK, Ireland and Australia, to flock together.

And these are all the techniques I employed in this project:

  • Wikipedia API
  • Neural network embeddings with Keras and Tensorflow
  • t-distributed stochastic neighbor embeddings (TSNE) from scikit-learn, for dimension reduction
  • Check bilateral combinations with cosine similarity / dot products
  • 2D scatterplot with Bokeh

Countries

Of course, if I want to work with countries, I first need a comprehensive list of them. Here’s a United Nations .xlsx file that I found after some quick Googling: Of all the on-line country lists that I found, this one seemed the most reliable.

That file is not usable in Python yet. So, I did some manual data massaging, added respective continents to the countries, and saved it all in a .csv. This is what I ended up with:

So, let’s load that info into some Python lists:

##get the imports out of the way
import wikipedia
import numpy as np
import pandas as pd
import csv
import itertools
from collections import defaultdict
import random
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE

from keras.layers import dot, Input, Dropout
from keras.layers.core import Dense, Reshape
from keras.layers.embeddings import Embedding
from keras.models import Model
from scipy.spatial.distance import cdist

from bokeh.plotting import figure, show, output_file, save
from bokeh.models import ColumnDataSource, HoverTool, CategoricalColorMapper
from bokeh.palettes import d3
from bokeh.io import show, output_notebook
output_notebook()

##load country and continent names from csv file
countries, continents = [],[]
with open('C:/Users/teun/Documents/Blog/countries.txt', 'rt') as file: 
    reader = csv.reader(file, delimiter = ',')
    for line in reader: 
        countries.append(line[0])
        continents.append(line[1])
        
print(countries[:10])
print(continents[:10])
print(len(countries))

##['Algeria', 'Egypt', 'Libya', 'Morocco', 'Tunisia', 'Western Sahara', 'Burundi', 'Comoros', 'Djibouti', 'Eritrea']
##['Africa', 'Africa', 'Africa', 'Africa', 'Africa', 'Africa', 'Africa', 'Africa', 'Africa', 'Africa']
##187

Apparently, the UN acknowledges 187 countries.

Wikipedia Data

Yet, we need more data. As I mentioned before, we need a dataset in which co-occurrence of countries is represented. At first, I wanted to gather lots of news articles and extract country co-occurrence from those. I tried the New York Times API for that. I found that country combinations are quite sparse; countries are not mentioned that often. Moreover, the NYT API has quite restrictive rate limiting, rendering my data acquisition super slow.

However, I got the inspiration for better data from Will Koehrsen at Towards Data Science; Wikipedia provides a foolproof API for gathering full articles, so that web scraping is not necessary. Basically, I acquired the full Wikipedia text for every country:

##acquire Wikipedia text for each country
pages = []
for country in countries: 
    wiki = wikipedia.page(country)
    pages.append(wiki.content)

print(pages[0][:400])
##Algeria ( (listen); Arabic: الجزائر‎ al-Jazā'ir, Algerian Arabic الدزاير al-dzāyīr; French: Algérie), officially the People's Democratic Republic of Algeria (Arabic: الجمهورية الجزائرية الديمقراطية الشعبية‎), is a country in the Maghreb region of North Africa. The capital and most populous city is Algiers, located in the far north of the country on the Mediterranean coast. With an area of 2,381,74

Then, I filtered out only the country names for every article:

##create a list of lists with countries that occur together within articles 
combinations = []
for page in pages: 
    combination = []
    for country in countries: 
        if country in page: 
            combination.append(country)
    combinations.append(combination)

print(combinations[1])
##['Egypt', 'Libya', 'Eritrea', 'Ethiopia', 'Somalia', 'Sudan', 'South Africa', 'Niger', 'Nigeria', 'China', 'Bangladesh', 'India', 'Iran', 'Armenia', 'Cyprus', 'Iraq', 'Israel', 'Jordan', 'Kuwait', 'Qatar', 'Saudi Arabia', 'Syria', 'Turkey', 'United Arab Emirates', 'Yemen', 'Argentina', 'Albania', 'Greece', 'Russia', 'France', 'Italy', 'Malta', 'Canada', 'United States']

Generating true and false bilateral combinations

Now, for embeddings, we need bilateral combinations. I’ll assume that all possible bilateral combinations within a Wikipedia article are valid. Generating them is quite simple;

##generate two-way combinations
def generate_true_combinations(sets): 
    true_combinations = []
    for set in sets: 
        for combination in list(itertools.combinations(set, 2)):
            true_combinations.append([i for i in combination])
    return true_combinations

true_combinations = generate_true_combinations(combinations)
print(true_combinations[:10])

##[['Algeria', 'Egypt'], ['Algeria', 'Libya'], ['Algeria', 'Morocco'], ['Algeria', 'Tunisia'], ['Algeria', 'Western Sahara'], ['Algeria', 'Mali'], ['Algeria', 'Mauritania'], ['Algeria', 'Niger'], ['Algeria', 'Nigeria'], ['Algeria', 'Jordan']]

Yet, we also need some “false” combinations. I’ll explain; we will train our neural network by feeding it country combinations, upon which the network will place those closer together in vector space. However, we also need to feed the network bilateral combinations that did not occur in the data. We can generate those randomly from our true combination data:

##generate false two-way combinations that are randomly sampled from the full true set
def generate_false_combinations (true_combinations, desired_amount):
    false_combinations = []
    counter = 0
    ##this list of unique combinations is required for checking whether a randomly generated combination exists or not. 
    ##first, create sets of tuples; sets only work with immutable objects such as tuples
    combinations_unique = set(tuple(c) for c in true_combinations)
    ##and transform back to list of lists 
    combinations_unique = [list(c) for c in combinations_unique]
    
    #generate bidirectional unique combinations dict for performance improvement
    true_dict = defaultdict(list)
    
    for c in combinations_unique: 

        true_dict].append(c[1]) 
        true_dict].append(c[0])
        
    ##generate units0 and units1 lists, which contain all items in respectively position 0 and 1 for each combination in the full set
    units0 = [item[0] for item in true_combinations]
    units1 = [item[1] for item in true_combinations]
    
    while counter <= desired_amount:
            
        #random items need to be generated from the full true set, to account for prevalence. 
        c = [random.choice(units0), random.choice(units1)]
            
        #check if c did not exist in true combinations: 
        if c[0] != c[1]: 
            if c[0] not in true_dict]: 
                false_combinations.append(c)
                counter += 1        

    return false_combinations
    
false_combinations = generate_false_combinations(true_combinations, len(true_combinations))

print(false_combinations[:10]) 

##[['New Caledonia', 'Bolivia'], ['Myanmar', 'Swaziland'], ['Sudan', 'Trinidad and Tobago'], ['Armenia', 'Macau'], ['Guyana', 'Jordan'], ['Slovakia', 'Ghana'], ['Moldova', 'Ecuador'], ['Sierra Leone', 'Czech Republic'], ['Mongolia', 'Czech Republic'], ['Bangladesh', 'Latvia']]

Now that we have equal length lists of both types of country combinations, we can merge them, add true / false labels, and randomize the order:

##merge true and false combinations, add binary number to be predicted, randomize order
def merge_combinations(true_combinations, false_combinations):
            
    ##create true and false labels
    labels_true = [1 for i in range(len(true_combinations))]
    labels_false = [0 for i in range(len(false_combinations))]
    
    ##combine true and false combinations
    combinations = true_combinations + false_combinations
    labels = labels_true + labels_false
    
    ##shuffle sequence after zipping the lists
    z = list(zip(combinations, labels))
    random.shuffle(z)
    
    ##unzip and return
    combinations, labels = zip(*z)
    return combinations, labels

##generate shuffled training set of true/false combinations, with labels
combinations, labels = merge_combinations(true_combinations, false_combinations)

##further data prep
print('there are ', len(combinations), 'country combinations in the set')
##there are  132489 country combinations in the set

##split combinations into lists first and second country
country1, country2 = zip(*combinations)

There is one last step to be taken before training the neural network. All our countries are still represented by their names. However, neural networks only take in numerical data. Therefore, I generate a lookup dictionary in which every word is represented by an index, and subsequently transform every word in the dataset to the corresponding index number:

##neural networks use numerical data as input, so assign an index number to every country
##and generate med to index lookup dict, and reverse lookup dict
country_to_index = {}
index_to_country = {}

indices = [i for i in range(len(countries))]
for idx, i in enumerate(countries): 
    country_to_index[i] = indices[idx]
    indices[idx] = country_to_index[i]

##and transform all the countries in our set to their respective indices
country1 = [country_to_index[i] for i in country1]
country2 = [country_to_index[i] for i in country2]

print(country1[:10])

# ##keras uses numpy arrays as input, so transform training lists to those
country1 = np.array(country1, dtype="int32")
country2 = np.array(country2, dtype="int32")
labels = np.array(labels, dtype = "int32")

Training the model

In my previous post, I used word2vec modelling with Gensim to generate embeddings. This time around though, I wanted to know more about how the neural network that generates embeddings actually works.

I’ll provide a quick rundown. All true and false combinations are run through the model a number of times. The model has two input ports, into which the respective countries in a combination are inserted for each training iteration. Subsequently, the model takes the two appropriate embeddings for each country. Then, it computes the dot product / cosine similarity between those embeddings; essentially, each number in one embedding is multiplied by the corresponding number in the other embedding. Then, the sum of all those products is taken. If the sum is above 0, the model puts out a 1 (true combination); if the sum is under 0, it puts out a 0 (false combination). Lastly, the predicted output is compared to the actual label accompanying the country combination. Backpropagation takes place here; both embeddings are adjusted so that next time this combination comes around, the prediction will be slightly better.

Here’s the code:

#specify neural network parameters
epochs = 100
batch_size = 10000
embed_size = 50

##keras API model specification
input_country1 = Input(name = 'input_country1', shape = [1])
input_country2 = Input(name = 'input_country2', shape = [1])

##embedding layers for both first and second medicine
model_country1 = Embedding(input_dim = len(countries), output_dim = embed_size)(input_country1)
model_country2 = Embedding(input_dim = len(countries), output_dim = embed_size)(input_country2)

#reshape those 
model_country1 = Reshape((embed_size, 1))(model_country1)
model_country2 = Reshape((embed_size, 1))(model_country2)

#merge embeddings with dot product
dot_ = dot([model_country1, model_country2], axes=1, normalize = False)
dot_ = Reshape(target_shape = [1])(dot_)

##predict true/false with dot product/cosine similarity
output = Dense(1,activation = "sigmoid")(dot_)

##compile
model = Model(inputs=[input_country1, input_country2], outputs=output)
model.compile(loss = "binary_crossentropy", optimizer = "adam", metrics = ['accuracy'])
model.summary()

##train model
history = model.fit([country1, country2], labels, batch_size=batch_size, epochs=epochs, verbose = 1, validation_split=0.1)

##assess model performance
plt.plot(history.history['acc'])
plt.plot(history.history['loss'])
plt.show()

At around 99% validation set accuracy, I was happy with the performance and stopped tweaking. Because Keras validation only impacts the model minimally, I didn’t bother with a test set (do use one if you’re using this type of model professionally!).  Accuracy and loss curves:

Checking out the results

Now, our model is quite proficient at predicting 0s and 1s. Yet, we actually want the embeddings. Here they are, normalized and all:

##retrieve embeddings
weights = model.layers[2].get_weights()[0]
print(weights[0][:10])

##[-0.44104806 -0.14912751 -0.6652928   0.37164855 -0.37634224  0.08819574
  0.48560727  0.29437146  0.25467378 -0.6605289 ]

Then, we need some defs for computing dot products between countries. First, a definition for cosine similarity between two countries:

##normalize the embeddings so that the dot product between two embeddings becomes the cosine similarity.
##Normalize just means divide each vector by the square root of the sum of squared components.
square_root = np.linalg.norm(weights, axis = 1)
square_root = square_root.reshape((-1, 1))
weights = weights / square_root

##compute most/least similar indices with dot product
def sim_dot(c1, c2, weights=weights, country_to_index=country_to_index): 
    
    ##retrieve indices, then embeddings
    weights1 = weights[country_to_index[c1]]
    weights2 = weights[country_to_index[c2]]
    
    ##compute dot product
    sim = np.dot(weights1,weights2)
    
    return sim

print(sim_dot('Netherlands','Belgium'))
##0.895393
print(sim_dot('United Kingdom','Australia'))
##0.967548
print(sim_dot('Belgium','Togo'))
##0.239391
print(sim_dot('United States','Russia'))
##0.104014

Subsequently, a definition for checking the highest ranked countries for a certain country:

##compute most or least similar countries to a certain country
def top_similar(country, top_K, how, weights=weights, country_to_index=country_to_index, countries=countries):
    ##retrieve all similarities
    sims = [sim_dot(country, c2,  weights=weights, country_to_index=country_to_index) for c2 in countries]
    sims = zip(sims, countries)
    
    ##order descending
    sims_ordered = sorted(sims, reverse = True)
    
    ##return only the highest k ones, or the lowest k ones, dependent on the specified parameter
    if how == 0: 
        sims_ordered = sims_ordered[1:top_K + 1] 
    elif how == 1: 
        sims_ordered = sims_ordered[-top_K:]
    
    return sims_ordered

print(top_similar('Australia',10, 0))
##[(0.97386092, 'Canada'), (0.97085083, 'Spain'), (0.96754754, 'United Kingdom'), (0.95988286, 'Japan'), (0.95359182, 'Switzerland'), (0.93311918, 'Portugal'), (0.92666471, 'Italy'), (0.91799873, 'New Zealand'), (0.90552145, 'Ireland'), (0.89424378, 'Netherlands')]
print(top_similar('Netherlands',10,0))
##[(0.94961381, 'Portugal'), (0.94443583, 'Sweden'), (0.94182992, 'Ireland'), (0.93346846, 'Switzerland'), (0.91344929, 'Spain'), (0.89897841, 'Denmark'), (0.89634496, 'Luxembourg'), (0.89585704, 'New Zealand'), (0.89539278, 'Belgium'), (0.89475739, 'Austria')]
print(top_similar('Australia',10, 1))
##[(0.11294153, 'Grenada'), (0.11038196, 'Malawi'), (0.1062203, 'United States'), (0.06896022, 'Barbados'), (0.060145646, 'Maldives'), (0.024445064, 'Burkina Faso'), (0.00057541206, 'Mauritius'), (-0.029210068, 'Saint Lucia'), (-0.12130371, 'United States Virgin Islands'), (-0.18051396, 'Georgia (country)')]
print(top_similar('Netherlands',10, 1))
##[(0.20300971, 'Indonesia'), (0.18801141, 'India'), (0.1720216, 'China'), (0.16769037, 'Burkina Faso'), (0.15285541, 'Palestinian territories'), (0.14719915, 'United States Virgin Islands'), (0.12747405, 'United States'), (0.11045419, 'Maldives'), (0.030738026, 'Georgia (country)'), (0.0073090047, 'Mauritius')]

You can clearly see in the results that this model has captured actual relationships between countries, without any access to geographical information!

Visualization in 2 dimensions

In order to make the embeddings comprehensible for our dimensionally-challenged human peanut brains, I’ll apply dimension reduction with TSNE, a popular ML method. Now, dimension reduction is not ideal, there are a few drawbacks. Firstly, although it captures quite a lot, dimension reduction of course will never be able to represent the full richness of 50-dimensional definitions in 2 dimensions. Secondly, training TSNE is a handful; underfit the model, and it will not capture any relevant information. Overfit the model, and it will start separating some of the items from the group, because that apparently decreases error. Find out more about the behaviour of TSNE with Tensorflow Embedding Projector. Yet, some fiddling with the parameters usually generates a quite nice model. Here we go:

## TSNE visualisation; given a large amount of dimensions, summarize those to two (or three) dimensions
##TSNE uses as input a list of lists / np ndarray with each list as one vector embedding
tsne = TSNE(n_components=2, metric = 'cosine', verbose=1, perplexity=20, n_iter=400).fit_transform(weights)

##extract 2-dimensional embeddings from tsne
x = [i[0] for i in tsne]
y = [i[1] for i in tsne]

#transform data to pandas
viz_df = pd.DataFrame({'x': x, 'y': y, 'country' :countries, 'continent': continents})

##initialize a Bokeh dataset
source = ColumnDataSource(viz_df)

##define empty plot
p = figure(title = 'TSNE-visualization of country-embeddings',plot_width = 950, plot_height = 950)

##define continent colors
length = len(viz_df['continent'].unique())
palette = d3['Category10'][length]
color_map = CategoricalColorMapper(factors=['Africa' ,'Asia', 'Caribbean/Central America', 'South America', 'Oceania'
 'Europe', 'North America'],palette=palette)

##fill the plot with colored country dots
p.scatter(x='x', y='y', size = 10, source = source,legend='continent', color={'field': 'continent', 'transform': color_map})

##add a hovertool so we can see which dot represents which country
hover = HoverTool(tooltips = [('Country:', '@country'),('Continent:','@continent')])
p.add_tools(hover)

output_file("countryembeddings.html")
save(p)

Here’s the plot. You can clearly see some of the continents grouped together! The image is currently static. I’ll try to embed a dynamic HTML file here shortly.

Results

Throughout this post, we’ve seen some quite fascinating things. Time to compare them to the hypotheses:

  • I expect to see basic grouping of countries based on the continent that they are in, maybe even specific geographical location. For example; the Netherlands should be wedged between Germany and Belgium, within the Europe cluster.

I found this one confirmed. Both when computing dot products and looking at the visualization, one could clearly see that countries which are geographically closer together, flock together in 50-dim space as well. One could distinctly recognize continents in the 2d image. It is even possible to split Eastern and Western Europe!

Of course, it wasnt exactly on point. Apparently, New Zealand and Australia are located in Europe 😉. However, this is to be expected from an analysis in which half of the data were randomly generated!

  • Countries that have strong diplomatic bonds, should be close together. For example, I expect countries that are part of the Commonwealth, such as the UK, Ireland and Australia, to flock together.

True! As I said before, this even caused New Zealand and Australia to be located in Europe! The embeddings reflect diplomatic bonds only to the extend that Wikipedia actually mentions them.

 Some more thoughts on vector embeddings; I find the technique a fascinating and useful tool for data science applications. I’ve used it in professional setting as well, usually in order to provide secondary neural networks with readymade object definitions. You could build both the embeddings and the other neural network layers required for a classification problem within one model. Yet, I find it convenient to have full control over embedding quality, before I start on secondary prediction.

Specifically interesting to me is the manner in which neural network embeddings generate dense, rich, continuous representations for sparse categorical data. In my analysis, only a small subset of all possible bilateral country combinations were represented, which is quite sparse. Yet, we ended up with a dense network with relationships between all countries.

To exemplify this; let’s say that Norway and Sweden are almost identical in regard to all the other countries, yet only Sweden has a relationship to Germany in our data. As a result, the relationship between Norway and Germany will be approximated according to all relations these countries have to other countries, such as Sweden. In other words, we are filling in / guessing lots of information that is not available to us, enriching our data in the process.

That’s it for today. Stay tuned for my next hobby project; reinforcement learning with Keras and PyGame!

Vector embeddings part 1: Word2vec with Gensim

Since the advent of neural networks, vector embeddings for text processing have gained traction in both scientific and applied text classification problems, for example in text sentiment analysis. Using (pre-trained) embeddings has become a de facto standard for attaining a high rating in scientific sentiment analysis contests such as SemEval. However, vector embeddings are finding their way into many other fascinating applications as well.

In the current post, I want to provide a intuitive, easy to understand introduction to the subject. I will touch upon both vector embeddings in general, and applied to text analysis in particular. Furthermore, I’ll provide some quite simple code with Python and Gensim for generating vector embeddings from text. In further posts, I will apply vector embeddings to transforming categorical to continuous variables and generating vector definitions in N-dimensional space for entities.

I’ll cycle through the following points:

  • What are vector embeddings?
  • How do you use vector embeddings?
  • Why do you need vector embeddings?
  • How do you generate vector embeddings?
  • How does the neural network optimize the vector matrix?
  • A practical example (with code): word vectorization for tweets.

What are vector embeddings?

In a nutshell, a vector embedding for a certain item (a word, or a person, or any other object you can think of) is a mathematical definition for that specific item. This definition is comprised of a set of float numbers between -1 and 1, which is called a vector.

A vector is a specific location in N-dimensional space, in comparison to the zero-point in that space. So, a vector embedding for an item means, that the item is represented by a vector, embedded in a space with as many dimensions as there are numbers in the vector. Every position in the vector, indicates the location of the item on a certain axis; the first number represents the X-axis, the second represents the Y-axis, and so forth.

To keep it simple, let’s say the vector embedding for the word ‘bird’ equals [0.6, -0.2, 0.6]. Therefore, ‘bird’ is represented in a specific spot in 3-dimensional space. Simple, right?

Like the following image:

However, it gets more complicated when you start adding dimensions. As a rule of thumb, 50 dimensions is a usual vector length for representing things like grocery items, or people, or countries. For language-related tasks, somewhere in the range of 200-300 dimensions is more common. Our human brains are just not equipped for imagining anything beyond 3 to 4 dimensions, so we’ll have to trust the math.

Lastly, a constraint that is put on vector embeddings, is that they should always add up to 1.

How do you use vector embeddings?

Of course, a list of float numbers on its own does not mean anything. It only becomes interesting once you start factoring in that you could encode multiple items in high-dimensional space. Naturally, items that are closer to each other can be thought of as more similar or compatible. Consequently, items that are far away, are dissimilar. Items that are neither far away nor close by (think of it like a 90 degree angle, just in high-dimensional space) do not really have any discernible relationship to each other. A mathematical term for this is ‘orthogonal’.

The measure that is used for checking whether two vectors are similar / close, is called cosine similarity. Cosine similarity can be computed by calculating the dot product between two vectors. It goes like this;  Given two vectors [A, B, C] and [D, E, F], the numbers in each position are multiplied by each other; similarity equals A times D plus B times E plus C times F. This outcome is called a scalar in linear algebra. Subsequently, the dot product equals the sum of the previously calculated scalar.

The trick here is that for each position in the two vectors that are closer to each other within the -1 to 1 space, the result of the multiplication is higher. Therefore, the dot product between two vectors increases as they resemble each other more.

To test this out, let’s add two more words:

‘bird’                   = [0.6, -0.2, 0.6]

‘wing’                 = [0.7, -0.3, 0.6]

‘lawnmower’    = [-0.2, 0.8, 0.4]

Now what we have here, is actually our first vector matrix, which is just a fancy name for a set of vectors.

Intuitively, ‘bird’ and ‘wing’ should be quite similar, so have a high dot product. Let’s try it out:

Dot([0.7, -0.3, 0.6], [0.6, -0.2, 0.6]) = 0.42 + 0.06 + 0.36 = 0.86

On the opposite, ‘bird’ and ‘lawnmower’ shouldn’t really be related in any way, so the dot product should be close to 0:

Dot([0.6, -0.2, 0.6], [-0.2, 0.8, 0.4]) = -0.12 + -0.16 + 0.24 = 0.06

When you start adding lots of words to your vector matrix, it will start becoming a word cloud in N-dimensional space, in which certain clusters of words can be recognized. Furthermore, vector embeddings do not only represent bilateral relationships between items, but also capture more complicated relationships; In our language use, we have concepts of how words relate to each other. We understand that a leg is to a human, what a paw is to a cat. If our model is correctly trained on sufficient data, these types of relationships should be available.

Why do you need vector embeddings?

In any language-related machine learning task, your results will be best when using an algorithm that actually speaks the language that you’re aiming at analysing. In other words, you might want to use some technique that somehow captures the meanings of the words and sentences you’re processing.

Now, all ML algorithms require numerical INPUT. So, a requirement for an algorithm to capture language meaning, is that it should be numerical; every word should be represented by either a number or a series of numbers.

Generally, for transforming categorical entities such as words to numerical representations, one would apply one-hot encoding; for each of the amount of entities, which is called the cardinality of the variable, a separate column is generated, filled with 1s and 0s, like this:

Country Country_Greece Country_Chile Country_Italy
Greece 1 0 0
Chile 0 1 0
Italy 0 0 1
Chile 0 1 0

Essentially, one-hot encoding already transforms entities to vectors; each entity is represented by a list of numbers. However, this transformation has a number of drawbacks;

  • If you apply one-hot encoding to a text corpus, you’ll need to create a massive amount of extra columns. That’s both very inefficient and complicated to analyse and interpret. Text datasets often contain thousands of unique words. In other words; the created vector is too large to be useful.
  • Due to the Boolean nature of this transformation, each entity is equally as similar to all other entities. No possibility exists for somehow putting more similar items close to each other. Yet, we also want to capture meanings of words. Therefore,  we need a certain way of representing whether two words are closely related or not.

Embedding entities as vectors solves both these problems. Summarized, the method lets you encode every word in your dataset as a short list of float numbers, which are all between 1 and -1. Given that two words are similar, their respective lists resemble each other more.

How do you generate vector embeddings?

As I described before, similarity is essential to understanding vector embeddings. In the case of actually generating the embeddings from scratch, we’ll need a measure for similarity. Usually, in text analysis, we derive that from word co-occurrence in a text corpus. Essentially, words that are close together often within sentences, are theorized to be quite similar. The opposite is true for dissimilar words; they are never close together in text.

This is where neural networks come in: you use the neural network to either make two vectors more similar to each other each time the underlying entities occur together, or to make two vectors for items more dissimilar given that they don’t occur together.

Firstly, we need to transform our dataset to a usable format. Since the neural network will require some number to be predicted, we need a binary output variable. all similar word combinations will be accompanied with a 1, and all dissimilar combinations will be accompanied by a 0, like this;

First word Second word Binary output
‘bird ‘wing’ 1
‘bird’ ‘lawnmower’ 0
‘wing’ ‘lawnmower’ 0

These word combinations, accompanied by the output variable, are fed to the neural network. Essentially, the neural network predicts whether a certain combination of words should be accompanied by a 1 or a 0, and if it’s wrong compared to the real value, subsequently adjusts the weights in its internal vector matrix to make a more correct prediction next time that specific combination comes around.

How does the neural network optimize the vector matrix?

I’ll try to keep it simple. The neural network is trained by feeding it lots of bilateral word combinations, accompanied by 0s and 1s. Once a certain word combination enters the model, the two respective word vectors are retrieved from the vector matrix. At first, the numbers in the vectors are randomly generated, close to 0.

Subsequently, the dot product for those specific two vectors is calculated. Based on the outcome, the model will predict either a 0 (non-combination) or a 1 (actual combination). A sigmoid transformation is applied to the dot product of each combination, so that the outcome of the model is either 0, for a prediction of a non-combination, or a 1, for the prediction of a true combination.

The learning of the model, as with all neural networks, happens through backpropagation; after feeding a word combination to the network, the model prediction is compared with the actual outcome value. If those don’t match, this is propagated back to the embedding layer, where the embeddings are adjusted accordingly. If two words occur together frequently as a true combination, their vector embeddings will be adjusted to resemble each other more. The opposite happens for false combinations; those embeddings are adjusted so that they resemble each other less.

After training, we end up with an embedding matrix containing optimized embeddings for all words in the embedding matrix. We discard the sigmoid layer so we can examine dot products between different words.

A practical example (with code): word vectorization for tweets.

So, neural network vector embeddings can be challenging to understand. However, there are some libraries available that do all the heavy lifting in terms of programming; I’ll be using Gensim.

Some library imports:

import pandas as pd
import numpy as np
import csv
from gensim.models import word2vec
import nltk
import re
import collections

Firstly, we need some textual data. I chose for a set of 1.6 million tweets that were provided with a sentiment analysis competition on Kaggle. The data were provided in .csv format. Let’s load them. I’ll be keeping only a subset, for quick training.

##open kaggle file. We only need the tweet text, which is in 5th position of every line
with open('sentiment140\kaggle_set.csv', 'r') as f:
    r = csv.reader(f)
    tweets = [line[5] for line in r]
tweets = tweets[:400000]

Here’s what they look like.

print(tweets[:10])

#["@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D", "is upset that he can't update his Facebook by texting it... and might cry as a result  School today also. Blah!", '@Kenichan I dived many times for the ball. Managed to save 50%  The rest go out of bounds', 'my whole body feels itchy and like its on fire ', "@nationwideclass no, it's not behaving at all. i'm mad. why am i here? because I can't see you all over there. ", '@Kwesidei not the whole crew ', 'Need a hug ', "@LOLTrish hey  long time no see! Yes.. Rains a bit ,only a bit  LOL , I'm fine thanks , how's you ?", "@Tatiana_K nope they didn't have it ", '@twittera que me muera ? ']

It appears that some data cleaning is in order.

##clean the data. We'll use nltk tokenization, which is not perfect. 
##Therefore I remove some of the things nltk does not recognize with regex
emoticon_str = r"[:=;X][oO\-]?[D\)\]pP\(/\\O]"
tweets_clean = []

counter = collections.Counter()
for t in tweets:
    t = t.lower()
    ##remove all emoticons
    t = re.sub(emoticon_str, '', t)
    ##remove username mentions, they usually don't mean anything
    t = re.sub(r'(?:@[\w_]+)', '', t)
    ##remove urls
    t = re.sub(r"http\S+", '', t)
    ##remove all reading signs
    t = re.sub(r'[^\w\s]','',t)
    ##tokenize the remainder
    words = nltk.word_tokenize(t)
    tweets_clean.append(words)

That’s better:

print(tweets_clean[:10])
##[['awww', 'thats', 'a', 'bummer', 'you', 'shoulda', 'got', 'david', 'carr', 'of', 'third', 'day', 'to', 'do', 'it', 'd'], ['is', 'upset', 'that', 'he', 'cant', 'update', 'his', 'facebook', 'by', 'texting', 'it', 'and', 'might', 'cry', 'as', 'a', 'result', 'school', 'today', 'also', 'blah'], ['i', 'dived', 'many', 'times', 'for', 'the', 'ball', 'managed', 'to', 'save', '50', 'the', 'rest', 'go', 'out', 'of', 'bounds'], ['my', 'whole', 'body', 'feels', 'itchy', 'and', 'like', 'its', 'on', 'fire'], ['no', 'its', 'not', 'behaving', 'at', 'all', 'im', 'mad', 'why', 'am', 'i', 'here', 'because', 'i', 'cant', 'see', 'you', 'all', 'over', 'there'], ['not', 'the', 'whole', 'crew'], ['need', 'a', 'hug'], ['hey', 'long', 'time', 'no', 'see', 'yes', 'rains', 'a', 'bit', 'only', 'a', 'bit', 'lol', 'im', 'fine', 'thanks', 'hows', 'you'], ['nope', 'they', 'didnt', 'have', 'it'], ['que', 'me', 'muera']]

Now building the actual model is super simple:

##now for the word2vec
##list of lists input works fine, you could train in batches as well if your set is too large for memory. 
model = word2vec.Word2Vec(tweets_clean, iter=5, min_count=30, size=300, workers=1)

And check out the results. The embeddings generate some cool results; you can clearly see that the most similar words to a certain word, are actually almost identical in semantic meaning:

print(model.wv.similarity('yellow','blue'))
print(model.wv.similarity('yellow','car'))
print(model.wv.similarity('bird','wing'))

#0.725584
#0.25973
#0.484771

print(model.wv.most_similar('car'))
print(model.wv.most_similar('bird'))
print(model.wv.most_similar('face'))

#[('truck', 0.7357473373413086), ('bike', 0.6849836707115173), ('apt', 0.6751911640167236), ('flat', 0.666846752166748), ('room', 0.6251213550567627), ('garage', 0.6184542775154114), ('van', 0.6047226190567017), ('house', 0.5993092060089111), ('license', 0.5938838720321655), ('passport', 0.5903675556182861)]
#[('squirrel', 0.7508804202079773), ('spider', 0.7459014654159546), ('cat', 0.7174966931343079), ('kitten', 0.7117133140563965), ('nest', 0.7008006572723389), ('giant', 0.6956182718276978), ('frog', 0.6906625032424927), ('rabbit', 0.685787558555603), ('mouse', 0.6779413223266602), ('hamster', 0.6754754781723022)]
#[('mouth', 0.6230158805847168), ('lip', 0.5884073972702026), ('butt', 0.5822890996932983), ('finger', 0.579236626625061), ('smile', 0.5790724754333496), ('eye', 0.5769941806793213), ('cheek', 0.5746228098869324), ('skin', 0.5726771354675293), ('arms', 0.5703110694885254), ('neck', 0.5571259260177612)]

That’s it for today! I’ll be elaborating more on how to generate vector embeddings by defining neural networks with Tensorflow and Keras, in a future post.

How to convert Categorical Data to Continuous

One of the main aspects of preparing your dataset for statistics, machine learning, or other advanced analyses, is understanding exactly with which datatypes you’re dealing, and subsequently transforming them to desired datatypes for analysis.

In this post, I’ve gathered a number of common and less common methods from machine learning and statistics. These are, to my mind, adjacent fields that often deal with exactly the same problems, just with different terminology. Therefore, regardless of your field, elect the most suitable method. The list does probably not encompass all available transformations. Yet, I try to touch upon at least all of the common techniques.

Because the topic is enormous, I’ll try to provide intuitive, concise descriptions for each transformation. I intend to elaborate on some of the more advanced methods (e.g. vector embeddings) in future posts.

For the sake of clarity; in this article, I’ll use the word categorical as synonym for nominal and ordinal variables, and I’ll use the term continuous as a synonym for ratio and interval variables. Other common denominations are discrete  and numerical, respectively. Furthermore, a variable, or column, contains some characteristics of cases, or lines. These are different between cases, otherwise you would have no reason for maintaining the variable in your data. All these different characteristics are denominated levels through the entire article. The amount of unique levels in a variable, is called its cardinality.

In this post, I first describe data types. Subsequently, I touch upon the following transformations:

  • One-hot encoding
  • Binary encoding, or dummy variables
  • Contrasts
  • Transformation to dichotomous
  • Counting levels
  • Ranking based on count
  • Vector embeddings
  • Ignoring transformation altogether

First though, we should go through all data types available.

Data Types

Different sorts of information are often split into the following hierarchy.

  • Nominal data: A nominal variable contains different levels that are not more or less than each other; there is no hierarchy present. For example, one could think of car brands; there’s no clear hierarchy of which brand is better than the other.
  • Ordinal data: This data type contains different levels, in which a clear hierarchy is established. For these types of variables, it’s often difficult to define whether the distances between different levels are equally large. For example, car X may be better than car Y, and car Z is better than both. However, is car Z better than car Y by the same amount as car Y is better than car X?

Often, it’s quite difficult to distinguish between both aforementioned data types. For example, I mentioned that different car brands contain no hierarchy. Your opinion on the other hand may be that some brands are better than others, and that car brand is therefore an ordinal variable. Within statistics, nominal and ordinal are usually treated equally; for an analysis, they mean the same thing.

  • Ratio data: A ratio variable contains different levels, between which a very clear hierarchy can be established. Also, the distances between those levels are equal. For example, one’s net worth. You may be worth millions, or you may be millions in debt. There is always a clear concept that the difference between owning $1000 and $2000, is exactly the same as the difference between $10000 and $11000.
  • Interval data: This data type is exactly the same as the previous one, bar one exception. An interval variable has an absolute starting point, in other words, a zero. One’s salary is an example of this. One may earn anything between $0.01 and millions, but the general agreement is that a salary cannot be negative; you cannot earn less than $0 over a period of time.

As before, the difference between ratio and interval is only a slight nuance. Therefore, they’re usually treated as being equal.

Within statistics and machine learning, in almost any algorithm, variables are expected to be either in interval or ratio. Therefore, in many situations, one might want to change the datatype of one or more variables from categorical to continuous.

In order to get to a mathematical formula to predict / explain some output variable, the assumption of equal distances between levels needs to be met. Now is may seem that there are analyses which can have categorical variables as input. One of those is a method widely used in social sciences, named analysis of variance (ANOVA). This analysis requires categorical variables as input, and continuous variables as output. However, in the background, it transforms all categorical inputs to continuous with one-hot encoding. Also, some analyses do exist that use both categorical inputs and outputs, such as the chi-square test of independence. Yet, even chi-square transforms your categorical levels to counts of how often they occur, which is in essence continuous information.

Therefore, you might want to take full control of the data types in your set. Now that you understand data types, here are the most common methods for transforming categorical variables (at least the ones that I know of).

One-hot encoding

One-hot encoding, and very similarly creating dummy variables, may be the most widespread method for categorical to continuous transformation. As I mentioned before, some analyses even do it automatically without you noticing.

What essentially happens is this; your categorical variable contains K levels. Therefore, K new variables are created. On each of those new variables, cases that have the corresponding variable are set to 1, and all other levels are set to 0. For example:

Person ID Hair_colour Red Black Blonde
1 Red 1 0 0
2 Red 1 0 0
3 Blonde 0 0 1
4 Black 0 1 0
5 Black 0 1 0

As mentioned before, the Hair colour variable with three levels is split into three binary dummy variables, that all encode a specific colour.

Note. Dummy encoding is common in statistics, and slightly different from one-hot encoding; K – 1 new variables are created, and one level is set to 0 on all of those. Like this:

Person ID Hair_colour Red Black
1 Red 1 0
2 Red 1 0
3 Blonde 0 0
4 Black 0 1
5 Black 0 1

One-hot encoding is a very elegant transformation; it captures all available information efficiently. Nevertheless, it’s not suited to all situations. It works well when a variable contains a relatively small amount of levels; say, hair colour or blood type. When the amount of levels on the other hand is large, one-hot encoding becomes an unmanageable wildgrowth of variables. Given a variable that contains items bought in a supermarket, dummy variables are useless. A supermarket may sell thousands of different products, resulting in thousands of variables in your set. You lose both oversight on your data, and efficiency of analysis in the process.

Yet maybe the largest drawback to one-hot encoding is the lack of information on relationships between levels. With supermarket items, we know that apples and pears are very similar, but detergent is not like those items. Nevertheless, items are encoded in a way, that they are equally similar to each other.

Binary encoding

This transformation is quite similar to one-hot encoding. However, it is better suited to variables with high cardinality. In simple terms, binary encoding is just the transformation of the number that a certain level has, to its binary representation. For example, the number 2 could also be written as 010. It has the characteristic that all the levels are equally distant to each other, like one-hot. This can be advantageous or not, depending on your data. Let’s see:

Car_color Binary_1 Binary_2 Binary_3
Blue 0 0 0
Red 0 0 1
Grey 0 1 0
Black 0 1 1
Yellow 1 0 0
Purple 1 0 1
White 1 1 1

In this case, the cardinality determines the amount of columns required. As you see in the previous example, binary with three places supports up to seven levels, thereafter you’d go to four places.

Binary encoding works better than one-hot encoding on variables with high cardinality. Nevertheless, it has some drawbacks; like one-hot encoding, this technique does not capture relationships between levels in any way.

For one-hot encoding and binary transformation, many adjacent methods, such as hashing, are to be discussed. Yet, for an introduction, these two suffice.

Contrasts

Within linear regression analysis, it is common to have an independent variable with more than two levels, predicting a continuous dependent variable. One should never apply linear regression to a categorical variable with over two levels, because you are not certain that those levels are equally distant from each other (the condition for an interval variable). You don’t even know whether the levels are in the right order. It would be like trying to fit a straight line through a random order of car colors.

Therefore, you it’s possible to transform the independent variable to a set of contrasts; you define which levels should be linearly compared to each other, with dichotomous tests. You do this based on your hypotheses; which comparisons are relevant for your specific analysis? Because the amount of contrasts is always K – 1, you need to be selective regarding which comparisons should be made.

Treatment contrasts

Within experimental research, you often want to compare some treatment groups to your baseline, or control group. Let’s say, you’re assessing the effectiveness of two drug treatments A and B for symptoms of depression. Your baseline group of patients receives a placebo, and your two experimental groups receive the drug treatments, respectively. Your variable has three levels, therefore you get to specify two contrasts. You’d specify the following dichotomous comparisons, i.e. treatment contrasts:

  • Compare the severity of depression for treatment A against the baseline group;
  • Compare the severity of depression for treatment B against the baseline group.

Usually, regression models let you specify the following matrix for this. Each contrast embodies a dichotomous comparison.

Variable Contrast_1 Contrast_2
Baseline 0 0
Treatment A 1 0
Treatment B 0 1

As you can see, it’s very similar to dummy coding, and the variable that the baseline, is never set to 1 in the contrasts.

Helmert contrasts

Another common one is the Helmert contrast. Again, choosing for this one depends on which question you’re trying to answer. The basic functionality of the contrasts is this; the mean of every level is compared to the mean of the previous levels.

This is best demonstrated with an example; say, you are testing how drug treatment X affects depression symptoms. You take four measurements (denoted as M1 through M4) over the entire treatment. You want to know whether patients experience less symptoms at each measurement. Four levels, so three contrasts to be defined. Helmert contrasting dictates the following comparisons:

  • Compare the mean of M2 to that of M1;
  • Compare the mean of M3 to the grand mean of M1 and M2;
  • Compare the mean of M4 to the grand mean of M1, M2, and M3.

This is captured in the following matrix, with each comparison embodied in a contrast, respectively:

Variable Contrast_1 Contrast_2 Contrast_3
M1 -1 -1 -1
M2 1 -1 -1
M3 0 2 -1
M4 0 0 3

There are many more contrast designs such as sum contrasts and orthogonal contrasts, but for an intuitive understanding of the method, I’ll stick to these two. Using contrasts can be quite advantageous; the method allows you to exactly specify which questions should be answered, and therefore forces you to instil prior knowledge into the model. This completely opposes the usual expectation-free manner in which inputs and outputs are linked in many machine learning applications. Nevertheless, contrasts are quite a niche tool for regression, and are difficult to define once your research question spans more than four or five dichotomous comparisons.

Transformation to dichotomous

Variables that only contain two levels, can be used as continuous, even if they contain categories. This is because only two levels need to be compared. Distance between levels only matters given that you would have three levels or more. For example, if the detail level of your flight class variable is 0 (economy) or 1 (business) you may enter it into a linear regression without transformation.

Alternatively, you can aggregate a categorical variable that has more than two levels, to binary. Let’s say that your data contains a variable with levels that are car brands. You want to transform that variable to continuous, and notice that the people in your dataset only drive German or Japanese cars. Then, you’d be able to encode the car brand someone drives as either 0 (German) or 1 (Japanese). Another case in which transformation to binary may be beneficial, is when you only have three or four levels in your variable. You may merge the most similar ones, so that only two levels remain.

The advantage of applying this transformation is that it’s very fast. Both generating the variable and running analyses with it is very efficient. However, you might lose a lot of information; what if you’re predicting reliability, and some German brands turn out to be reliable, and others don’t? That information will be lost with aggregation. Nonetheless, this method may be beneficial in some situations.

Counts

Just counting the amount of occurrences of each level in the data, is actually a method that is used quite often in statistics. For example, in the chi-square test of independence I previously mentioned.

Say, you have supermarket data on individual sales of products, and you want to know what each customer has bought over a period of time. You might structure the data like this, eliminating any need for dummy or binary coding:

Customer Toilet paper Detergent Chilli sauce
Joe 20 14 50
Anne 13 16 17
Pedro 0 40 20
Jen 50 60 3

The major drawback here, is that aggregating your data to a more dense format always leads to information loss. For example, any sequence in which customers bought the product, cannot be analysed from this set. Nevertheless, counting often suffices in situations where aggregation is inevitable, and the cardinality of the continuous variable is quite low.

Ranking

The technique of ranking your levels, is a slightly more advanced method than just counting them.

This method rests upon the following premise: Levels of a categorical variable do not have any numerical meaning. However, if we are able to order them in some meaningful manner, they do may categorical value. So, how often each level occurs in the categorical variable, is a numerical representation for that level. Therefore, if we rank levels by how often they occur, our transformation should work. Naturally, the higher the count, the higher the rank that should be given to the level. This method should to some extent satisfy the condition for continuous variables that levels which are closer to each other, are more similar.

In the example below, the color blonde occurs the least often, and receives rank 1. Black hair occurs 3 times, and is assigned rank 2. Red occurs most times, and is therefore ranked 3.

Person_ID Hair_color Rank_score
1 Blonde 1
2 Red 3
3 Black 2
4 Red 3
5 Red 3
6 Black 2
7 Black 2
8 Red 3

However, applying the aforementioned may lead to the situation in which at least two levels occur equally as often in the variable. You can’t just assign them the same rank, which would mean that those levels are indistiguishable. We need a method to differentiate between them, because otherwise we don’t satisfy the condition that there should be K ranks for K levels. This paper actually has a clever mathematical solution for that.

Of course, the obvious disadvantage here is having a large number of levels with the same count. Yet, your analysis may actually benefit from ranking your variables, as the authors of the previously mentioned paper point out.

Vector embeddings

Simply put, every level of your variable receives a vector, or list of numbers, of length X. That vector represents the location of a specific level in X-dimensional space, usually between 50 and 300. Moreover, levels that are more similar to each other, are closer together in embedding space.

Think of it like GPS coordinates of cities on a map; cities that are closer to each other have more similar coordinates. And because two cities close together are likely in the same country, they may resemble each other very much. Nevertheless, vector embeddings are often computed in up to 300 dimensions, instead of two-dimensional maps.

Vector embeddings are most commonly used for transforming text to a usable mathematical transformation. A large text corpus often contains thousands of unique words, rendering most transformation techniques useless. Moreover, you might want to capture some of the rich semantic meanings of words that are present in text format.

For that purpose, you can use neural networks. What basically happens under the hood, is this; each unique word is assigned a vector. Your neural network model is then taught to predict whether two random words occur close to each other or not. As your network becomes proficient at this, it pushes co-occurring words closer to each other in vector space, and words that are never close, away from each other.

This results in an embedding space that actually captures all the meanings of words in relation to each other, in a mathematical way. Of course, our brains limited to three dimensions won’t understand anything of it.

Not only text is suitable for embeddings. Any data in which combinations of certain levels of the variable occur, is sufficient for generating embeddings. For example, if you’re working with product sales information from a supermarket, your data contains patterns on which items are bought together often, and which aren’t. Based on the aforementioned product combinations, you can produce vectors that group similar products together. These vector embeddings actually capture some type of mathematical meaning of a product. For example, it’s very unlikely that two brands of milk are often bought together. You either choose one or the other. However, those two milk products will probably be encoded as very similar vectors. This is because although they do not occur together and therefore have no direct relationship, their relationships to other products are very likely to be similar. If you buy milk, you might buy flour, regardless of the brand of milk. Thus, the different milk products will be close in vector space.

Creating vector embeddings is, to my opinion, the most elegant way to transform some categorical information to continuous. This is because it not only encodes each available level fully, but vector embeddings also contain information on how different levels of your variable are related to each other.

On the other hand, vector embeddings are only applicable in very limited situations. You need some information on which combinations do or do not occur between levels. This is most easily found in text, but as I mentioned before, sales data also works, as do Wikipedia links, or social media mentions. You’ll need to be creative in how to acquire combination information if you’d like to use vector embeddings.

Or, just ignore the transformation

The last, and probably very incorrect option available, is to just avoid any transformation and use your categorical variable as continuous. In this case, you’d transform all the different levels in your variable to the numbers 1 to K, in random order. You basically ignore any ranking or clustering within the variable. Although I do not have any experience with this, and it goes against basic assumptions of any advanced analysis, I think that in some situations, it may actually work well.

For example, if you’re trying to apply linear regression with an independent variable with three levels, and you have some knowledge on the hierarchy within those levels (so an ordinal variable) the regression function may fit well.

In my opinion, the most important thing to remember here is that, given that you’re going to avoid any transformation, this should be a well-founded decision; without a solid indication that it may actually improve your analysis, don’t do it.

Conclusion

So there you go, all common available methods for transforming categorical data to continuous. Spending some time on finding the right one for your feature engineering or statistical analysis, may actually mean a large increase in performance. Now go apply them!

Multivariate Outlier Detection with Isolation Forests

Recently, I was struggling with a high-dimensional dataset that had the following structure: I found a very small amount of outliers, all easily identifiable in scatterplots. However, one group of cases happened to be quite isolated, at a large distance from more common cases, on a few variables. Therefore, when I tried to remove outliers that were at three, four, or even five standard deviations from the mean, I would also delete this group.

Fortunately, I ran across a multivariate outlier detection method called isolation forest, presented in this paper by Liu et al. (2012).

This unsupervised machine learning algorithm almost perfectly left in the patterns while picking off outliers, which in this case were all just faulty data points. I’ve used isolation forests on every outlier detection problem since. In this blog post, I’ll explain what an isolation forest does in layman’s terms, and I’ll include some Python / scikit-learn code for you to apply to your own analyses.

Outliers

First, some outlier theory. Univariate outliers are all cases in one’s data that are quite far from the mean in terms of standard deviation on a certain variable. Don’t confuse them with influential cases. Influential cases may be outliers, and vice versa, but they’re not identical. Multivariate outliers, which we are discussing in this post, are essentially cases that display a unique or divergent pattern on variables.

Outliers may have two causes:

  • There may be mistakes present in your data. Maybe someone filled in a faulty number or just typed 999; maybe the application generating your data contains some weird logic. You definitely want to remove these cases before analysis.
  • Some cases are quite abnormal in your data, but are valid. In this case, it’s up to the data scientist to remove them or not. If you’re for example doing a regression, those outliers may strongly influence your results (Cook’s distance), and you’ll remove them from the data. However, a clustering algorithm will just put the abnormal group of cases in a separate cluster.

What I find isolation forests to do well, is that they first start at picking off the false or bad cases, and only when those are all identified, will start on the valid, abnormal cases. This is because valid cases, however abnormal, are often still grouped together, where bad cases are truly unique. This is not true for all analyses; if a default is 999 for example, there may be many cases with that value on some variable. It’s up to the data scientist to be vigilant in those cases.

Isolation Forests

As the name suggests, isolation forests are based on random forests. I’ll dig into decision trees and random forests some other time, but here’s what you need to know; decision trees split data into classes in order to minimize prediction error. On each iteration, the tree gets to make one split on one of the included variables to removing the most entropy, or degree of uncertainty. For example, a decision tree could first split cases into younger and older people, when predicting SES. Subsequently, it could split the younger group into people with and without college degrees to remove entropy, and so on.

Introduce random forests; large, powerful ensembles of trees, in which individual quality of each tree is diminished due to random splits, but with low prediction error due to trees outperforming other trees gaining a larger weighting in the final decision.

An isolation forest is based on the following principles (according to Liu et al.); outliers are the minority and have abnormal behaviour on variables, compared to normal cases. Therefore, given a decision tree whose sole purpose is to identify a certain data point, less dataset splits should be required for isolating an outlier, than for isolating a common data point. This is illustrated in the following plot:

Based on this, essentially what an isolation forest does, is construct a decision tree for each data point. In each tree, each split is based on selecting a random variable, and a random value on that variable. Subsequently, data points are ranked on how little splits it took to identify them. Given that the model’s instructions were to identify X% as outliers, the top X% cases on rank score are returned.

Isolation forests perform well because they deliberately target outliers, instead of defining abnormal cases based on normal case behaviour in the data. They are also quite efficient; I’ve easily applied them on datasets containing millions of cases.

Now for the practical bit. Let’s generate some data. I generate a large sample of definite inliers, then some valid outliers, then some bad cases.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest

##first, build a simulated 2-dimensional dataset
inliers = [[np.random.normal(50, 5, 1), np.random.normal(50, 5, 1)] for i in range(900)]
outliers1 = [[np.random.normal(70, 1, 1), np.random.normal(70, 1, 1)] for i in range(50)]
outliers2 = [[np.random.normal(50, 20, 1), np.random.normal(50, 20, 1)] for i in range(50)]

#merge and reshape the data points
df = inliers+ outliers1 + outliers2
v1, v2 = [i[0] for i in df],[i[1] for i in df]

df = pd.DataFrame({'v1': v1, 'v2': v2})
df = df = df.sample(frac=1).reset_index(drop=True)

##and look at the data
plt.figure(figsize = (20,10))
plt.scatter(df['v1'], df['v2'])
plt.show()

Subsequently, we’ll train an isolation forest to identify outliers and examine the results. 5% of the set were generated as outliers. However, some of these are likely to have ended up as inliers. Therefore I set the outlier identification threshold to 4%.

##apply an Isolation forest
outlier_detect = IsolationForest(n_estimators=100, max_samples=1000, contamination=.04, max_features=df.shape[1])
outlier_detect.fit(df)
outliers_predicted = outlier_detect.predict(df)

#check the results
df['outlier'] = outliers_predicted
plt.figure(figsize = (20,10))
plt.scatter(df['v1'], df['v2'], c=df['outlier'])
plt.show()

As you see, the isolation forest nicely separates bad cases from actual data patterns, with barely any errors. If I would now want to remove the rightmost cluster as well, I could just increase my removal threshold. The next plot is the result with the threshold set to 15%. One can clearly see that it now starts picking at cases in the smaller cluster and in the periphery of the larger cluster. Keep in mind that this algorithm will always incur slight error, due to random generation of splits.

Hyperparameter tuning

This is how to tune the following parameters for optimal performance:

  • n_estimators: The number of decision trees in the forest. According to Liu et al., 100 should be sufficient in most cases.
  • max_samples: Given large datasets, you might want to train on random subsets of cases to decrease training time. This parameter lets you determine subset size.
  • contamination: The proportion of the data you want to identify as outlier. As demonstrated before, this parameter requires some trial and error combined with scatterplot visualisation, given no prior knowledge.
  • max_features: The amount of variables that should be used to define outliers on. Should be set to the amount of variables that you have in almost any situation. This feature allows one to iterate over variables, and do univariate outlier detection on each variable without specifying a standard deviation threshold.

That’s all you need to know for applying isolation forests to multivariate outlier detection! Please refer to this blog if you use any information written here and the scikit-learn documentation in your own work.