Vector embeddings part 2: Country Embeddings with TensorFlow and Keras

Following this previous post on neural network vector embeddings for text, I wanted to experiment some more with creating embeddings for items in a dataset. As you may remember, vector embeddings are lists of numbers that represent the location of certain objects in N-dimensional space. Essentially, embeddings encode categorical data to continuous. Objects that are more similar or compatible are placed closer to each other in vector space, and the opposite for dissimilar objects. Aforementioned (cosine) similarity is rooted in co-occurrence in the data; if two items are together often, they are placed closer together. It’s kind of like a sky with stars that all have their unique location, just the sky has between 50 and 300 dimensions in this type of analysis.

As I described in my previous post, one could capture meanings of words in relation to each other by generating embeddings from words that often co-occur. This is the most prevalent application of embeddings. Yet, there are other fascinating applications of vector embeddings. For example, one could attempt to embed sales data on which items are bought together frequently, for making purchase recommendations to customers. Even more interesting; encoding people and the networks between them, based on how often they interact on social media.

Credit where credit is due; Inspiration for this post partially came from TowardsDataScience and Deep Learning with Keras by Gulli & Pal (2017).

Summary for impatient readers

In the current post, I’ll describe a pet project I’ve been working on recently; recreating a world map of countries from purely relational data on those countries. Here’s a summary for what I did: Firstly, I gathered some data in which some countries co-occur and others don’t. Subsequently, I generated 50-dimensional embeddings for each country with a neural network. Lastly, I applied TSNE dimension reduction to two dimensions, and plotted those dimensions on a map and to see whether they in any way resemble either a world map, or geographical / diplomatic relations between countries. Stick around until the end of the post to see whether I succeeded or not!

These are my hypotheses:

  • I expect to see basic grouping of countries based on their respective continents and specific geographical locations. For example; the Netherlands might be wedged between Germany and Belgium, within the Europe cluster.
  • Countries that have strong diplomatic bonds, should be close together. For example, I expect countries that are part of the Commonwealth, such as the UK, Ireland and Australia, to flock together.

And these are all the techniques I employed in this project:

  • Wikipedia API
  • Neural network embeddings with Keras and Tensorflow
  • t-distributed stochastic neighbor embeddings (TSNE) from scikit-learn, for dimension reduction
  • Check bilateral combinations with cosine similarity / dot products
  • 2D scatterplot with Bokeh

Countries

Of course, if I want to work with countries, I first need a comprehensive list of them. Here’s a United Nations .xlsx file that I found after some quick Googling: Of all the on-line country lists that I found, this one seemed the most reliable.

That file is not usable in Python yet. So, I did some manual data massaging, added respective continents to the countries, and saved it all in a .csv. This is what I ended up with:

So, let’s load that info into some Python lists:

##get the imports out of the way
import wikipedia
import numpy as np
import pandas as pd
import csv
import itertools
from collections import defaultdict
import random
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE

from keras.layers import dot, Input, Dropout
from keras.layers.core import Dense, Reshape
from keras.layers.embeddings import Embedding
from keras.models import Model
from scipy.spatial.distance import cdist

from bokeh.plotting import figure, show, output_file, save
from bokeh.models import ColumnDataSource, HoverTool, CategoricalColorMapper
from bokeh.palettes import d3
from bokeh.io import show, output_notebook
output_notebook()

##load country and continent names from csv file
countries, continents = [],[]
with open('C:/Users/teun/Documents/Blog/countries.txt', 'rt') as file: 
    reader = csv.reader(file, delimiter = ',')
    for line in reader: 
        countries.append(line[0])
        continents.append(line[1])
        
print(countries[:10])
print(continents[:10])
print(len(countries))

##['Algeria', 'Egypt', 'Libya', 'Morocco', 'Tunisia', 'Western Sahara', 'Burundi', 'Comoros', 'Djibouti', 'Eritrea']
##['Africa', 'Africa', 'Africa', 'Africa', 'Africa', 'Africa', 'Africa', 'Africa', 'Africa', 'Africa']
##187

Apparently, the UN acknowledges 187 countries.

Wikipedia Data

Yet, we need more data. As I mentioned before, we need a dataset in which co-occurrence of countries is represented. At first, I wanted to gather lots of news articles and extract country co-occurrence from those. I tried the New York Times API for that. I found that country combinations are quite sparse; countries are not mentioned that often. Moreover, the NYT API has quite restrictive rate limiting, rendering my data acquisition super slow.

However, I got the inspiration for better data from Will Koehrsen at Towards Data Science; Wikipedia provides a foolproof API for gathering full articles, so that web scraping is not necessary. Basically, I acquired the full Wikipedia text for every country:

##acquire Wikipedia text for each country
pages = []
for country in countries: 
    wiki = wikipedia.page(country)
    pages.append(wiki.content)

print(pages[0][:400])
##Algeria ( (listen); Arabic: الجزائر‎ al-Jazā'ir, Algerian Arabic الدزاير al-dzāyīr; French: Algérie), officially the People's Democratic Republic of Algeria (Arabic: الجمهورية الجزائرية الديمقراطية الشعبية‎), is a country in the Maghreb region of North Africa. The capital and most populous city is Algiers, located in the far north of the country on the Mediterranean coast. With an area of 2,381,74

Then, I filtered out only the country names for every article:

##create a list of lists with countries that occur together within articles 
combinations = []
for page in pages: 
    combination = []
    for country in countries: 
        if country in page: 
            combination.append(country)
    combinations.append(combination)

print(combinations[1])
##['Egypt', 'Libya', 'Eritrea', 'Ethiopia', 'Somalia', 'Sudan', 'South Africa', 'Niger', 'Nigeria', 'China', 'Bangladesh', 'India', 'Iran', 'Armenia', 'Cyprus', 'Iraq', 'Israel', 'Jordan', 'Kuwait', 'Qatar', 'Saudi Arabia', 'Syria', 'Turkey', 'United Arab Emirates', 'Yemen', 'Argentina', 'Albania', 'Greece', 'Russia', 'France', 'Italy', 'Malta', 'Canada', 'United States']

Generating true and false bilateral combinations

Now, for embeddings, we need bilateral combinations. I’ll assume that all possible bilateral combinations within a Wikipedia article are valid. Generating them is quite simple;

##generate two-way combinations
def generate_true_combinations(sets): 
    true_combinations = []
    for set in sets: 
        for combination in list(itertools.combinations(set, 2)):
            true_combinations.append([i for i in combination])
    return true_combinations

true_combinations = generate_true_combinations(combinations)
print(true_combinations[:10])

##[['Algeria', 'Egypt'], ['Algeria', 'Libya'], ['Algeria', 'Morocco'], ['Algeria', 'Tunisia'], ['Algeria', 'Western Sahara'], ['Algeria', 'Mali'], ['Algeria', 'Mauritania'], ['Algeria', 'Niger'], ['Algeria', 'Nigeria'], ['Algeria', 'Jordan']]

Yet, we also need some “false” combinations. I’ll explain; we will train our neural network by feeding it country combinations, upon which the network will place those closer together in vector space. However, we also need to feed the network bilateral combinations that did not occur in the data. We can generate those randomly from our true combination data:

##generate false two-way combinations that are randomly sampled from the full true set
def generate_false_combinations (true_combinations, desired_amount):
    false_combinations = []
    counter = 0
    ##this list of unique combinations is required for checking whether a randomly generated combination exists or not. 
    ##first, create sets of tuples; sets only work with immutable objects such as tuples
    combinations_unique = set(tuple(c) for c in true_combinations)
    ##and transform back to list of lists 
    combinations_unique = [list(c) for c in combinations_unique]
    
    #generate bidirectional unique combinations dict for performance improvement
    true_dict = defaultdict(list)
    
    for c in combinations_unique: 

        true_dict].append(c[1]) 
        true_dict].append(c[0])
        
    ##generate units0 and units1 lists, which contain all items in respectively position 0 and 1 for each combination in the full set
    units0 = [item[0] for item in true_combinations]
    units1 = [item[1] for item in true_combinations]
    
    while counter <= desired_amount:
            
        #random items need to be generated from the full true set, to account for prevalence. 
        c = [random.choice(units0), random.choice(units1)]
            
        #check if c did not exist in true combinations: 
        if c[0] != c[1]: 
            if c[0] not in true_dict]: 
                false_combinations.append(c)
                counter += 1        

    return false_combinations
    
false_combinations = generate_false_combinations(true_combinations, len(true_combinations))

print(false_combinations[:10]) 

##[['New Caledonia', 'Bolivia'], ['Myanmar', 'Swaziland'], ['Sudan', 'Trinidad and Tobago'], ['Armenia', 'Macau'], ['Guyana', 'Jordan'], ['Slovakia', 'Ghana'], ['Moldova', 'Ecuador'], ['Sierra Leone', 'Czech Republic'], ['Mongolia', 'Czech Republic'], ['Bangladesh', 'Latvia']]

Now that we have equal length lists of both types of country combinations, we can merge them, add true / false labels, and randomize the order:

##merge true and false combinations, add binary number to be predicted, randomize order
def merge_combinations(true_combinations, false_combinations):
            
    ##create true and false labels
    labels_true = [1 for i in range(len(true_combinations))]
    labels_false = [0 for i in range(len(false_combinations))]
    
    ##combine true and false combinations
    combinations = true_combinations + false_combinations
    labels = labels_true + labels_false
    
    ##shuffle sequence after zipping the lists
    z = list(zip(combinations, labels))
    random.shuffle(z)
    
    ##unzip and return
    combinations, labels = zip(*z)
    return combinations, labels

##generate shuffled training set of true/false combinations, with labels
combinations, labels = merge_combinations(true_combinations, false_combinations)

##further data prep
print('there are ', len(combinations), 'country combinations in the set')
##there are  132489 country combinations in the set

##split combinations into lists first and second country
country1, country2 = zip(*combinations)

There is one last step to be taken before training the neural network. All our countries are still represented by their names. However, neural networks only take in numerical data. Therefore, I generate a lookup dictionary in which every word is represented by an index, and subsequently transform every word in the dataset to the corresponding index number:

##neural networks use numerical data as input, so assign an index number to every country
##and generate med to index lookup dict, and reverse lookup dict
country_to_index = {}
index_to_country = {}

indices = [i for i in range(len(countries))]
for idx, i in enumerate(countries): 
    country_to_index[i] = indices[idx]
    indices[idx] = country_to_index[i]

##and transform all the countries in our set to their respective indices
country1 = [country_to_index[i] for i in country1]
country2 = [country_to_index[i] for i in country2]

print(country1[:10])

# ##keras uses numpy arrays as input, so transform training lists to those
country1 = np.array(country1, dtype="int32")
country2 = np.array(country2, dtype="int32")
labels = np.array(labels, dtype = "int32")

Training the model

In my previous post, I used word2vec modelling with Gensim to generate embeddings. This time around though, I wanted to know more about how the neural network that generates embeddings actually works.

I’ll provide a quick rundown. All true and false combinations are run through the model a number of times. The model has two input ports, into which the respective countries in a combination are inserted for each training iteration. Subsequently, the model takes the two appropriate embeddings for each country. Then, it computes the dot product / cosine similarity between those embeddings; essentially, each number in one embedding is multiplied by the corresponding number in the other embedding. Then, the sum of all those products is taken. If the sum is above 0, the model puts out a 1 (true combination); if the sum is under 0, it puts out a 0 (false combination). Lastly, the predicted output is compared to the actual label accompanying the country combination. Backpropagation takes place here; both embeddings are adjusted so that next time this combination comes around, the prediction will be slightly better.

Here’s the code:

#specify neural network parameters
epochs = 100
batch_size = 10000
embed_size = 50

##keras API model specification
input_country1 = Input(name = 'input_country1', shape = [1])
input_country2 = Input(name = 'input_country2', shape = [1])

##embedding layers for both first and second medicine
model_country1 = Embedding(input_dim = len(countries), output_dim = embed_size)(input_country1)
model_country2 = Embedding(input_dim = len(countries), output_dim = embed_size)(input_country2)

#reshape those 
model_country1 = Reshape((embed_size, 1))(model_country1)
model_country2 = Reshape((embed_size, 1))(model_country2)

#merge embeddings with dot product
dot_ = dot([model_country1, model_country2], axes=1, normalize = False)
dot_ = Reshape(target_shape = [1])(dot_)

##predict true/false with dot product/cosine similarity
output = Dense(1,activation = "sigmoid")(dot_)

##compile
model = Model(inputs=[input_country1, input_country2], outputs=output)
model.compile(loss = "binary_crossentropy", optimizer = "adam", metrics = ['accuracy'])
model.summary()

##train model
history = model.fit([country1, country2], labels, batch_size=batch_size, epochs=epochs, verbose = 1, validation_split=0.1)

##assess model performance
plt.plot(history.history['acc'])
plt.plot(history.history['loss'])
plt.show()

At around 99% validation set accuracy, I was happy with the performance and stopped tweaking. Because Keras validation only impacts the model minimally, I didn’t bother with a test set (do use one if you’re using this type of model professionally!).  Accuracy and loss curves:

Checking out the results

Now, our model is quite proficient at predicting 0s and 1s. Yet, we actually want the embeddings. Here they are, normalized and all:

##retrieve embeddings
weights = model.layers[2].get_weights()[0]
print(weights[0][:10])

##[-0.44104806 -0.14912751 -0.6652928   0.37164855 -0.37634224  0.08819574
  0.48560727  0.29437146  0.25467378 -0.6605289 ]

Then, we need some defs for computing dot products between countries. First, a definition for cosine similarity between two countries:

##normalize the embeddings so that the dot product between two embeddings becomes the cosine similarity.
##Normalize just means divide each vector by the square root of the sum of squared components.
square_root = np.linalg.norm(weights, axis = 1)
square_root = square_root.reshape((-1, 1))
weights = weights / square_root

##compute most/least similar indices with dot product
def sim_dot(c1, c2, weights=weights, country_to_index=country_to_index): 
    
    ##retrieve indices, then embeddings
    weights1 = weights[country_to_index[c1]]
    weights2 = weights[country_to_index[c2]]
    
    ##compute dot product
    sim = np.dot(weights1,weights2)
    
    return sim

print(sim_dot('Netherlands','Belgium'))
##0.895393
print(sim_dot('United Kingdom','Australia'))
##0.967548
print(sim_dot('Belgium','Togo'))
##0.239391
print(sim_dot('United States','Russia'))
##0.104014

Subsequently, a definition for checking the highest ranked countries for a certain country:

##compute most or least similar countries to a certain country
def top_similar(country, top_K, how, weights=weights, country_to_index=country_to_index, countries=countries):
    ##retrieve all similarities
    sims = [sim_dot(country, c2,  weights=weights, country_to_index=country_to_index) for c2 in countries]
    sims = zip(sims, countries)
    
    ##order descending
    sims_ordered = sorted(sims, reverse = True)
    
    ##return only the highest k ones, or the lowest k ones, dependent on the specified parameter
    if how == 0: 
        sims_ordered = sims_ordered[1:top_K + 1] 
    elif how == 1: 
        sims_ordered = sims_ordered[-top_K:]
    
    return sims_ordered

print(top_similar('Australia',10, 0))
##[(0.97386092, 'Canada'), (0.97085083, 'Spain'), (0.96754754, 'United Kingdom'), (0.95988286, 'Japan'), (0.95359182, 'Switzerland'), (0.93311918, 'Portugal'), (0.92666471, 'Italy'), (0.91799873, 'New Zealand'), (0.90552145, 'Ireland'), (0.89424378, 'Netherlands')]
print(top_similar('Netherlands',10,0))
##[(0.94961381, 'Portugal'), (0.94443583, 'Sweden'), (0.94182992, 'Ireland'), (0.93346846, 'Switzerland'), (0.91344929, 'Spain'), (0.89897841, 'Denmark'), (0.89634496, 'Luxembourg'), (0.89585704, 'New Zealand'), (0.89539278, 'Belgium'), (0.89475739, 'Austria')]
print(top_similar('Australia',10, 1))
##[(0.11294153, 'Grenada'), (0.11038196, 'Malawi'), (0.1062203, 'United States'), (0.06896022, 'Barbados'), (0.060145646, 'Maldives'), (0.024445064, 'Burkina Faso'), (0.00057541206, 'Mauritius'), (-0.029210068, 'Saint Lucia'), (-0.12130371, 'United States Virgin Islands'), (-0.18051396, 'Georgia (country)')]
print(top_similar('Netherlands',10, 1))
##[(0.20300971, 'Indonesia'), (0.18801141, 'India'), (0.1720216, 'China'), (0.16769037, 'Burkina Faso'), (0.15285541, 'Palestinian territories'), (0.14719915, 'United States Virgin Islands'), (0.12747405, 'United States'), (0.11045419, 'Maldives'), (0.030738026, 'Georgia (country)'), (0.0073090047, 'Mauritius')]

You can clearly see in the results that this model has captured actual relationships between countries, without any access to geographical information!

Visualization in 2 dimensions

In order to make the embeddings comprehensible for our dimensionally-challenged human peanut brains, I’ll apply dimension reduction with TSNE, a popular ML method. Now, dimension reduction is not ideal, there are a few drawbacks. Firstly, although it captures quite a lot, dimension reduction of course will never be able to represent the full richness of 50-dimensional definitions in 2 dimensions. Secondly, training TSNE is a handful; underfit the model, and it will not capture any relevant information. Overfit the model, and it will start separating some of the items from the group, because that apparently decreases error. Find out more about the behaviour of TSNE with Tensorflow Embedding Projector. Yet, some fiddling with the parameters usually generates a quite nice model. Here we go:

## TSNE visualisation; given a large amount of dimensions, summarize those to two (or three) dimensions
##TSNE uses as input a list of lists / np ndarray with each list as one vector embedding
tsne = TSNE(n_components=2, metric = 'cosine', verbose=1, perplexity=20, n_iter=400).fit_transform(weights)

##extract 2-dimensional embeddings from tsne
x = [i[0] for i in tsne]
y = [i[1] for i in tsne]

#transform data to pandas
viz_df = pd.DataFrame({'x': x, 'y': y, 'country' :countries, 'continent': continents})

##initialize a Bokeh dataset
source = ColumnDataSource(viz_df)

##define empty plot
p = figure(title = 'TSNE-visualization of country-embeddings',plot_width = 950, plot_height = 950)

##define continent colors
length = len(viz_df['continent'].unique())
palette = d3['Category10'][length]
color_map = CategoricalColorMapper(factors=['Africa' ,'Asia', 'Caribbean/Central America', 'South America', 'Oceania'
 'Europe', 'North America'],palette=palette)

##fill the plot with colored country dots
p.scatter(x='x', y='y', size = 10, source = source,legend='continent', color={'field': 'continent', 'transform': color_map})

##add a hovertool so we can see which dot represents which country
hover = HoverTool(tooltips = [('Country:', '@country'),('Continent:','@continent')])
p.add_tools(hover)

output_file("countryembeddings.html")
save(p)

Here’s the plot. You can clearly see some of the continents grouped together! The image is currently static. I’ll try to embed a dynamic HTML file here shortly.

Results

Throughout this post, we’ve seen some quite fascinating things. Time to compare them to the hypotheses:

  • I expect to see basic grouping of countries based on the continent that they are in, maybe even specific geographical location. For example; the Netherlands should be wedged between Germany and Belgium, within the Europe cluster.

I found this one confirmed. Both when computing dot products and looking at the visualization, one could clearly see that countries which are geographically closer together, flock together in 50-dim space as well. One could distinctly recognize continents in the 2d image. It is even possible to split Eastern and Western Europe!

Of course, it wasnt exactly on point. Apparently, New Zealand and Australia are located in Europe 😉. However, this is to be expected from an analysis in which half of the data were randomly generated!

  • Countries that have strong diplomatic bonds, should be close together. For example, I expect countries that are part of the Commonwealth, such as the UK, Ireland and Australia, to flock together.

True! As I said before, this even caused New Zealand and Australia to be located in Europe! The embeddings reflect diplomatic bonds only to the extend that Wikipedia actually mentions them.

 Some more thoughts on vector embeddings; I find the technique a fascinating and useful tool for data science applications. I’ve used it in professional setting as well, usually in order to provide secondary neural networks with readymade object definitions. You could build both the embeddings and the other neural network layers required for a classification problem within one model. Yet, I find it convenient to have full control over embedding quality, before I start on secondary prediction.

Specifically interesting to me is the manner in which neural network embeddings generate dense, rich, continuous representations for sparse categorical data. In my analysis, only a small subset of all possible bilateral country combinations were represented, which is quite sparse. Yet, we ended up with a dense network with relationships between all countries.

To exemplify this; let’s say that Norway and Sweden are almost identical in regard to all the other countries, yet only Sweden has a relationship to Germany in our data. As a result, the relationship between Norway and Germany will be approximated according to all relations these countries have to other countries, such as Sweden. In other words, we are filling in / guessing lots of information that is not available to us, enriching our data in the process.

That’s it for today. Stay tuned for my next hobby project; reinforcement learning with Keras and PyGame!