One of the main aspects of preparing your dataset for statistics, machine learning, or other advanced analyses, is understanding exactly with which datatypes you’re dealing, and subsequently transforming them to desired datatypes for analysis.
In this post, I’ve gathered a number of common and less common methods from machine learning and statistics. These are, to my mind, adjacent fields that often deal with exactly the same problems, just with different terminology. Therefore, regardless of your field, elect the most suitable method. The list does probably not encompass all available transformations. Yet, I try to touch upon at least all of the common techniques.
Because the topic is enormous, I’ll try to provide intuitive, concise descriptions for each transformation. I intend to elaborate on some of the more advanced methods (e.g. vector embeddings) in future posts.
For the sake of clarity; in this article, I’ll use the word categorical as synonym for nominal and ordinal variables, and I’ll use the term continuous as a synonym for ratio and interval variables. Other common denominations are discrete and numerical, respectively. Furthermore, a variable, or column, contains some characteristics of cases, or lines. These are different between cases, otherwise you would have no reason for maintaining the variable in your data. All these different characteristics are denominated levels through the entire article. The amount of unique levels in a variable, is called its cardinality.
In this post, I first describe data types. Subsequently, I touch upon the following transformations:
- One-hot encoding
- Binary encoding, or dummy variables
- Transformation to dichotomous
- Counting levels
- Ranking based on count
- Vector embeddings
- Ignoring transformation altogether
First though, we should go through all data types available.
Different sorts of information are often split into the following hierarchy.
- Nominal data: A nominal variable contains different levels that are not more or less than each other; there is no hierarchy present. For example, one could think of car brands; there’s no clear hierarchy of which brand is better than the other.
- Ordinal data: This data type contains different levels, in which a clear hierarchy is established. For these types of variables, it’s often difficult to define whether the distances between different levels are equally large. For example, car X may be better than car Y, and car Z is better than both. However, is car Z better than car Y by the same amount as car Y is better than car X?
Often, it’s quite difficult to distinguish between both aforementioned data types. For example, I mentioned that different car brands contain no hierarchy. Your opinion on the other hand may be that some brands are better than others, and that car brand is therefore an ordinal variable. Within statistics, nominal and ordinal are usually treated equally; for an analysis, they mean the same thing.
- Ratio data: A ratio variable contains different levels, between which a very clear hierarchy can be established. Also, the distances between those levels are equal. For example, one’s net worth. You may be worth millions, or you may be millions in debt. There is always a clear concept that the difference between owning $1000 and $2000, is exactly the same as the difference between $10000 and $11000.
- Interval data: This data type is exactly the same as the previous one, bar one exception. An interval variable has an absolute starting point, in other words, a zero. One’s salary is an example of this. One may earn anything between $0.01 and millions, but the general agreement is that a salary cannot be negative; you cannot earn less than $0 over a period of time.
As before, the difference between ratio and interval is only a slight nuance. Therefore, they’re usually treated as being equal.
Within statistics and machine learning, in almost any algorithm, variables are expected to be either in interval or ratio. Therefore, in many situations, one might want to change the datatype of one or more variables from categorical to continuous.
In order to get to a mathematical formula to predict / explain some output variable, the assumption of equal distances between levels needs to be met. Now is may seem that there are analyses which can have categorical variables as input. One of those is a method widely used in social sciences, named analysis of variance (ANOVA). This analysis requires categorical variables as input, and continuous variables as output. However, in the background, it transforms all categorical inputs to continuous with one-hot encoding. Also, some analyses do exist that use both categorical inputs and outputs, such as the chi-square test of independence. Yet, even chi-square transforms your categorical levels to counts of how often they occur, which is in essence continuous information.
Therefore, you might want to take full control of the data types in your set. Now that you understand data types, here are the most common methods for transforming categorical variables (at least the ones that I know of).
One-hot encoding, and very similarly creating dummy variables, may be the most widespread method for categorical to continuous transformation. As I mentioned before, some analyses even do it automatically without you noticing.
What essentially happens is this; your categorical variable contains K levels. Therefore, K new variables are created. On each of those new variables, cases that have the corresponding variable are set to 1, and all other levels are set to 0. For example:
As mentioned before, the Hair colour variable with three levels is split into three binary dummy variables, that all encode a specific colour.
Note. Dummy encoding is common in statistics, and slightly different from one-hot encoding; K – 1 new variables are created, and one level is set to 0 on all of those. Like this:
One-hot encoding is a very elegant transformation; it captures all available information efficiently. Nevertheless, it’s not suited to all situations. It works well when a variable contains a relatively small amount of levels; say, hair colour or blood type. When the amount of levels on the other hand is large, one-hot encoding becomes an unmanageable wildgrowth of variables. Given a variable that contains items bought in a supermarket, dummy variables are useless. A supermarket may sell thousands of different products, resulting in thousands of variables in your set. You lose both oversight on your data, and efficiency of analysis in the process.
Yet maybe the largest drawback to one-hot encoding is the lack of information on relationships between levels. With supermarket items, we know that apples and pears are very similar, but detergent is not like those items. Nevertheless, items are encoded in a way, that they are equally similar to each other.
This transformation is quite similar to one-hot encoding. However, it is better suited to variables with high cardinality. In simple terms, binary encoding is just the transformation of the number that a certain level has, to its binary representation. For example, the number 2 could also be written as 010. It has the characteristic that all the levels are equally distant to each other, like one-hot. This can be advantageous or not, depending on your data. Let’s see:
In this case, the cardinality determines the amount of columns required. As you see in the previous example, binary with three places supports up to seven levels, thereafter you’d go to four places.
Binary encoding works better than one-hot encoding on variables with high cardinality. Nevertheless, it has some drawbacks; like one-hot encoding, this technique does not capture relationships between levels in any way.
For one-hot encoding and binary transformation, many adjacent methods, such as hashing, are to be discussed. Yet, for an introduction, these two suffice.
Within linear regression analysis, it is common to have an independent variable with more than two levels, predicting a continuous dependent variable. One should never apply linear regression to a categorical variable with over two levels, because you are not certain that those levels are equally distant from each other (the condition for an interval variable). You don’t even know whether the levels are in the right order. It would be like trying to fit a straight line through a random order of car colors.
Therefore, you it’s possible to transform the independent variable to a set of contrasts; you define which levels should be linearly compared to each other, with dichotomous tests. You do this based on your hypotheses; which comparisons are relevant for your specific analysis? Because the amount of contrasts is always K – 1, you need to be selective regarding which comparisons should be made.
Within experimental research, you often want to compare some treatment groups to your baseline, or control group. Let’s say, you’re assessing the effectiveness of two drug treatments A and B for symptoms of depression. Your baseline group of patients receives a placebo, and your two experimental groups receive the drug treatments, respectively. Your variable has three levels, therefore you get to specify two contrasts. You’d specify the following dichotomous comparisons, i.e. treatment contrasts:
- Compare the severity of depression for treatment A against the baseline group;
- Compare the severity of depression for treatment B against the baseline group.
Usually, regression models let you specify the following matrix for this. Each contrast embodies a dichotomous comparison.
As you can see, it’s very similar to dummy coding, and the variable that the baseline, is never set to 1 in the contrasts.
Another common one is the Helmert contrast. Again, choosing for this one depends on which question you’re trying to answer. The basic functionality of the contrasts is this; the mean of every level is compared to the mean of the previous levels.
This is best demonstrated with an example; say, you are testing how drug treatment X affects depression symptoms. You take four measurements (denoted as M1 through M4) over the entire treatment. You want to know whether patients experience less symptoms at each measurement. Four levels, so three contrasts to be defined. Helmert contrasting dictates the following comparisons:
- Compare the mean of M2 to that of M1;
- Compare the mean of M3 to the grand mean of M1 and M2;
- Compare the mean of M4 to the grand mean of M1, M2, and M3.
This is captured in the following matrix, with each comparison embodied in a contrast, respectively:
There are many more contrast designs such as sum contrasts and orthogonal contrasts, but for an intuitive understanding of the method, I’ll stick to these two. Using contrasts can be quite advantageous; the method allows you to exactly specify which questions should be answered, and therefore forces you to instil prior knowledge into the model. This completely opposes the usual expectation-free manner in which inputs and outputs are linked in many machine learning applications. Nevertheless, contrasts are quite a niche tool for regression, and are difficult to define once your research question spans more than four or five dichotomous comparisons.
Transformation to dichotomous
Variables that only contain two levels, can be used as continuous, even if they contain categories. This is because only two levels need to be compared. Distance between levels only matters given that you would have three levels or more. For example, if the detail level of your flight class variable is 0 (economy) or 1 (business) you may enter it into a linear regression without transformation.
Alternatively, you can aggregate a categorical variable that has more than two levels, to binary. Let’s say that your data contains a variable with levels that are car brands. You want to transform that variable to continuous, and notice that the people in your dataset only drive German or Japanese cars. Then, you’d be able to encode the car brand someone drives as either 0 (German) or 1 (Japanese). Another case in which transformation to binary may be beneficial, is when you only have three or four levels in your variable. You may merge the most similar ones, so that only two levels remain.
The advantage of applying this transformation is that it’s very fast. Both generating the variable and running analyses with it is very efficient. However, you might lose a lot of information; what if you’re predicting reliability, and some German brands turn out to be reliable, and others don’t? That information will be lost with aggregation. Nonetheless, this method may be beneficial in some situations.
Just counting the amount of occurrences of each level in the data, is actually a method that is used quite often in statistics. For example, in the chi-square test of independence I previously mentioned.
Say, you have supermarket data on individual sales of products, and you want to know what each customer has bought over a period of time. You might structure the data like this, eliminating any need for dummy or binary coding:
|Customer||Toilet paper||Detergent||Chilli sauce|
The major drawback here, is that aggregating your data to a more dense format always leads to information loss. For example, any sequence in which customers bought the product, cannot be analysed from this set. Nevertheless, counting often suffices in situations where aggregation is inevitable, and the cardinality of the continuous variable is quite low.
The technique of ranking your levels, is a slightly more advanced method than just counting them.
This method rests upon the following premise: Levels of a categorical variable do not have any numerical meaning. However, if we are able to order them in some meaningful manner, they do may categorical value. So, how often each level occurs in the categorical variable, is a numerical representation for that level. Therefore, if we rank levels by how often they occur, our transformation should work. Naturally, the higher the count, the higher the rank that should be given to the level. This method should to some extent satisfy the condition for continuous variables that levels which are closer to each other, are more similar.
In the example below, the color blonde occurs the least often, and receives rank 1. Black hair occurs 3 times, and is assigned rank 2. Red occurs most times, and is therefore ranked 3.
However, applying the aforementioned may lead to the situation in which at least two levels occur equally as often in the variable. You can’t just assign them the same rank, which would mean that those levels are indistiguishable. We need a method to differentiate between them, because otherwise we don’t satisfy the condition that there should be K ranks for K levels. This paper actually has a clever mathematical solution for that.
Of course, the obvious disadvantage here is having a large number of levels with the same count. Yet, your analysis may actually benefit from ranking your variables, as the authors of the previously mentioned paper point out.
Simply put, every level of your variable receives a vector, or list of numbers, of length X. That vector represents the location of a specific level in X-dimensional space, usually between 50 and 300. Moreover, levels that are more similar to each other, are closer together in embedding space.
Think of it like GPS coordinates of cities on a map; cities that are closer to each other have more similar coordinates. And because two cities close together are likely in the same country, they may resemble each other very much. Nevertheless, vector embeddings are often computed in up to 300 dimensions, instead of two-dimensional maps.
Vector embeddings are most commonly used for transforming text to a usable mathematical transformation. A large text corpus often contains thousands of unique words, rendering most transformation techniques useless. Moreover, you might want to capture some of the rich semantic meanings of words that are present in text format.
For that purpose, you can use neural networks. What basically happens under the hood, is this; each unique word is assigned a vector. Your neural network model is then taught to predict whether two random words occur close to each other or not. As your network becomes proficient at this, it pushes co-occurring words closer to each other in vector space, and words that are never close, away from each other.
This results in an embedding space that actually captures all the meanings of words in relation to each other, in a mathematical way. Of course, our brains limited to three dimensions won’t understand anything of it.
Not only text is suitable for embeddings. Any data in which combinations of certain levels of the variable occur, is sufficient for generating embeddings. For example, if you’re working with product sales information from a supermarket, your data contains patterns on which items are bought together often, and which aren’t. Based on the aforementioned product combinations, you can produce vectors that group similar products together. These vector embeddings actually capture some type of mathematical meaning of a product. For example, it’s very unlikely that two brands of milk are often bought together. You either choose one or the other. However, those two milk products will probably be encoded as very similar vectors. This is because although they do not occur together and therefore have no direct relationship, their relationships to other products are very likely to be similar. If you buy milk, you might buy flour, regardless of the brand of milk. Thus, the different milk products will be close in vector space.
Creating vector embeddings is, to my opinion, the most elegant way to transform some categorical information to continuous. This is because it not only encodes each available level fully, but vector embeddings also contain information on how different levels of your variable are related to each other.
On the other hand, vector embeddings are only applicable in very limited situations. You need some information on which combinations do or do not occur between levels. This is most easily found in text, but as I mentioned before, sales data also works, as do Wikipedia links, or social media mentions. You’ll need to be creative in how to acquire combination information if you’d like to use vector embeddings.
Or, just ignore the transformation
The last, and probably very incorrect option available, is to just avoid any transformation and use your categorical variable as continuous. In this case, you’d transform all the different levels in your variable to the numbers 1 to K, in random order. You basically ignore any ranking or clustering within the variable. Although I do not have any experience with this, and it goes against basic assumptions of any advanced analysis, I think that in some situations, it may actually work well.
For example, if you’re trying to apply linear regression with an independent variable with three levels, and you have some knowledge on the hierarchy within those levels (so an ordinal variable) the regression function may fit well.
In my opinion, the most important thing to remember here is that, given that you’re going to avoid any transformation, this should be a well-founded decision; without a solid indication that it may actually improve your analysis, don’t do it.
So there you go, all common available methods for transforming categorical data to continuous. Spending some time on finding the right one for your feature engineering or statistical analysis, may actually mean a large increase in performance. Now go apply them!