This article describes how to do Feature Engineering in Displayr
Feature engineering refers to the process of manipulating predictor variables (features) with the goal of improving a predictive model. In this post I outline some of the key tools and processes for feature engineering in Displayr.
A Displayr Document
Switching between categorical and numeric treatment of predictor variables
Perhaps the most fundamental form of feature engineering when building a predictive model is the decision about whether to treat a particular predictor as being categorical or numeric. In Displayr, the way that a variable is treated in a model is determined by its structure. Displayr has 15 different structures, but the two key ones of relevance in most predictive models are Numeric and Mutually exclusive categories (nominal), where Mutually exclusive categories (nominal) means that the data is treated as being categorical.
The structure of a variable is changed by:
- Selecting the variable in the Data Sets Tree (bottom-left)
- Changing Object Inspector > Properties > INPUTS > Structure.
Sometimes a variable will be grouped into a variable set with other variables. It can be split by
- Selecting Toolbar > Split.
Creating a new numeric variable
There are many tools in Displayr for creating new variables. The most flexible tool is to:
- Select Anything > Data > Variables > New > Custom Code > R - Numeric
which allows you to create a new variable using the R language. For example, to create a new variable which is the natural logarithm of an existing variable, called Tenure, type log(Tenure). See Feature Engineering for Numeric Variables for examples of the code to do things like winsorize, cap, normalize, and calculate polynomials.
Creating a new categorical variable
Categorical variables are created as follows:
- Start by creating a numeric variable: Anything > Data > Variables > New > Custom Code > R - Numeric and enter code in the R CODE box.
- Change the type to categorical with Object Inspector > Properties > INPUTS > Structure: Mutually exclusive categories (nominal).
- Labels and values can be modified by clicking on the various options in Object Inspector > Properties > DATA VALUES.
Missing value settings
To modify which values of a variable are treated as missing, select the variable and then press Object Inspector > Properties > DATA VALUES > Missing values.
Merging categories of categorical variables
Categories of categorical variables can be merged by dragging and dropping. This is done by:
- Dragging the variable from the Data Sets Tree onto the page. This will create a table.
- Click on the table and then click on one of the categories you wish to merge. When three grey lines appear to the right, you can click on them and drag the category onto another category to merge them. Alternatively, you can use control or shift to select multiple categories and merge them using Toolbar > Combine > As One Category .
Reordering categories of categorical variables
Categories can be reordered by clicking on them (see the previous selection), and dragging them.
Displayr contains a large number of tools for feature extraction. For example:
- Principal components analysis (PCA), for extracting dimensions from numeric variables: Anything > Advanced Analysis > Dimension Reduction > Principal Components Analysis. Once the analysis has been run, select it and then select the output and then click Anything > Advanced Analysis > Dimension Reduction > Save Variable(s), which will add the variables to the data set.
- t-SNE, which is a highly nonlinear dimension reduction technique: Anything > Advanced Analysis> Dimension Reduction > t-SNE. Once the analysis has been run, select it and then select the output and then click Anything > Advanced Analysis > Dimension Reduction > Save Variable(s) > Components/Dimensions.
- Multiple correspondence analysis, for extracting dimensions from categorical variables: Anything > Advanced Analysis > Dimension Reduction > Multiple Correspondence Analysis. Once the analysis has been run, select it and then select the output and then click Anything > Advanced Analysis > Dimension Reduction > Save Variable(s).
- The various cluster analysis and latent class analysis tools in Anything > Advanced Analysis > Cluster.
You can do anything...
Displayr supports all the main R packages, so it can perform any feature engineering that you require. If you cannot figure out how to do something, please contact us.