How to Do Feature Engineering in Displayr

Feature engineering refers to the process of manipulating predictor variables (features) with the goal of improving a predictive model. In this post, we outline some of the key tools and processes for feature engineering in Displayr.

Requirements

Familiarity with the Structure and Value Attributes of Variable Sets.
A Displayr Document

Switching between categorical and numeric treatment of predictor variables

Perhaps the most fundamental form of feature engineering when building a predictive model is the decision about whether to treat a particular predictor as being categorical or numeric. In Displayr, the way that a variable is treated in a model is determined by its structure. Displayr has 15 different structures, but the two key ones of relevance in most predictive models are Numeric and Mutually exclusive categories (nominal), where Mutually exclusive categories (nominal) means that the data is treated as being categorical.

The structure of a variable is changed by:

Selecting the variable in the Data Sources tree.
Changing Data > Attributes > Structure in Properties .

Sometimes a variable will be grouped into a variable set with other variables. It can be split by right-clicking and selecting Split.

Creating a new numeric variable

There are many tools in Displayr for creating new variables. The most flexible tool is to hover in the Data Sources tree, and select + > Custom Code > R > Numeric.

This allows you to create a new variable using the R language. For example, to create a new variable which is the natural logarithm of an existing variable, called Tenure, type log(Tenure) in the R Code editor. See Feature Engineering for Numeric Variables for examples of the code to do things like winsorize, cap, normalize, and calculate polynomials.

Creating a new categorical variable

Categorical variables are created as follows:

Create a numeric variable by hovering in the Data Sources tree, and clicking + Custom Code > R > Nominal.
Enter the R code in the R Code editor.
Labels and values can be modified by clicking on the Values & Labels button in Properties .

Missing value settings

To modify which values of a variable are treated as missing, select the variable, and then from the Properties , click Data > Attributes > Values & Labels, Values, or Categories, depending on the variable's structure.

Merging categories of categorical variables

Categories of categorical variables can be merged by:

Drag the variable from the Data Sources tree onto the page. This will create a summary table.
Click on one of the categories you wish to merge. When six black dots appear, click on them and drag the category onto another category to merge them. Alternatively, you can use control or shift to select multiple categories, right-click, and select Combine.

Reordering categories of categorical variables

Categories can be reordered by:

Drag the variable from the Data Sources tree onto the page. This will create a summary table.
Click on the category you wish to move. When six black dots appear to the right, you can click on them, drag the category into position, and drop when the hover text reads Move.

Feature extraction

Displayr contains a large number of tools for feature extraction. For example:

Principal components analysis (PCA), for extracting dimensions from numeric variables: From either the toolbar > Anything or the + menu in the Report tree, select Advanced Analysis > Dimension Reduction > Principal Components Analysis. Once the analysis has been run, select it, and from Properties , click Save Variable(s) > Components/Dimensions, which will add the variables to the data set.
t-SNE, which is a highly nonlinear dimension reduction technique: from either the toolbar > Anything or the + menu in the Report tree, select Advanced Analysis> Dimension Reduction > t-SNE. Once the analysis has been run, select the output and then click Anything > Advanced Analysis > Dimension Reduction > Save Variable(s) > Components/Dimensions.
Multiple correspondence analysis, for extracting dimensions from categorical variables: from either the toolbar > Anything or the + menu in the Report tree, select Advanced Analysis > Dimension Reduction > Multiple Correspondence Analysis. Once the analysis has been run, select it, and from Properties , click Save Variable(s) > Components/Dimensions, which will add the variables to the data set.
The various cluster analysis and latent class analysis tools can be found from either the toolbar > Anything or the + menu in the Report tree, and select > Advanced Analysis > Cluster.

You can do anything...

Displayr supports all the main R packages, so it can perform any feature engineering that you require. If you cannot figure out how to do something, please contact us.

How to Do Principal Component Analysis in Displayr

How to Create a Principal Component Analysis Biplot

How to Create a Dimension Reduction Scatterplot

How to Create a Component Plot from a Principal Component Analysis

How to Create a Goodness of Fit Plot from a Dimension Reduction Output

How to Create a Scree Plot from a Principal Component Analysis

How to Do Multidimensional Scaling

How to Create a Distance Matrix

How to Perform t-SNE in Displayr