Feature engineering refers to the process of manipulating predictor variables (features) with the goal of improving a predictive model. In this post, we outline some of the key tools and processes for feature engineering in Displayr.
Requirements
- Familiarity with the Structure and Value Attributes of Variable Sets.
- A Displayr Document
Please note these steps require a Displayr license.
Switching between categorical and numeric treatment of predictor variables
Perhaps the most fundamental form of feature engineering when building a predictive model is the decision about whether to treat a particular predictor as being categorical or numeric. In Displayr, the way that a variable is treated in a model is determined by its structure. Displayr has 15 different structures, but the two key ones of relevance in most predictive models are Numeric and Mutually exclusive categories (nominal), where Mutually exclusive categories (nominal) means that the data is treated as being categorical.
The structure of a variable is changed by:
- Selecting the variable in the Data Sources tree.
- Changing Data > Properties > Structure in the object inspector.
Sometimes a variable will be grouped into a variable set with other variables. It can be split by right-clicking and selecting Split.
Creating a new numeric variable
There are many tools in Displayr for creating new variables. The most flexible tool is to:
- Select Anything > Data > Variables > New > Custom Code > R - Numeric
This allows you to create a new variable using the R language. For example, to create a new variable which is the natural logarithm of an existing variable, called Tenure, type log(Tenure) in the R Code editor. See Feature Engineering for Numeric Variables for examples of the code to do things like winsorize, cap, normalize, and calculate polynomials.
Creating a new categorical variable
Categorical variables are created as follows:
- Create a numeric variable by going to Anything > Data > Variables > New > Custom Code > R - Numeric and enter code in the R Code box.
- Change the structure to categorical from the object inspector > Data > Properties > Structure > Nominal: Mutually exclusive categories.
- Labels and values can be modified by clicking on the Labels and/or Values buttons in the object inspector.
Missing value settings
To modify which values of a variable are treated as missing, select the variable and then from the object inspector, click Data > Properties > Missing values.
Merging categories of categorical variables
Categories of categorical variables can be merged by:
- Drag the variable from the Data Sources tree onto the page. This will create a summary table.
- Click on one of the categories you wish to merge. When six black dots appear to the right, you can click on them and drag the category onto another category to merge them. Alternatively, you can use control or shift to select multiple categories, right-click, and select Combine.
Reordering categories of categorical variables
Categories can be reordered by:
- Drag the variable from the Data Sources tree onto the page. This will create a summary table.
- Click on the category you wish to move. When six black dots appear to the right, you can click on them, drag the category into position, and drop when the hover text reads Move.
Feature extraction
Displayr contains a large number of tools for feature extraction. For example:
- Principal components analysis (PCA), for extracting dimensions from numeric variables: Anything > Advanced Analysis > Dimension Reduction > Principal Components Analysis. Once the analysis has been run, select it, and from the object inspector, click Save Variable(s) > Components/Dimensions, which will add the variables to the data set.
- t-SNE, which is a highly nonlinear dimension reduction technique: Anything > Advanced Analysis> Dimension Reduction > t-SNE. Once the analysis has been run, select the output and then click Anything > Advanced Analysis > Dimension Reduction > Save Variable(s) > Components/Dimensions.
- Multiple correspondence analysis, for extracting dimensions from categorical variables: Anything > Advanced Analysis > Dimension Reduction > Multiple Correspondence Analysis. Once the analysis has been run, select it, and from the object inspector, click Save Variable(s) > Components/Dimensions, which will add the variables to the data set.
- The various cluster analysis and latent class analysis tools can be found in Anything > Advanced Analysis > Cluster.
You can do anything...
Displayr supports all the main R packages, so it can perform any feature engineering that you require. If you cannot figure out how to do something, please contact us.
Next
How to Do Principal Component Analysis in Displayr
How to Create a Principal Component Analysis Biplot
How to Create a Dimension Reduction Scatterplot
How to Create a Component Plot from a Principal Component Analysis
How to Create a Goodness of Fit Plot from a Dimension Reduction Output
How to Create a Scree Plot from a Principal Component Analysis
How to Do Multidimensional Scaling