Introduction
This article describes how to pull data from a Wikipedia table using the rvest R package.
In this example, we will pull the latest life expectancy figures per US state/territory and visualize it as a geographic map.
Requirements
- A Wikipedia table.
- A Displayr document.
Method
1. Select Calculation > Custom Code.
2. Paste the below in the R CODE field of the object inspector.
library(rvest)
# Reading in the table from Wikipedia
page = read_html("https://en.wikipedia.org/wiki/List_of_U.S._states_by_life_expectancy")
# Obtain the piece of the web page that corresponds to the "wikitable" node
my.table = html_node(page, ".wikitable")
# Convert the html table element into a data frame
my.table = html_table(my.table, fill = TRUE)
# Extracting and tidying a single column from the table and adding row names
x = as.numeric(gsub("\\[.*","",my.table[,4]))
names(x) = gsub("\\[.*","",my.table[,2])
# Excluding non-states and averages from the table
life.expectancy = x[!names(x) %in% c("United States", "Northern Mariana Islands", "Guam", "American Samoa", "Puerto Rico", "U.S. Virgin Islands")]
In this example, we use the read_html function to pull the webpage and html_node to find the correct "wikitable" node element. The next steps involve pulling the latest data from the correct table column with each row assigned to the appropriate state/territory name. Finally, we need to also remove any footnote number references, e.g. [1], and non-states from the table.
3. To create the geographic map, go to Visualization > Geographic Maps > Geographic Map.
4. Under Inputs > DATA SOURCE > Output in 'Pages', choose the life expectancy R output.
See Also
How to Create a Geographic Map
How to Set Initial Zoom and Position of Geographic Maps