This article describes how to pull data from a Wikipedia table using the rvest R package.
In this example, we will pull the latest life expectancy figures per US state/territory and visualize it as a geographic map.
Requirements
- A Wikipedia table.
- A Displayr document.
Method
1. Select Calculation > Custom Code from the toolbar.
2. Click onto the page to place the object.
3. Paste the below in the R Code editor.
library(rvest)
# Reading in the table from Wikipedia
page = read_html("https://en.wikipedia.org/wiki/List_of_U.S._states_by_life_expectancy")
# Obtain the piece of the web page that corresponds to the "wikitable" node
my.table = html_node(page, ".wikitable")
# Convert the html table element into a data frame
my.table = html_table(my.table, fill = TRUE)
# Extracting and tidying a single column from the table and adding row names
x = as.numeric(gsub("\\[.*","",my.table[,4]))
names(x) = gsub("\\[.*","",my.table[,2])
# Excluding non-states and averages from the table
life.expectancy = x[!names(x) %in% c("United States", "Northern Mariana Islands", "Guam", "American Samoa", "Puerto Rico", "U.S. Virgin Islands")]
In this example, we use the read_html function to pull the webpage and html_node to find the correct "wikitable" node element. The next steps involve pulling the latest data from the correct table column with each row assigned to the appropriate state/territory name. Finally, we need to also remove any footnote number references, e.g. [1], and non-states from the table.
3. To create the geographic map, go to Visualization > Geographic Maps > Geographic Map from the toolbar.
4. Under Data> Data Source > Data, choose the life expectancy R output.