I recently authored a book chapter titled “Twitter”, which will be appearing in the Handbook on Geographies of the Internet edited by Barney Warf. For this chapter I created a map of 10,000 tweets in Europe symbolized by language. James Chesire’s work on Twitter Tongues was the initial inspiration for this approach.
Most of my visualizations with spatial data are optimized for the web using R
openlayers, but since this work will be appearing
in print it required a different approach. During my dissertation I used the R
cartography, but based on the number of points required for this map I
ggplot2 would be a better choice.
I’ve created many non-spatial visualizations with
ggplot2 before, but to my
surprise the experience of visualizing spatial data with it was not much
different. It took a while to figure out a few elements – especially the legend
– but other than that it was pretty straightforward. One unexpected twist
however, was the requirement that the map not use colored points for the
languages since the book will be printed in black and white. But thanks to an
editorial assistant’s suggestion, her patience with my stubbornness, and
ggplot2’s numerous plotting options, this was resolved quite easily.
I collected this data over the course of several days and then sampled 10,000
points from the resulting dataset. For this, I use the Python module
and the NoSQL system ElasticSearch. After extracting the tweets, the remainder
of the analysis/mapping was completed in R.
First, I attach packages and load the data. I use the
package for country borders and create a vector of the language codes of
interest. Some languages excluded due to spareness. Others, like Finnish, are
excluded due to the ill effect of bot accounts producing unnatural (and
unsightly) patterns on the map.
library(sf) library(here) library(dplyr) library(ggplot2) library(rnaturalearth) library(rnaturalearthdata) library(magrittr) ## get world polygon for background world <- ne_countries(scale = "medium", returnclass = "sf") ## create variable of languages of interest langs <- c("es", "pt", "de", "it", "fr", "nl", "en", "ru", "tr", "sv") ## read tweets, filter by those with the languages of interest tweets <- st_read(here("data/europe-tweets-10k.geojson"), stringsAsFactors = FALSE) %>% filter(lang %in% langs)
Filtering by the desired languages leaves 9313 in the dataset, so
about 93% use one of the ten languages of interest. Setting the argument
FALSE saves a step later, since
ggplot2 needs a
character instead of a factor for plotting using a categorical variable.
Next, since the variable
tweets is an
sf obejct, the lat/lng data is not
present outside of the
geometry column. In this case, it’s easiest to get them
as separate variables to use as the x/y locations for mapping later.
## get x/y as variables lng <- tweets %>% st_coordinates() %>% extract(,1) lat <- tweets %>% st_coordinates() %>% extract(,2)
The next chunk will show the construction of the map object in its entirety. Below, I’ll explain each element individually.
## create map object map.color <- ggplot(data = world) + theme(text = element_text(family = "Source Code Pro"), legend.text = element_text(size = 16), legend.title = element_text(size = 20)) + geom_sf(fill = "white") + coord_sf(xlim = c(-24, 50), ylim = c(30, max(lat)), expand = FALSE) + geom_point(data = tweets, aes(x = lng, y = lat, color = tweets$lang), size = 3, alpha = 1/2) + xlab("") + ylab("") + scale_color_discrete(name = "Language", labels = c("German", "English", "Spanish", "French", "Italian", "Dutch", "Portuguese", "Russian", "Swedish", "Turkish")) ## display map map.color
ggplot(data = world): this tells ggplot2 to use the
worldpolygons as the primary plotting object.
theme(text = ...): I have some weird locale issue on my home computer (I think that’s the issue at least) that prevents the default font from rendering on plots, so I specify explicitly here. Plus, Source Code Pro looks awesome anyway. I use the other arguments to adjust the size up a bit.
geom_sf(fill = "white"): this is what adds the
worldpolygons to the map. Here,
worldis inferred since it is the
dataargument passed to the
coord_sf(xlim = ...): this is used to set the view. Originally I was using
lat, but this resulted in a bit too much empty space. So I set three out of the four manually.
geom_point(data = ...): The points are added to the map with this function. By setting the
tweet$lang, it will automatically symbolize the points by this variable.
xlab(""): Remove the “lng” label from the x axis.
ylab(""): Remove the “lat” label from the y axis.
scale_color_discrete(name = ...): Create the legend with manually defined labels. This order should not necessarily match the order of the
langvariable defined earlier; this should match the order of appearance of unique values in the
tweets$langcolumn, which can be easily retrieved with
##  "es" "de" "en" "it" "nl" "sv" "fr" "tr" "pt" "ru"
I’m not crazy about the color scheme, but after trying various options from
wesanderson it was
clear that I was not going to be satisfied without defining my own colors
manually, so I decided to accept
ggplot2’s defaults and move on.
One confusing thing about
ggplot2 is that you can include some items in the
ggplot chain, and they won’t appear nor trigger an error. For example, I
originally tried to use
guides(fill=guide_legend(title="New Legend Title")) to
create the legend title, but I needed to include it in
instead. In fact, that line remained in the code until I decided to write this
Back to the drawing (coding) board
After sending this map off to the publisher feeling very accomplished, the editorial assistant responded with concerns that the map would not reproduce well in grayscale. Knowing this would be the case – not that I expected it to be printed any other way – I stated why it was not possible to produce this map without color: (1) using a 10 class grayscale color scheme is not really feasible and (2) the density of the points would ruin any possibility of distinguishing languages from one another. For example, if Turkish tweets were light gray and German tweets were dark gray, three overlapping Turkish tweets would not be distinguishable from one German tweet. My suggestion was to take the map as is, knowing that people could find the color figure online.
Feeling very confident that I was right and that the publisher would listen to
me, I was frustrated to hear that the map was not acceptable. The editorial
assistant then suggested that I use unique icons for the tweets instead, and I
remembered back to my grad school days where I tried darn near every
ggplot2 offers just for the fun of it. For the record, there are 25,
with 14 of those being hollow – more than enough for my ten languages.
I then altered the code a bit to produce this grayscale friendly map. Below the code I highlight the major changes.
map.gray <- ggplot(data = world) + theme(text = element_text(family = "Source Code Pro"), legend.text = element_text(size = 16), legend.title = element_text(size = 20)) + geom_sf(fill = "white") + coord_sf(xlim = c(-24, 50), ylim = c(30, max(lat)), expand = FALSE) + geom_point(data = tweets, aes(x = lng, y = lat, shape = tweets$lang), size = 2, alpha = 2/5) + xlab("") + ylab("") + scale_shape_manual(name = "Language", values = 1:10, labels = c("German", "English", "Spanish", "French", "Italian", "Dutch", "Portuguese", "Russian", "Swedish", "Turkish")) ## display map map.gray
geom_point(..., shape = tweets$lang): Here, I symbolize the tweets by language using
shaperather than color – a pretty simple switch. I also adjusted the
alphavalue to make the points more transparent and altered the
sizeparameter as well.
scale_shape_manual(...): Since the map is no longer in color, the
scale_color_discretefunction will no longer suffice. Other arguments remain the same with one addition in the
After creating the map,
ggsave can be used to save the last plot.
## save ggsave(filename = here("img/europe-tweets-gray.jpg"), dpi = 400, width = 18.04, height = 20)
In the end, I feel like the grayscale map offers a good balance between
depicting the data in a meaningful way and dealing with realistic limitations of
printing. I had to apologize to the editorial assistant for being a bit
stubborn; her suggestion was a good one, and it forced me to learn something new
and cool about
ggplot2. In fact, the grayscale map shows some patterns that
not really discernible with the color version, and in the end both suffer from
an overabundance of points in some locations. That said, if I had to choose
between one or the other, color is better when it’s available. Had this been
produced for a web page though, I would have used a different approach