I recently authored a book chapter titled “Twitter”, which will be appearing in the Handbook on Geographies of the Internet edited by Barney Warf. For this chapter I created a map of 10,000 tweets in Europe symbolized by language. James Chesire’s work on Twitter Tongues was the initial inspiration for this approach.
Most of my visualizations with spatial data are optimized for the web using R
packages like leaflet
or openlayers
, but since this work will be appearing
in print it required a different approach. During my dissertation I used the R
package cartography
, but based on the number of points required for this map I
decided ggplot2
would be a better choice.
I’ve created many non-spatial visualizations with ggplot2
before, but to my
surprise the experience of visualizing spatial data with it was not much
different. It took a while to figure out a few elements – especially the legend
– but other than that it was pretty straightforward. One unexpected twist
however, was the requirement that the map not use colored points for the
languages since the book will be printed in black and white. But thanks to an
editorial assistant’s suggestion, her patience with my stubbornness, and
ggplot2
’s numerous plotting options, this was resolved quite easily.
I collected this data over the course of several days and then sampled 10,000
points from the resulting dataset. For this, I use the Python module tweepy
and the NoSQL system ElasticSearch. After extracting the tweets, the remainder
of the analysis/mapping was completed in R.
First, I attach packages and load the data. I use the rnaturalearthdata
package for country borders and create a vector of the language codes of
interest. Some languages excluded due to spareness. Others, like Finnish, are
excluded due to the ill effect of bot accounts producing unnatural (and
unsightly) patterns on the map.
library(sf)
library(here)
library(dplyr)
library(ggplot2)
library(rnaturalearth)
library(rnaturalearthdata)
library(magrittr)
## get world polygon for background
world <- ne_countries(scale = "medium", returnclass = "sf")
## create variable of languages of interest
langs <- c("es", "pt", "de", "it", "fr", "nl", "en", "ru", "tr", "sv")
## read tweets, filter by those with the languages of interest
tweets <- st_read(here("data/europe-tweets-10k.geojson"),
stringsAsFactors = FALSE) %>%
filter(lang %in% langs)
Filtering by the desired languages leaves 9313 in the dataset, so
about 93% use one of the ten languages of interest. Setting the argument
stringsAsFactors
to FALSE
saves a step later, since ggplot2
needs a
character instead of a factor for plotting using a categorical variable.
Next, since the variable tweets
is an sf
obejct, the lat/lng data is not
present outside of the geometry
column. In this case, it’s easiest to get them
as separate variables to use as the x/y locations for mapping later.
## get x/y as variables
lng <- tweets %>%
st_coordinates() %>%
extract(,1)
lat <- tweets %>%
st_coordinates() %>%
extract(,2)
The next chunk will show the construction of the map object in its entirety. Below, I’ll explain each element individually.
## create map object
map.color <- ggplot(data = world) +
theme(text = element_text(family = "Source Code Pro"),
legend.text = element_text(size = 16),
legend.title = element_text(size = 20)) +
geom_sf(fill = "white") +
coord_sf(xlim = c(-24, 50),
ylim = c(30, max(lat)),
expand = FALSE) +
geom_point(data = tweets,
aes(x = lng,
y = lat,
color = tweets$lang),
size = 3,
alpha = 1/2) +
xlab("") +
ylab("") +
scale_color_discrete(name = "Language",
labels = c("German", "English", "Spanish",
"French", "Italian", "Dutch",
"Portuguese", "Russian", "Swedish",
"Turkish"))
## display map
map.color
ggplot(data = world)
: this tells ggplot2 to use theworld
polygons as the primary plotting object.theme(text = ...)
: I have some weird locale issue on my home computer (I think that’s the issue at least) that prevents the default font from rendering on plots, so I specify explicitly here. Plus, Source Code Pro looks awesome anyway. I use the other arguments to adjust the size up a bit.geom_sf(fill = "white")
: this is what adds theworld
polygons to the map. Here,world
is inferred since it is thedata
argument passed to theggplot
function.coord_sf(xlim = ...)
: this is used to set the view. Originally I was usingmin
/max
oflng
/lat
, but this resulted in a bit too much empty space. So I set three out of the four manually.geom_point(data = ...)
: The points are added to the map with this function. By setting thecolor
argument totweet$lang
, it will automatically symbolize the points by this variable.xlab("")
: Remove the “lng” label from the x axis.ylab("")
: Remove the “lat” label from the y axis.scale_color_discrete(name = ...)
: Create the legend with manually defined labels. This order should not necessarily match the order of thelang
variable defined earlier; this should match the order of appearance of unique values in thetweets$lang
column, which can be easily retrieved with
unique(tweets$lang)
## [1] "es" "de" "en" "it" "nl" "sv" "fr" "tr" "pt" "ru"
I’m not crazy about the color scheme, but after trying various options from
packages like randomcoloR
, viridis
, RColorBrewer
, and wesanderson
it was
clear that I was not going to be satisfied without defining my own colors
manually, so I decided to accept ggplot2
’s defaults and move on.
One confusing thing about ggplot2
is that you can include some items in the
ggplot chain, and they won’t appear nor trigger an error. For example, I
originally tried to use guides(fill=guide_legend(title="New Legend Title"))
to
create the legend title, but I needed to include it in scale_color_discrete
instead. In fact, that line remained in the code until I decided to write this
post!
Back to the drawing (coding) board
After sending this map off to the publisher feeling very accomplished, the editorial assistant responded with concerns that the map would not reproduce well in grayscale. Knowing this would be the case – not that I expected it to be printed any other way – I stated why it was not possible to produce this map without color: (1) using a 10 class grayscale color scheme is not really feasible and (2) the density of the points would ruin any possibility of distinguishing languages from one another. For example, if Turkish tweets were light gray and German tweets were dark gray, three overlapping Turkish tweets would not be distinguishable from one German tweet. My suggestion was to take the map as is, knowing that people could find the color figure online.
Feeling very confident that I was right and that the publisher would listen to
me, I was frustrated to hear that the map was not acceptable. The editorial
assistant then suggested that I use unique icons for the tweets instead, and I
remembered back to my grad school days where I tried darn near every shape
option ggplot2
offers just for the fun of it. For the record, there are 25,
with 14 of those being hollow – more than enough for my ten languages.
I then altered the code a bit to produce this grayscale friendly map. Below the code I highlight the major changes.
map.gray <- ggplot(data = world) +
theme(text = element_text(family = "Source Code Pro"),
legend.text = element_text(size = 16),
legend.title = element_text(size = 20)) +
geom_sf(fill = "white") +
coord_sf(xlim = c(-24, 50),
ylim = c(30, max(lat)),
expand = FALSE) +
geom_point(data = tweets,
aes(x = lng,
y = lat,
shape = tweets$lang),
size = 2,
alpha = 2/5) +
xlab("") +
ylab("") +
scale_shape_manual(name = "Language",
values = 1:10,
labels = c("German", "English", "Spanish",
"French", "Italian", "Dutch",
"Portuguese", "Russian", "Swedish",
"Turkish"))
## display map
map.gray
geom_point(..., shape = tweets$lang)
: Here, I symbolize the tweets by language usingshape
rather than color – a pretty simple switch. I also adjusted thealpha
value to make the points more transparent and altered thesize
parameter as well.scale_shape_manual(...)
: Since the map is no longer in color, thescale_color_discrete
function will no longer suffice. Other arguments remain the same with one addition in thevalues
argument.
After creating the map, ggsave
can be used to save the last plot.
## save
ggsave(filename = here("img/europe-tweets-gray.jpg"), dpi = 400,
width = 18.04,
height = 20)
Conclusion
In the end, I feel like the grayscale map offers a good balance between
depicting the data in a meaningful way and dealing with realistic limitations of
printing. I had to apologize to the editorial assistant for being a bit
stubborn; her suggestion was a good one, and it forced me to learn something new
and cool about ggplot2
. In fact, the grayscale map shows some patterns that
not really discernible with the color version, and in the end both suffer from
an overabundance of points in some locations. That said, if I had to choose
between one or the other, color is better when it’s available. Had this been
produced for a web page though, I would have used a different approach
altogether.