Mapping 10,000 points with ggplot2

I recently authored a book chapter titled “Twitter”, which will be appearing in the Handbook on Geographies of the Internet edited by Barney Warf. For this chapter I created a map of 10,000 tweets in Europe symbolized by language. James Chesire’s work on Twitter Tongues was the initial inspiration for this approach.

Most of my visualizations with spatial data are optimized for the web using R packages like leaflet or openlayers, but since this work will be appearing in print it required a different approach. During my dissertation I used the R package cartography, but based on the number of points required for this map I decided ggplot2 would be a better choice.

I’ve created many non-spatial visualizations with ggplot2 before, but to my surprise the experience of visualizing spatial data with it was not much different. It took a while to figure out a few elements – especially the legend – but other than that it was pretty straightforward. One unexpected twist however, was the requirement that the map not use colored points for the languages since the book will be printed in black and white. But thanks to an editorial assistant’s suggestion, her patience with my stubbornness, and ggplot2’s numerous plotting options, this was resolved quite easily.

I collected this data over the course of several days and then sampled 10,000 points from the resulting dataset. For this, I use the Python module tweepy and the NoSQL system ElasticSearch. After extracting the tweets, the remainder of the analysis/mapping was completed in R.

First, I attach packages and load the data. I use the rnaturalearthdata package for country borders and create a vector of the language codes of interest. Some languages excluded due to spareness. Others, like Finnish, are excluded due to the ill effect of bot accounts producing unnatural (and unsightly) patterns on the map.

library(sf)
library(here)
library(dplyr)
library(ggplot2)
library(rnaturalearth)
library(rnaturalearthdata)
library(magrittr)

## get world polygon for background
world <- ne_countries(scale = "medium", returnclass = "sf")

## create variable of languages of interest
langs <- c("es", "pt", "de", "it", "fr", "nl", "en", "ru", "tr", "sv")

## read tweets, filter by those with the languages of interest
tweets <- st_read(here("data/europe-tweets-10k.geojson"),
                  stringsAsFactors = FALSE) %>%
  filter(lang %in% langs)

Filtering by the desired languages leaves 9313 in the dataset, so about 93% use one of the ten languages of interest. Setting the argument stringsAsFactors to FALSE saves a step later, since ggplot2 needs a character instead of a factor for plotting using a categorical variable.

Next, since the variable tweets is an sf obejct, the lat/lng data is not present outside of the geometry column. In this case, it’s easiest to get them as separate variables to use as the x/y locations for mapping later.

## get x/y as variables
lng <- tweets %>%
  st_coordinates() %>%
  extract(,1)

lat <- tweets %>%
  st_coordinates() %>%
  extract(,2)

The next chunk will show the construction of the map object in its entirety. Below, I’ll explain each element individually.

## create map object
map.color <- ggplot(data = world) +
  theme(text = element_text(family = "Source Code Pro"),
        legend.text = element_text(size = 16),
        legend.title = element_text(size = 20)) +
  geom_sf(fill = "white") +
  coord_sf(xlim = c(-24, 50),
           ylim = c(30, max(lat)),
           expand = FALSE) +
  geom_point(data = tweets,
             aes(x = lng,
                 y = lat,
                 color = tweets$lang),
             size = 3,
             alpha = 1/2) +
  xlab("") +
  ylab("") +
  scale_color_discrete(name = "Language",
                       labels = c("German", "English", "Spanish",
                                  "French", "Italian", "Dutch",
                                  "Portuguese", "Russian", "Swedish",
                                  "Turkish"))

## display map
map.color

ggplot(data = world): this tells ggplot2 to use the world polygons as the primary plotting object.
theme(text = ...): I have some weird locale issue on my home computer (I think that’s the issue at least) that prevents the default font from rendering on plots, so I specify explicitly here. Plus, Source Code Pro looks awesome anyway. I use the other arguments to adjust the size up a bit.
geom_sf(fill = "white"): this is what adds the world polygons to the map. Here, world is inferred since it is the data argument passed to the ggplot function.
coord_sf(xlim = ...): this is used to set the view. Originally I was using min/max of lng/lat, but this resulted in a bit too much empty space. So I set three out of the four manually.
geom_point(data = ...): The points are added to the map with this function. By setting the color argument to tweet$lang, it will automatically symbolize the points by this variable.
xlab(""): Remove the “lng” label from the x axis.
ylab(""): Remove the “lat” label from the y axis.
scale_color_discrete(name = ...): Create the legend with manually defined labels. This order should not necessarily match the order of the lang variable defined earlier; this should match the order of appearance of unique values in the tweets$lang column, which can be easily retrieved with

unique(tweets$lang)

##  [1] "es" "de" "en" "it" "nl" "sv" "fr" "tr" "pt" "ru"

I’m not crazy about the color scheme, but after trying various options from packages like randomcoloR, viridis, RColorBrewer, and wesanderson it was clear that I was not going to be satisfied without defining my own colors manually, so I decided to accept ggplot2’s defaults and move on.

One confusing thing about ggplot2 is that you can include some items in the ggplot chain, and they won’t appear nor trigger an error. For example, I originally tried to use guides(fill=guide_legend(title="New Legend Title")) to create the legend title, but I needed to include it in scale_color_discrete instead. In fact, that line remained in the code until I decided to write this post!

Back to the drawing (coding) board

After sending this map off to the publisher feeling very accomplished, the editorial assistant responded with concerns that the map would not reproduce well in grayscale. Knowing this would be the case – not that I expected it to be printed any other way – I stated why it was not possible to produce this map without color: (1) using a 10 class grayscale color scheme is not really feasible and (2) the density of the points would ruin any possibility of distinguishing languages from one another. For example, if Turkish tweets were light gray and German tweets were dark gray, three overlapping Turkish tweets would not be distinguishable from one German tweet. My suggestion was to take the map as is, knowing that people could find the color figure online.

Feeling very confident that I was right and that the publisher would listen to me, I was frustrated to hear that the map was not acceptable. The editorial assistant then suggested that I use unique icons for the tweets instead, and I remembered back to my grad school days where I tried darn near every shape option ggplot2 offers just for the fun of it. For the record, there are 25, with 14 of those being hollow – more than enough for my ten languages.

I then altered the code a bit to produce this grayscale friendly map. Below the code I highlight the major changes.

map.gray <- ggplot(data = world) +
  theme(text = element_text(family = "Source Code Pro"),
        legend.text = element_text(size = 16),
        legend.title = element_text(size = 20)) +
  geom_sf(fill = "white") +
  coord_sf(xlim = c(-24, 50),
           ylim = c(30, max(lat)),
           expand = FALSE) +
  geom_point(data = tweets,
             aes(x = lng,
                 y = lat,
                 shape = tweets$lang),
             size = 2,
             alpha = 2/5) +
  xlab("") +
  ylab("") +
  scale_shape_manual(name = "Language",
                     values = 1:10,
                     labels = c("German", "English", "Spanish",
                                "French", "Italian", "Dutch",
                                "Portuguese", "Russian", "Swedish",
                                "Turkish"))

## display map
map.gray

geom_point(..., shape = tweets$lang): Here, I symbolize the tweets by language using shape rather than color – a pretty simple switch. I also adjusted the alpha value to make the points more transparent and altered the size parameter as well.
scale_shape_manual(...): Since the map is no longer in color, the scale_color_discrete function will no longer suffice. Other arguments remain the same with one addition in the values argument.

After creating the map, ggsave can be used to save the last plot.

 ## save
 ggsave(filename = here("img/europe-tweets-gray.jpg"), dpi = 400,
        width = 18.04,
        height = 20)

Conclusion

In the end, I feel like the grayscale map offers a good balance between depicting the data in a meaningful way and dealing with realistic limitations of printing. I had to apologize to the editorial assistant for being a bit stubborn; her suggestion was a good one, and it forced me to learn something new and cool about ggplot2. In fact, the grayscale map shows some patterns that not really discernible with the color version, and in the end both suffer from an overabundance of points in some locations. That said, if I had to choose between one or the other, color is better when it’s available. Had this been produced for a web page though, I would have used a different approach altogether.

Mapping 10,000 points with ggplot2

Back to the drawing (coding) board

Conclusion

Appendix: helpful links