34 min read

Challenges of teaching spatial analysis with Python

The landscape of geospatial software is ever-changing. There’s no shortage of powerful programs, and due to the wide array of potential employment opportunities for geographers, it would be nice to give students broad exposure to as many as possible. These include ArcGIS Pro, ArcGIS Collector, Survey123, QGIS, PostGIS, Python, R, Leaflet, OpenLayers, Mapbox, and Carto to name a few. The reality is however, that at a four year university we simply don’t have time to teach on everything; we have to make decisions. Otherwise educators can only go “an inch deep and a mile wide,” and students have significant exposure to software but will lack depth in many principal concepts. One increasingly important concept is that of scripting, i.e., using a programming language to automate a task, extend an existing method, or create a reproducible workflow.

Over the past three years, I’ve incorporated scripting in four different geospatial courses: GIS 1, GIS 2, GIS 3, and Quantitative Methods for Geography, namely with Python and R. Having the background of teaching multiple languages in several different courses gives me considerable experience to draw upon, though I should point out that I’ve always used one or the other over a semester; I never mix the two in the same course. Every semester incurs unique challenges, regardless of the subject content or language used. This piece, however, is about the challenges of teaching with Python specifically, not with teaching scripting altogether.

Within spatial analysis and data analysis writ large, there is often a debate about whether it is better to learn Python or R. On this topic, I agree with Matt Dancho’s assertion: “The 2 most popular data science languages - Python and R - are often pitted as rivals. This couldn’t be further from the truth. Data scientists that learn to use the strengths of both languages are valuable because they have NO LIMITS” (emphasis theirs). 1 Here, I wholeheartedly agree. Though I personally use R more in research, I still use Python regularly, and knowing both has been immensely valuable. When students ask me which they should learn, I recommend both. If they only have time to learn one of the two, I recommend Python since it’s more commonly used.

I’m writing this post for two main reasons. First, I want to describe the problems I’ve encountered in teaching with Python as I am genuinely interested in their solutions. It’s a powerful language that is valuable for students to know. There are plenty of others more skilled than me both in using Python and in teaching with it, so if there are solutions to the challenges I describe below, I want to know about them! That said, I believe at least a few of problems below do not have clean solutions, and I think it’s worth elucidating those. Here, I concur with Yihui Xie in his assessment of Joel Grus’ critique of Jupyter Notebooks: ‘criticizing a popular tool is not “an unhelpful way to spend time.”’ 2.

Second, I want to suggest an alternative approach which includes R for spatial analysis without excluding Python. I do this by comparing Python’s challenging facets to how they would be tackled with R, though R has problems of its own. This post has some overlap with the main points brought out by Holger K. von Jouanne-Diedrich 3, but I have several unique points, and I extend some of the common distinctions of the two languages particularly to cases involving spatial data. Further, von Jouanne-Diedrich goes full bore and campaigns for R over Python entirely; his points are not just geared toward the classroom. I don’t go this far as I think individual organization needs are too varied to simply pick one language over the other, so my summary here applies only to a pedagogical context.

My points also apply specifically to geography students and those who are new to programming languages. While some content may be relevant to other disciplines or to intermediate students, my focus is notably niche. When I include code examples, I use Python’s geopandas, shapely, and fiona, as arcpy has unique workflow shortcomings – ones that are resolved in other libraries within the same language – so much so that I believe it would make for an unfair comparison with R. The reasons for using Python libraries other than arcpy is a topic for another discussion entirely.

The post is organized into three main sections. In the first section, I delve into the drawbacks of using Python for teaching spatial analysis using examples where appropriate. Many of these points bleed together, and Point 1 is pervasive throughout the piece. In the second section, I discuss some shortcomings of R and take on its primary criticism: that employers appear to want Python. In the third section I provide some concluding thoughts and discuss how Python could make me a believer. Hopefully this post serves to promote constructive discussion.

Challenges of teaching spatial analysis with Python

1. Data analysis vs. software engineering

Yihui Xie states this: “I feel a major difference between the R culture and Python culture is that Python users seem to create code more often, whereas R users often use code. There seems to be a strong atmosphere of software engineering in the Python world: in the beginning was the custom class (with methods). For R users, in the beginning was the data.” 4.

The chief task of most academic geographers is analyzing data, not creating software. In essence, I think it’s more prudent to focus on teaching undergraduate geography students scripting – i.e. using code to answer a question, create a map, automate a task, or communicate a concept – rather than creating or customizing underlying software that makes spatial analysis possible.

Certainly Python can be used for data analysis and R can be used for software engineering, but Python lends itself to data analysis less effectively than some other languages. Much of the post will center on this idea. The terms “programming” and “scripting” are often equated, and to many people they mean the same thing. Here however, I make a conscious differentiation. “Scripting” will refer primarily to data analysis whereas “programming” will refer to software engineering.

2. Lack of an expressive toolkit for spatial operations

This is, perhaps, the biggest weakness Python has, and I think it’s an important one. Much of the technical content we teach geography students is about successive spatial operations, and the modus operandi – beyond simply the tool itself – is critical to student comprehension. For example, consider a situation in which one need to complete several operations:

  • Read spatial dataset from a web source
  • Transform the crs
  • Filter points that fulfill a certain condition
  • Select columns of interest
  • Create a buffer around the points

Compare the two code samples below. The first is written in Python:

import geopandas as gpd

points = gpd.read_file("https://raw.githubusercontent.com/mhaffner/data/master/meuse.geojson")

points_trans = points.to_crs(23032)

points_trans_filt = points_trans[points_trans["lead"] > 150]

points_trans_filt_sel = points_trans_filt[["lead", "cadmium", "elev", "dist"]]

points_buf = points_trans_filt_sel.buffer(100)

This example is not terribly unreadable, but it incurs a cluttered workspace due to the need to create intermediate variables for every operation. This is not remotely a performance problem – in fact, the Python example above probably runs a little faster than the R example below – but it does create a readability problem. In longer chains of analysis, the number of intermediate variables becomes unwieldy, and the best practice of creating short but meaningful variable names becomes impossible. Plus, at the end, I chose to shorten the variable name to be more manageable: points_buf over points_trans_filt_sel_buf, but another user inspecting the code may wonder about this new variable at first glance – is the buffer indeed conducted on the untransformed data? Or did the naming convention change at some point in the code?

One could argue that using comments solves the problem of vague code, but I’d argue that the existence of regular expressions is enough to counter that. Maybe one could come up with better variable names or use names like v1, v2, and v3 instead of the longer examples I’ve used above. Yet extra short variable names make it much more difficult to spot mistakes, and they provide less information about the process at hand. Of course, close inspection of the code could sort any of these issues out, but what if such nuisances could be avoided?

Now consider an analogous example written in R:

library(sf)
library(dplyr)

pts_buffer <- st_read("https://raw.githubusercontent.com/mhaffner/data/master/meuse.geojson") %>%
  st_transform(23032) %>%
  filter(lead > 150) %>%
  select(lead, cadmium, elev, dist) %>%
  st_buffer(100)

Here, only one variable is created as the process is “chained” together using R’s pipe operator: %>%. To be fair it’s an ugly symbol that takes a little getting used to, but I advise students to read the symbol as “send to” (e.g., read the data, send to a coordinate reference system transformation, send to a filter), and they quickly grasp its meaning. Like R’s assignment operator, the pipe also looks like an arrow, so in a way these operations appear like a text-based model that students can visualize.

This operator has not always been a part of R but has proliferated in recent years due to its close association with the popular tidyverse. Aside from the pipe, other “tidy” functions are more readable as well:

  • Selecting variables of interest using select vs. [[
  • Subsetting by a certain criteria using filter(column > value) vs. dataset[dataset.column > value]

Before the tidyverse, R’s subsetting and selection procedures looked a lot like pandas’; then again, it would be more accurate to phrase this the other way around considering pandas data structures were essentially created to mimic a data structure native to R, the data frame.

3. More complex syntax

At a certain level, the choice for Python or R’s syntax is merely a matter of preference. I think DataCamp effectively describes the barrier of entry and development progression for the two languages in this: (1) R is a bit easier to learn for those new to scripting/programming while Python is easier for those with more experience. (2) At the same time, it is more difficult to achieve advanced proficiency with R and easier to get there with Python 5. With undergraduate geography students, the goal is to get them to basic proficiency. While advanced proficiency in a number of languages could be beneficial career-wise, it would either take a serious personal time investment or courses outside of our department to get there.

All this said, basic operations usually require more lines of code in Python than in R. For example, consider a situation where one would want to achieve the following in as few lines as (practically) possible using commonly accepted procedures:

  1. Read a geojson file from the web
  2. Create a histogram of one field
  3. Create a simple map of the data

Compare the two code samples below. Again, the first is written in Python and the second in R.

import geopandas as gpd
import matplotlib.pyplot as plt

meuse = gpd.read_file("https://raw.githubusercontent.com/mhaffner/data/master/meuse.geojson")

plt.hist(meuse['lead'])
plt.show()

meuse.plot()
plt.show()

Versus:

library(sf)

meuse <- st_read("https://raw.githubusercontent.com/mhaffner/data/master/meuse.geojson")

hist(meuse$lead)

plot(meuse$geometry)

The difference in lines between the two examples is 7 vs. 4 which admittedly is not a ton. Over the course of large projects, I honestly don’t know if R code is any more succinct than that of Python – and certainly not 75% more if it is – but it ought to be kept in mind that we’re working with students who have had no exposure to scripting. Keeping examples as simple as possible is nice since at this stage, scripting is more of an effort in pattern recognition rather than knowledge implementation.

The Python example has several elements that may confuse students:

  • The use of as to create an alias for geopandas and matplotlib.pyplot in order to reduce the amount of typing: while this is not imperative, most internet examples of these packages (and others) use these aliases. What’s more confusing is that instead of import matplotlib.pyplot as plt, another commonly used syntax is from matplotlib import pyplot as plt.
  • The discrepancy between how the modules are imported: one contains a period and the other does not.
  • The need to use plt.show() after calling plt.hist() and meuse.plot(): to be fair, inside of a Jupyter notebook these lines aren’t needed if the user calls %matplotlib inline, but this is another layer of complexity that requires explanation.

Further, consider another example where a user may want to retrieve all of the full (as opposed to relative) file paths in a directory:

import os

def listdir_fullpath(d):
    return [os.path.join(d, f) for f in os.listdir(d)]

file_names = listdir_fullpath("/home/user/Downloads")

Versus:

file_names <- list.files("/home/user/Downloads", full.names = TRUE)

The Python example above does several things:

  • Imports the os module
  • Creates a function for retrieving full file paths using list comprehension, since there is no built-in method for getting the full path
  • Uses the newly created function to retrieve the file names

It’s good practice to create functions for code which may be used often, and list comprehension is a powerful construct, but these concepts warrant explanation on their own. The R example on the other hand, simply uses the list.files() function and an argument, full.names = TRUE, to retrieve the full paths.

With some practice with Python, these minor inconveniences become moot. But with geography and GIS students, these cases demonstrate how the language becomes a barrier to learning geospatial concepts. Beyond these simple examples, I could provide other similar comparisons demonstrating the extra syntacical burden in Python with the creation of web maps (in folium vs. leaflet), type conversion (in pandas vs. R’s native data structures), and in building web applications (in shiny vs. flask).

4. Method vs. function discrepancies

A major challenge for newcomers to Python is determining when to use a function and when to use a method like a property of an object. Essentially, methods are special types of functions but they’re called differently. For example, consider the following code:

import seaborn as sns

# load an example dataset
iris = sns.load_dataset('iris')

# view the first few rows
iris.head()

# get a column mean
iris["sepal_length"].mean()

This excerpt first imports the seaborn module. It then loads the “iris” datsest. The next line of code utilizes the head() method which retrieves the first few rows. The following line uses the mean() method. This is simple enough.

However, if the user came across the statistics module, they would find that mean() needs to be called in a different way:

from statistics import mean

mean(iris["sepal_length"])

While a student likely wouldn’t feel the need to search online for how to compute the mean if they are using pandas datafames – they would probably just try the method – this discrepancy is common as there is no standardization. R trades this problem for namespace collisions and a degree of ambiguity in where functions come from, but I think differentiating functions vs. methods is far worse for students.

5. Configuration challenges

Configuring a Python IDE for classroom instruction is a major challenge for a few reasons outlined as sub-issues below. Most of these are related to the fact that students won’t have access to the same C: drive from session-to-session on our lab computers. These issues create significant systems administration overhead which to time to resolve. Our university staff work incredibly hard at getting software to work for us, but it’s a legitimate bottleneck for me to have to make technical requests – e.g., change the default version of Python used by Spyder, allow write access for pip, etc. – and these issues can grind curriculum development to a halt.

Configuring R for instruction is simple. One must simply install R, install RStudio, and allow for package installation in a place where students will have persistent access and suitable permissions. This can be a little tricky on lab computers where students won’t have access to the same C: drive everyday, but one of the default locations where R installs packages is in the students’ OneDrive folder. This works perfectly without any modification. I can supply our systems administrator with a list of packages I would like installed by default which will be consistent across all computers, and students can install packages to their OneDrive individually. Next time, if they log on to a different computer, they will still have access to the system packages but also the packages in their “synced” OneDrive folder.

5a. Multiple Python versions installed on a system

From a research perspective, having multiple Python versions available is a good thing: some modules work with only with specific Python versions, and having them available is great, especially for the ease of creating virtual environments. That said, the system’s default Python is not necessarily the one I want students using. On our lab computers the default is 2.7, and various IDEs will use different default versions. ArcMap and ArcGIS Pro install their own separate versions of Python (2.6 or 2.7 versus 3.x, respectively), and ArcGIS Pro’s version uses the popular Anaconda distribution. Anaconda uses the conda package manager which is slower than the standard tool pip and should not used along with pip as they will conflict with one another. Some experimental packages I like to use in demonstrations and in-class activities are not available in the conda forge repository, which precludes their use with Anaconda.

I’m afraid of modifying the system default version of Python to one I prefer because of how it might affect ArcMap, ArcGIS Pro, and other programs our various faculty are using. I wouldn’t dare do this during the semester, and even if I made this request during the summer or winter break I suspect issues may not creep up until during the semester. I also suspect system updates may tamper with or reset configuration settings of pip’s default Python version, default install location, and various IDE settings, requiring manual changes frequently. Most of these version issues are mitigated by using virtual environments and other computing best practices, but covering this information in class would take away from other important content, not to mention be a deterrent to enrollment for our classes.

It’s possible to install multiple versions of R on one computer, but this is uncommon and not necessary for most purposes. It’s worth noting as well that even if all of these systems administration issues are mitigated, students have a much more difficult time configuring Python on their own computers as opposed to R. I’ll return to this point later.

5b. Difficulty in installing external packages

Depending on whether Python was installed with Anaconda or not, packages are installed with conda or pip. Either way, this is completed at the command line, rather than in a Python shell. The command line is generally intimidating for students, not to mention programming in general. R packages on the other hand are installed within R at the R console, and this small difference adds an extra layer of complexity with Python. Further, installing experimental packages from GitHub with pip is not as straightforward as R’s devtools either.

There is actually a push button tool inside of ArcGIS Pro that allows for installation of Python packages (used in tandem with running Jupyter inside of ArcGIS Pro), but ArcGIS Pro uses the Anaconda distribution, and as stated previously, some of the packages that I use are not available through conda repositories, so this workflow is a no-go. Either way, as explained in the next point, I feel strongly that students new to programming and scripting should be using an IDE.

One might question the need for students to install external libraries that are not on the system by default, especially since I can (and do) ask our sys admins to install many packages up front. Also, most of the students I’m working with are new to scripting, a point I emphasize a lot in this piece. It’s true that most beginners won’t do this, but gifted students can and will. I argue that the inability or extra difficulty in installing external packages severely inhibits exploration and introduces continuity problems if another IDE/framework is used for Python instruction in the future. No matter which framework and setup is used for Python development, students have a much more difficult time configuring it on their own. I’ve seen students in their first semester of using R lounging in our hallways and labs using a dark-themed version of RStudio and installing GitHub packages with devtools on their own laptops. I’ve never heard of or seen my students at any level of scripting using Python this way, even those who have taken a dedicated Python course in the Computer Science Department.

5c. Varied IDE usage

The problem with configuring an IDE (or editor for that matter) is related directly to the previous two points. There is no shortage of Python IDEs in the world, but those that are relatively easy to use out-of-box – namely Spyder, PyCharm, and PyScripter – have given the students and myself major headaches (for various reasons) when attempting to use them for instruction. Much of these issues are related to having multiple Python versions installed on the system and the challenge of installing packages.

At the beginning of my GIS 3 class this semester, I decided to use PyCharm after much painful experimentation with various IDEs and web services. The default version of Python selected by PyCharm is a version of Python 3.8 that comes installed with ArcGIS Pro, likely set as a system variable. So I have students create a project using a different version of Python that will allow them to install packages – also 3.8 but not installed with Anaconda – and create a virtual environment on our department’s network drive so they will have access to it when working from home or on a different lab computer. This only has to be completed once with the push of a button, so I’m able to skirt around the explanation of virtual environments. I would never recommend using a virtual environment across a network drive as a day-to-day workflow, but I thought it would enough to tie us over for the semester.

If this seems overly complicated and fragile, I would have to agree. But this is literally the only combo that I could get to work on our lab computers that would allow students to:

  1. Use an IDE that allows for single line code execution
  2. Install packages with pip
  3. Avoid having to activate virtual environments manually from the command line

After a successful first day of instruction with PyCharm, during the subsequent class period everyone except me working from the teaching station had errors associated with the Python interpreter after loading their project. After having previously scrapped Kaggle kernels, I am back to using it for instruction again, which is better than nothing but forfeits two of three of the points just mentioned. For example, I recently discovered a package called geoplotlib and wanted to potentially demonstrate it in class, but it’s not available on Kaggle and therefore the students can’t use it. Kaggle says that they’re willing to consider library additions, but I doubt they’ll add anything that isn’t extremely stable. Even if they did add it, this is not a suitable long term workflow.

Sadly, when I first taught GIS 2 as a newly minted assistant professor, I was primarily using Python and PyCharm – it was the first IDE I tested for instruction – but I decided to scrap Python all together in large part due to the difficulties in configuring it and the clear lack in student enthusiasm. Two and half years later, after evaluating as many other options as I feasibly could, I went back to the same workflow that I left due to its complexity, only to scrap it again.

In addition to Kaggle and PyCharm I’ve also physically tested:

  • Spyder (which I believe is nearly on par with RStudio but had to nix due to inexplicable errors related to setting the default Python version, along with other shortcomings related to the previous two points)
  • repl.it
  • CoCalc
  • PyScripter
  • DataLore
  • Google Collab
  • Local Jupyter Notebooks

None of these were able to fulfill the three requirements of (1) single line code execution, (2) ability to install packages with pip, and (3) avoiding activation of virtual environments from the command line, even if I had even budged on the IDE requirement. All of these problems are avoided with RStudio.

Within the R community, the vast majority of users rely on RStudio as their IDE of choice. It is widely accepted, painless to install locally, and relatively intuitive to use. I think that the Python community is a bit more fractured on its choice IDE. In fact, the most commonly used platform for Python scripting (i.e. data analysis) is not an IDE at all: it’s the Jupyter notebook. This software is a bit less intuitive for students as it is tied to a background “kernel,” there is no variable explorer, and single lines of code cannot be sequentially executed without being in a cell of their own. I feel strongly that new programmers should be using an IDE due to the availability of these helpful features.

It may appear as though my recommendation, RStudio, is strongly influenced by my own preferences. That’s less true on this point than others in this post. As an Eamcs user I haven’t touched RStudio for my own work in years – not to mention Python-specific IDEs since I use Emacs for Python development as well – but I do think RStudio is a more user-friendly choice for students. Paradoxically, RStudio 1.4 is an excellent Python IDE, so if other issues could be mitigated, it may be a useful, low barrier of entry tool for Python instruction.

6. A general purpose programming language?

One of the main arguments for using Python over other languages is that Python is a “general purpose” programming language. While Python probably has broader applications and usage by industry, for much of the work that students do in the classroom, R has far wider applicability. Much of this has been enabled by the framework “RMarkdown” which allows users to weave markdown and R code together. In fact, it’s so generalized that one can even use it to integrate markdown with Python and render a document with an R process. I’ve done this before with a Python case study and the output looked far better than anything I can produce with Jupyter Notebooks!

Personally, I’m currently using R, RMarkdown, and Shiny to

  • Serve out all course materials (syllabuses, assignments, etc.)
  • Create class presentations with interactive charts and web maps
  • Create web applications for teaching concepts
  • Create web applications for spatial data collection
  • Maintain my personal website
  • Produce literate programming reports

Python modules can accomplish some of these tasks, but the barrier of entry for these in R is so remarkably low that students with no experience with R can accomplish some by the end of a semester.

7. Fewer libraries for accessing spatial data APIs

In all of my geospatial classes, extracting US census TIGER files is a common task. Suppose there is a need to download a road dataset for Eau Claire County. This can be done in cleanly in Python by

  • Finding the file path on the census’ FTP server (must be done manually outside of Python)
  • Downloading the .zip file
  • Extracting the .zip file
  • Reading the resulting shapefile with geopandas
  • Cleaning up intermediate files created in the process
import wget
import zipfile
import geopandas as gpd
import os
import glob

# download roads
url = "https://www2.census.gov/geo/tiger/TIGER2019/ROADS/tl_2019_55035_roads.zip"

urllib.request.urlretrieve(url, "tl_2019_55035_roads.zip")

# unzip them
with zipfile.ZipFile("tl_2019_55035_roads.zip") as zip_ref:
    zip_ref.extractall()

# read roads as a geodataframe
roads = gpd.read_file("tl_2019_55035_roads.shp")

# remove intermediate files
for f in glob.glob("tl_2019_55035_roads*"):
    os.remove(f)

I typically followed a very similar approach until I discovered R’s tigris package:

library(tigris)

ec.roads <- roads("WI", "Eau Claire")

Much of the tedious work displayed in the Python example is done behind the scenes with R’s tigris. Unlike the previous code comparisons, this discrepancy is not inherent to language itself. Someone could come along and implement the tigris of Python; nothing about the language prevents it. It’s simply an issue of what is currently available. I could probably implement this in Python if I had the time and motivation.

But aside from tigris, at least two other R packages do not have a Python equivalent and require far more manual labor if R is avoided: tidycensus, which is used for retrieving US census data (along with its spatial data counterpart) and osmdata, which is used for extracting spatial data from OpenStreetMap using the Overpass API. There are probably others.

8. Lacking geo-visualization capabilities

Both Python and R have a wealth of visualization libraries. Python has matplotlib, seaborn, and pygal, and R has ggplot2, among others. When it comes to static visualizations, some of the choice in toolset is simply a matter of personal preference, but I think it’s generally accepted that R produces more aesthetically pleasing non-spatial visualizations (see DataCamp’s comparison of R and Python 6). When it comes to mapping, I think there’s much less debate. I’ve seen ugly maps made in Python, and I’ve seem plenty of ugly maps made in R, many of them made by me. I’ve never seen a beautiful map made in Python, but I have seen some made in R, occasionally with surprisingly simple syntax. Several examples come to mind:

Cartographic work by Yan Holtz7



Cartographic work by James Cheshire8




Cartographic work by Yan Holtz9



Cartographic work by James Cheshire10



There are also several packages for creating maps in R, each with different strengths:

  • ggplot2: a general purpose plotting library which makes syntax consistent with that of non-spatial visualizations
  • cartography: good for vector data, choropleth maps, and proportional symbol maps
  • tmap: becoming the new standard for visualizations in R, this package contains many wrappers around other popular spatial data packages to making mapping more user friendly

9. Community disconnect

This is a minor and somewhat subjective point but one worth mentioning. When our students encounter other people “in the wild” – searching for help on the internet, going to conferences, attending technical workshops for a general audience – they are far less likely to encounter other Python users “like them” as opposed to R users. As mentioned at the beginning of this piece, Python has a stronger culture of software engineering while R has stronger culture of data analysis. Most users of R are domain scientists who use scripting as a means to an end, rather than software engineers who view code as the end. Of course, plenty of people use Python for scientific computing, but the proportion of people who do so is much smaller.

It’s also worth noting that many companies develop Python code and contribute to open source projects, but none is as contributory to the language as RStudio is for R. RStudio, a public benefit corporation which makes the popular IDE by the same name, has a vested interest in creating easy-to-use software domain scientists who are not computer scientists by trade. The company even has a branch dedicated to creating educational resources for R. It’s not that other organizations don’t have educational branches; it’s simply that few languages have an organization dedicated to creating material for students just like ours, and the bulk of educational content RStudio produces is geared toward R more generally, rather than their specific products.

10. Students don’t get it

From my experience in teaching with many software programs in a variety of different geospatial courses, students simply don’t understand Python well, even those who have taken a 3-hour Python specific course in the Computer Science department. Beyond this, they don’t enjoy Python much either. The influence of my expertise on the level of student engagement can’t be ruled out, but I believe the challenges associated with Python described throughout this piece play a greater role. When teaching with Python, I spend more time talking about idiosyncrasies of the language and less time teaching on spatial analysis. I don’t mind teaching about programming languages, but I care about spatial content much more. I don’t think these problems justify eliminating scripting from a geospatial curriculum altogether. Rather, either the solutions to the Python problems ought to be rectified or another language/framework should play a greater role in the curriculum.

Shortcomings of R

R is certainly not a perfect language or educational tool, and there are legitimate drawbacks that are worth weighing. I outline several in this section. The first issue is the one most commonly referenced, and I tackle it in depth, largely because I feel it is less important than people make it out to be. The issues with R that follow are more minor, but admittedly some do not have clear resolutions.

1. Employers generally want Python

It’s true that entry level job ads mention Python more commonly than any other language; I’m more than willing to concede that. However, I don’t accept the premise that the role of a university education is to train students for jobs, and even if it was, I certainly don’t think the target should be entry level positions. On the other hand, there is a reality that students might view their education that way, and they expect their education will be relevant for the workplace. Further, part of what drives students in selecting a major is their prospect of employment after graduation. I can understand from an employer’s perspective that, all things being equal, they would rather hire someone who is familiar with their toolset rather than another ecosystem. Python appears to be the scripting language of choice, but I question how often entry-level employees use scripting in their day-to-day work.

I’d also question why from a technical perspective employers would prefer Python experience over R beside the fact that Python has a reputation of being more tightly integrated with ArcGIS. This is in fact true, but from all of my conversations with potential employers and with students who are in the work force, I have yet to hear of a use case in which Python was unquestionably the right choice as opposed to another language. Again, I think there are many cases where Python is the right choice! But I would speculate that employers prefer Python because of legacy and perceived needs rather than specific use cases, just like the data science job ads I see that reference things like “spacial” analysis and “geospatial informatics system”: they have a general idea of what they are looking for but don’t understand the depth of the topic enough to know how that requirement fits directly into the position. Maybe I’m not giving employers enough credit, but on more than one occasion I’ve heard employers adamantly lobby for Python and then admit to having little to no programming experience themselves. I’d wager that what employers really want is exposure the language (Python) and scripting experience more generally.

In essence, I honestly believe the vast majority of students who use Python in my classes as opposed to a language like R will be less equipped for workplace success – especially in the long-term – because of the reasons described in the previous sections. They will have a lower ceiling for their specialized field of interest because they will have spent more time learning a language up-front and less time mastering higher level spatial concepts through the language. They’ll be more equipped for software engineering and less equipped for spatial data analysis. I don’t think that’s a wise tradeoff. I can’t be convinced that a student highly skilled in R with no exposure to Python is ill-equipped for the modern workplace, even though I’d advocate for exposure to both languages even if employers’ desires were not a factor. The challenge is convincing employers that students are well-equipped for their organization even if they’ve spent more time in tools not listed explicitly on a job advertisement. For this reason, I think exposure to Python is crucially important, particularly in lower level courses like GIS 1 or Intro. to GIS– at least enough for students to feel confident putting it on their resumes.

Satisfying the employer requirement of Python experience could easily be done in a four week segment of a GIS 1 course or in a standalone one hour course. Would this simply a “box checking” exercise? Perhaps, but at present, we introduce Python in GIS 1, and I teach with it in GIS 2 and GIS 3, and due to the challenges described here I feel it is little more than a box checking exercise currently. I’ve heard of geospatial educators at other institutions weighing whether or not to teach arcpy, with the only real upside that it satisfies what employers claim they want. I think there is a place for “teaching to the job” and that this needs to be done to some extent, but teaching with subpar workflows over the course of multiple semesters to satisfy an entry level job requirement is a high price to pay.

Beyond all this, I think the employer desire for Python is largely an artifact of the jobs available based on how the academy approaches GIS – the choice to emphasize database management vs. geostatistics, for instance – rather than a necessity of employment with a geospatial skillset in itself. If this is the primary drawback to R, I don’t think it should be a serious deterrent to its use for geospatial education. Python and R can be taught the same curriculum, and students can be at least just as competitive for entry level jobs with a curriculum that includes Python but focuses more on R as they can with a curriculum centered exclusively around Python.

2. Namespace collisions and functions that appear “out of nowhere”

This is, no doubt, a legitimate concern in R: a user loads several packages and uses their functions:

library(sf)
library(tigris)

wi.places <- places("WI")

plot(wi.places$geometry)

But where did each of these functions come from? Did places come from sf, tigris, or one of the base R packages loaded by default? A user can always find this out through various techniques, and experienced users will simply know out of…well, experience, but it’s not made explicit in the code. Even more confusing is the fact that plot is a function that is part of base R, and it will allow you to use the function but it won’t work properly with spatial objects (e.g., sf) unless sf is loaded first. There is a subtle difference between “loading” and “attaching” a package that is not intuitive for beginners.

The real killer is a situation where a base R function, like filter, is used without loading the intended package, e.g., dplyr:

#library(dplyr)

filter(mtcars, mpg > 20)

This will throw an error message but not one that is expected. A user can specify the package explicitly, e.g., dplyr::filter rather than filter, but this is not used commonly in example scripts. The double colon :: should always be used if there is any question about two packages sharing the same function name, but in practice – and in examples students will find online – it’s often omitted.

Python has far less ambiguity with namespace issues. There are fewer “base” functions that are called frequently, and the user typically references the module name (or alias) explicitly within the code. In my experience, I used to commonly see from module import *, making all functions of a module available implicitly, but this is far less common now.

3. Use of a period in variable and function names

This is the number one complaint I hear from those learning R if they have experience in another language. The period, “.”, is an acceptable and commonly used character in both variable and function names. For those with prior experience in Python, JavaScript, and other languages, this takes some getting used to. Like many other problems, this is a non-issue for those encountering scripting for the first time.

4. Lack of a unified model interface for spatial analysis

I envy Python’s scikit-learn on this point. For problems like interpolation and clustering, sklearn has a consistent syntax for implementing functions and data structures for output. These translate directly to more general machine learning problems. While R has caret for machine learning, it has a smattering of different object types and data structures for various spatial analysis operations outside of the formal Simple Features standard. I believe this will improve over time with the adoption and proliferation of sf, but this problem can be frustrating for novices and experienced users alike.

5. Need for production grade machine learning code

This is a non-issue for new programmers/scripters, but if a geospatial department wanted to gear its curriculum toward production grade machine learning, Python would clearly be the right choice.

Conclusion and how Python could make me a believer

When I started writing this post, I expected it to be 1500 - 2000 words rather than pushing 8000. I suppose this is the culmination of three years of being immersed in geospatial education as a person who uses scripting for all spatial analysis research and views it as crucial tool for students. For the record, I love using both Python and R and will continue using both in research no matter how my courses are taught. I simply believe in using the right tool for the job. Though I’m open to other approaches, at this point it should be clear that I believe R is the tool currently most suitable for teaching geospatial concepts. It might not always be that way. In fact, a feasible scenario of the future is one in which Julia becomes more prevalent scripting language for data analysis and geospatial education.

All this said, there is hope for Python to make me a believer, and I would less begrudgingly use it in the classroom if several problems were effectively addressed. Namely, if:

  • Students took a general, introductory Python scripting course concurrently with or as a prerequisite to upper level geospatial courses, even if it was just one credit
  • A better IDE emerged (RStudio 1.4 shows promise here)
  • A solution to the configuration challenges was sorted out, including package installation (I would suspect )
  • Python developed a tidyverse equivalent
  • Python spatial data APIs caught up to those of R

It would not even be necessary for all of these be fully solved, just several. A tidyverse equivalent alone is a tall order, for instance, and in recognition of that I’d be willing to use Python without it.

I would love to hear feedback about how these issues could be addressed or other strategies for teaching scripting in a geospatial curriculum. Please reach out if you have any suggestions. I would be delighted to keep the conversation going.

Footnotes