13 min read

Avoid arcpy. Use geopandas.

Overview

When someone wants to begin using Python for spatial analysis or automating tasks involving spatial data, I generally see two camps: (a) the data scientist with significant programming experience but a lack of experience working with spatial data and (b) the GIS practitioner with a wealth of knowledge about spatial data but little experience with Python. When it comes to choosing the correct toolkit in Python, both may choose errant paths, although the latter is more likely to do so than the former.

I’m writing this post because I see a lot of confusion surrounding approaches to Python for spatial analysis, namely arcpy. Generally people assume it should be used for everything spatial; I argue it should be avoided at all costs. That said, there are legitimate use cases for arcpy, though these are few and far between. This post is organized into several sections: (1) a description of what arcpy is and what it should be used for; (2) shortcomings of arcpy and a comparison of common spatial analysis tasks between it and geopandas; (3) legitimate use cases of arcpy.

What is arcpy, really?

Arcpy is a Python library used for working with spatial data within the ESRI ecosystem. ESRI is the company that creates proprietary GIS software such as ArcGIS Pro. Previously, the ecosystem was more fractured into programs like ArcMap, ArcCatalog, and ArcScene, but more recently such standalone programs have become more integrated through ArcGIS Pro. Arcpy is likely the correct tool for two categories of tasks:

  • Operations in which writing to an ESRI geodatabase (or creating a geodatabase feature class) is an absolute necessity
  • Automating ESRI license management operations and other license-restricted tasks unrelated to analysis

I’m going to ignore the second category of operations above as it is beyond my expertise (and interests, for that matter) and focus on the first category. If the end product is a geodatabase feature class – i.e., data needs updating and the storage format is non-negotiable – then arcpy may be unavoidable. For other types of tasks, there are other options. In some cases, it may still make sense to use arcpy for the sake of consistency with other organizational operations, but that is simply a matter of preference. But quite commonly the need for writing to a geodatabase is grossly overstated.

For example, consider a situation in which the end result of a workflow is a report or a publication. In this case, data is not the final product; it is only an intermediate necessity in achieving an end goal. In this case, even if data is housed in an ESRI geodatabase, the data can be read with geopandas – or any software which uses the geospatial data abstraction library (GDAL) as a backend – and any intermediate data can be stored as a GeoJSON file, a shapefile, a PostgreSQL/PostGIS table, or simply within memory.

In essence, using arcpy for analytical tasks is like running a headless version of ArcMap. This differs markedly from the vast majority of Python workflows. Those with programming experience will feel this approach is clunky and awkward. Those without programming experience are likely to assume that scripting their solution is more work than it’s worth. Basically, arcpy is created to mimic the workflow of clicking buttons when the modus operadi of scripting should be drastically different. ArcMap – and much of the functionality in ArcGIS Pro – is set up to be as user-friendly as possible for beginners, with the downside that it severely restricts power users. Scripting is a powerful approach, and it should not be relegated to a subpar workflow.

Much of the awkwardness of arcpy can be avoided with the more pythonic approach of geopandas1, and this can be demonstrated with side-by-side code examples of each.

Code examples

In the examples below, I’ll provide two comparisons of tasks:

  1. A simple buffer analysis

  2. An attribute table modification

It may be tempting to think I’m simply picking and choosing examples which make arcpy look inferior. On the contrary, I intentionally picked these examples so that the arcpy solutions would not be unwieldy. If more complex needs arise – such as ingesting data from internet, using spatial data APIs, combining data from disparate sources – the arcpy examples would be far more cumbersome.

Example 1: Buffer analysis

This example will involve several operations:

  • Read a spatial dataset from disk
  • Transform the data into a different CRS
  • Select certain features
  • Create a buffer
  • Save the output

In geopandas, the workflow would look something like this:

import geopandas as gpd

# read data from disk
crashes = gpd.read_file('/home/matt/Downloads/Crashes_Involving_Cyclists.geojson')

# transform x/y data
crashes_proj = crashes.to_crs(32119)

# select crashes with more than one driver
crashes_mult = crashes_proj[crashes_proj['drivers'] > 1]

# buffer
crashes_buf = crashes_proj.buffer(500)

# save
crashes_buf.to_file('/home/matt/Downloads/crashes_buffer.shp')

In arcpy, it would look like this:

import arcpy
from arcpy import env

# set workspace for the sake of intermediate file creation
env.workspace = 'C:/Users/haffnerm/Downloads/'

# read data from disk and transform
arcpy.management.Project('Crashes_Involving_Cyclists.shp',
                         'crashes_proj.shp',
                         arcpy.SpatialReference(32119))

# select crashes with more than one driver
arcpy.Select_analysis('crashes_proj.shp',
                      'crashes_mult.shp',
                      'drivers > 1')

# buffer
arcpy.Buffer_analysis('crashes_mult.shp',
                      'crashes_buf.shp',
                      '500 meters')

These two examples differ fundamentally in several ways. First, in the geopandas example, intermediate operations are handled in memory whereas in the arcpy example everything is written to disk. Not only is this latter approach much slower, it creates many unnecessary intermediate files that take up space on the file system. Though it is possible to save data as a “memory” layer with arcpy, this is not standard practice and virtually none of the provided examples demonstrate it. Even if the discrepancy between disk and memory was not an issue, the following problem – which is more severe – would persist.

Since there are no intermediate Python objects created with arcpy, every facet of the data must be extracted using arcpy rather than using methods or properties in Python. This is absolutely debilitating. While the examples above do not appear radically different, the workflow involved in leading to this end result differs markedly. For example, to retrieve the number of rows, number of columns, field names, and CRS information from a geopandas object, one can simply use:

# number of rows
crashes_buf.shape[0]

# number of columns
crashes_buf.shape[1]

# field names
crashes.columns

# crs info
crashes.crs

Versus arcpy:

# number of rows
len(arcpy.GetCount_management('crashes_buf.shp'))

# number of columns
len(arcpy.ListFields('crashes_buf.shp'))

# field names
arcpy.ListFields('crashes_buf.shp')

# crs info
arcpy.Describe('crashes_buf.shp').spatialReference

In these arcpy examples, there is far more typing and a much more roundabout way of extracting essential qualities of the data. Further, what if a user wanted to simply count the number of crashes where the number of drivers is greater than 1 without incorporating this information into an end script? Or greater than 2? Or greater than 3?

With geopandas, these intermediate “queries” could be accomplished with this:

# greater than 1
sum(crashes['drivers'] > 1)

# greater than 2
sum(crashes['drivers'] > 2)

# greater than 3
sum(crashes['drivers'] > 3)

With arcpy, they would look more like this:

# greater than 1
arcpy.Select_analysis('Crashes_Involving_Cyclists.shp',
                      'crashes_greater_1.shp',
                      'drivers > 1')

len(arcpy.GetCount_management('crashes_greater_1.shp'))

# greater than 2
arcpy.Select_analysis('Crashes_Involving_Cyclists.shp',
                      'crashes_greater_2.shp',
                      'drivers > 2')

len(arcpy.GetCount_management('crashes_greater_2.shp'))

# greater than 3
arcpy.Select_analysis('Crashes_Involving_Cyclists.shp',
                      'crashes_greater_3.shp',
                      'drivers > 3')

len(arcpy.GetCount_management('crashes_greater_3.shp'))

Over the course of anything more than a minute or two, the constant arcpy calls become unwieldy. Combine these with the need to use other Python objects (e.g., non-spatial data) and the workflow is crippling. Scripting is an interactive and iterative processes which should involve substantial amounts of exploration and querying of intermediate datasets. This is incredibly difficult to achieve with arcpy.

Example 2: Attribute table modification

The emergence of pandas and geopandas is relatively new (within the last ten years), and these libraries have made data analysis much simpler. The pandas dataframe was designed to mimic R-like data structures so that end users could employ vectorized operations and avoid loops. For example, borrowing our example dataset from earlier, suppose a user wanted to create a field called injury_total which adds the columns type_a_injury, type_b_injury, and type_c_injury. In geopandas, this would be accomplished as follows:

import geopandas as gpd

# read data from disk
crashes = gpd.read_file('/home/matt/Downloads/Crashes_Involving_Cyclists.geojson')

# create new column
crashes['injury_total'] = crashes['type_a_injury'] + crashes['type_b_injury'] + crashes['type_c_injury']

This is straightforward enough. The + operator is simply used to add the values together and a new column is created “on-the-fly”. With arcpy, however, there is a larger separation between creating a field and computing it, and computing fields is much more cumbersome through the use of loops and cursors. Further, there is a cryptic error that appears related to the field length.

import arcpy
from arcpy import env

# set workspace for the sake of intermediate file creation
env.workspace = 'C:/Users/haffnerm/Downloads/'

# create variable for shapefile name
fc = 'Crashes_Involving_Cyclists.shp'

# create column
arcpy.AddField_management(fc, 'inj_total', "LONG")

fields = ['inj_total', 'type_a_inj', 'type_b_inj', 'type_c_inj']

# use an update cursor to update the field values
with arcpy.da.UpdateCursor(fc, fields) as cursor:
    for row in cursor:
        row[0] = row[1] + row[2] + row[3]
        cursor.updateRow(row)

If one tries to create a field titled injury_total, an error message will appear stating that the field name must not be larger than 10. This is understandable. But arcpy will also truncate existing field names, such as type_a_injury into type_a_inj, which can be problematic if working with disparate systems where field names are not expected to change (I ran into this exact issue in writing these scripts). It should be noted that the 10 character limit is not a problem for geodatabase feature classes (their limit is 64), but the discrepancy between field length limits in arcpy – which does not exist in geopandas – is confusing to new users.

Most of the students I work with are new to programming, and I wouldn’t dare have them use cursors in a course where they are first exposed to scripting. The amount of time it would take to cover loops and cursors in a class where students have no prior programming experience would simply be too great. But aside from this, using a loop is much slower than the previous vectorized geopandas approach; that is partially why pandas was developed.

Other shortcomings and caveats

These two examples demonstrate the ease of use of geopandas compared to arcpy, but there are other shortcomings which I did not demonstrate yet still worth mentioning. For instance, the primary file format that I use for sharing data with students is GeoJSON – it’s plain text, portable since it’s a single file, and widely used. But ArcGIS products cannot read this format, so it’s out of the question with arcpy. Aside from this, downloading other data from the internet (e.g., .csv’s) and using them directly in scripts is seamless with geopandas but awkward and time consuming in arcpy.

The push toward using ESRI geodatabases is obvious from ESRI’s perspective: that is, the desire to create technological lock-in and encourage people to use their products. Private companies make decisions in their own best interest; we shouldn’t expect otherwise. But because of this, it should not be assumed that a company like ESRI simply makes workflow decisions in the best interest of end users.

A major downside not demonstrated in this piece is the inability to visualize data and outputs when using arcpy. With geopandas, there is matplotlib, seaborn, and a wealth of other data visualization tools available. One could argue that it would be easy to have ArcMap or ArcGIS Pro open to visualize data along with arcpy, but this misses the mark on a couple levels. The end goal of scripting is automation, and if a point-and-click program is perpetually used as a companion to scripting, much of the advantage will ultimately be lost. Novice scripters who take this approach will soon be wondering why anything should be scripted, and they will be tempted to simply conduct their analyses with the mouse.

The ArcGIS API for Python does seem to improve on some shortcomings mentioned in this piece (notably visualization and mapping), but some of its procedures are very similar to those of geopandas. If this is the case, it is puzzling why someone would choose the ArcGIS API for Python given that development is happening behind closed doors, making the software difficult to extend. My prediction is that the ArcGIS for Python will never be able to compete with geopandas for this reason. Facebook and Google, seeing that they were left out of the party with respect to Hadoop, notably open-sourced PyTorch and Tensorflow, respectively, recognizing the necessity of community contributions for such large projects.

Conclusion

It’s not surprising to me when I encounter someone with arcpy experience who sees little value in scripting. The amount of time it takes to become proficient in the techniques needed to do something as simple as creating/updating a field in an attribute table is simply an obtuse barrier. My first exposure to scripting with spatial data was in a GIS Programming course in grad school where we exclusively used arcpy. Admittedly, geopandas was not mature at the time, so there may not have been another Python-centric option. Either way, the course made very little sense to me. Looking back, I’m shocked that it did not drive me away from scripting entirely. I ended up using Python heavily in my master’s thesis; I needed to make roughly 80 million calculations and scripting was the only option. Admittedly, I used ArcMap for the spatial operations and Python for computational portions.

Like I mentioned in the Introduction, if a person needs to write to an ESRI geodatabase, then arcpy may be the right choice. But it should be noted that it would be easy to conduct analysis with geopandas, save the data as some intermediate file format, then write to a geodatabase with arcpy, and this multi-tool workflow would probably be simpler than using arcpy entirely. The workflows where arcpy beneficial or required are niche. A friend of mine who works at the Wisconsin DNR once told me that the best use case for arcpy is to get spatial data out of the ESRI ecosystem and into something usable. I think that’s acceptable too.

I came across a fascinating Twitter thread on the topic of teaching arcpy started by a faculty member who I deeply respect from another institution. He wanted to scrap it from the curriculum in favor of other approaches but was hesitant to do so based on arcpy’s prevalence in job ads. The post received many replies from others in the same boat, but the general consensus was that arcpy should be kept for exactly the reason he mentioned.

Herein lies a great example of a positive feedback effect. Employers seem to want arcpy, not due to specific use cases but because it is perceived to be useful. University faculty are increasingly having to “teach to the job” rather than educate, so they design curriculum around what employers think they want, regardless of whether or not that makes sense. Students become equipped with arcpy – and sometimes that is the only scripting approach they learn – but once they get a job they don’t use it in the workplace because it’s too cumbersome.2 When those employees later play a role in hiring, they add arcpy to the list of desired skills, hoping somebody else has figured it out. These employers assume universities are teaching with it because it is listed on their entry level job ad. They are probably correct since these documents are increasingly viewed as golden tablets of curriculum design. So the process repeats and bad workflows perpetuate.

If it is necessary to satisfy the perceived needs of employers of entry level jobs, than perhaps a quick lesson on arcpy would be beneficial, but only for use cases which make sense. A geospatial curriculum centered around subpar workflows like arcpy will do nothing more than drive students away, further exacerbate geography students’ programming anxiety, and potentially cause serious problems for the discipline of geography. While this last point might sound needlessly alarmist, the post is long enough and I’ll expand on this another day.

Footnotes


  1. And as I’ve argued elsewhere, most of the awkwardness of geopandas is avoided by using R, but that’s a conversation I’ve covered in another post.↩︎

  2. I have a former student in mind who is in this exact boat in his current job.↩︎