my_numbers <- c(1, 1, 4, 1, 1, 4, 1)Data Visualization
A Practical Introduction
Second Edition
Kieran Healy
March 2026
This is the website for the second edition of Data Visualization: A Practical Introduction, forthcoming from Princeton University Press. It’s not yet available for pre-order. If you would like to receive one (and only one) email to let you know when you can order it, please fill out this form. Your email will be used for this single purpose only and will not be shared with anyone.
Preface
You should look at your data. Graphs and charts let you explore and learn about the structure of the information you collect. Good data visualizations also make it easier to communicate your ideas and findings to other people. Beyond that, producing effective plots from your own data is the best way to develop a good eye for reading and understanding graphs—good and bad—made by others, whether presented in research articles, business slide decks, public policy advocacy, or media reports. This book teaches you how to do it.
My main goal is to introduce you to both the ideas and the methods of data visualization in a sensible, comprehensible, reproducible way. Some classic works on visualizing data, such as The Visual Display of Quantitative Information (Tufte 1983), present numerous examples of good and bad work together with some general taste-based rules of thumb for constructing and assessing graphs. In what has now become a large and thriving field of research, more recent work provides excellent discussions of the cognitive underpinnings of successful and unsuccessful graphics, again providing many compelling and illuminating examples (Ware 2008). Other books provide good advice about how to graph data under different circumstances (Few 2009; Munzner 2014; Cairo 2013), but choose not to teach the reader about the tools used to produce the graphics they show. This may be because the software used is some (proprietary, costly) point-and-click application that requires a fully visual introduction of its own, such as Tableau, Microsoft Excel, or SPSS. Or perhaps the necessary software is freely available, but showing how to use it is not what the book is about (Cleveland 1994). Conversely, there are excellent cookbooks that provide code “recipes” for many kinds of plot (Chang 2013). But for that reason they do not take the time to introduce the beginner to the principles behind the output they produce. Finally, we also have thorough introductions to particular software tools and packages, including the one we will use in this book (Wickham, Çetinkaya-Rundel, and Grolemund 2023). These can sometimes be hard for beginners to digest, as they may presuppose a background that the reader does not have.
Each of the books I have just cited is well worth your time. When teaching people how to make graphics with data, however, I have repeatedly found the need for an introduction that motivates and explains why you are doing something but that does not skip the necessary details of how to produce the images you see on the page. And so this book has two main aims. First, I want you get to the point where you can reproduce almost every figure in the text for yourself. Second, I want you to understand why the code is written the way it is, such that when you look at data of your own you can feel confident about your ability to get from a rough picture in your head to a high-quality graphic on your screen or page.
What’s New in This Edition?
This is the second edition of this book. Every chapter has been revised and updated to reflect the continuing development of R and ggplot2. The text was written using R version 4.5 and ggplot version 4. Supporting packages were also used with their latest versions at the time of writing. Since the first edition, base R has acquired a pipe operator, |>, and so the book uses that in preference to the magrittr pipe, %>%. Version 4 of ggplot introduced a number of significant updates to the theming system that are reflected in this book, as well as numerous other smaller changes that have also been incorporated. The orienting discussion about R and RStudio presumes the use of Quarto rather than RMarkdown for notebook-style work, though this does not affect any of the plotting code. Some datasets from the first edition have been updated and some new ones have been introduced. These are included in the accompanying socviz package.
Some chapters have been revised to the point of being almost wholly rewritten. The biggest changes are in the second half of the book. 6 Work with Models and especially 7 Draw Maps are mostly new, reflecting the availability of packages for working with models (marginaleffects) and geographic data (sf and related packages). These are either newly-available or have fully-matured since the first edition was written. 8 Refine your Plots has been extensively revised and so has the Appendix.
Data visualization with ggplot is a very good entry point to the wider world of writing and working with code of all kinds. You can move quickly from a standing start to confidently producing graphics that would be hard to make on many other platforms. Graphs are also more rewarding to make than tables. However, any analysis is going to spend a a lot of time working with rectangular tables of data. If you can be confident about how to split, aggregate, and reshape those rectangles, everything becomes easier. The ability to create the table you want, and to fluidly switch between different representations of it, is an important skill when you want to draw a good graph. So this edition has a little more emphasis, especially in Chapters 5 and 6, on the data-wrangling tools that R and the tidyverse provide to help us do that preparatory work. Throughout the book I have tried to preserve and extend what readers and reviewers found most useful in the first edition. For me that has meant keeping in mind the “curse of knowledge”. Once you know something, it is very hard to remember what it is like not to know it. Having taught the material in this book many times, and gotten reports from many people who have learned R from it, I have tried to keep things clear and accessible.
There are curses of ignorance, too. Perhaps you have a robot to help you write your code now. Large Language Models (LLMs) and coding agents are now part of the workflow of code generation and evaluation. They can do a great deal; so much so that it might seem superfluous to spend any time with the iterative, write-try-redo approach to visualization that this book presents. Can’t the robot write all the code instead? Not quite. It’s not that I believe repeatedly doing repetitive and error-prone tasks yourself is a virtue. To the contrary, that’s what computers are for. This book is full of examples where we end up automating something in order not to worry about it. But I also want you, the reader, to learn how to do good graphical work in a reproducible way. That means having a keen eye for quality and a good nose for error. Cultivating those senses requires practice and a vocabulary to express them. It seems faintly absurd to have to say explicitly but, whatever tools you use, your work will be better if you know what you are doing and understand why you are doing it. This book teaches you ggplot specifically, but it is not trying to lock you in to a particular framework. It’s just that, the way you acquire a general skill or a wide-ranging taste is by first learning some more specific version of those things, and then practising them. Automation can come a later. In the words of the author Ann Leckie, you don’t learn how to do something by not doing it. For that reason, this book remains a hands-on introduction.
What You Will Learn
This book introduces the principles and practice of looking at and presenting data using R and ggplot. R is a powerful, widely used, and freely available programming language for data analysis. You may be interested in exploring ggplot after having used R before, or be entirely new to both R and ggplot and just want to graph your data. I do not assume you have any prior knowledge of R.
After installing the software we need, we begin with an overview of some basic principles of visualization. We focus not just on the aesthetic aspects of good plots, but on how their effectiveness is rooted in the way we perceive properties like length, absolute and relative size, orientation, shape, and color. We then learn how to produce and refine plots using ggplot2, a powerful, versatile, and widely-used visualization library for R (Wickham 2016). The ggplot2 library implements a “grammar of graphics” (Wilkinson 2005). This approach gives us a coherent way to produce visualizations by expressing relationships between the attributes of data and their graphical representation.
Through a series of worked examples, you will learn how to build plots piece by piece, beginning with scatterplots and summaries of single variables, then moving on to more complex graphics. Topics covered include plotting continuous and categorical variables, layering information on graphics; faceting grouped data to produce effective “small multiple” plots; transforming data to easily produce visual summaries on the graph such as trend lines, linear fits, error ranges, and boxplots; creating maps, and also some alternatives to maps worth considering when presenting country- or state-level data. We will also cover cases where we are not working directly with a dataset, but rather with estimates from a statistical model. From there, we will explore the process of refining plots to accomplish common tasks such as highlighting key features of the data, labeling particular items of interest, annotating plots, and changing their overall appearance. Finally we will examine some strategies for presenting graphical results in different formats, and to different sorts of audiences.
If you follow the text and examples in this book, then by the end you will:
- understand the basic principles behind effective data visualization;
- have a practical sense for why some graphs and figures work well, while others may fail to inform or actively mislead;
- know how to create a wide range of plots in R using ggplot2; and
- know how to refine plots for effective presentation.
Learning how to visualize data effectively is more than just knowing how to write code that produces figures from data. This book will teach you how to do that. But it will also teach you how to think about the information you want to show, and how to consider the audience you are showing it to—including the most common case, when the audience is yourself.
This book is not a comprehensive guide to R, or even a comprehensive survey of everything ggplot can do. Nor is it a cookbook containing just examples of specific things people commonly want to do with ggplot. (Both these sorts of books already exist) Neither is it a rigid set of rules, or a sequence of beautifully finished examples that you can admire but not reproduce. My goal is to get you quickly up and running in R, making plots in a well-informed way, with a solid grasp of the core sequence of steps—taking your data, specifying the relationship between variables and visible elements, and building up images layer by layer—that is at the heart of what ggplot does.
Learning ggplot does mean getting learning a little bit about how R works, and also understanding how ggplot connects to other tools in the R language. As you work your way through the book, you will gradually learn more about some very useful idioms, functions, and techniques for manipulating data in R. In particular you will learn about some of the tools provided by the tidyverse library that ggplot belongs to. Similarly, although this is not a cookbook, once you get past 1 Look at Data you will be able to see and understand the code used to produce almost every figure in the book. In most cases you will also see these figures built up piece by piece, a step at a time. If you use the book as it is designed, by the end you will have the makings of a version of the book itself, containing code you have written out and annotated yourself. And though we do not go into great depth on the topic of rules or principles of visualization, the discussion in 1 Look at Data and its application throughout the book gives you more to think about than just a list of graph types. By the end of the book you should be able to look at a figure and be able to see it in terms of ggplot’s grammar, understanding how the various layers, shapes, and data are pieced together to make a finished plot.
The Right Frame of Mind
It can be a little disorienting to learn about any programming language, mostly because at the beginning there seem to be so many pieces to fit together in order for things to work properly. It can seem like you have to learn everything before you can do anything. The language has some possibly unfamiliar concepts that define how it works, like “object”, or “function”, or “class”. The syntactic rules for writing code are annoyingly picky. Error messages seem obscure; help pages are terse; other people seem to have had not quite the same issue as you. There is also a wider environment of supporting applications and tools that are good to know about, but involve new concepts of their own—editors that highlight what you write; applications that help you organize your code and its output; ways of writing your code that let you keep track of what you have done. It can all seem a bit confusing. You might be tempted to hand the entire burden off to a coding agent. Resist this temptation when you’re starting out.
It’s worth learning to understand what you are doing. Beginning with graphics is more rewarding than some of the other places you might begin, because you will be able to see the results of your efforts very quickly. As you build your confidence and ability in this area, you will gradually see the other tools as things that help you sort out some issue, or solve a problem that’s stopping you from making the picture you want. That makes them easier to learn. As you acquire them piecemeal—perhaps initially using them without completely understanding what is happening—you will begin to see how they fit together, and be more confident of your own ability to do what you need to do.
Free tools for coding have been around for a long time, but in recent years what you might call the “ecology of assistance” has gotten much better, both in terms of help from people and help from robots. There are more resources available for learning the various pieces, and more of them are oriented to the way writing code actually happens most of the time—which is to say, iteratively, in an error-prone fashion, and taking account of problems other people have run into and solved before.
How to Use This Book
This book can be used in any one of several ways. At a minimum, you can sit down and read it for a general overview of good practices in data visualization, together with many worked examples of graphics from their beginnings to a properly finished state. Even if you do not sit down and work through the code, you will get a good sense of how to think about visualization and a better understanding of the process through which good graphics are produced.
More usefully, if you set things up as described in 2 Get Started, and then work through the examples, then you will end up with a data visualization book of your own. If you approach the book this way, then by the end you will be comfortable using ggplot in particular and also be ready to learn more about the R language in general.
This book can also be used to teach with, either as the main focus of a course on data visualization or as a supplement to undergraduate or graduate courses in statistics or data analysis. My aim has been to make the “hidden tasks” of coding and polishing graphs more accessible and explicit. I want to make sure you are not left with the “How to Draw an Owl in Three Steps” problem common to many tutorials. You know the one. The first two steps are shown clearly enough. Sketch a few bird-shaped ovals. Make a line for a branch. But the final step, an owl such as John James Audubon might have drawn, is presented as a simple extension for readers to figure out for themselves.
If you have never used R or ggplot, you should start at the beginning of the book and work your way through to the end. If you know about R already but only want to learn the core of ggplot, then after installing the software described below, focus on Chapters 3 through 5. 6 Work with Models (on models) necessarily incorporates some material on statistical modeling that the book cannot develop fully. This is not a statistics text. So, for example, I show generally how to fit and work with various kinds of model in 6 Work with Models, but I do not go through the important details of fitting, selecting, and fully understanding different approaches. I provide references in the text to other books that have this material as their main focus.
Each chapter ends with a section suggesting where to go next (apart from continuing to read the book). Sometimes I suggest other books or websites to explore. I also ask questions or pose some challenges that extend the material covered in the chapter, encouraging you to use the concepts and skills you have learned.
Conventions
This book alternates between regular text (like this), samples of code that you can type and run yourself, and the output of that code. In the main text, references to objects or other things that exist in the R language or in your R project—like tables of data, variables, functions, and so on—will also appear in a monospaced or "typewriter" typeface. Code you can type directly into R at the console will be in gray boxes, and also monospaced. Like this:
If you type that line of code into R’s console it will create a thing called my_numbers. Doing this doesn’t produce any output, however. When we write code that also produces output at the console, we will first see the code (in a gray box) and then the output in a monospaced font against a white background. Here we add two numbers and see the result:
4 + 1[1] 5
In this book, and also at the R console, if what you did results in a series of things being displayed at the console (like the individual observations from a variable, or the elements of a list, and so on), each line of your output will be prefaced by a number in square brackets at the beginning. It looks like this: [1]. This is not part of the output itself, but just a counter or index keeping track of how many items have been printed out so far. In the case of adding 4 + 1 we got just one, or [1], thing back—the number five. If there are more elements returned as the result of some instruction or command, the counter will keep track of that on each line. In this next bit of code we will tell R to show us the lower-case letters of the alphabet:
letters [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[20] "t" "u" "v" "w" "x" "y" "z"
You can see the counter incrementing on each line as it keeps count of how many letters have been printed. Like much of technical computing, this feature of console output traces its origins back to the days of teletype machines mechanically printing out results on sheets of paper.
Before You Begin
The book is designed for you to follow along in an active way, writing out the examples and experimenting with the code as you go. You will be able to reproduce almost all of the plots in the text. You will need to install some software first. Here is what to do:
Get the most recent version of R. R is free and available for Windows, Mac, and Linux operating systems. Download the version of R compatible with your operating system. If you are running Windows or macOS, you should choose one of the precompiled binary distributions (i.e., ready-to-run applications) linked at the top of the R Project’s webpage.
Once R is installed, download and install RStudio. RStudio is an “Integrated Development Environment”, or IDE. It is a front-end that makes R easier to work with. RStudio is also free, and available for Windows, Mac, and Linux platforms. RStudio is not required to use R. You can use any plain-text editor to write R code and send it to the R interpreter yourself. Or you can use other IDEs or programming-focused text editors such as VS Code, Sublime Text, BBEdit, Zed, Neovim, or Emacs. These all have at least minimal to quite good support for writing R. A notable alternative to RStudio is Positron, an offshoot of VS Code specifically designed for “data science” work broadly-conceived. Still in development, it has strong support for R and Python and is made by the same people who make RStudio. RStudio remains the most widely-used IDE for R, however, so we use that in this book.
Install the tidyverse and several other add-on packages for R. These packages provide useful functionality that we will take advantage of throughout the book. You can learn more about the tidyverse family of packages at its website.
To install the tidyverse, make sure you have an Internet connection and then launch RStudio. Type the following lines of code at R’s command prompt, located in the window named “Console”, and hit return. In the code below, the <- arrow is made up of two keystrokes, first < and then the short dash or minus symbol, -.
option and then press - on a Mac, or alt and then - on Windows.my_packages <- c("tidyverse", "broom", "gapminder",
"geomtextpath", "ggrepel", "gridExtra",
"here", "marginaleffects", "maps",
"mapdata", "MASS", "patchwork",
"quantreg", "rlang", "scales",
"sf", "socviz", "survey", "srvyr")
install.packages(my_packages,
repos = "https://cran.rstudio.com")R Studio should then download and install these packages for you. It may take a little while to download everything. Once the installation process has finished, we can get started.