Number of PhDs by Field

An Incomplete Data Exploration

By Harshvardhan in R statistics economics thoughts

February 20, 2022

Yesterday I was talking to one of my friends about his plans post PhD. “I want to go for pure sciences and abstract mathematics, but there are hardly any positions in academia on these topics.”, he said. It got me into thinking how many PhD students graduate every year and if the demand (in academia or in industry) is less than that. But I didn’t even know how many PhDs are awarded each year, let alone employed.

While searching for a dataset for my Text Mining class project, I discovered this dataset on number of PhDs by field. So, let’s explore!

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5          ✓ purrr   0.3.4     
## ✓ tibble  3.1.6          ✓ dplyr   1.0.8.9000
## ✓ tidyr   1.2.0          ✓ stringr 1.4.0     
## ✓ readr   2.1.2          ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(garlic)
library(DT)
theme_set(theme_linedraw())
# Loading dataset from their repository
phds = readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-02-19/phd_by_field.csv")
## Rows: 3370 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): broad_field, major_field, field
## dbl (2): year, n_phds
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
phds
## # A tibble: 3,370 × 5
##    broad_field   major_field                                 field   year n_phds
##    <chr>         <chr>                                       <chr>  <dbl>  <dbl>
##  1 Life sciences Agricultural sciences and natural resources Agric…  2008    111
##  2 Life sciences Agricultural sciences and natural resources Agric…  2008     28
##  3 Life sciences Agricultural sciences and natural resources Agric…  2008      3
##  4 Life sciences Agricultural sciences and natural resources Agron…  2008     68
##  5 Life sciences Agricultural sciences and natural resources Anima…  2008     41
##  6 Life sciences Agricultural sciences and natural resources Anima…  2008     18
##  7 Life sciences Agricultural sciences and natural resources Anima…  2008     77
##  8 Life sciences Agricultural sciences and natural resources Envir…  2008    182
##  9 Life sciences Agricultural sciences and natural resources Fishi…  2008     52
## 10 Life sciences Agricultural sciences and natural resources Food …  2008     96
## # … with 3,360 more rows

There are many records by fields — in three levels of granularity.There are 337 fields and we have records for each of them between 2008 to 2017. Let’s see how many people are from which field.

phds %>%
   group_by(broad_field) %>%
   summarise(n_phds = sum(n_phds, na.rm = T)) %>%
   arrange(desc(n_phds)) %>%
   datatable(colnames = c("Broad Field", "Number of PhDs"),
             rownames = FALSE,
             caption = "Number of PhDs by their broad fields. Life sciences lead the way.") %>%
   formatRound("n_phds", digits = 0)

Life sciences has most number of graduates. Engineering has least number of graduates — even less than mysterious Other. Surprisingly social sciences, humanities and eucation are higher than mathematics and computer science. And they lead by a margin. The number of graduates in “humanities and social science” subjects is four times the number of PhDs in in “hard sciences” like engineering and maths. No wonder there is such a shortage of people in the tech world.

Life sciences as such a broad encompassing field. Let’s explore what is covered in life sciences.

phds %>%
   filter(broad_field == "Life sciences") %>%
   group_by(major_field) %>%
   summarise(n_phds = sum(n_phds, na.rm = T)) %>%
   arrange(desc(n_phds)) %>%
   datatable(colnames = c("Major Field", "Number of PhDs"),
             rownames = FALSE,
             caption = "Number of PhDs by their major fields. Biology, excluding health sciences, leads the way.") %>%
   formatRound("n_phds", digits = 0)

Biological and biomedical sciences has the most number of graduates. Let me explore engineering too. There are so few PhDs in geosciences. With climate change becoming another major issue, I wonder why the field isn’t picking up fast.

Let’s see the fields in engineering.

phds %>% 
  filter(broad_field == "Engineering") %>% 
  group_by(major_field) %>% 
  summarise(n_phds = sum(n_phds, na.rm = T)) %>% 
  arrange(desc(n_phds))
## # A tibble: 1 × 2
##   major_field       n_phds
##   <chr>              <dbl>
## 1 Other engineering  18139

Oh, so no information. The information is nested in another column, I guess. I’ll have to group by field.

phds %>% 
  filter(broad_field == "Engineering") %>% 
  group_by(field) %>% 
  summarise(n_phds = sum(n_phds, na.rm = T)) %>% 
  arrange(desc(n_phds)) %>% 
   datatable(colnames = c("Field", "Number of PhDs")) %>% 
   formatRound("n_phds", digits = 0)

Computer engineering PhDs are most popular; twice as much as next in the list. Environmental engineering is the second most popular. That’s impressive. Let’s visualise the counts.

phds %>% 
  filter(broad_field == "Engineering") %>% 
  group_by(field) %>% 
  summarise(n_phds = sum(n_phds, na.rm = T)) %>% 
  ggplot(aes(reorder(field, n_phds), n_phds)) +
  geom_col() +
  coord_flip() +
  labs(y = "Number of PhDs", x = "Field (Engineering only)")

The data gives me opportunity to see how it grew up with the rise in popoularity of computer engineering. I’ve heard numerous time that its popularity has increased over the years.

# ggrepel for text labels
library(ggrepel)

phds %>%
   filter(broad_field == "Engineering") %>%
   mutate(label = if_else(year == max(year), field, NA_character_)) %>%
   ggplot(aes(x = year, y = n_phds, colour = field)) +
   geom_line() +
   scale_x_continuous(breaks = seq(from = 2008, to = 2017, by = 1)) +
   geom_label_repel(aes(label = label),
                    nudge_x = 1,
                    na.rm = TRUE) +
   labs(x = "Year", y = "Number of PhDs") +
   theme(legend.position = "none")
## Warning: Removed 20 row(s) containing missing values (geom_path).
## Warning: ggrepel: 10 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

phds_top_engineering = phds %>% 
  filter(broad_field == "Engineering") %>% 
  group_by(field) %>% 
  summarise(n_phds = sum(n_phds)) %>% 
  filter(n_phds > 100) %>% 
  slice_max(order_by = n_phds, n = 6)

phds_top_engineering
## # A tibble: 6 × 2
##   field                                            n_phds
##   <chr>                                             <dbl>
## 1 Computer engineering                               4030
## 2 Environmental, environmental health engineeringl   2001
## 3 Engineering, other                                 1488
## 4 Nuclear engineering                                1166
## 5 Operations research (engineering)                   985
## 6 Systems engineering                                 924
phds %>% 
  filter(field %in% phds_top_engineering$field) %>% 
ggplot(aes(x = year, y = n_phds, fill = field)) +
  geom_bar(stat = "identity") + 
  scale_x_continuous(labels = scales::label_number(accuracy = 1)) +
  scale_fill_manual(values = MetBrewer::met.brewer("Hokusai1", 6)) +
  facet_wrap( ~ field) +
  labs(x = "Year", y = "Number of PhDs", fill = "Field")

Computer engineering has been ever popular. I didn’t expect that.

But wait, wasn’t there a computer science in major_field? What was that? It was called Mathematics and computer sciences.

phds %>%
   filter(broad_field == "Mathematics and computer sciences") %>%
   group_by(major_field) %>%
   summarise(n_phds = sum(n_phds, na.rm = T)) %>%
   arrange(desc(n_phds)) %>%
   datatable(colnames = c("Major Field", "Number of PhDs"),
             rownames = FALSE,
             caption = "Mathematics and computer sciences has two fields.") %>%
   formatRound("n_phds", digits = 0)
phds %>%
   filter(broad_field == "Mathematics and computer sciences") %>%
   filter(n_phds >= 300) %>% 
   mutate(label = if_else(year == max(year), field, NA_character_)) %>%
   ggplot(aes(x = year, y = n_phds, colour = field)) +
   geom_line() +
   scale_x_continuous(breaks = seq(from = 2008, to = 2017, by = 1)) +
   geom_label_repel(aes(label = label),
                    nudge_x = 1,
                    na.rm = TRUE) +
   labs(x = "Year", y = "Number of PhDs") +
   theme(legend.position = "none")

Computer engineering averaged around 400; computer science averaged around 1500. I think this the “computer science” in general parlance.


This exploration is incomplete. I couldn’t finish it in time but I’d get back to it someday.

Today I found this wonderful visualisation on Twitter that I thought to replicate for the number of PhDs by field.

library(tweetrmd)
tweet_screenshot("https://twitter.com/jenjentro/status/1512997114896269312?t=nWQqyQa3tHQVNSHPakh2TA")

Her codes were available on Github.

# Loading packages
library(tidytuesdayR)
library(tidylog)
## 
## Attaching package: 'tidylog'
## The following objects are masked from 'package:dplyr':
## 
##     add_count, add_tally, anti_join, count, distinct, distinct_all,
##     distinct_at, distinct_if, filter, filter_all, filter_at, filter_if,
##     full_join, group_by, group_by_all, group_by_at, group_by_if,
##     inner_join, left_join, mutate, mutate_all, mutate_at, mutate_if,
##     relocate, rename, rename_all, rename_at, rename_if, rename_with,
##     right_join, sample_frac, sample_n, select, select_all, select_at,
##     select_if, semi_join, slice, slice_head, slice_max, slice_min,
##     slice_sample, slice_tail, summarise, summarise_all, summarise_at,
##     summarise_if, summarize, summarize_all, summarize_at, summarize_if,
##     tally, top_frac, top_n, transmute, transmute_all, transmute_at,
##     transmute_if, ungroup
## The following objects are masked from 'package:tidyr':
## 
##     drop_na, fill, gather, pivot_longer, pivot_wider, replace_na,
##     spread, uncount
## The following object is masked from 'package:stats':
## 
##     filter
library(showtext)
## Loading required package: sysfonts
## Loading required package: showtextdb
Posted on:
February 20, 2022
Length:
8 minute read, 1547 words
Categories:
R statistics economics thoughts
See Also: