🚇 MTAccessibility

1 Preface

This notebook is a companion to the MTAccessibility web dashboard which displays a subset of the project’s findings. The notebook and dashboard, taken together, are a submission to the 2024 MTA Open Data Challenge, and so between the two, there is a bit of overlap. The main difference is that in this notebook, we show the data transformations that led to the results, and a few other ancillary insights gained during exploratory data analysis which ultimately led to the creation of the dashboard (which itself focuses on only one finding related to Senior and Fair Fare ridership).

2 Introduction

New York City’s public transit system is a lifeline for millions, including seniors and low-income residents who rely on it for daily commutes, errands, and access to essential services. Understanding how these populations use the subway system is vital for ensuring that transit services remain accessible, equitable, and responsive to their needs.

This accessibility analysis aims to shed light on ridership patterns, particularly among seniors and Fair Fare riders (low-income individuals who qualify for reduced fare programs). By studying the spatial patterns of subway ridership relative to the demographics of surrounding neighborhoods, we can uncover critical insights into how well the MTA serves vulnerable populations and where opportunities for strategic intervention might exist. These insights will enable the MTA to allocate resources more effectively, improve accessibility where it is most needed, and ensure that no community is left behind in the provision of transit services.

The findings of this study could have significant implications for policy and planning. For instance, above-average ridership in certain areas might point to unmet transportation needs, while below-average ridership could reveal opportunities to enhance alternative modes of transit or improve service in underutilized areas.

3 Data and Methods

For this analysis, we used three key datasets:

MTA Subway Hourly Ridership: Beginning February 2022: This dataset provides ridership counts by fare group (seniors/disability and Fair Fare riders). It gives us a clear picture of how these populations use the subway system.
MTA Subway Entrances and Exits: 2024: This station-level dataset provides information about station accessibility, including elevators and escalators, which is crucial for evaluating station suitability for vulnerable populations.
Census Tract Data from the 2022 American Community Survey (ACS): Census tract data is used to estimate the proportion of seniors (age 65+) and low-income individuals (those below the poverty line), serving as proxies for the two ridership groups of interest.

The goal of this study was to assess the relationship between the demographic characteristics of census tracts and the ridership patterns of seniors and low-income populations at nearby subway stations. We assigned each census tract to its nearest subway station, then used linear regression to model the relationship between census data and station data in terms of senior and Fair Fare population proportion versus ridership proportion. Census tracts were then classified as having average, above average, or below average ridership based on how their ridership patterns compared to ridership at all other census tracts.

4 Exploratory Data Analysis

Load some libraries for this analysis.

Code

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Code

library(sf)

Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE

Code

library(mapview)
library(glue)
library(ggdist)
library(tidycensus)
library(htmltools)
library(janitor)


Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test

Code

library(data.table)


Attaching package: 'data.table'

The following objects are masked from 'package:lubridate':

    hour, isoweek, mday, minute, month, quarter, second, wday, week,
    yday, year

The following objects are masked from 'package:dplyr':

    between, first, last

The following object is masked from 'package:purrr':

    transpose

Code

library(patchwork)

knitr::opts_chunk$set(message = FALSE, warning = FALSE)

4.1 Cleaning

Our analysis is built around two main sources of Open MTA data, accessed 2024-10-13 from the NYC Open data Portal:

Hourly ridership data from February 2022 to October 2024. These represent entries to a station and contain payment information.
MTA station geolocation. We will use the station complex geolocation in this analysis, but a future study could analyze all entrances and exits in this dataset to determine how physical accessibility impacts different communities.

Let’s do some light exploratory data analysis (EDA) to get familiar with the data.

Code

# Data accessed 2024-10-13.
# Read stations data frame and convert to spatial object. 
sta <- read_csv("data/MTA_Subway_Entrances_and_Exits__2024_20241013.csv") %>% 
  select(-entrance_georeference) %>% 
  clean_names() %>% 
  mutate(entrance_type2 = ifelse(entrance_type == "Elevator", "Elevator", "Other")) %>% 
  st_as_sf(coords = c("entrance_longitude", "entrance_latitude"), crs = 4329) 

# hourly ridership
hr <- fread("data/MTA_Subway_Hourly_Ridership__Beginning_February_2022_20241013.csv")

Let’s view the stations on a map. Stations with elevators appear relatively even distributed across space.

Code

sta %>% 
  mutate(entrance_type2 = factor(entrance_type2, levels = c("Elevator", "Other"))) %>% 
  mapview(zcol = "entrance_type2", layer.name = "Entrance Type", burst = TRUE)

Next, we filter the hourly ridership data to one calendar year, from 2023-01-01 to 2023-12-31, and aggregate all ridership groups to one of the following four groups: full fare, fair fare, student, and senior/disability. Then, we summarize the ridership count per fare group, day of week, hour of the day, and borough.

We observe that full fare riders comprise around 80-90% of riders and they are more like to ride during rush hour, and at night. Seniors and students don’t ride as much at night. Students interestingly are less likely to ride on the weekend.

Code

group_full_fare <- c("Metrocard - Full Fare", "Metrocard - Unlimited 7-Day", 
                     "Metrocard - Unlimited 30-Day", "OMNY - Full Fare", 
                     "Metrocard - Other", "OMNY - Other")
group_fair_fare <- c("Metrocard - Fair Fare", "OMNY - Fair Fare")
group_student <- c("Metrocard - Students", "OMNY - Students")
group_senior <- c("OMNY - Seniors & Disability", "Metrocard - Seniors & Disability")

# by borough
hrb <- hr %>% 
  mutate(
    transit_timestamp = mdy_hms(transit_timestamp),
    hour = hour(transit_timestamp), 
    dow  = lubridate::wday(transit_timestamp, label = TRUE),
    fare_group = case_when(
      fare_class_category %in% group_senior ~ "Seniors & Disability",
      fare_class_category %in% group_full_fare ~ "Full Fare",
      fare_class_category %in% group_student ~ "Students",
      fare_class_category %in% group_fair_fare ~ "Fair Fare",
      TRUE ~ "Other" # optional to catch anything not covered by the case_when
    )
  ) %>% 
  filter(
    transit_timestamp >= ymd_hms("2023-01-01 00:00:00") & 
    transit_timestamp <= ymd_hms("2023-12-31 24:00:00")) %>% 
  group_by(fare_group, dow, hour, borough) %>% 
  # because we're grouping by the day of week, and because, ignoring leap years,
  # there are 52 Mondays in a year (and so on), we divide the annual ridership
  # by 52 to arrive at an average weekly ridership
  summarize(ridership = sum(ridership)/52) %>% 
  ungroup()

hrb %>% 
  group_by(dow, hour, borough) %>% 
  mutate(total_ridership = sum(ridership), 
         ridership_proportion = ridership / total_ridership) %>% 
  ungroup() %>% 
  ggplot(aes(hour, ridership_proportion, color = borough)) +
  geom_line() +
  scale_x_continuous(
    limits = c(0, 25),
    breaks = seq(0, 24, by = 6),  # Breaks every 6 hours
    labels = sprintf("%02d:00", seq(0, 24, by = 6))  # Format labels as "HH:00"
  ) +
  rcartocolor::scale_color_carto_d(palette = "Bold") +
  labs(
    title = "Average daily ridership, per hour, per bouroughs, across fare groups",
    x = "",
    y = "",
    color = ""
  ) +
  facet_grid(fare_group~dow, scales = "free") +
  theme_minimal(base_size = 9) +
  theme(
    panel.grid.minor = element_blank(), 
    panel.grid.major.y = element_blank(),
    axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1),
    legend.position = "bottom"
  )

4.2 Hourly Ridership Curves

By borough average stats are interesting, but let’s go one spatial level deeper and recalculate ridership per station complex. This will not go into our map, but it is a level of grouped summary that we can visualize for each station, and average ridership proportions per station, per fare group, per hour of the day will reveal the flow dynamics of urban commuters.

Code

# by station ridership
hrs <- hr %>% 
  mutate(
    transit_timestamp = mdy_hms(transit_timestamp),
    hour = hour(transit_timestamp),
    dow  = lubridate::wday(transit_timestamp, label = TRUE),
     fare_group = case_when(
      fare_class_category %in% group_senior ~ "Seniors & Disability",
      fare_class_category %in% group_full_fare ~ "Full Fare",
      fare_class_category %in% group_student ~ "Students",
      fare_class_category %in% group_fair_fare ~ "Fair Fare",
      TRUE ~ "Other" # optional to catch anything not covered by the case_when
    )
  ) %>% 
  filter(
    transit_timestamp >= ymd_hms("2023-01-01 00:00:00") & 
    transit_timestamp <= ymd_hms("2023-12-31 24:00:00")
  ) %>% 
  # Calculate the average annual ridership count per day of week and hour of day
  group_by(fare_group, dow, hour, station_complex, station_complex_id) %>% 
  summarize(ridership = mean(ridership)) %>% 
  ungroup()

# For each fare group, calculate the total ridership and ridership proportion
hrs <- hrs %>% 
  group_by(dow, hour, station_complex, station_complex_id) %>% 
  mutate(total_ridership = sum(ridership), 
         ridership_proportion = ridership / total_ridership) %>% 
  ungroup() %>%
  mutate(fare_group = factor(
    fare_group,
    levels = c("Seniors & Disability", "Fair Fare", "Full Fare", "Students")
  ))

sta_mapgl <- hr %>% 
  group_by(station_complex_id) %>% 
  slice(1) %>% 
  ungroup() %>% 
  st_as_sf(coords = c("longitude", "latitude"), crs = 4269) %>% 
  select(station_complex_id, station_complex)

# Write all plots to a PDF
plots <- vector("list", nrow(sta_mapgl))

for(i in seq_along(sta_mapgl$station_complex)){
  station = sta_mapgl$station_complex[i]
  plots[[i]] = hrs %>% 
    filter(station_complex == station) %>%
    mutate(dow_upscale = ifelse(dow %in% c("Sat","Sun"),"weekend","weekday")) %>% 
    ggplot(aes(hour, ridership, group = dow, color = dow_upscale)) +
    geom_line(alpha = 0.7, key_glyph = "timeseries") +
    scale_color_manual(values = c("#7570b3", "#1b9e77")) +
    scale_x_continuous(
      limits = c(0, 25),
      breaks = seq(0, 24, by = 6),  # Breaks every 6 hours
      labels = sprintf("%02d:00", seq(0, 24, by = 6))  # Format labels as "HH:00"
    ) +
    facet_wrap(~fare_group, ncol = 1, scales = "free") +
    labs(
      title = NULL,
      subtitle = glue("{station}\nAverage hourly ridership"),
      y = NULL,
      x = NULL,
      color = "",
      caption = "Data from 2023-01-01 to 2023-12-31"
    ) +
    theme_minimal() +
    theme(
      panel.grid.minor = element_blank(), 
      legend.position = "bottom"
    )
}

# Ran this once, but don't need to do it for the report
# pdf("out/station_hourly_avg_per_rider.pdf", height = 4, width = 6)
# plots
# dev.off()

Below, we show three stations with different curves, but as a supplement to this project, we provide a PDF showing the average daily ridership curves at all 428 station complexes. Some stations seek their peak in the morning, some at morning and night, and some at night.

Morning peaking stations likely reflect the inflow of workers into the dense urban core, whereas evening peaking stations might reflect the opposite: the outflow of workers away from urban core. The examples below shown below seem to support this hypothesis. Stations with a more balanced morning and evening peak might be explained by a balance of residential and commercial surroundings that both supply and receive commuters.

Code

sta_newkirk <- "Newkirk Av-Little Haiti (2,5)"
plots[[which(sta_mapgl$station_complex == sta_newkirk)]]

Code

sta_westchester <- "Westchester Sq-E Tremont Av (6)"
plots[[which(sta_mapgl$station_complex == sta_westchester)]]

Code

sta_47 <- "47-50 Sts-Rockefeller Ctr (B,D,F,M)"
plots[[which(sta_mapgl$station_complex == sta_47)]]

A more informative plot would show the location of stations alongside the hourly curves, as demonstrated in this PDF and below.

Code

bnyc <- st_read("data/Borough Boundaries.geojson") %>% 
  st_transform(crs = 4269) %>% 
  rmapshaper::ms_simplify(keep_shapes = TRUE)

Reading layer `Borough Boundaries' from data source 
  `/Users/richpauloo/Documents/GitHub/mta/data/Borough Boundaries.geojson' 
  using driver `GeoJSON'
Simple feature collection with 5 features and 4 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -74.25559 ymin: 40.49613 xmax: -73.70001 ymax: 40.91553
Geodetic CRS:  WGS 84

Code

crs <- "+proj=omerc +lonc=90 +lat_0=40 +gamma=143 +alpha=57"

plots <- vector("list", nrow(sta_mapgl))
for(i in seq_along(sta_mapgl$station_complex)){
  station = sta_mapgl$station_complex[i]
  p1 = hrs %>% 
    filter(station_complex == station) %>%
    mutate(dow_upscale = ifelse(dow %in% c("Sat","Sun"),"weekend","weekday")) %>% 
    ggplot(aes(hour, ridership, group = dow, color = dow_upscale)) +
    geom_line(alpha = 0.7, key_glyph = "timeseries") +
    scale_color_manual(values = c("#984ea3", "#377eb8")) +
    scale_x_continuous(
      limits = c(0, 25),
      breaks = seq(0, 24, by = 6),  # Breaks every 6 hours
      labels = sprintf("%02d:00", seq(0, 24, by = 6))  # Format labels as "HH:00"
    ) +
    facet_wrap(~fare_group, ncol = 1, scales = "free") +
    labs(
      title = " ",
      subtitle = " ",
      y = NULL,
      x = NULL,
      color = ""
    ) +
    theme_minimal() +
    theme(
      panel.grid.minor = element_blank(), 
      legend.position = "bottom"
    )
  
  p2 = ggplot() + 
    geom_sf(data = bnyc, fill = "grey90") + 
    geom_sf(data = sta_mapgl, color = "#4daf4a", cex = 0.8, alpha = 0.5, pch =20) +
    geom_sf(data = filter(sta_mapgl, station_complex == station), 
            color = "#e41a1c", cex = 2, alpha = 0.8) +
    coord_sf(crs = crs) +
    labs(
      title = glue("{station} Average hourly ridership"),
      subtitle = "Data from 2023-01-01 to 2023-12-31"
    ) +
    theme_void()
  
  plots[[i]] = p2 + p1 + plot_layout(widths = c(1, 2))
  
}

Code

sta_newkirk <- "Newkirk Av-Little Haiti (2,5)"
plots[[which(sta_mapgl$station_complex == sta_newkirk)]]

Code

sta_westchester <- "Westchester Sq-E Tremont Av (6)"
plots[[which(sta_mapgl$station_complex == sta_westchester)]]

Code

sta_47 <- "47-50 Sts-Rockefeller Ctr (B,D,F,M)"
plots[[which(sta_mapgl$station_complex == sta_47)]]

Finally, we calculate a third grouped summary from our hourly data (the annual ridership count per station complex), which we will compare against Census tract level data. From this summary level, we can visualize the total ridership count per station. Let’s look at the top 25 stations. Just over 40 million riders entered Times Square in 2023.

Code

# this is for the map
hrs_annual <- hr %>% 
  mutate(
    transit_timestamp = mdy_hms(transit_timestamp),
    hour = hour(transit_timestamp),
    dow  = lubridate::wday(transit_timestamp, label = TRUE),
     fare_group = case_when(
      fare_class_category %in% group_senior ~ "Seniors & Disability",
      fare_class_category %in% group_full_fare ~ "Full Fare",
      fare_class_category %in% group_student ~ "Students",
      fare_class_category %in% group_fair_fare ~ "Fair Fare",
      TRUE ~ "Other" # optional to catch anything not covered by the case_when
    )
  ) %>% 
  filter(
    transit_timestamp >= ymd_hms("2023-01-01 00:00:00") & 
    transit_timestamp <= ymd_hms("2023-12-31 24:00:00")
  ) %>% 
  # per fare group and station complex, what's the annual ridership count
  group_by(fare_group, station_complex, station_complex_id) %>% 
  summarize(ridership = sum(ridership, na.rm = TRUE)) %>% 
  ungroup()

# now calculate the proportion of ridership at each station per fare group
hrs_annual <- hrs_annual %>% 
  group_by(station_complex, station_complex_id) %>% 
  mutate(total_ridership = sum(ridership), 
         ridership_proportion = ridership / total_ridership) %>% 
  ungroup() 

# And visualize the top 25 stations in YC by total annual ridership
hrs_annual %>% 
  mutate(fare_group = factor(
    fare_group, levels = c("Fair Fare", "Seniors & Disability", "Students", "Full Fare"))
  ) %>% 
  arrange(desc(total_ridership)) %>%
  slice(1:100) %>% # 4 fare groups, so this takes the top 25 stations
  # filter(station_complex %in% unique(hr$station_complex)[1:30]) %>%
  ggplot(aes(ridership, fct_reorder(station_complex, total_ridership), fill = fare_group)) +
  geom_col() +
  rcartocolor::scale_fill_carto_d(
    palette = "Bold", 
    guide = guide_legend(reverse = TRUE)
  ) +
  scale_x_continuous(labels = scales::label_comma()) +
  labs(
    title = "Total annual ridership (2023) for the top 25 stations in NYC",
    x = "Annual Ridership Count",
    y = "",
    fill = ""
  ) +
  theme_minimal(base_size = 9) +
  theme(
    panel.grid.minor = element_blank(), 
    panel.grid.major.y = element_blank(),
    legend.position = "bottom"
  )

Now we join ridership data to the station spatial data.

Code

sta_complex_sf <- hr %>% 
  group_by(station_complex_id) %>% 
  slice(1) %>% 
  ungroup() %>% 
  st_as_sf(coords = c("longitude", "latitude"), crs = 4269) %>% 
  select(station_complex_id)

# Who takes transit (full fare, seniors/disability, students, fair fare) by station complex
sta_ridership <- hrs_annual %>% 
  left_join(sta_complex_sf) 
  
# data frame for seniors and fare fare
sta_seniors <- sta_ridership %>% 
  filter(fare_group == "Seniors & Disability") %>%
  st_as_sf() %>% 
  rename(
    seniors_prop_sta = ridership_proportion,
    seniors_ridership_sta = ridership
  )

sta_fair <- sta_ridership %>% 
  filter(fare_group == "Fair Fare") %>%
  st_as_sf() %>% 
  rename(
    low_income_prop_sta = ridership_proportion,
    low_income_ridership_sta = ridership
  )

We can visualize the ridership proportion of Seniors/Disability and Fair Fare populations per station and see that seniors¹ and Fair Fare riders tend to use stations outside of denser urban metros. This trend is particularity evident for Fair Fare riders, where ridership proportions in Harlem and Queens can be 3-5 times the ridership proportion in downtown Manhattan.

4.3 Census Tract Variables

Now it’s time to pull Census Tract data². Critically, we are interested in total population, senior population, and population below the below the poverty line, which we will assume is a proxy for the Fair Fare riders ³. From these variables we can calculate the proportion of potential senior and fair fare riders. From the Census tract spatial boundaries, we can find the nearest MTA Station.

Code

variables <- c(
  median_age       = "B01002_001", 
  median_income    = "B19013_001",
  total_population = "B17021_001", 
  below_poverty    = "B17021_002"
)

# Get the most recent ACS data with spatial polygons for New York City
nyc_acs <- get_acs(
  geography = "tract",                   # Get data at the census tract level
  variables = variables,                 # Age and income variables
  state = "NY",                          # State of New York
  county = c("Bronx", "Kings", "New York", "Queens", "Richmond"),  # NYC boroughs
  year = 2022,                           # Most recent ACS year
  survey = "acs5",                       # 5-year ACS survey
  geometry = TRUE                        # Include spatial data (polygons)
)

# Define the breaks and custom labels
breaks <- seq(0, 1, by = 0.2)
labels <- c("0-0.2", "0.2-0.4", "0.4-0.6", "0.6-0.8", "0.8-1")

nyc <- nyc_acs %>% 
  pivot_wider(names_from = variable, values_from = c(estimate, moe)) %>% 
  mutate(
    estimate_below_poverty_prop = estimate_below_poverty/estimate_total_population,
    estimate_below_poverty_prop_bin = cut(estimate_below_poverty_prop, 
                                          breaks = breaks, 
                                          labels = labels, 
                                          include.lowest = TRUE)
  )

Let’s visualize some of these variables:

Now we calculate the population and proportion of seniors (65 years of age or greater).

Code

# add population seniors

# Variables for men and women 65+
variables <- c(
  males_65_to_66      = "B01001_020",
  males_67_to_69      = "B01001_021",
  males_70_to_74      = "B01001_022",
  males_75_to_79      = "B01001_023",
  males_80_to_84      = "B01001_024",
  males_85_and_over   = "B01001_025",
  females_65_to_66    = "B01001_044",
  females_67_to_69    = "B01001_045",
  females_70_to_74    = "B01001_046",
  females_75_to_79    = "B01001_047",
  females_80_to_84    = "B01001_048",
  females_85_and_over = "B01001_049"
)

# Get the data for New York City
nyc_seniors_combined <- get_acs(
  geography = "tract",
  variables = variables,
  state = "NY",
  county = c("Bronx", "Kings", "New York", "Queens", "Richmond"),  # NYC boroughs
  year = 2022,
  survey = "acs5"
)

# Summarize the total senior population (65+) by GEOID
nyc_seniors_combined_totals <- nyc_seniors_combined %>%
  group_by(GEOID) %>%
  summarize(seniors_population_acs = sum(estimate, na.rm = TRUE))

# join back to nyc
nyc <- nyc %>% 
  left_join(nyc_seniors_combined_totals) %>% 
  mutate(seniors_prop_acs = seniors_population_acs/estimate_total_population)

Next, we find the nearest MTA station to each Census Tract and join this back to our main dataframe of Census Tract data.

Code

# Find the index of the nearest subway station for each polygon in 'nyc'
nearest_station_index <- st_nearest_feature(nyc, sta_complex_sf)

# Use the index to add the closest subway station information to the 'nyc' polygons
nyc <- nyc %>%
  mutate(station_complex_id = sta_complex_sf$station_complex_id[nearest_station_index])

# add seniors proportion from station data
nyc <- nyc %>% 
  left_join(
    sta_seniors %>% 
      select(-fare_group) %>% 
      st_drop_geometry()
  ) 

# add low income proportion from station data
nyc <- nyc %>% 
  rename(low_income_prop_acs = estimate_below_poverty_prop) %>%
  left_join(
    sta_fair %>% 
      select(-fare_group) %>% 
      st_drop_geometry()
  )

4.4 Senior and Fair Fare Ridership

Next, we rank every Census Tract in terms of its actual Senior and Fair Fare ridership proportion relative to what would theoretically be expected based on the population of these individuals in the Census Tract. We can think of this as a “station utilization” metric for these populations and use the results as a springboard for hypothesis generation and interventions.

To proceed, we assume a roughly linear relationship between the proportion of Seniors in a Census Tract and the proportion of Senior ridership at that tract’s closest MTA Station⁴. We make the same assumption for Fair Fare riders.

It follows then, that outliers from this linear trend–which we can quantify with the standard error of the least squares regression fit–represent deviations from “average ridership,” relative to all other Census Tracts. In other words, we ask: compared to all other census tracts, is Senior and Fair Fare Ridership average, above average, or below average? Put yet another way, if the station ridership proportion ( $y$ ) falls within the standard error ( $\sigma_{\hat{y}}$ ) envelope of the least squares fit ( $\hat{y}$ ), which is the average trend, then the station has average ridership, and if the ridership proportion falls outside of this envelope, it has above or below ridership.

$\begin{align*} \text{Average Ridership:} & \quad \hat{y} - \sigma_{\hat{y}} \leq y \leq \hat{y} + \sigma_{\hat{y}} \\ \text{Below Average Ridership:} & \quad y < \hat{y} - \sigma_{\hat{y}} \\ \text{Above Average Ridership:} & \quad y > \hat{y} + \sigma_{\hat{y}} \end{align*}$

Next, we fit the models and show the graphical interpretation of the approach.

Code

# cleaning for linear model
nyc <- nyc %>% 
  # rm 4 rows where total population is 0, so seniors prop is infinite
  filter(estimate_total_population > 0 & seniors_prop_acs < 0.9) 

nyc <- nyc %>% 
    mutate(
    residual_senior = residuals(lm(seniors_prop_sta ~ seniors_prop_acs, data = nyc)),
    residual_low_income = residuals(lm(low_income_prop_sta ~ low_income_prop_acs, data = nyc)) 
  ) %>% 
  # Get the standard error from the model for residual_senior and residual_low_income
  mutate(
    se_senior = summary(lm(seniors_prop_sta ~ seniors_prop_acs, data = nyc))$sigma,
    se_low_income = summary(lm(low_income_prop_sta ~ low_income_prop_acs, data = nyc))$sigma
  ) %>%
  # Create categories based on residuals using standard error (SE) thresholds
  mutate(
    seniors_category = case_when(
      # residual_senior > 2 * se_senior ~ "Highly Overutilized",
      residual_senior > se_senior ~ "Above Average",
      residual_senior >= -se_senior & residual_senior <= se_senior ~ "Average",  
      # residual_senior < -2 * se_senior ~ "Highly Underutilized",  
      residual_senior < -se_senior ~ "Below Average"
    ),
    low_income_category = case_when(
      # residual_low_income > 2 * se_low_income ~ "Highly Overutilized",
      residual_low_income > se_low_income ~ "Above Average",
      residual_low_income >= -se_low_income & residual_low_income <= se_low_income ~ "Average",
      # residual_low_income < -2 * se_low_income ~ "Highly Underutilized",
      residual_low_income < -se_low_income ~ "Below Average"
    )
  )


lvls <- c("Above Average", "Average", "Below Average")
nyc <- nyc %>% 
  # Convert utilization categories to an ordered factor
  mutate(
    seniors_category = factor(seniors_category, 
                              levels = lvls, 
                              ordered = TRUE),
    low_income_category = factor(low_income_category, 
                                 levels = lvls, 
                                 ordered = TRUE)
  )

# Fair Fare
p1 <- nyc %>% 
  ggplot(aes(low_income_prop_acs, low_income_prop_sta)) +
  geom_point(aes(color = low_income_category), alpha = 0.3) + 
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "Proportion Below Poverty (Census)",
    y = "Proportion Fair Fare Ridership (MTA)",
    color = ""
  ) +
  rcartocolor::scale_color_carto_d(palette = "Bold") +
  theme_minimal() +
  theme(
    panel.grid.minor = element_blank(), 
    legend.position = "bottom"
  )

p2 <- nyc %>% 
  ggplot(aes(low_income_prop_acs, low_income_prop_sta)) +
  geom_point(aes(color = residual_low_income)) + 
  scale_color_gradient2(low = "red", mid = "white", high = "blue", midpoint = 0) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "Proportion Below Poverty (Census)",
    y = "Proportion Fair Fare Ridership (MTA)",
    color = "Residual"
  ) +
  theme_minimal() +
  theme(
    panel.grid.minor = element_blank(), 
    legend.position = "bottom"
  )

# Seniors
p3 <- nyc %>% 
  ggplot(aes(seniors_prop_acs, seniors_prop_sta)) +
  geom_point(aes(color = seniors_category), alpha = 0.3) + 
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "Proportion Senior (Census)",
    y = "Proportion Senior Ridership (MTA)",
    color = ""
  ) +
  rcartocolor::scale_color_carto_d(palette = "Vivid") +
  theme_minimal() +
  theme(
    panel.grid.minor = element_blank(), 
    legend.position = "bottom"
  )

p4 <- nyc %>% 
  ggplot(aes(seniors_prop_acs, seniors_prop_sta)) +
  geom_point(aes(color = residual_senior)) + 
  scale_color_gradient2(low = "red", mid = "white", high = "blue", midpoint = 0) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "Proportion Senior (Census)",
    y = "Proportion Senior Ridership (MTA)",
    color = "Residual"
  ) +
  theme_minimal() +
  theme(
    panel.grid.minor = element_blank(), 
    legend.position = "bottom"
  )

Code

p1 + p2

Code

p3 + p4

Finally, we can bring this all together in three maps, one for Seniors, one for Fair Fare riders, and one that shows the intersection of the two. The code is below, but I won’t render the maps so this file doesn’t get too large. To view the maps, please refer back to the project webpage.

Code

cat_key <- c("Above Average" = "🔵", "Average" = "⚪", "Below Average" = "🔴")

nyc <- nyc %>% 
  mutate(
    NAME = str_replace(NAME, "; New York$", ""),
    tooltip_senior = glue(
      "<div style='color: black;'>
        <b>{NAME}</b><br>
        Closest Ⓜ️ station: <b>{station_complex}</b><br><hr>
        {cat_key[seniors_category]} Relative to all other tracts, 🧓🏼 senior ridership is <b>{seniors_category}</b>.<br><br>
        There are <b>{seniors_population_acs} seniors</b> in this tract<br>
        (<b>{round(seniors_prop_acs*100, 0)}%</b> of tract population).<br><br>
        Seniors make up <b>{round(seniors_prop_sta*100, 0)}%</b> of daily riders at
        the closest station, <b>{station_complex}</b>.
      </div>"
    ),
    tooltip_low_income = glue(
      "<div style='color: black;'>
        <b>{NAME}</b><br>
        Closest Ⓜ️ station: <b>{station_complex}</b><br><hr>
        {cat_key[low_income_category]} Relative to all other tracts, 🎟 fair fare ridership is <b>{low_income_category}</b>.<br><br>
        There are <b>{estimate_below_poverty} people below poverty</b> in this tract<br>
        (<b>{round(low_income_prop_acs*100, 0)}%</b> of tract population).<br><br>
        People below poverty make up <b>{round(low_income_prop_sta*100, 0)}%</b> of daily riders at
        the closest station, <b>{station_complex}</b>.
      </div>"
    ),
    tooltip_unified = glue(
      "<div style='color: black;'>
        <b><i>{NAME}</i></b><br>
        Closest Ⓜ️ station: <b>{station_complex}</b><br><hr>
        {cat_key[seniors_category]} Relative to all other tracts, 🧓🏼 senior ridership is <b>{seniors_category}</b>.<br><br>
        There are <b>{seniors_population_acs} seniors</b> in this tract<br>
        (<b>{round(seniors_prop_acs*100, 0)}%</b> of tract population).<br><br>
        Seniors make up <b>{round(seniors_prop_sta*100, 0)}%</b> of daily riders at
        the closest station.<br><hr>
        {cat_key[low_income_category]} Relative to all other tracts, 🎟️ fair fare ridership is <b>{low_income_category}</b>.<br><br>
        There are <b>{estimate_below_poverty} people below poverty</b> in this tract
        (<b>{round(low_income_prop_acs*100, 0)}%</b> of tract population).<br><br>
        People below poverty make up <b>{round(low_income_prop_sta*100, 0)}%</b> of daily riders at
        the closest station.
      </div>"
    )
  )


# write data for dashboard - uncomment for now because it's done
# write_rds(sta_mapgl, "out/sta_mapgl.rds")
# write_rds(nyc, "out/nyc.rds")

Code

# Just showing the code for one of the maps, and the maps for both
obj_bounds <- nyc %>% 
  st_convex_hull() %>% 
  st_union() %>% 
  st_as_sf() %>% 
  st_centroid() %>% 
  st_buffer(13500)

sta_mapgl <- read_rds("out/sta_mapgl.rds") %>% 
  mutate(popup_station = glue::glue("<b>{station_complex}</b>"))

maplibre(style = carto_style("dark-matter-no-labels"), 
         bounds = obj_bounds) %>%
  add_fill_layer(
    id = "Senior Ridership",
    source = nyc,
    fill_color = match_expr(
      "seniors_category",
      values = c("Above Average", "Average", "Below Average"),
      stops  = c("#4575B4", "#F7F7F7", "#D73027")
    ),
    tooltip = "tooltip_senior",
    # tooltip = "tooltip_unified",
    fill_opacity = 0.5
  ) %>%
  add_circle_layer(
    id = "MTA stations",
    source = sta_mapgl,
    circle_radius = 3.5,
    circle_color = "blue",
    circle_stroke_color = "#ffffff",
    circle_stroke_width = 0.5,
    circle_opacity = 0.8,
    # 5MB --> 58MB to include the plot popups
    # popup = "popup",
    popup = "popup_station",
    hover_options = list(
      circle_radius = 12,
      circle_color = "lightblue"
    )
  ) %>%
  add_categorical_legend(
    legend_title = "Senior Ridership",
    values = c("Above Average", "Average", "Below Average"),
    colors = c("#4575B4", "lightgrey", "#D73027")
  ) %>%
  add_fullscreen_control(position = "top-right") %>%
  add_navigation_control()

5 Results

In this section, we focus only on the results from the Senior and Fair Fare Ridership section of the Exploratory Data Analysis. This analysis revealed interesting patterns in ridership across different census tracts:

Above average senior ridership was concentrated in areas where seniors may need to travel farther for work or essential services. These areas appear to be correlated with low-income tracts, suggesting that seniors in these neighborhoods face greater travel burdens and rely more heavily on public transit.
Above average Fair Fare ridership also indicated areas with high levels of economic hardship, where residents have few alternatives to public transportation and may travel longer distances for work.
Below average ridership seems to be correlated with relatively more affluent or well-connected areas where residents might have other transportation options, such as walking or private vehicles. These tracts may have more localized services, reducing the need for frequent subway use.
Zones of above average ridership for seniors and Fair Fare riders, such as south Brooklyn, might represent areas with the greatest need for interventions and transportation assistance services to meet an increased reliance on transit.

5.1 Implications for MTA

Resource Allocation: The MTA can use these findings to target accessibility improvements in stations serving areas with high proportions of seniors and low-income residents. Stations with above-average ridership but inadequate accessibility (e.g., lack of elevators or escalators) might be prioritized for upgrades.
Service Adjustments: In areas with below-average ridership, the MTA might explore enhancing bus routes or other services that better meet the local population’s needs, especially for residents who might prefer surface transportation over the subway. We might also seek to understand why low ridership is observed in these populations.
Policy Applications: The study highlights the need for continued investment in programs like Fair Fares, which enable low-income riders to access affordable transit. Expanding this program to more residents could help reduce travel-based inequality, especially in areas with high populations of qualifying individuals but below average ridership.

5.2 Future Studies

Spatial autocorrelation: A future study could apply techniques like Moran’s I to assess the spatial correlation of above-average senior ridership and low-income tracts, quantifying the degree of overlap between these two groups, to test the hypothesis that low-income seniors are more likely to travel than high-income seniors.
Impact of accessibility upgrades: After implementing targeted accessibility improvements, future analyses could use changes in ridership patterns to evaluate the effectiveness of those interventions.
Multi-modal transit: Expanding the model to include other forms of transit (e.g., buses, paratransit) could provide a more comprehensive view of how seniors and low-income riders navigate the city.
Elevator access: This study neglected elevator access, but a future study might prioritize stations for elevator upgrades.
Aggregating by different spatial units: For the purposes of this study, and because transit occurs between boroughs, all five boroughs are grouped into the regression. However, there may be merit in a future study that groups each borough independently to examine trends that might emerge at that spatial scale. In the same sense, for future analysis, data could be grouped at different spatial scales, including by neighborhood or community district, subway line corridors, fare zones, accessibility zones (e.g., stations with elevators or ADA compliance), transit-oriented development areas, economic development zones, land use zones (commercial, industrial, or residential), and proximity to major transit hubs like Penn Station or Grand Central, or by distance from key downtown metro areas. Analyses at these scales might help identify underserved areas with unique transit needs.

This analysis lays very preliminary groundwork for a more nuanced understanding of transit accessibility in New York City. With these insights, the MTA can make informed decisions to better serve its most vulnerable populations, ultimately enhancing the quality of life for seniors, low-income residents, and the broader community.

Footnotes

Note that in the Senior ridership proportion map, there’s one outlying station in Queens with a ~0.21 senior ridership that skews the color legend and obscures the general trend in senior ridership, so we re-level it to the next highest value, of 0.11 so we can better visualize general, regional trends.↩︎
A subsequent analysis might refine the spatial unit of consideration to the Census block group level, of which there are ~13, 000 blocks in NYC.↩︎
We find this assumption reasonable for the limited scope of this analysis because the 2024 Fair Fare program income qualification ceilings are slightly above the poverty threshold (on the order of ~20%).↩︎
As stated before in the Assumptions, because there is not a clear and simple way to disaggregate the senior population from the disabled population in the Census, and because Seniors likely make up the majority of Senior/Disability riders in this fare category, for the purposes of this broad analysis, we assume that Senior/Disability riders are Seniors and henceforth, refer to this category only as “Seniors.”↩︎