Data Sources for R — EDA Across a Heterogeneous Stack

R has a reputation as a statistics tool — the thing you reach for when you need a regression or a publication-quality plot. That framing undersells it. The part of R that earns its place in a practical engineering workflow is its I/O story: the breadth of data sources it can connect to, and the consistency of the analysis layer once data arrives in a data frame regardless of where it came from.

Exploratory data analysis rarely lives in one system. The data you need is spread across a production Postgres database, an Elasticsearch cluster holding your search logs, a Redis instance accumulating time-series events, and a folder of CSV exports from a third-party API. Pulling all of that into one place for a single analysis session — without standing up infrastructure, without writing a pipeline — is exactly what R is good at.

The DBI Interface

Before the individual connectors: the DBI package defines a consistent interface for all SQL-speaking databases. You connect with a backend-specific driver; from there, the query and retrieval API is identical regardless of the underlying database.

library(DBI)

# Generic pattern — same functions for SQLite, Postgres, MySQL, Redshift
con <- dbConnect(Driver, ...)

dbListTables(con)                          # list tables
dbGetQuery(con, "SELECT ...")              # query and return a data frame
dbExecute(con, "UPDATE ...")               # execute without returning rows
dbWriteTable(con, "staging", my_df)        # write a data frame to a table
dbDisconnect(con)

The driver slot is where backends differ. Everything else is portable.

SQLite

RSQLite is the zero-infrastructure option — no server, no credentials, no networking. The database is a file. For local analysis of exported datasets, prototyping a schema, or working offline, it covers most ground that Postgres would cover.

library(DBI)
library(RSQLite)

con <- dbConnect(RSQLite::SQLite(), "data/events.db")

# Query directly into a data frame
df <- dbGetQuery(con, "
  SELECT
    date(timestamp, 'start of day') AS day,
    event_type,
    count(*)                        AS n
  FROM events
  WHERE timestamp >= '2023-01-01'
  GROUP BY 1, 2
  ORDER BY 1
")

dbDisconnect(con)

SQLite’s in-memory mode is useful for intermediate work — load a large CSV into an in-memory database, index it, then query it with SQL rather than filtering a data frame in R:

con <- dbConnect(RSQLite::SQLite(), ":memory:")
dbWriteTable(con, "events", read.csv("events.csv"))
dbExecute(con, "CREATE INDEX idx_events_type ON events(event_type)")

df <- dbGetQuery(con, "SELECT event_type, count(*) AS n FROM events GROUP BY 1")

For large CSVs where read.csv would exhaust memory, this pattern — load into SQLite in-memory, query a subset — is a practical alternative to chunk-reading.

Postgres

RPostgres is the modern Postgres driver, built on libpq and significantly faster than the older RPostgreSQL package for large result sets.

library(DBI)
library(RPostgres)

con <- dbConnect(RPostgres::Postgres(),
  host     = Sys.getenv("PG_HOST"),
  port     = 5432,
  dbname   = Sys.getenv("PG_DB"),
  user     = Sys.getenv("PG_USER"),
  password = Sys.getenv("PG_PASS")
)

# Pull an aggregation from a production database
retention <- dbGetQuery(con, "
  SELECT
    cohort_month,
    months_since_signup,
    count(DISTINCT user_id)::float /
      first_value(count(DISTINCT user_id)) OVER (
        PARTITION BY cohort_month ORDER BY months_since_signup
      ) AS retention_rate
  FROM user_activity
  GROUP BY 1, 2
  ORDER BY 1, 2
")

dbDisconnect(con)

Use Sys.getenv() for credentials rather than hardcoding. A .Renviron file in the project root handles local secrets; CI environments inject them as environment variables.

dbplyr — SQL via dplyr

For exploration where you want to iterate on a query without rewriting SQL repeatedly, dbplyr translates dplyr verbs into SQL and runs them on the database. The data stays server-side until you call collect().

library(dplyr)
library(dbplyr)

users  <- tbl(con, "users")
events <- tbl(con, "events")

# Joins, filters and aggregations run as SQL — nothing pulled into R yet
summary <- events %>%
  left_join(users, by = "user_id") %>%
  filter(plan == "pro", created_at >= "2023-01-01") %>%
  group_by(user_id, week = sql("date_trunc('week', created_at)")) %>%
  summarise(sessions = n(), .groups = "drop") %>%
  arrange(desc(sessions))

# show_query() prints the generated SQL — useful for understanding what's happening
show_query(summary)

# collect() executes and pulls the result into a local data frame
df <- collect(summary)

show_query() is worth running when the result looks wrong. dbplyr’s SQL generation is good but not perfect — complex window functions and non-standard SQL occasionally require dropping to a raw dbGetQuery() call.

Elasticsearch

Elasticsearch doesn’t fit the SQL model, but R’s elastic package (rOpenSci) wraps the REST API cleanly. The pattern for EDA is typically: run an aggregation in Elasticsearch — which is fast and doesn’t require transferring millions of documents — then pull the bucketed result into R for visualisation and further analysis.

library(elastic)

es <- connect(host = "localhost", port = 9200)

# Check connectivity
ping(es)

# Run a date histogram aggregation — stays in ES, returns bucket counts
res <- Search(es,
  index = "search-logs",
  body  = '{
    "size": 0,
    "query": {
      "range": {
        "timestamp": { "gte": "2023-01-01", "lt": "2023-04-01" }
      }
    },
    "aggs": {
      "by_day": {
        "date_histogram": {
          "field":             "timestamp",
          "calendar_interval": "day"
        },
        "aggs": {
          "zero_results": {
            "filter": { "term": { "result_count": 0 } }
          }
        }
      }
    }
  }'
)

# Extract the buckets into a data frame
buckets <- res$aggregations$by_day$buckets

df <- data.frame(
  date         = sapply(buckets, `[[`, "key_as_string"),
  total        = sapply(buckets, `[[`, "doc_count"),
  zero_results = sapply(buckets, function(b) b$zero_results$doc_count),
  stringsAsFactors = FALSE
)

df$zero_pct <- df$zero_results / df$total

This pattern — aggregate in ES, analyse in R — keeps the heavy lifting where ES is fast and brings only the summary data across the wire. Fetching raw documents at scale belongs in a pipeline, not an interactive session.

For fetching a manageable number of documents, scroll() handles pagination:

# Scroll through all documents matching a query — use for small-to-medium result sets
res <- scroll(es,
  index = "search-logs",
  body  = '{ "query": { "term": { "result_count": 0 } } }',
  size  = 500
)

# Bind successive pages
docs <- lapply(res$hits$hits, function(h) as.data.frame(h$`_source`))
df   <- do.call(rbind, docs)

Redis

redux binds libhiredis and exposes the full Redis command set. For EDA purposes, the most common patterns are reading time-series data stored in sorted sets and sampling key-value stores.

library(redux)

r <- hiredis()

# Read a sorted set (e.g. events stored with timestamp as score)
raw <- r$ZRANGEBYSCORE("events:2023", "-inf", "+inf", "WITHSCORES")

# Sorted set returns alternating member/score pairs
members <- raw[seq(1, length(raw), 2)]
scores  <- as.numeric(raw[seq(2, length(raw), 2)])

df <- data.frame(
  event     = unlist(members),
  timestamp = as.POSIXct(scores, origin = "1970-01-01"),
  stringsAsFactors = FALSE
)

# Read multiple keys matching a pattern (SCAN, not KEYS — safe on production)
cursor  <- "0"
pattern <- "session:*"
keys    <- character(0)

repeat {
  res    <- r$SCAN(cursor, "MATCH", pattern, "COUNT", 100)
  cursor <- res[[1]]
  keys   <- c(keys, unlist(res[[2]]))
  if (cursor == "0") break
}

values <- r$MGET(.list = as.list(keys))

The SCAN instead of KEYS pattern matters for production Redis. KEYS blocks the server for the duration of the scan; SCAN is iterative and safe.

Google Analytics

googleAnalyticsR wraps the GA4 Data API and handles OAuth. The setup requires a Google Cloud project and OAuth credentials — worth the one-time cost if you’re doing recurring analysis against GA data.

library(googleAnalyticsR)

# First run opens browser for OAuth — subsequent runs use cached token
ga_auth()

# GA4 property ID (not the old UA view ID)
property_id <- 123456789

df <- ga_data(
  property_id,
  date_range  = c("2023-01-01", "2023-03-31"),
  metrics     = c("sessions", "bounceRate", "averageSessionDuration"),
  dimensions  = c("date", "deviceCategory", "sessionSourceMedium"),
  limit       = 10000
)

GA4’s data model is more flexible than Universal Analytics but requires understanding the new dimension/metric naming. ga_meta("data") returns the full list of available fields for a property.

CSV and Flat Files

For large CSVs, data.table::fread() outperforms base R’s read.csv and readr::read_csv significantly — particularly for files above a few hundred megabytes.

library(data.table)

# fread detects separator, handles quoted fields, uses multiple threads
dt <- fread("exports/events-2023.csv",
  select  = c("user_id", "event_type", "timestamp", "value"),
  nThread = 4
)

# data.table syntax for grouping without copying to a new data frame
summary <- dt[
  event_type %in% c("purchase", "refund"),
  .(total = sum(value), count = .N),
  by = .(event_type, month = substr(timestamp, 1, 7))
][order(month)]

For genuinely large files — tens of gigabytes — load into an in-memory SQLite database as shown earlier, or use arrow::read_parquet() if the source can be exported in Parquet format. Parquet is column-oriented and compresses well; a 10GB CSV often becomes a 500MB Parquet file that reads in seconds.

The Common Thread

The point worth holding onto: once data from any of these sources arrives in a data frame, the analysis layer is identical. A retention calculation on Postgres data, a zero-results trend from Elasticsearch, and a channel attribution from Google Analytics all become data frames before the visualisation step. ggplot2, dplyr, and base R functions see no difference between them.

library(ggplot2)

# The source — Postgres, ES, Redis, GA, CSV — is already forgotten by here
ggplot(df, aes(x = date, y = zero_pct)) +
  geom_line(colour = "#1098CD") +
  geom_smooth(method = "loess", se = FALSE, colour = "#06394D", linetype = "dashed") +
  scale_y_continuous(labels = scales::percent) +
  labs(
    title    = "Zero-result search rate",
    subtitle = "7-day LOESS trend",
    x        = NULL,
    y        = "% searches returning no results"
  ) +
  theme_minimal()

This is the core value proposition for R in EDA: the connection layer is wide, the transformation layer is expressive, and the output layer — plots, tables, R Markdown documents — is built-in rather than assembled from separate tools.

When Not to Use This

R is an interactive analysis tool, not a pipeline runtime. Scheduled jobs, production data transformations, and anything that needs to run reliably in an automated context belong in Python, dbt, or a proper orchestration tool. Using R for those cases is possible but fighting against the grain.

The same caution applies to very large result sets. Pulling a million rows from Postgres into a data frame for a group-by is the wrong approach — push the aggregation to the database and pull the summary. R is fast enough for the analysis layer; it is not a query engine.

For one-off exploration, ad-hoc questions, and the initial investigation before you know what you’re looking for, the connector breadth and the unified data frame model make it worth keeping in the toolkit alongside Python.