Beyond pd.read_csv()

analytics engineering 101 for the notebook-native data scientist

Apr 20, 2025

Who is this for?

Data professionals looking to become more familiar with concepts across data engineering, analytics engineering and the modern data stack.

If you’d like to learn more about the history of the modern data stack and how it has evolved in the past 20 years, I recommend this post:

The Modern Data Stack: Past, Present, and Future

Why this post?

If you’ve read some of my earlier posts, you’ve probably noticed that my focus is on analyzing data, more aligned with the data scientist role.

Those posts assume the data to be perfect, where all you need to do is:

import pandas as pd
df = pd.read_csv("really_clean_nonexistent_data.csv")

In reality, that’s rarely the case. Real-world data is messy. It’s thanks to data engineers that the data gets cleaned, organized, and made accessible to data scientists and others across the company.

As access to information becomes easier than ever, I believe we, as data professionals, should push ourselves to understand more of the entire analytics stack — not just our slice of it. That’s what motivated me to write this post.

Agenda

Project Overview
Data Stack

Let’s get started!

Project Overview

The project consists of transforming publicly available New York City subway data into insights.

Why should you care? You will understand how the data flows. Different tools, packages and libraries will be introduced and you will begin to understand what problems they solve.

people standing on train station — Photo by Nic Y-C on Unsplash

Data Stack

This is the project’s data stack. An entire blog post could be written for each part of this stack, but this post will give a brief introduction to each one of them.

Meltano: an EL tool for data engineers

It’s an open-source platform for managing extract, and load jobs.

Extractors and loaders are key components in a ETL job.

Extractors are responsible for retrieving the data from different sources such as APIs, databases, CSV or Excel files.

Loaders take the extracted data and load it into a destination. In this project, the destination is a DuckDB database.

Meltano’s Hub shows the list of extractors and loaders they currently support, and they have great documentation on how to use them.

So far the data flows: Raw Subway Data (CSV) → Meltano → Parquet file

More precisely:

Raw NYC subway data is a CSV file
I extract it and load it into parquet files using meltano

Check out the data-pipeline/meltano.yml file to see the extractor and loader I used.

dbt: the transformation tool

dbt is an open-source command line tool that handles the T in ELT (Extract, Load, Transform).

Simply put, once the data is stored in the data warehouse, dbt shines at transforming the data. This transformation is done using SQL.

For example: we want to know the busiest stations in each borough.

From the raw_subway_data, we can create a query (model) that calculates this and the output will be stored as a view or table inside of the data warehouse.

-- busiest_stations.sql
-- rider count per station in each borough
select 
    borough,
    station_complex,
    count(*) rider_count
from {{ ref('raw_subway_data') }}
group by 
    borough,
    station_complex

This model transforms the data and can now be used for visualization or other purposes.

In the project I created four models.

The first model stores the raw data from the parquet file in the DuckDB database, raw_subway_data.sql.

The other models are created inside of DuckDB.

The models are found here data_transform/models.

You can specify where your models will be stored in the profiles.yml configuration file.

data_transform:
  target: dev
  outputs:
    dev:
      type: duckdb
      path: /Users/juanherrera/Desktop/nyc-subway-analytics/data/subway_data.duckdb
      schema: main

DuckDB:

DuckDB is a column-oriented Relational Database Management System, that is high performant against large datasets.

You write SQL code to efficiently query your data, allowing you to perform complex.

All the data models are stored in DuckDB, this is the project’s database: subway_data.duckdb

Streamlit:

Once our data models are stored in our DuckDB database, we can effectively query it and start populating our data app using Streamlit.

Below is a guide on how to navigate the repo.

Thank you for reading!

Code Repo

Juan’s Substack

Discussion about this post