A couple weeks ago I read two posts by Dr. Drang. He was documenting the creation of a table for a previous blog post using Pandas, a data analysis package for Python. Working with data is my day job now, so it was interesting to follow his process and the updates he made in the follow-up post. Of course I got nerd-sniped, and just had to work out how I’d approach the problem with my own preferred tools.
This will be the most technical piece I’ve written here, so if wrangling code and crunching numbers sounds like a good time, read on.[1]
The Problem
The original post—and the table in question—was looking at states’ percentage of the Electoral College vote compared to their population as a percentage of the US total. He started with a CSV containing data for each state. The header and first ten rows look like this:
State | Abbrev | Population | Electors |
---|---|---|---|
Alabama | AL | 5108468 | 9 |
Alaska | AK | 733406 | 3 |
Arizona | AZ | 7431344 | 11 |
Arkansas | AR | 3067732 | 6 |
California | CA | 38965193 | 54 |
Colorado | CO | 5877610 | 10 |
Connecticut | CT | 3617176 | 7 |
Delaware | DE | 1031890 | 3 |
District of Columbia | DC | 678972 | 3 |
Florida | FL | 22610726 | 30 |
From that he calculated this table for his post:
Electors | States | Pop Pct | EC Pct |
---|---|---|---|
3 | AK, DE, DC, ND, SD, VT, WY | 1.61% | 3.90% |
4 | HI, ID, ME, MT, NH, RI, WV | 3.04% | 5.20% |
5 | NE, NM | 1.22% | 1.86% |
6 | AR, IA, KS, MS, NV, UT | 5.60% | 6.69% |
7 | CT, OK | 2.29% | 2.60% |
8 | KY, LA, OR | 3.98% | 4.46% |
9 | AL, SC | 3.13% | 3.35% |
10 | CO, MD, MN, MO, WI | 8.93% | 9.29% |
11 | AZ, IN, MA, TN | 8.49% | 8.18% |
12 | WA | 2.33% | 2.23% |
13 | VA | 2.60% | 2.42% |
14 | NJ | 2.77% | 2.60% |
15 | MI | 3.00% | 2.79% |
16 | GA, NC | 6.53% | 5.95% |
17 | OH | 3.52% | 3.16% |
19 | IL, PA | 7.62% | 7.06% |
28 | NY | 5.84% | 5.20% |
30 | FL | 6.75% | 5.58% |
40 | TX | 9.11% | 7.43% |
54 | CA | 11.63% | 10.04% |
Both Dr. Drang and I write our posts in Markdown, so the plaintext version of the table looks like this:
| Electors | States | Pop Pct | EC Pct |
|:----------:|:--------------------------:|----------:|---------:|
| 3 | AK, DE, DC, ND, SD, VT, WY | 1.61% | 3.90% |
| 4 | HI, ID, ME, MT, NH, RI, WV | 3.04% | 5.20% |
| 5 | NE, NM | 1.22% | 1.86% |
| 6 | AR, IA, KS, MS, NV, UT | 5.60% | 6.69% |
| 7 | CT, OK | 2.29% | 2.60% |
| 8 | KY, LA, OR | 3.98% | 4.46% |
| 9 | AL, SC | 3.13% | 3.35% |
| 10 | CO, MD, MN, MO, WI | 8.93% | 9.29% |
| 11 | AZ, IN, MA, TN | 8.49% | 8.18% |
| 12 | WA | 2.33% | 2.23% |
| 13 | VA | 2.60% | 2.42% |
| 14 | NJ | 2.77% | 2.60% |
| 15 | MI | 3.00% | 2.79% |
| 16 | GA, NC | 6.53% | 5.95% |
| 17 | OH | 3.52% | 3.16% |
| 19 | IL, PA | 7.62% | 7.06% |
| 28 | NY | 5.84% | 5.20% |
| 30 | FL | 6.75% | 5.58% |
| 40 | TX | 9.11% | 7.43% |
| 54 | CA | 11.63% | 10.04% |
The Tools
My go-to language for working with data is SQL. I turn to Python[2] for things that involve more scripting (like loops or complex functions). But for most data analysis needs, I find SQL better suited to the job.
From personal projects to one-off data transformation/analysis tasks at work, I keep finding more and more uses for DuckDB—in its own words, “a fast in-process analytical database”. DuckDB can import (and export) a variety of file and database formats or even query them directly. It can also be used from within Python, which allows for workflows combining DuckDB and Pandas.
The Solution
I worked through this a couple different ways. The first time through was more piece-by-piece, and then I condensed that down to a one-shot query, which is what I used in a (very short) Python script to generate the final Markdown table.
I started by importing the CSV to a new table in the database:[3]
create table states as from "states.csv";
I queried that table to get something like Dr. Drang’s initial summary table.
with totals as (
-- Sum population and electors for all states
select sum(population) as total_pop,
sum(electors) as total_electors
from states
)
select electors,
string_agg(abbrev, ', ') as states,
count(*) as num_states,
sum(population) as population,
sum(population/total_pop) as pop_pct,
sum(electors) as electors_sum,
sum(electors/total_electors) as ec_pct
from states,
totals
group by electors;
This is where the meat of the aggregation happens.
In the first part, I sum the population
and electors
columns for the whole states
table. This gives me a single-row result set which I use along with the states
table in the second part.
The group by electors
clause tells the query what to group, and all the other columns are aggregate functions for all rows (or states) with that number of electors. string_agg
does the thing Dr. Drang wrote a custom lambda function for, combining the column values with a given separator—in this case, a comma and space.
Here’s DuckDB’s output:
┌──────────┬────────────────────────────┬────────────┬────────────┬──────────────────────┬──────────────┬──────────────────────┐
│ Electors │ states │ num_states │ population │ pop_pct │ electors_sum │ ec_pct │
│ int64 │ varchar │ int64 │ int128 │ double │ int128 │ double │
├──────────┼────────────────────────────┼────────────┼────────────┼──────────────────────┼──────────────┼──────────────────────┤
│ 3 │ AK, DE, DC, ND, SD, VT, WY │ 7 │ 5379033 │ 0.016060895111876104 │ 21 │ 0.039033457249070626 │
│ 4 │ HI, ID, ME, MT, NH, RI, WV │ 7 │ 10196485 │ 0.030445003050700387 │ 28 │ 0.052044609665427496 │
│ 5 │ NE, NM │ 2 │ 4092750 │ 0.012220268674524016 │ 10 │ 0.01858736059479554 │
│ 6 │ AR, IA, KS, MS, NV, UT │ 6 │ 18766882 │ 0.05603477862637313 │ 36 │ 0.06691449814126393 │
│ 7 │ CT, OK │ 2 │ 7671000 │ 0.02290432618710494 │ 14 │ 0.026022304832713755 │
│ 8 │ KY, LA, OR │ 3 │ 13333261 │ 0.03981089285383978 │ 24 │ 0.04460966542750929 │
│ 9 │ AL, SC │ 2 │ 10482023 │ 0.031297571880163765 │ 18 │ 0.03345724907063197 │
│ 10 │ CO, MD, MN, MO, WI │ 5 │ 29902889 │ 0.08928503762127392 │ 50 │ 0.0929368029739777 │
│ 11 │ AZ, IN, MA, TN │ 4 │ 28421431 │ 0.08486165119649276 │ 44 │ 0.08178438661710037 │
│ 12 │ WA │ 1 │ 7812880 │ 0.023327956196155443 │ 12 │ 0.022304832713754646 │
│ 13 │ VA │ 1 │ 8715698 │ 0.026023620119971076 │ 13 │ 0.024163568773234202 │
│ 14 │ NJ │ 1 │ 9290841 │ 0.02774090116236843 │ 14 │ 0.026022304832713755 │
│ 15 │ MI │ 1 │ 10037261 │ 0.029969586751284978 │ 15 │ 0.027881040892193308 │
│ 16 │ GA, NC │ 2 │ 21864718 │ 0.06528440008617711 │ 32 │ 0.05947955390334572 │
│ 17 │ OH │ 1 │ 11785935 │ 0.03519083557033198 │ 17 │ 0.031598513011152414 │
│ 19 │ IL, PA │ 2 │ 25511372 │ 0.076172700530384 │ 38 │ 0.07063197026022305 │
│ 28 │ NY │ 1 │ 19571216 │ 0.05843638575704434 │ 28 │ 0.05204460966542751 │
│ 30 │ FL │ 1 │ 22610726 │ 0.06751185551183085 │ 30 │ 0.055762081784386616 │
│ 40 │ TX │ 1 │ 30503301 │ 0.09107776768184646 │ 40 │ 0.07434944237918216 │
│ 54 │ CA │ 1 │ 38965193 │ 0.11634356543025655 │ 54 │ 0.10037174721189591 │
├──────────┴────────────────────────────┴────────────┴────────────┴──────────────────────┴──────────────┴──────────────────────┤
│ 20 rows 7 columns │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Finally, I rewrote the query to cut out the import step and select just the columns for the final table. For some reason, when doing it this way the table was no longer sorted by the electors column, even though it had been in the two-step query before. That can be remedied with order by electors
:
with totals as (
-- Sum population and electors for all states
select sum(population) as total_pop,
sum(electors) as total_electors
from "states.csv"
)
select electors as "Electors",
string_agg(abbrev, ', ') as "States",
sum(population/total_pop) as "Pop Pct",
sum(electors/total_electors) as "EC Pct"
from "states.csv",
totals
group by electors
order by electors;
This gets closer, but it’s not Markdown yet:
┌──────────┬────────────────────────────┬──────────────────────┬──────────────────────┐
│ Electors │ States │ Pop Pct │ EC Pct │
│ int64 │ varchar │ double │ double │
├──────────┼────────────────────────────┼──────────────────────┼──────────────────────┤
│ 3 │ AK, DE, DC, ND, SD, VT, WY │ 0.016060895111876104 │ 0.039033457249070626 │
│ 4 │ HI, ID, ME, MT, NH, RI, WV │ 0.030445003050700387 │ 0.052044609665427496 │
│ 5 │ NE, NM │ 0.012220268674524016 │ 0.01858736059479554 │
│ 6 │ AR, IA, KS, MS, NV, UT │ 0.05603477862637313 │ 0.06691449814126393 │
│ 7 │ CT, OK │ 0.02290432618710494 │ 0.026022304832713755 │
│ 8 │ KY, LA, OR │ 0.03981089285383978 │ 0.04460966542750929 │
│ 9 │ AL, SC │ 0.031297571880163765 │ 0.03345724907063197 │
│ 10 │ CO, MD, MN, MO, WI │ 0.08928503762127392 │ 0.0929368029739777 │
│ 11 │ AZ, IN, MA, TN │ 0.08486165119649276 │ 0.08178438661710037 │
│ 12 │ WA │ 0.023327956196155443 │ 0.022304832713754646 │
│ 13 │ VA │ 0.026023620119971076 │ 0.024163568773234202 │
│ 14 │ NJ │ 0.02774090116236843 │ 0.026022304832713755 │
│ 15 │ MI │ 0.029969586751284978 │ 0.027881040892193308 │
│ 16 │ GA, NC │ 0.06528440008617711 │ 0.05947955390334572 │
│ 17 │ OH │ 0.03519083557033198 │ 0.031598513011152414 │
│ 19 │ IL, PA │ 0.076172700530384 │ 0.07063197026022305 │
│ 28 │ NY │ 0.05843638575704434 │ 0.05204460966542751 │
│ 30 │ FL │ 0.06751185551183085 │ 0.055762081784386616 │
│ 40 │ TX │ 0.09107776768184646 │ 0.07434944237918216 │
│ 54 │ CA │ 0.11634356543025655 │ 0.10037174721189591 │
├──────────┴────────────────────────────┴──────────────────────┴──────────────────────┤
│ 20 rows 4 columns │
└─────────────────────────────────────────────────────────────────────────────────────┘
To get it into a Markdown table, we can run this query from Python, convert it to a Pandas dataframe, and run that through the to_markdown
function.
import duckdb
import pandas
df = duckdb.sql("""
with totals as (
select sum(population) as total_pop,
sum(electors) as total_electors
from 'states.csv'
)
select electors as 'Electors',
string_agg(abbrev, ', ') as 'States',
sum(population/total_pop) as 'Pop Pct',
sum(electors/total_electors) as 'EC Pct'
from 'states.csv',
totals
group by electors
order by electors;
""").df()
print(df.to_markdown(
index=False,
floatfmt='.2%',
colalign=['center', 'center', 'right', 'right']
))
I’ve made two tweaks to the to_markdown
call from Dr. Drang’s code. It was adding an index column by default, so I’ve disabled that. Second, since I’ve already renamed the columns in SQL, the headers
parameter is no longer needed.
This code combines everything: reading from the CSV, transforming it to the new table, and converting the results (via Pandas) to Markdown. So all we have to do is run this script in the same working directory as states.csv
.
| Electors | States | Pop Pct | EC Pct |
|:----------:|:--------------------------:|----------:|---------:|
| 3 | AK, DE, DC, ND, SD, VT, WY | 1.61% | 3.90% |
| 4 | HI, ID, ME, MT, NH, RI, WV | 3.04% | 5.20% |
| 5 | NE, NM | 1.22% | 1.86% |
| 6 | AR, IA, KS, MS, NV, UT | 5.60% | 6.69% |
| 7 | CT, OK | 2.29% | 2.60% |
| 8 | KY, LA, OR | 3.98% | 4.46% |
| 9 | AL, SC | 3.13% | 3.35% |
| 10 | CO, MD, MN, MO, WI | 8.93% | 9.29% |
| 11 | AZ, IN, MA, TN | 8.49% | 8.18% |
| 12 | WA | 2.33% | 2.23% |
| 13 | VA | 2.60% | 2.42% |
| 14 | NJ | 2.77% | 2.60% |
| 15 | MI | 3.00% | 2.79% |
| 16 | GA, NC | 6.53% | 5.95% |
| 17 | OH | 3.52% | 3.16% |
| 19 | IL, PA | 7.62% | 7.06% |
| 28 | NY | 5.84% | 5.20% |
| 30 | FL | 6.75% | 5.58% |
| 40 | TX | 9.11% | 7.43% |
| 54 | CA | 11.63% | 10.04% |
Success!
Joke’s on me: writing this post and updating my website to handle all these tables and code blocks nicely ended up being way more involved than solving the original problem. ↩︎
Some Pandas, but mostly Snowpark Python, which is Pandas-inspired but native to Snowflake, the data warehouse we use at work. ↩︎
To pick this apart a bit: I could read the contents of the file with
select * from "states.csv"
, which gives me every column (and row) of the data. DuckDB has a convenient syntax for thoseselect *
queries—namely, you can drop theselect *
and just typefrom "states.csv"
. Here I’m taking it a step further and creating a table from the results of that query. ↩︎