Tabular data analysis in Haskell. Read CSV, Parquet, and JSON files, transform columns with a typed expression DSL, and optionally lock down your entire schema at the type level for compile-time safety.
The library ships three API layers — all operating on the same underlying DataFrame type at runtime:
- Untyped (
import qualified DataFrame as D) — string-based column names, great for exploration and scripting. - Typed (
import qualified DataFrame.Typed as T) — phantom-type schema tracking with compile-time column validation. - Monadic API — write your transformation as a self contained pipeline.
This README is a runnable scripths notebook. Every Haskell block runs top-to-bottom in one shared session against the datasets in
./data. Reproduce every output below withscripths docs/base_scripts/base_readme.md -o README.mdfrom the repo root.
- Concise, declarative, composable data pipelines using the
|>pipe operator. - Choose your level of type safety: keep it lightweight for quick analysis, or lock it down for production pipelines.
- High performance from Haskell's optimizing compiler and an efficient columnar memory model with bitmap-backed nullability.
- Designed for interactivity: a custom REPL, IHaskell notebook support, terminal and web plotting, and helpful error messages.
cabal update
cabal install dataframeTo use as a dependency in a project:
build-depends: base >= 4, dataframe
Works with GHC 9.4 through 9.12. A custom REPL with all imports pre-loaded is available after installing:
dataframeGroup sales by product and compute totals. The first block carries the
scripths cabal directives and the imports shared by the rest of the document;
you can also drop the same code into an Example.hs and run it with
cabal run Example.hs after adding a #!/usr/bin/env cabal header.
-- cabal: build-depends: dataframe, text
-- cabal: default-extensions: OverloadedStrings, TypeApplications, TemplateHaskell, DataKinds, TypeFamilies, FlexibleInstances, FlexibleContexts, ScopedTypeVariables, DeriveGeneric, UndecidableInstances
import qualified DataFrame as D
import qualified DataFrame.Functions as F
import qualified DataFrame.Typed as DT
import DataFrame.Operators
import Data.Text (Text)
import Data.Int (Int64)
sales = D.fromNamedColumns
[ ("product", D.fromList [1, 1, 2, 2, 3, 3 :: Int])
, ("amount", D.fromList [100, 120, 50, 20, 40, 30 :: Int])
]
-- Group by product and compute totals
sales
|> D.groupBy ["product"]
|> D.aggregate [ F.sum (F.col @Int "amount") `as` "total"
, F.count (F.col @Int "amount") `as` "orders"
]
|> D.toMarkdown'
product
Inttotal
Intorders
Int1 220 2 3 70 2 2 70 2
Reading from files works the same way:
fileDf <- D.readCsv "./data/housing.csv"
fileDf <- D.readParquet "./data/mtcars.parquet"
-- Hugging Face datasets (needs network access):
-- fileDf <- D.readParquet "hf://datasets/scikit-learn/iris/default/train/0000.parquet"
D.dimensions fileDf(32,12)
The dataframe REPL comes with all imports pre-loaded. Here's a typical exploration session (each block runs as a cell):
df <- D.readCsv "./data/housing.csv"
D.dimensions df(20640,10)
D.describeColumns df |> D.toMarkdown'
Column Name
Text# Non-null Values
Int# Null Values
IntType
Texttotal_bedrooms 20433 207 Maybe Double ocean_proximity 20640 0 Text median_house_value 20640 0 Double median_income 20640 0 Double households 20640 0 Double population 20640 0 Double total_rooms 20640 0 Double housing_median_age 20640 0 Double latitude 20640 0 Double longitude 20640 0 Double
The :declareColumns macro ($(D.declareColumns df) outside the REPL) generates typed column references from a dataframe, so you can use column names directly in expressions instead of writing F.col @Double "median_income" every time:
$(D.declareColumns df)df |> D.groupBy ["ocean_proximity"]
|> D.aggregate [F.mean median_house_value `as` "avg_value"]
|> D.toMarkdown'
ocean_proximity
Textavg_value
DoubleNEAR BAY 259212.31179039303 NEAR OCEAN 249433.97742663656 INLAND 124805.39200122119 <1H OCEAN 240084.28546409807 ISLAND 380440.0
Create new columns from existing ones:
df |> D.derive "rooms_per_household" (total_rooms / households) |> D.take 3 |> D.toMarkdown'
longitude
Doublelatitude
Doublehousing_median_age
Doubletotal_rooms
Doubletotal_bedrooms
Maybe Doublepopulation
Doublehouseholds
Doublemedian_income
Doublemedian_house_value
Doubleocean_proximity
Textrooms_per_household
Double-122.23 37.88 41.0 880.0 Just 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY 6.984126984126984 -122.22 37.86 21.0 7099.0 Just 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY 6.238137082601054 -122.24 37.85 52.0 1467.0 Just 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY 8.288135593220339
Type mismatches are caught as compile errors — adding a Double column to a Text column won't silently produce garbage:
dataframe> df |> D.derive "nonsense" (latitude + ocean_proximity)
<interactive>:14:47: error: [GHC-83865]
• Couldn't match type 'Text' with 'Double'
Expected: Expr Double
Actual: Expr Text
• In the second argument of '(+)', namely 'ocean_proximity'
In the second argument of 'derive', namely
'(latitude + ocean_proximity)'
For scripts and projects, Template Haskell can generate column bindings at compile time.
declareColumnsFromCsvFile (in DataFrame.TH, also re-exported from DataFrame)
reads your CSV at compile time and generates typed Expr bindings for every column:
-- Reads housing.csv at compile time and generates:
-- latitude :: Expr Double
-- total_rooms :: Expr Double
-- ocean_proximity :: Expr Text
-- ... one binding per column
$(D.declareColumnsFromCsvFile "./data/housing.csv")
df <- D.readCsv "./data/housing.csv"
df |> D.derive "rooms_per_household" (total_rooms / households)
|> D.filterWhere (median_income .>. 5)
|> D.groupBy ["ocean_proximity"]
|> D.aggregate [F.mean median_house_value `as` "avg_value"]
|> D.toMarkdown'
ocean_proximity
Textavg_value
DoubleNEAR BAY 361441.9354304636 NEAR OCEAN 380041.63071895426 INLAND 234817.86695906433 <1H OCEAN 333411.75125531096
Compare this to the manual version which requires spelling out every column name and type:
-- Without TH — every column needs its name and type spelled out
df |> D.derive "rooms_per_household"
(F.col @Double "total_rooms" / F.col @Double "households")
|> D.filterWhere (F.col @Double "median_income" .>. F.lit 5)
|> D.take 5
|> D.toMarkdown'
longitude
Doublelatitude
Doublehousing_median_age
Doubletotal_rooms
Doubletotal_bedrooms
Maybe Doublepopulation
Doublehouseholds
Doublemedian_income
Doublemedian_house_value
Doubleocean_proximity
Textrooms_per_household
Double-122.23 37.88 41.0 880.0 Just 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY 6.984126984126984 -122.22 37.86 21.0 7099.0 Just 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY 6.238137082601054 -122.24 37.85 52.0 1467.0 Just 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY 8.288135593220339 -122.25 37.85 52.0 1274.0 Just 235.0 558.0 219.0 5.6431000000000004 341300.0 NEAR BAY 5.8173515981735155 -122.29 37.82 49.0 135.0 Just 29.0 86.0 23.0 6.1183 75000.0 NEAR BAY 5.869565217391305
deriveSchemaFromCsvFile generates a type synonym for use with the typed API — instead of manually writing out every column name and type:
-- Generates:
-- type HousingSchema = '[ DT.Column "longitude" Double
-- , DT.Column "latitude" Double
-- , DT.Column "total_rooms" Double
-- , ...
-- ]
$(DT.deriveSchemaFromCsvFile "HousingSchema" "./data/housing.csv")When the canonical row shape lives in your code as a Haskell record,
deriveSchemaFromType produces both the typed schema and a HasSchema
instance that converts between [Order] and a DataFrame (or
TypedDataFrame OrderSchema) at runtime:
data Order = Order
{ orderId :: Int64
, region :: Text
, amount :: Double
} deriving (Show, Eq)
$(DT.deriveSchemaFromType ''Order)
-- expands to:
-- type OrderSchema =
-- '[DT.Column "order_id" Int64, DT.Column "region" Text, DT.Column "amount" Double]
-- instance DT.HasSchema Order where
-- type Schema Order = OrderSchema
-- toColumns = ...
-- fromColumns = ...
xs :: [Order]
xs = [Order 1 "us" 10.0, Order 2 "eu" 20.5]
-- Untyped: [Order] -> DataFrame
ordersDf :: D.DataFrame
ordersDf = D.fromRecords xs
ordersDf |> D.toMarkdown'
order_id
Int64region
Textamount
Double1 us 10.0 2 eu 20.5
The runtime-checked round-trip back to records:
D.toRecords ordersDf :: Either Text [Order]Right [Order {orderId = 1, region = "us", amount = 10.0},Order {orderId = 2, region = "eu", amount = 20.5}]
And the typed bridge — [Order] to TypedDataFrame OrderSchema and back:
DT.thaw (DT.fromRecordsTyped xs :: DT.TypedDataFrame OrderSchema) |> D.toMarkdown'
order_id
Int64region
Textamount
Double1 us 10.0 2 eu 20.5
Field names are translated camelCase → snake_case by default; override
the translation with deriveSchemaFromTypeWith defaultSchemaOptions{nameTransform = id} (or any String -> String).
If all you need is a runtime Schema to drive readCsvWithSchema (no
typed-dataframe machinery), there's a companion splice in
DataFrame.Internal.Schema (re-exported from DataFrame):
$(D.deriveSchema ''Order)
-- emits:
-- orderSchema :: Schema
-- orderSchema = makeSchema [("order_id", schemaType @Int64), ...]
-- orderOrderId :: Expr Int64
-- orderOrderId = col "order_id"
-- orderRegion :: Expr Text
-- orderRegion = col "region"
-- orderAmount :: Expr Double
-- orderAmount = col "amount"
orders :: IO D.DataFrame
orders = do
raw <- D.readCsvWithSchema orderSchema "./data/orders.csv"
pure (D.filter orderAmount (> 100) raw)Each record field gets a typed accessor named <lower-first TyConName><UpperFirst FieldName>,
so data Order { customerId :: Int } yields orderCustomerId :: Expr Int = col "customer_id".
That's the same shape as $(D.declareColumns df) produces from a runtime
DataFrame, but driven off the ADT instead of an existing frame.
If you'd rather not depend on Template Haskell, the same schema is
available via GHC.Generics (shown here on an equivalent record):
import GHC.Generics (Generic)
import DataFrame.Typed (Schema)
data OrderG = OrderG
{ orderGId :: Int64
, regionG :: Text
, amountG :: Double
} deriving (Generic)
type OrderGSchema = DT.SchemaOf OrderG
instance DT.HasSchema OrderG where
type Schema OrderG = OrderGSchema
toColumns = DT.genericToColumns
fromColumns = DT.genericFromColumnsWhen you want compile-time guarantees that column names exist and types match, wrap your DataFrame in a TypedDataFrame:
type EmployeeSchema =
'[ DT.Column "name" Text
, DT.Column "department" Text
, DT.Column "salary" Double
]
employees <- D.readCsv "./data/employees.csv"
case DT.freeze @EmployeeSchema employees of
Nothing -> "Schema mismatch!"
Just tdf -> tdf
|> DT.derive @"bonus" (DT.col @"salary" * DT.lit 0.1)
|> DT.filterWhere (DT.col @"salary" DT..>. DT.lit 50000)
|> DT.select @'["name", "bonus"]
|> DT.thaw
|> D.toMarkdown'
name
Textbonus
DoubleAlice 8500.0 Carol 12000.0 Dave 5200.0 Frank 6700.0
DT.freeze validates the runtime DataFrame against your schema once at the boundary. After that, every column access is checked at compile time:
-- Typo in column name -> compile error
tdf |> DT.filterWhere (DT.col @"slary" DT..>. DT.lit 50000)
-- error: Column "slary" not found in schema
-- Wrong type -> compile error
tdf |> DT.filterWhere (DT.col @"name" DT..>. DT.lit 50000)
-- error: Couldn't match type 'Text' with 'Double'
filterAllJust goes further — it strips Maybe from every column in the schema type, so downstream code can't accidentally treat cleaned columns as nullable:
type ScoreSchema = '[ DT.Column "name" Text, DT.Column "score" (Maybe Double) ]
scoresDf = D.fromNamedColumns
[ ("name", D.fromList ["a", "b", "c" :: Text])
, ("score", D.fromList [Just 1.0, Nothing, Just 3.0 :: Maybe Double])
]
Just stdf = DT.freeze @ScoreSchema scoresDf
-- filterAllJust drops the null row and changes the column type from
-- (Maybe Double) to Double, so `scaled` can multiply it directly.
DT.thaw (DT.filterAllJust stdf |> DT.derive @"scaled" (DT.col @"score" * DT.lit 100)) |> D.toMarkdown'
name
Textscore
Doublescaled
Doublea 1.0 100.0 c 3.0 300.0
I/O: CSV, TSV, Parquet (Snappy, ZSTD, Gzip), JSON. Read Parquet from HTTP URLs and Hugging Face datasets (hf:// URIs). Column projection and predicate pushdown for Parquet reads.
Operations: filter, select, derive, groupBy, aggregate, joins (inner, left, right, full outer), sort, sample, stratified sample, distinct, k-fold splits.
Expressions: typed column references (F.col @Double "x"), arithmetic, comparisons, logical operators, nullable-aware three-valued logic (.==, .&&), string matching (like, regex), casting, and user-defined functions via lift/lift2.
Statistics: mean, median, mode, variance, standard deviation, percentiles, inter-quartile range, correlation, skewness, frequency tables, imputation.
Plotting: terminal plots (histogram, scatter, line, bar, box, pie, heatmap, stacked bar, correlation matrix) and interactive HTML plots.
Lazy engine: streaming query execution for files that don't fit in memory. Rule-based optimizer with filter fusion, predicate pushdown, and dead column elimination. Pull-based executor with configurable batch sizes.
Interop: Arrow C Data Interface for zero-copy round-trips with Python and Polars.
ML: decision trees (TAO algorithm), feature synthesis, k-fold cross-validation, stratified sampling.
Notebooks: IHaskell integration with pre-built Binder examples.
For files too large to fit in memory, DataFrame.Lazy provides a streaming query engine. Declare a schema, build a query plan with the same familiar operations, and runDataFrame runs it through an optimizer before streaming results batch-by-batch:
import qualified DataFrame.Lazy as L
import DataFrame.Internal.Schema (schemaType, makeSchema)
housingSchema = makeSchema
[ ("longitude", schemaType @Double)
, ("latitude", schemaType @Double)
, ("housing_median_age", schemaType @Double)
, ("total_rooms", schemaType @Double)
, ("total_bedrooms", schemaType @(Maybe Double))
, ("population", schemaType @Double)
, ("households", schemaType @Double)
, ("median_income", schemaType @Double)
, ("median_house_value", schemaType @Double)
, ("ocean_proximity", schemaType @Text)
]
lazyResult <- L.runDataFrame $
L.scanCsv housingSchema "./data/housing.csv"
|> L.filter (F.col @Double "median_income" .>. F.lit 5)
|> L.derive "value_per_income"
(F.col @Double "median_house_value" / F.col @Double "median_income")
|> L.select ["ocean_proximity", "median_house_value", "value_per_income"]
|> L.take 1000
D.take 10 lazyResult |> D.toMarkdown'
ocean_proximity
Textmedian_house_value
Doublevalue_per_income
DoubleNEAR BAY 452600.0 54365.06029885168 NEAR BAY 358500.0 43185.48678536151 NEAR BAY 352100.0 48515.997464656764 NEAR BAY 341300.0 60480.94132657581 NEAR BAY 75000.0 12258.307046074891 NEAR BAY 262500.0 51554.49064163246 NEAR BAY 327600.0 55908.25312308007 NEAR BAY 347600.0 65748.65703260951 NEAR BAY 366100.0 61467.42780389524 NEAR BAY 373600.0 58895.860264211624
The optimizer pushes the filter into the scan, drops unreferenced columns before reading, and stops pulling batches once 1000 rows have been collected.