DataFrame

DataFrame

Tabular data analysis in Haskell. Read CSV, Parquet, and JSON files, transform columns with a typed expression DSL, and optionally lock down your entire schema at the type level for compile-time safety.

The library ships three API layers — all operating on the same underlying DataFrame type at runtime:

Untyped (import qualified DataFrame as D) — string-based column names, great for exploration and scripting.
Typed (import qualified DataFrame.Typed as T) — phantom-type schema tracking with compile-time column validation.
Monadic API — write your transformation as a self contained pipeline.

This README is a runnable scripths notebook. Every Haskell block runs top-to-bottom in one shared session against the datasets in ./data. Reproduce every output below with scripths docs/base_scripts/base_readme.md -o README.md from the repo root.

Why this library?

Concise, declarative, composable data pipelines using the |> pipe operator.
Choose your level of type safety: keep it lightweight for quick analysis, or lock it down for production pipelines.
High performance from Haskell's optimizing compiler and an efficient columnar memory model with bitmap-backed nullability.
Designed for interactivity: a custom REPL, IHaskell notebook support, terminal and web plotting, and helpful error messages.

Install

cabal update
cabal install dataframe

To use as a dependency in a project:

build-depends: base >= 4, dataframe

Works with GHC 9.4 through 9.12. A custom REPL with all imports pre-loaded is available after installing:

dataframe

Quick Start

Group sales by product and compute totals. The first block carries the scripths cabal directives and the imports shared by the rest of the document; you can also drop the same code into an Example.hs and run it with cabal run Example.hs after adding a #!/usr/bin/env cabal header.

-- cabal: build-depends: dataframe, text
-- cabal: default-extensions: OverloadedStrings, TypeApplications, TemplateHaskell, DataKinds, TypeFamilies, FlexibleInstances, FlexibleContexts, ScopedTypeVariables, DeriveGeneric, UndecidableInstances
import qualified DataFrame as D
import qualified DataFrame.Functions as F
import qualified DataFrame.Typed as DT
import DataFrame.Operators
import Data.Text (Text)
import Data.Int (Int64)

sales = D.fromNamedColumns
    [ ("product", D.fromList [1, 1, 2, 2, 3, 3 :: Int])
    , ("amount",  D.fromList [100, 120, 50, 20, 40, 30 :: Int])
    ]

-- Group by product and compute totals
sales
    |> D.groupBy ["product"]
    |> D.aggregate [ F.sum (F.col @Int "amount") `as` "total"
                   , F.count (F.col @Int "amount") `as` "orders"
                   ]
    |> D.toMarkdown'

product
Int total
Int orders
Int

1 220 2

3 70 2

2 70 2

Reading from files works the same way:

fileDf <- D.readCsv "./data/housing.csv"
fileDf <- D.readParquet "./data/mtcars.parquet"

-- Hugging Face datasets (needs network access):
-- fileDf <- D.readParquet "hf://datasets/scikit-learn/iris/default/train/0000.parquet"

D.dimensions fileDf

(32,12)

Interactive REPL

The dataframe REPL comes with all imports pre-loaded. Here's a typical exploration session (each block runs as a cell):

df <- D.readCsv "./data/housing.csv"
D.dimensions df

(20640,10)

D.describeColumns df |> D.toMarkdown'

Column Name
Text # Non-null Values
Int # Null Values
Int Type
Text

total_bedrooms 20433 207 Maybe Double

ocean_proximity 20640 0 Text

median_house_value 20640 0 Double

median_income 20640 0 Double

households 20640 0 Double

population 20640 0 Double

total_rooms 20640 0 Double

housing_median_age 20640 0 Double

latitude 20640 0 Double

longitude 20640 0 Double

The :declareColumns macro ($(D.declareColumns df) outside the REPL) generates typed column references from a dataframe, so you can use column names directly in expressions instead of writing F.col @Double "median_income" every time:

$(D.declareColumns df)

df |> D.groupBy ["ocean_proximity"]
   |> D.aggregate [F.mean median_house_value `as` "avg_value"]
   |> D.toMarkdown'

ocean_proximity
Text avg_value
Double

NEAR BAY 259212.31179039303

NEAR OCEAN 249433.97742663656

INLAND 124805.39200122119

<1H OCEAN 240084.28546409807

ISLAND 380440.0

Create new columns from existing ones:

df |> D.derive "rooms_per_household" (total_rooms / households) |> D.take 3 |> D.toMarkdown'

longitude
Double latitude
Double housing_median_age
Double total_rooms
Double total_bedrooms
Maybe Double population
Double households
Double median_income
Double median_house_value
Double ocean_proximity
Text rooms_per_household
Double

-122.23 37.88 41.0 880.0 Just 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY 6.984126984126984

-122.22 37.86 21.0 7099.0 Just 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY 6.238137082601054

-122.24 37.85 52.0 1467.0 Just 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY 8.288135593220339

Type mismatches are caught as compile errors — adding a Double column to a Text column won't silently produce garbage:

dataframe> df |> D.derive "nonsense" (latitude + ocean_proximity)

<interactive>:14:47: error: [GHC-83865]
    • Couldn't match type 'Text' with 'Double'
        Expected: Expr Double
          Actual: Expr Text
    • In the second argument of '(+)', namely 'ocean_proximity'
      In the second argument of 'derive', namely
        '(latitude + ocean_proximity)'

Template Haskell

For scripts and projects, Template Haskell can generate column bindings at compile time.

Generate column references from a CSV

declareColumnsFromCsvFile (in DataFrame.TH, also re-exported from DataFrame) reads your CSV at compile time and generates typed Expr bindings for every column:

-- Reads housing.csv at compile time and generates:
--   latitude :: Expr Double
--   total_rooms :: Expr Double
--   ocean_proximity :: Expr Text
--   ... one binding per column
$(D.declareColumnsFromCsvFile "./data/housing.csv")

df <- D.readCsv "./data/housing.csv"
df |> D.derive "rooms_per_household" (total_rooms / households)
   |> D.filterWhere (median_income .>. 5)
   |> D.groupBy ["ocean_proximity"]
   |> D.aggregate [F.mean median_house_value `as` "avg_value"]
   |> D.toMarkdown'

ocean_proximity
Text avg_value
Double

NEAR BAY 361441.9354304636

NEAR OCEAN 380041.63071895426

INLAND 234817.86695906433

<1H OCEAN 333411.75125531096

Compare this to the manual version which requires spelling out every column name and type:

-- Without TH — every column needs its name and type spelled out
df |> D.derive "rooms_per_household"
        (F.col @Double "total_rooms" / F.col @Double "households")
   |> D.filterWhere (F.col @Double "median_income" .>. F.lit 5)
   |> D.take 5
   |> D.toMarkdown'

longitude
Double latitude
Double housing_median_age
Double total_rooms
Double total_bedrooms
Maybe Double population
Double households
Double median_income
Double median_house_value
Double ocean_proximity
Text rooms_per_household
Double

-122.23 37.88 41.0 880.0 Just 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY 6.984126984126984

-122.22 37.86 21.0 7099.0 Just 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY 6.238137082601054

-122.24 37.85 52.0 1467.0 Just 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY 8.288135593220339

-122.25 37.85 52.0 1274.0 Just 235.0 558.0 219.0 5.6431000000000004 341300.0 NEAR BAY 5.8173515981735155

-122.29 37.82 49.0 135.0 Just 29.0 86.0 23.0 6.1183 75000.0 NEAR BAY 5.869565217391305

Generate a schema type from a CSV

deriveSchemaFromCsvFile generates a type synonym for use with the typed API — instead of manually writing out every column name and type:

-- Generates:
-- type HousingSchema = '[ DT.Column "longitude" Double
--                       , DT.Column "latitude" Double
--                       , DT.Column "total_rooms" Double
--                       , ...
--                       ]
$(DT.deriveSchemaFromCsvFile "HousingSchema" "./data/housing.csv")

Generate a schema (and a row bridge) from a record ADT

When the canonical row shape lives in your code as a Haskell record, deriveSchemaFromType produces both the typed schema and a HasSchema instance that converts between [Order] and a DataFrame (or TypedDataFrame OrderSchema) at runtime:

data Order = Order
    { orderId :: Int64
    , region  :: Text
    , amount  :: Double
    } deriving (Show, Eq)

$(DT.deriveSchemaFromType ''Order)
-- expands to:
--   type OrderSchema =
--     '[DT.Column "order_id" Int64, DT.Column "region" Text, DT.Column "amount" Double]
--   instance DT.HasSchema Order where
--     type Schema Order = OrderSchema
--     toColumns   = ...
--     fromColumns = ...

xs :: [Order]
xs = [Order 1 "us" 10.0, Order 2 "eu" 20.5]

-- Untyped: [Order] -> DataFrame
ordersDf :: D.DataFrame
ordersDf = D.fromRecords xs

ordersDf |> D.toMarkdown'

order_id
Int64 region
Text amount
Double

1 us 10.0

2 eu 20.5

The runtime-checked round-trip back to records:

D.toRecords ordersDf :: Either Text [Order]

Right [Order {orderId = 1, region = "us", amount = 10.0},Order {orderId = 2, region = "eu", amount = 20.5}]

And the typed bridge — [Order] to TypedDataFrame OrderSchema and back:

DT.thaw (DT.fromRecordsTyped xs :: DT.TypedDataFrame OrderSchema) |> D.toMarkdown'

order_id
Int64 region
Text amount
Double

1 us 10.0

2 eu 20.5

Field names are translated camelCase → snake_case by default; override the translation with deriveSchemaFromTypeWith defaultSchemaOptions{nameTransform = id} (or any String -> String).

If all you need is a runtime Schema to drive readCsvWithSchema (no typed-dataframe machinery), there's a companion splice in DataFrame.Internal.Schema (re-exported from DataFrame):

$(D.deriveSchema ''Order)
-- emits:
--   orderSchema     :: Schema
--   orderSchema     = makeSchema [("order_id", schemaType @Int64), ...]
--   orderOrderId    :: Expr Int64
--   orderOrderId    = col "order_id"
--   orderRegion     :: Expr Text
--   orderRegion     = col "region"
--   orderAmount     :: Expr Double
--   orderAmount     = col "amount"

orders :: IO D.DataFrame
orders = do
    raw <- D.readCsvWithSchema orderSchema "./data/orders.csv"
    pure (D.filter orderAmount (> 100) raw)

Each record field gets a typed accessor named <lower-first TyConName><UpperFirst FieldName>, so data Order { customerId :: Int } yields orderCustomerId :: Expr Int = col "customer_id". That's the same shape as $(D.declareColumns df) produces from a runtime DataFrame, but driven off the ADT instead of an existing frame.

If you'd rather not depend on Template Haskell, the same schema is available via GHC.Generics (shown here on an equivalent record):

import GHC.Generics (Generic)
import DataFrame.Typed (Schema)

data OrderG = OrderG
    { orderGId :: Int64
    , regionG  :: Text
    , amountG  :: Double
    } deriving (Generic)

type OrderGSchema = DT.SchemaOf OrderG

instance DT.HasSchema OrderG where
    type Schema OrderG = OrderGSchema
    toColumns   = DT.genericToColumns
    fromColumns = DT.genericFromColumns

Typed API

When you want compile-time guarantees that column names exist and types match, wrap your DataFrame in a TypedDataFrame:

type EmployeeSchema =
    '[ DT.Column "name"       Text
     , DT.Column "department" Text
     , DT.Column "salary"     Double
     ]

employees <- D.readCsv "./data/employees.csv"
case DT.freeze @EmployeeSchema employees of
    Nothing  -> "Schema mismatch!"
    Just tdf -> tdf
        |> DT.derive @"bonus" (DT.col @"salary" * DT.lit 0.1)
        |> DT.filterWhere (DT.col @"salary" DT..>. DT.lit 50000)
        |> DT.select @'["name", "bonus"]
        |> DT.thaw
        |> D.toMarkdown'

name
Text bonus
Double

Alice 8500.0

Carol 12000.0

Dave 5200.0

Frank 6700.0

DT.freeze validates the runtime DataFrame against your schema once at the boundary. After that, every column access is checked at compile time:

-- Typo in column name -> compile error
tdf |> DT.filterWhere (DT.col @"slary" DT..>. DT.lit 50000)
-- error: Column "slary" not found in schema

-- Wrong type -> compile error
tdf |> DT.filterWhere (DT.col @"name" DT..>. DT.lit 50000)
-- error: Couldn't match type 'Text' with 'Double'

filterAllJust goes further — it strips Maybe from every column in the schema type, so downstream code can't accidentally treat cleaned columns as nullable:

type ScoreSchema = '[ DT.Column "name" Text, DT.Column "score" (Maybe Double) ]

scoresDf = D.fromNamedColumns
    [ ("name",  D.fromList ["a", "b", "c" :: Text])
    , ("score", D.fromList [Just 1.0, Nothing, Just 3.0 :: Maybe Double])
    ]

Just stdf = DT.freeze @ScoreSchema scoresDf

-- filterAllJust drops the null row and changes the column type from
-- (Maybe Double) to Double, so `scaled` can multiply it directly.
DT.thaw (DT.filterAllJust stdf |> DT.derive @"scaled" (DT.col @"score" * DT.lit 100)) |> D.toMarkdown'

name
Text score
Double scaled
Double

a 1.0 100.0

c 3.0 300.0

Features

I/O: CSV, TSV, Parquet (Snappy, ZSTD, Gzip), JSON. Read Parquet from HTTP URLs and Hugging Face datasets (hf:// URIs). Column projection and predicate pushdown for Parquet reads.

Operations: filter, select, derive, groupBy, aggregate, joins (inner, left, right, full outer), sort, sample, stratified sample, distinct, k-fold splits.

Expressions: typed column references (F.col @Double "x"), arithmetic, comparisons, logical operators, nullable-aware three-valued logic (.==, .&&), string matching (like, regex), casting, and user-defined functions via lift/lift2.

Statistics: mean, median, mode, variance, standard deviation, percentiles, inter-quartile range, correlation, skewness, frequency tables, imputation.

Plotting: terminal plots (histogram, scatter, line, bar, box, pie, heatmap, stacked bar, correlation matrix) and interactive HTML plots.

Lazy engine: streaming query execution for files that don't fit in memory. Rule-based optimizer with filter fusion, predicate pushdown, and dead column elimination. Pull-based executor with configurable batch sizes.

Interop: Arrow C Data Interface for zero-copy round-trips with Python and Polars.

ML: decision trees (TAO algorithm), feature synthesis, k-fold cross-validation, stratified sampling.

Notebooks: IHaskell integration with pre-built Binder examples.

Lazy Queries

For files too large to fit in memory, DataFrame.Lazy provides a streaming query engine. Declare a schema, build a query plan with the same familiar operations, and runDataFrame runs it through an optimizer before streaming results batch-by-batch:

import qualified DataFrame.Lazy as L
import DataFrame.Internal.Schema (schemaType, makeSchema)

housingSchema = makeSchema
    [ ("longitude",          schemaType @Double)
    , ("latitude",           schemaType @Double)
    , ("housing_median_age", schemaType @Double)
    , ("total_rooms",        schemaType @Double)
    , ("total_bedrooms",     schemaType @(Maybe Double))
    , ("population",         schemaType @Double)
    , ("households",         schemaType @Double)
    , ("median_income",      schemaType @Double)
    , ("median_house_value", schemaType @Double)
    , ("ocean_proximity",    schemaType @Text)
    ]

lazyResult <- L.runDataFrame $
    L.scanCsv housingSchema "./data/housing.csv"
    |> L.filter  (F.col @Double "median_income" .>. F.lit 5)
    |> L.derive  "value_per_income"
                 (F.col @Double "median_house_value" / F.col @Double "median_income")
    |> L.select  ["ocean_proximity", "median_house_value", "value_per_income"]
    |> L.take 1000

D.take 10 lazyResult |> D.toMarkdown'

ocean_proximity
Text median_house_value
Double value_per_income
Double

NEAR BAY 452600.0 54365.06029885168

NEAR BAY 358500.0 43185.48678536151

NEAR BAY 352100.0 48515.997464656764

NEAR BAY 341300.0 60480.94132657581

NEAR BAY 75000.0 12258.307046074891

NEAR BAY 262500.0 51554.49064163246

NEAR BAY 327600.0 55908.25312308007

NEAR BAY 347600.0 65748.65703260951

NEAR BAY 366100.0 61467.42780389524

NEAR BAY 373600.0 58895.860264211624

The optimizer pushes the filter into the scan, drops unreferenced columns before reading, and stops pulling batches once 1000 rows have been collected.

Documentation

User guide: https://dataframe.readthedocs.io/en/latest/
API reference: https://hackage.haskell.org/package/dataframe/docs/DataFrame.html
Coming from pandas, Polars, dplyr, or Frames?
Cookbook (SQL-style patterns)
Tutorials
Discord: https://discord.gg/8u8SCWfrNC

Name		Name	Last commit message	Last commit date
Latest commit History 1,197 Commits
.github		.github
app		app
benchmark		benchmark
cbits		cbits
data		data
dataframe-arrow		dataframe-arrow
dataframe-core		dataframe-core
dataframe-csv-th		dataframe-csv-th
dataframe-csv		dataframe-csv
dataframe-fastcsv		dataframe-fastcsv
dataframe-fusion		dataframe-fusion
dataframe-hasktorch		dataframe-hasktorch
dataframe-json		dataframe-json
dataframe-lazy		dataframe-lazy
dataframe-learn		dataframe-learn
dataframe-operations		dataframe-operations
dataframe-parquet-th		dataframe-parquet-th
dataframe-parquet		dataframe-parquet
dataframe-parsing		dataframe-parsing
dataframe-persistent		dataframe-persistent
dataframe-th		dataframe-th
dataframe-viz		dataframe-viz
docs		docs
examples		examples
ffi/DataFrame		ffi/DataFrame
python		python
scripts		scripts
src		src
static		static
tests		tests
.fourmolu-ignore		.fourmolu-ignore
.ghci		.ghci
.gitignore		.gitignore
.hlint.yaml		.hlint.yaml
.readthedocs.yaml		.readthedocs.yaml
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
cabal.project		cabal.project
cabal.project.bare		cabal.project.bare
cabal.project.ci		cabal.project.ci
dataframe.cabal		dataframe.cabal
dataframe.ghci		dataframe.ghci
flake.nix		flake.nix
fourmolu.yaml		fourmolu.yaml
test_coverage.md		test_coverage.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataFrame

Why this library?

Install

Quick Start

Interactive REPL

Template Haskell

Generate column references from a CSV

Generate a schema type from a CSV

Generate a schema (and a row bridge) from a record ADT

Typed API

Features

Lazy Queries

Documentation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Column Name Text	# Non-null Values Int	# Null Values Int	Type Text
total_bedrooms	20433	207	Maybe Double
ocean_proximity	20640	0	Text
median_house_value	20640	0	Double
median_income	20640	0	Double
households	20640	0	Double
population	20640	0	Double
total_rooms	20640	0	Double
housing_median_age	20640	0	Double
latitude	20640	0	Double
longitude	20640	0	Double

ocean_proximity Text	avg_value Double
NEAR BAY	259212.31179039303
NEAR OCEAN	249433.97742663656
INLAND	124805.39200122119
<1H OCEAN	240084.28546409807
ISLAND	380440.0

longitude Double	latitude Double	housing_median_age Double	total_rooms Double	total_bedrooms Maybe Double	population Double	households Double	median_income Double	median_house_value Double	ocean_proximity Text	rooms_per_household Double
-122.23	37.88	41.0	880.0	Just 129.0	322.0	126.0	8.3252	452600.0	NEAR BAY	6.984126984126984
-122.22	37.86	21.0	7099.0	Just 1106.0	2401.0	1138.0	8.3014	358500.0	NEAR BAY	6.238137082601054
-122.24	37.85	52.0	1467.0	Just 190.0	496.0	177.0	7.2574	352100.0	NEAR BAY	8.288135593220339

ocean_proximity Text	avg_value Double
NEAR BAY	361441.9354304636
NEAR OCEAN	380041.63071895426
INLAND	234817.86695906433
<1H OCEAN	333411.75125531096

ocean_proximity Text	median_house_value Double	value_per_income Double
NEAR BAY	452600.0	54365.06029885168
NEAR BAY	358500.0	43185.48678536151
NEAR BAY	352100.0	48515.997464656764
NEAR BAY	341300.0	60480.94132657581
NEAR BAY	75000.0	12258.307046074891
NEAR BAY	262500.0	51554.49064163246
NEAR BAY	327600.0	55908.25312308007
NEAR BAY	347600.0	65748.65703260951
NEAR BAY	366100.0	61467.42780389524
NEAR BAY	373600.0	58895.860264211624

Folders and files

Latest commit

History

Repository files navigation

DataFrame

Why this library?

Install

Quick Start

Interactive REPL

Template Haskell

Generate column references from a CSV

Generate a schema type from a CSV

Generate a schema (and a row bridge) from a record ADT

Typed API

Features

Lazy Queries

Documentation

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages