Skip to content

Latest commit

 

History

History
471 lines (330 loc) · 6.21 KB

README.md

File metadata and controls

471 lines (330 loc) · 6.21 KB

Ruby Polars

🔥 Blazingly fast DataFrames for Ruby, powered by Polars

Build Status

Installation

Add this line to your application’s Gemfile:

gem "polars-df"

Getting Started

This library follows the Polars Python API.

Polars.scan_csv("iris.csv")
  .filter(Polars.col("sepal_length") > 5)
  .group_by("species")
  .agg(Polars.all.sum)
  .collect

You can follow Polars tutorials and convert the code to Ruby in many cases. Feel free to open an issue if you run into problems.

Reference

Examples

Creating DataFrames

From a CSV

Polars.read_csv("file.csv")

# or lazily with
Polars.scan_csv("file.csv")

From Parquet

Polars.read_parquet("file.parquet")

# or lazily with
Polars.scan_parquet("file.parquet")

From Active Record

Polars.read_database(User.all)
# or
Polars.read_database("SELECT * FROM users")

From JSON

Polars.read_json("file.json")
# or
Polars.read_ndjson("file.ndjson")

# or lazily with
Polars.scan_ndjson("file.ndjson")

From Feather / Arrow IPC

Polars.read_ipc("file.arrow")

# or lazily with
Polars.scan_ipc("file.arrow")

From Avro

Polars.read_avro("file.avro")

From Delta Lake (requires deltalake-rb) [experimental]

Polars.read_delta("./table")

# or lazily with
Polars.scan_delta("./table")

From a hash

Polars::DataFrame.new({
  a: [1, 2, 3],
  b: ["one", "two", "three"]
})

From an array of hashes

Polars::DataFrame.new([
  {a: 1, b: "one"},
  {a: 2, b: "two"},
  {a: 3, b: "three"}
])

From an array of series

Polars::DataFrame.new([
  Polars::Series.new("a", [1, 2, 3]),
  Polars::Series.new("b", ["one", "two", "three"])
])

Attributes

Get number of rows

df.height

Get column names

df.columns

Check if a column exists

df.include?(name)

Selecting Data

Select a column

df["a"]

Select multiple columns

df[["a", "b"]]

Select first rows

df.head

Select last rows

df.tail

Filtering

Filter on a condition

df[Polars.col("a") == 2]
df[Polars.col("a") != 2]
df[Polars.col("a") > 2]
df[Polars.col("a") >= 2]
df[Polars.col("a") < 2]
df[Polars.col("a") <= 2]

And, or, and exclusive or

df[(Polars.col("a") > 1) & (Polars.col("b") == "two")] # and
df[(Polars.col("a") > 1) | (Polars.col("b") == "two")] # or
df[(Polars.col("a") > 1) ^ (Polars.col("b") == "two")] # xor

Operations

Basic operations

df["a"] + 5
df["a"] - 5
df["a"] * 5
df["a"] / 5
df["a"] % 5
df["a"] ** 2
df["a"].sqrt
df["a"].abs

Rounding

df["a"].round(2)
df["a"].ceil
df["a"].floor

Logarithm

df["a"].log # natural log
df["a"].log(10)

Exponentiation

df["a"].exp

Trigonometric functions

df["a"].sin
df["a"].cos
df["a"].tan
df["a"].asin
df["a"].acos
df["a"].atan

Hyperbolic functions

df["a"].sinh
df["a"].cosh
df["a"].tanh
df["a"].asinh
df["a"].acosh
df["a"].atanh

Summary statistics

df["a"].sum
df["a"].mean
df["a"].median
df["a"].quantile(0.90)
df["a"].min
df["a"].max
df["a"].std
df["a"].var

Grouping

Group

df.group_by("a").count

Works with all summary statistics

df.group_by("a").max

Multiple groups

df.group_by(["a", "b"]).count

Combining Data Frames

Add rows

df.vstack(other_df)

Add columns

df.hstack(other_df)

Inner join

df.join(other_df, on: "a")

Left join

df.join(other_df, on: "a", how: "left")

Encoding

One-hot encoding

df.to_dummies

Conversion

Array of hashes

df.rows(named: true)

Hash of series

df.to_h

CSV

df.to_csv
# or
df.write_csv("file.csv")

Parquet

df.write_parquet("file.parquet")

JSON

df.write_json("file.json")
# or
df.write_ndjson("file.ndjson")

Feather / Arrow IPC

df.write_ipc("file.arrow")

Avro

df.write_avro("file.avro")

Delta Lake [experimental]

df.write_delta("./table")

Numo array

df.to_numo

Types

You can specify column types when creating a data frame

Polars::DataFrame.new(data, schema: {"a" => Polars::Int32, "b" => Polars::Float32})

Supported types are:

  • boolean - Boolean
  • float - Float64, Float32
  • integer - Int64, Int32, Int16, Int8
  • unsigned integer - UInt64, UInt32, UInt16, UInt8
  • string - String, Binary, Categorical
  • temporal - Date, Datetime, Time, Duration
  • nested - List, Struct, Array
  • other - Object, Null

Get column types

df.schema

For a specific column

df["a"].dtype

Cast a column

df["a"].cast(Polars::Int32)

Visualization

Add Vega to your application’s Gemfile:

gem "vega"

And use:

df.plot("a", "b")

Specify the chart type (line, pie, column, bar, area, or scatter)

df.plot("a", "b", type: "pie")

Group data

df.group_by("c").plot("a", "b")

Stacked columns or bars

df.group_by("c").plot("a", "b", stacked: true)

History

View the changelog

Contributing

Everyone is encouraged to help improve this project. Here are a few ways you can help:

To get started with development:

git clone https://github.com/ankane/ruby-polars.git
cd ruby-polars
bundle install
bundle exec rake compile
bundle exec rake test
bundle exec rake test:docs