Skip to content

Commit

Permalink
Merge pull request #168 from codedthinking/0.5-dev
Browse files Browse the repository at this point in the history
0.5 dev
  • Loading branch information
gergelyattilakiss authored Jul 21, 2024
2 parents 26aae12 + 3cee9a2 commit f9e64f8
Show file tree
Hide file tree
Showing 19 changed files with 488 additions and 247 deletions.
3 changes: 2 additions & 1 deletion Project.toml
Original file line number Diff line number Diff line change
@@ -1,10 +1,11 @@
name = "Kezdi"
uuid = "48308a23-c29e-446c-b4c0-d9446a767439"
authors = ["Miklos Koren <[email protected]>", "Gergely Attila Kiss <[email protected]>"]
version = "0.4.8"
version = "0.5.0"

[deps]
CSV = "336ed68f-0bac-5ca0-87d4-7b16caf5d00b"
Crayons = "a8cc5b0e-0ffa-5ad4-8c14-923d3ee1735f"
DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
Dates = "ade2ca70-3891-5945-98fb-dc099432e06a"
Expronicon = "6b7a57c9-7cc1-4fdf-b7f5-e857abae3636"
Expand Down
16 changes: 6 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ It imports and reexports [CSV](https://csv.juliadata.org/stable/), [DataFrames](

## Getting started

> `Kezdi.jl` is currently in beta. We have more than 300 unit tests and a large code coverage. [![Coverage](https://codecov.io/gh/codedthinking/Kezdi.jl/branch/main/graph/badge.svg)](https://codecov.io/gh/codedthinking/Kezdi.jl) The package, however, is not guaranteed to be bug-free. If you encounter any issues, please report them as a [GitHub issue](https://github.com/codedthinking/Kezdi.jl/issues/new).
> `Kezdi.jl` is currently in beta. We have more than 400 unit tests and a large code coverage. [![Coverage](https://codecov.io/gh/codedthinking/Kezdi.jl/branch/main/graph/badge.svg)](https://codecov.io/gh/codedthinking/Kezdi.jl) The package, however, is not guaranteed to be bug-free. If you encounter any issues, please report them as a [GitHub issue](https://github.com/codedthinking/Kezdi.jl/issues/new).
>
> If you would like to receive updates on the package, please star the repository on GitHub and sign up for [email notifications here](https://relentless-producer-1210.ck.page/62d7ebb237).
Expand Down Expand Up @@ -87,20 +87,17 @@ end
The function can operate on individual elements,
```julia
get_make(text) = split(text, " ")[1]
@generate Make = Main.get_make(Model)
@generate Make = get_make(Model)
```
or on the entire column:
```julia
function geometric_mean(x::AbstractVector)
function geometric_mean(x::Vector)
n = length(x)
return exp(sum(log.(x)) / n)
end
@collapse geom_NPG = Main.geometric_mean(MPG), by(Cylinders)
@collapse geom_NPG = geometric_mean(MPG), by(Cylinders)
```

!!! tip "Note: `Main.` prefix"
If you define a function in your own code, you need to prefix the function name with `Main.` to use it in other commands. To make use of [Automatic vectorization](@ref), make sure to give the function a vector argument type.

## Commands
[See the full documentation](https://codedthinking.github.io/Kezdi.jl/dev/).

Expand Down Expand Up @@ -146,10 +143,9 @@ All functions are automatically vectorized, so there is no need to use the `.` o
@generate logHP = log(Horsepower)
```

If you want to turn off automatic vectorization, use the convenience function [`DNV`](@ref) ("do not vectorize").

If you want to turn off automatic vectorization, use the `~` notation,
```julia
@generate logHP = DNV(log(Horsepower))
@generate logHP = ~log(Horsepower)
```

The exception is when the function operates on Vectors, in which case Kezdi.jl understands you want to apply the function to the entire column.
Expand Down
1 change: 1 addition & 0 deletions docs/make.jl
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ makedocs(;
pages=[
"Home" => "index.md",
"Reference" => "reference.md",
"Contributing" => "developing.md",
],
)

Expand Down
78 changes: 78 additions & 0 deletions docs/src/developing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
!!! warning "Kezdi.jl is in beta"
`Kezdi.jl` is currently in beta. We have more than 400 unit tests and a large code coverage. [![Coverage](https://codecov.io/gh/codedthinking/Kezdi.jl/branch/main/graph/badge.svg)](https://codecov.io/gh/codedthinking/Kezdi.jl) The package, however, is not guaranteed to be bug-free. If you encounter any issues, please report them as a [GitHub issue](https://github.com/codedthinking/Kezdi.jl/issues/new).

If you would like to receive updates on the package, please star the repository on GitHub and sign up for [email notifications here](https://relentless-producer-1210.ck.page/62d7ebb237).

!!! warning "This is the developer documentation"
You are currently reading the documentation for developers. This explains the internal working of Kezdi.jl. If you do not want to change or contribute to Kezdi.jl functionality, head over to the [end-user documentation]().

## Developer documentation
### How Kezdi.jl works: High-level overview
At its core, Kezdi.jl is a *transpiler*: it takes commands with Stata-like syntax entered by the user and translates them to regular Julia function calls to DataFrames and other modules. Most of the code deals with steps in this transpiling process:

1. scanning and parsing Julia expression to be evaluated as Kezdi.jl syntax
2. generating Julia code common for all commands, like subsetting dataframe rows whenever `@if` is used
3. generating Julia code specific to the command called

As an example, let us see how the following Kezdi.jl command becomes Julia code:
```julia
@replace distance = 5 @if distance < 0
```

The compiler will try to expand the `@replace` macro. It will hence call the macro `replace`, defined in `src/macros.jl`. The macro definition is almost identical for all Kezdi.jl macros:
```julia
macro replace(exprs...)
:replace |> parse(exprs) |> rewrite
end
```
The function `parse`, defined in `src/parse.jl` consumes all the Julia tokens to the right of `@replace` and returns a `Command` struct with the command name, arguments, and the expression to be evaluated, including any options or `@if` clauses. In this case, we get
```julia
Kezdi.Command(:replace, (:(distance = 5),), :(distance < 5), ())
```

This `Command` is then passed onto the `rewrite` function, defined in `src/codegen.jl`. This function dispatches on the first argument of the `Command` struct and first calls the function `generate_command` (also defined in `src/codegen.jl`). This function returns a `GeneratedCommand`. (All structs are defined in `src/structs.jl`.) The `GeneratedCommand` struct contains the

1. name of the DataFrame with which the command was called
2. name of the DataFrame to operate on
- This will be different from the first if there is an `@if` clause or a `by` option
3. a Julia `quote` block that will run before the command
- This usually deals with error checking, subsetting rows and grouping the DataFrame
4. name of a function to be run after the command has finished
5. arguments and
6. options as parsed

This `GeneratedCommand` is then consumed by the `rewrite` function which implements the actual functionality of the command. In this case, the function does runtime error checking (like making sure that the column `distance` exists in the DataFrame), type checking and promotion and the actual replacement of values. The function returns a Julia `quote` block that will be evaluated by Julia in its next step of compilation (code lowering).

Removing `LineNumberNode`s for brevity, the final Julia code will look like this:
```julia
!("distance" in names(getdf())) && ArgumentError("Column \"distance\" does not exist in $(names(getdf()))") |> throw
begin
getdf() isa AbstractDataFrame || error("Kezdi.jl commands can only operate on a global DataFrame set by setdf()")
local var"##386" = copy(getdf())
local var"##387" = view(var"##386", falses(nrow(var"##386")) .| Missings.replace((var"##386").distance .< 5, false), :)
function var"##389"(x)
begin
end
x
end
end
eltype_RHS = if 5 isa AbstractVector
eltype(5)
else
typeof(5)
end
eltype_LHS = eltype(var"##386"[.!(falses(nrow(var"##386")) .| Missings.replace((var"##386").distance .< 5, false)), "distance"])
if eltype_RHS != eltype_LHS
local var"##390" = Vector{promote_type(eltype_LHS, eltype_RHS)}(undef, nrow(var"##386"))
var"##390"[falses(nrow(var"##386")) .| Missings.replace((var"##386").distance .< 5, false)] .= 5
var"##390"[.!(falses(nrow(var"##386")) .| Missings.replace((var"##386").distance .< 5, false))] .= var"##386"[.!(falses(nrow(var"##386")) .| Missings.replace((var"##386").distance .< 5, false)), "distance"]
var"##386"[!, "distance"] = var"##390"
else
var"##387"[!, "distance"] .= 5
end
(var"##386" |> var"##389") |> setdf
```

Note that macro hygene dictates the use of temporary variables like `var"##386"` and `var"##387"` to avoid name clashes with the user's code. This is a little hard to debug as a developer, but the generated code will typically not be seen by the end user.

## Style guide
70 changes: 59 additions & 11 deletions docs/src/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ It imports and reexports [CSV](https://csv.juliadata.org/stable/), [DataFrames](

## Getting started
!!! warning "Kezdi.jl is in beta"
`Kezdi.jl` is currently in beta. We have more than 300 unit tests and a large code coverage. [![Coverage](https://codecov.io/gh/codedthinking/Kezdi.jl/branch/main/graph/badge.svg)](https://codecov.io/gh/codedthinking/Kezdi.jl) The package, however, is not guaranteed to be bug-free. If you encounter any issues, please report them as a [GitHub issue](https://github.com/codedthinking/Kezdi.jl/issues/new).
`Kezdi.jl` is currently in beta. We have more than 400 unit tests and a large code coverage. [![Coverage](https://codecov.io/gh/codedthinking/Kezdi.jl/branch/main/graph/badge.svg)](https://codecov.io/gh/codedthinking/Kezdi.jl) The package, however, is not guaranteed to be bug-free. If you encounter any issues, please report them as a [GitHub issue](https://github.com/codedthinking/Kezdi.jl/issues/new).

If you would like to receive updates on the package, please star the repository on GitHub and sign up for [email notifications here](https://relentless-producer-1210.ck.page/62d7ebb237).

Expand All @@ -24,13 +24,11 @@ using Pkg; Pkg.add("Kezdi")
Every Kezdi.jl command is a macro that begins with `@`. These commands operate on a global `DataFrame` that is set using the `setdf` function. Alternatively, commands can be executed within a `@with` block that sets the `DataFrame` for the duration of the block.

### Example
```@setup mtcars
```@repl mtcars
using Kezdi
using RDatasets
df = dataset("datasets", "mtcars")
```
```@repl mtcars
setdf(df)
@rename HP Horsepower
Expand Down Expand Up @@ -104,6 +102,10 @@ end
setdf
```

```@docs
@use
```

```@docs
getdf
```
Expand All @@ -124,6 +126,10 @@ getdf
@tail
```

```@docs
@clear
```

### Filtering columns and rows
```@docs
@keep
Expand Down Expand Up @@ -221,17 +227,39 @@ Column names of the data frame can be used directly in the commands without the
!!! danger "Reserved words cannot be used as variable names"
Julia reserved words, like `begin`, `export`, `function` and standard types like `String`, `Int`, `Float64`, etc., cannot be used as variable names in Kezdi.jl. If you have a column with a reserved word, rename it *before* passing it to Kezdi.jl.

If you want to avoid variable name substitution, you currently have two workarounds. One is to refer to the fully qualified name of the variable, including the module. The other is to define a constant function.

```julia
df = DataFrame(x = 1:2, y = 3:4)
x = 5
y() = 6
@with df begin
@generate x1 = x
@generate x2 = Main.x
@generate y1 = y
@generate y2 = y()
end
```
results in
```julia
2×6 DataFrame
Row │ x y x1 x2 y1 y2
│ Int64 Int64 Int64 Int64 Int64 Int64
─────┼──────────────────────────────────────────
11 3 1 5 3 6
22 4 2 5 4 6
```

### Automatic vectorization
All functions are automatically vectorized, so there is no need to use the `.` operator to broadcast functions over elements of a column.

```julia
@generate logHP = log(Horsepower)
```

If you want to turn off automatic vectorization, use the convenience function [`DNV`](@ref) ("do not vectorize").

If you want to turn off automatic vectorization, use the `~` symbol:
```julia
@generate logHP = DNV(log(Horsepower))
@generate logHP = ~log(Horsepower)
```

The exception is when the function operates on Vectors, in which case Kezdi.jl understands you want to apply the function to the entire column.
Expand Down Expand Up @@ -328,6 +356,26 @@ julia> @with DataFrame(x = [1, 2, missing, 4]) begin
```
returns `[1, 2]`.

### Use `cond` instead of ternary operators
Ternary operators like `x ? y : z` are not vectorized in Julia. Instead, use the `cond` function, which provides the exact same functionality.

```julia
@with DataFrame(x = [1, 2, 3, 4]) begin
@generate y = cond(x <= 2, 1, 0)
end
```

Note that you can achieve the same result with the more readable code
```julia
@with DataFrame(x = [1, 2, 3, 4]) begin
@generate y = 1 @if x <= 2
@replace y = 0 @if x > 2
end
```

!!! warning "`cond` may not work as you expect with missing values"
Because `cond` is vectorized and vectorized functions ignore missing values, this may lead to unexpected behavior. Use `@replace @if` instead.

### Row-count variables
The variable `_n` refers to the row number in the data frame, `_N` denotes the total number of rows. These can be used in `@if` conditions, as well.

Expand Down Expand Up @@ -359,7 +407,7 @@ Unlike Stata, where `egen` and `collapse` have different syntax, Kezdi.jl uses t
```

### Different function names
To maintain compatibility with Julia, we had to rename some functions. For example, `count` is called `rowcount`, `missing` is called `ismissing` in Kezdi.jl.
To maintain compatibility with Julia, we had to rename some functions. For example, `count` is called `rowcount`, `missing` is called `ismissing`, `max` is `maximum`, and `min` is `minimum` in Kezdi.jl.

### Missing values
In Julia, the result of any operation involving a missing value is `missing`. The only exception is the `ismissing` function, which returns `true` if the value is missing and `false` otherwise. You *cannot* check for missing values with `== missing`.
Expand Down Expand Up @@ -402,15 +450,15 @@ rowcount
```

```@docs
DNV
keep_only_values
```

```@docs
keep_only_values
ismissing
```

```@docs
ismissing
cond
```

## Acknowledgements
Expand Down
5 changes: 3 additions & 2 deletions src/Kezdi.jl
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,15 @@
Kezdi.jl is a Julia package for data manipulation and analysis. It is inspired by Stata, but it is written in Julia, which makes it faster and more flexible. It is designed to be used in the Julia REPL, but it can also be used in Jupyter notebooks or in scripts.
"""
module Kezdi
export @generate, @replace, @egen, @collapse, @keep, @drop, @summarize, @regress, use, @use, @tabulate, @count, @sort, @order, getdf, setdf, @list, @head, @tail, @names, @rename
export @generate, @replace, @egen, @collapse, @keep, @drop, @summarize, @regress, @use, @tabulate, @count, @sort, @order, @list, @head, @tail, @names, @rename, @clear, @describe

export display_and_return, keep_only_values, rowcount, distinct
export getdf, setdf, display_and_return, keep_only_values, rowcount, distinct, cond

using Reexport
using Logging
using InteractiveUtils
using ReadStatTables
using Crayons

@reexport using FreqTables: freqtable
@reexport using FixedEffectModels
Expand Down
Loading

0 comments on commit f9e64f8

Please sign in to comment.