arrow.qmd

---
title: "Saving data with Arrow"
engine: julia
julia:
  exeflags: ["--project", "--threads=auto"]
---

# The Arrow storage format

The [Arrow storage format](https://arrow.apache.org) provides a language-agnostic storage and memory specification for columnar data tables, which just means "something that looks like a data frame in R".  That is, an arrow table is an ordered, named collection of columns, all of the same length.

The columns can be of different types including numeric values, character strings, and factor-like representations - called *DictEncoded*.

An Arrow file can be read or written from R, Python, Julia and many other languages.  Somewhat confusingly in R and Python the name `feather`, which refers to an earlier version of the storage format, is used in some function names like `read_feather`.

Internally, the SMLP2024 package uses Arrow to store all of its datasets.

# The Emotikon data

The [SMLP2021 repository](https://github.com/RePsychling/SMLP2021) contains a version of the data from @Fuehner2021 in `notebooks/data/fggk21.arrow`.
After that file was created there were changes in the master RDS file on the [osf.io site](https://osf.io/2d8rj/) for the project.
We will recreate the Arrow file here then split it into two separate tables, one with a row for each child in the study and one with a row for each test result.

The `Arrow` package for Julia does not export any function names, which means that the function to read an Arrow file must be called as `Arrow.Table`.
It returns a *column table*, as described in the [Tables package](https://github.com/JuliaData/Tables.jl).
This is like a read-only data frame, which can be easily converted to a full-fledged `DataFrame` if desired.

This arrangement allows for the `Arrow` package not to depend on the `DataFrames` package, which is a heavy-weight dependency, but still easily produce a `DataFrame` if warranted.

Load the packages to be used.

```{julia}
#| code-fold: true
#| output: false
using AlgebraOfGraphics
using Arrow
using CairoMakie
using Chain
using DataFrameMacros
using DataFrames
using Downloads
using KernelDensity
using RCall   # access R from within Julia
using StatsBase

CairoMakie.activate!(; type="svg")
using AlgebraOfGraphics: density
```

# Downloading and importing the RDS file

This is similar to some of the code shown by Julius Krumbiegel on Monday.
In the data directory of the emotikon project on osf.io under Data, the url for the rds data file is found to be [https://osf.io/xawdb/].  Note that we want version 2 of this file.

```{julia}
fn = Downloads.download("https://osf.io/xawdb/download?version=2");
```

```{julia}
dfrm = rcopy(reval("""readRDS("$(fn)")"""))
```

Now write this file as a Arrow file and read it back in.

```{julia}
arrowfn = joinpath("data", "fggk21.arrow")
Arrow.write(arrowfn, dfrm; compress=:lz4)
tbl = Arrow.Table(arrowfn)
```

```{julia}
filesize(arrowfn)
```

```{julia}
df = DataFrame(tbl)
```

# Avoiding needless repetition

One of the principles of relational database design is that information should not be repeated needlessly.
Each row of `df` is determined by a combination of `Child` and `Test`, together producing a `score`, which can be converted to a `zScore`.

The other columns in the table, `Cohort`, `School`, `age`, and `Sex`, are properties of the `Child`.

Storing these values redundantly in the full table takes up space but, more importantly, allows for inconsistency.
As it stands, a given `Child` could be recorded as being in one `Cohort` for the `Run` test and in another `Cohort` for the `S20_r` test and nothing about the table would detect this as being an error.

The approach used in relational databases is to store the information for `score` in one table that contains only `Child`, `Test` and `score`, store the information for the `Child` in another table including `Cohort`, `School`, `age` and `Sex`.
These tables can then be combined to create the table to be used for analysis by *joining* the different tables together.

The maintainers of the `DataFrames` package have put in a lot of work over the past few years to make joins quite efficient in Julia.
Thus the processing penalty of reassembling the big table from three smaller tables is minimal.

It is important to note that the main advantage of using smaller tables that are joined together to produce the analysis table is the fact that the information in the analysis table is consistent by design.

# Creating the smaller table

```{julia}
Child = unique(select(df, :Child, :School, :Cohort, :Sex, :age))
```

```{julia}
length(unique(Child.Child))  # should be 108295
```

```{julia}
filesize(
  Arrow.write("./data/fggk21_Child.arrow", Child; compress=:lz4)
)
```

```{julia}
filesize(
  Arrow.write(
    "./data/fggk21_Score.arrow",
    select(df, :Child, :Test, :score);
    compress=:lz4,
  ),
)
```

::: {.callout-note}
A careful examination of the file sizes versus that of `./data/fggk21.arrow` will show that the separate tables combined take up more space than the original because of the compression.  Compression algorithms are often more successful when applied to larger files.
:::

Now read the Arrow tables in and reassemble the original table.

```{julia}
Score = DataFrame(Arrow.Table("./data/fggk21_Score.arrow"))
```

At this point we can create the z-score column by standardizing the scores for each `Test`.
The code to do this follows Julius's presentation on Monday.

```{julia}
@transform!(groupby(Score, :Test), :zScore = @bycol zscore(:score))
```

```{julia}
Child = DataFrame(Arrow.Table("./data/fggk21_Child.arrow"))
```

```{julia}
df1 = disallowmissing!(leftjoin(Score, Child; on=:Child))
```

::: {.callout-note}
The call to `disallowmissing!` is because the join will create columns that allow for missing values but we know that we should not get missing values in the result.  This call will fail if, for some reason, missing values were created.
:::

# Discovering patterns in the data

One of the motivations for creating the `Child` table was to be able to bin the ages according to the age of each child, not the age of each `Child-Test` combination.
Not all children have all 5 test results.
We can check the number of results by grouping on `:Child` and evaluate the number of rows in each group.

```{julia}
nobsChild = combine(groupby(Score, :Child), nrow => :ntest)
```

Now create a table of the number of children with 1, 2, ..., 5 test scores.

```{julia}
combine(groupby(nobsChild, :ntest), nrow)
```

A natural question at this point is whether there is something about those students who have few observations.
For example, are they from only a few schools?

One approach to examining properties like is to add the number of observations for each child to the :Child table.
Later we can group the table according to this `:ntest` to look at properties of `:Child` by `:ntest`.

```{julia}
gdf = groupby(
  disallowmissing!(leftjoin(Child, nobsChild; on=:Child)), :ntest
)
```

Are the sexes represented more-or-less equally?

```{julia}
combine(groupby(first(gdf), :Sex), nrow => :nchild)
```

```{julia}
combine(groupby(last(gdf), :Sex), nrow => :nchild)
```

# Reading Arrow files in other languages

There are Arrow implementations for R (the `arrow` package) and for Python (`pyarrow`).

```{python}
#| eval: false
import pyarrow.feather: read_table
read_table("./data/fggk21.arrow")
```

```{r}
#| eval: false
library(arrow)
read_ipc_file("./data/fggk21.arrow")
```


# References

::: {#refs}
:::