Merge branch 'master' into dev

quantling · Jun 20, 2021 · 38526d4 · 38526d4
2 parents b1c26ae + 5cac3d2
commit 38526d4
Show file tree

Hide file tree

Showing 9 changed files with 303 additions and 25 deletions.
diff --git a/Project.toml b/Project.toml
@@ -1,7 +1,7 @@
 name = "JudiLing"
 uuid = "b43a184b-0e9d-488b-813a-80fd5dbc9fd8"
 authors = ["Xuefeng Luo"]
-version = "0.4.10"
+version = "0.5.2"
 
 [deps]
 BenchmarkTools = "6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf"

diff --git a/README_dev.md b/README_dev.md
@@ -0,0 +1,90 @@
+# JudiLing README for future developers
+
+Author: Xuefeng Luo
+Email: [email protected]
+Data: June 04, 2021
+
+
+## Project Structure
+
+```bash
+├── docs
+│   ├── build: not uploaded in GitHub, auto-generated by Documenter.jl
+│   │   └── ...
+│   ├── src: the source code of documentation
+│   │   ├── man: manual pages
+│   │   │   └── ...: all manual pages
+│   │   └── index.md: the index page
+│   ├── make.jl: the script to gernerate from src
+│   ├── Project.toml
+│   ├── Manifest.toml
+│   └── pdf_make.jl: the script to gernerate latex format documentation
+├── examples: example scripts
+│   └── ...
+├── src: the source code
+│   ├── JudiLing.jl: the entry-point module script
+│   └── ...: all other source code scripts
+├── test: the test source code
+│   ├── data: data for all tests
+│   ├── runtests.jl: the entry-point test script
+│   └── ...: all other test source code scripts
+├── thesis
+│   └── (thesis).pdf
+├── .gitignore
+├── LICENSE
+├── Manifest.toml
+├── Project.toml
+├── README_dev.md
+└── README.md
+```
+
+## Test, Dev and Production environments
+In total, there are three environments. Production environment, development environment and under-construction environment.
+Production environment is tagged with version numbers. It should be thoroughly tested, carefully maintained and bug-free (well, at least try it). Normally users can download them through `add JudiLing` in pkg mode.
+Development environment the master branch in GitHub. It may contain bugs but it is ready to be tested by users.
+Under-construction environment is any other branches that haven't been merged into master branch. They can be used for checkpoints for the developments.
+
+## Development
+
+### deploy dev environment locally
+
+Usually, directly cloning JudiLIng from GitHub and writing codes there is not convenient because it is hard to add this into Julia environment. A more sophisticated way to do so is to
+
+1. create a working dir, for example `judiling_dev`
+2. add JudiLing in the environment, `add JudiLing`
+3. clone package locally `develop --local Example`
+4. update the package, go to `dev/JudiLing` and `git pull`
+5. write code in `dev/JudiLing` and test it through scripts in `judiling_dev`
+
+Please see more details in:
+
+https://docs.julialang.org/en/v1/stdlib/Pkg/#Pkg
+
+
+### new functions
+To develop new functions, please make sure to write corresponding docs and tests for that functions.
+To test package, run `test JudiLing` in pkg mode ("]").
+To make docs, run `make.jl` in the docs dir and you can test/verify that docs looks good. On the production side, docs are auto-generate by GitHub action after CI. See `.github/workflows/ci.yml` for more details.
+
+### update new version
+1. update version number in `Project.toml` ( please make sure the `Project.toml` file is the one located in the root dir of JudiLing not somewhere else.)
+2. test the functions and docs
+3. push to GitHub and wait for CI script to complete
+4. if NOT pass, fix bugs and repeat steps 2 and 3
+5. if pass, post a new comment under issue "register" with "@JuliaRegistrator register", then wait for Julia registry to pass
+6. if NOT pass (rarely happened), follow the instruction in the Julia registry threads.
+
+### test_combo function
+`test_combo` function is a huge monster that contains almost 50 parameters. The workflow is:
+
+1. prepare datasets in four different modes
+2. create C matrices
+3. create S matrices
+4. calculate/learn F matrices
+5. calculate/learn G matrices
+6. predicting Shat
+7. prediction Chat
+8. learn_paths
+9. build_paths
+10. evaluate
+11. output
diff --git a/docs/src/index.md b/docs/src/index.md
@@ -548,6 +548,8 @@ In general, `test_combo` function will perform the following operations:
 - evaluate results
 - save outputs
 
+You can download the available datasets you need for the following demos. ([french.csv](https://osf.io/b3mju/download), [estonian_train.csv](https://osf.io/3xvp4/download) and [estonian_val.csv](https://osf.io/zqt2c/download))
+
 ### Split mode
 `test_combo` function provides four split mode. `:train_only` give the opportunity to only evaluate the model with training data or partial training data. `data_path` is the path to the CSV file and `data_output_dir` is the directory for store training and validation datasets for future analysis.
 

diff --git a/src/find_path.jl b/src/find_path.jl
@@ -82,6 +82,8 @@ word, which n-grams are best supported for a given position in the sequence of n
 - `sparse_ratio::Float64=0.05`: the ratio to decide whether a matrix is sparse
 - `if_pca::Bool=false`: turn on to enable pca mode
 - `pca_eval_M::Matrix=nothing`: pass original F for pca mode
+- `activation::Function=nothing`: the activation function you want to pass
+- `ignore_nan::Bool=true`: whether to ignore NaN when compare correlations, otherwise NaN will be selected as the max correlation value
 - `check_threshold_stat::Bool=false`: if true, return a threshold and torlerance proportion for each timestep
 - `verbose::Bool=false`: if true, more information is printed
 
@@ -233,6 +235,8 @@ function learn_paths(
     sparse_ratio = 0.05,
     if_pca = false,
     pca_eval_M = nothing,
+    activation = nothing,
+    ignore_nan = true,
     check_threshold_stat = false,
     verbose = false
 )
@@ -304,9 +308,11 @@ function learn_paths(
         if is_truly_sparse(Ythat_val, verbose = verbose)
             Ythat_val = sparse(Ythat_val)
         end
-        # Ythat = sparse(Ythat)
-        # verbose && println("Sparsity of Ythat: $(length(Ythat.nzval)/Ythat.m/Ythat.n)")
 
+        # apply activation to Yt hat
+        if !isnothing(activation)
+            Ythat_val = activation.(Ythat_val)
+        end
         # collect supports for gold path each timestep
         if check_gold_path && !isnothing(gold_ind)
             for j = 1:size(data_val, 1)
@@ -458,7 +464,7 @@ function learn_paths(
 
     verbose && println("Evaluating paths...")
     res =
-        eval_can(res, S_val, F_train, i2f, max_can, if_pca, pca_eval_M, verbose)
+        eval_can(res, S_val, F_train, i2f, max_can, if_pca, pca_eval_M, ignore_nan, verbose)
 
     # initialize gpi
     if check_gold_path
@@ -520,6 +526,8 @@ for users who is very new to JudiLing and learn_paths function.
 - `is_tolerant::Bool=false`: if true, select a specified number (given by `max_tolerance`) of n-grams whose supports are below threshold but above a second tolerance threshold to be added to the path
 - `tolerance::Float64=(-1000.0)`: the value set for the second threshold (in tolerant mode) such that if the support for an n-gram is in between this value and the threshold and the max_tolerance number has not been reached, then allow this n-gram to be added to the path
 - `max_tolerance::Int64=4`: maximum number of n-grams allowed in a path
+- `activation::Function=nothing`: the activation function you want to pass
+- `ignore_nan::Bool=true`: whether to ignore NaN when compare correlations, otherwise NaN will be selected as the max correlation value
 - `verbose::Bool=false`: if true, more information is printed
 
 # Examples
@@ -539,6 +547,8 @@ function learn_paths(
     is_tolerant = false,
     tolerance = (-1000.0),
     max_tolerance = 3,
+    activation = nothing,
+    ignore_nan = true,
     verbose = true)
 
     max_t = JudiLing.cal_max_timestep(data, cue_obj.target_col,
@@ -568,6 +578,8 @@ function learn_paths(
         sep_token = cue_obj.sep_token,
         keep_sep = cue_obj.keep_sep,
         target_col = cue_obj.target_col,
+        activation = activation,
+        ignore_nan = ignore_nan,
         verbose = verbose,
     )
 end
@@ -675,6 +687,7 @@ function build_paths(
     start_end_token = "#",
     if_pca = false,
     pca_eval_M = nothing,
+    ignore_nan = true,
     verbose = false,
 )
     # initialize queues for storing paths
@@ -775,7 +788,7 @@ function build_paths(
     end
 
     verbose && println("Evaluating paths...")
-    eval_can(res, S_val, F_train, i2f, max_can, if_pca, pca_eval_M, verbose)
+    eval_can(res, S_val, F_train, i2f, max_can, if_pca, pca_eval_M, ignore_nan, verbose)
 end
 
 """
@@ -793,6 +806,7 @@ function eval_can(
     max_can,
     if_pca,
     pca_eval_M,
+    ignore_nan = true,
     verbose = false,
 )
 
@@ -819,6 +833,11 @@ function eval_can(
                 push!(res, Result_Path_Info_Struct(ci, n, Scor))
             end
         end
+
+        if ignore_nan
+            res = filter(x -> !isnan(x.support), res)
+        end
+
         # we collect only top x candidates from the top
         res_l[i] = collect(Iterators.take(
             sort!(res, by = x -> x.support, rev = true),

diff --git a/src/output.jl b/src/output.jl
@@ -12,6 +12,12 @@ that is optionally returned as second output result.
 """
 function write2df end
 
+"""
+Write comprehension evaluation into a CSV file, include target and predicted 
+ids and indentifiers and their correlations.
+"""
+function write_comprehension_eval end
+
 """
     write2csv(res, data, cue_obj_train, cue_obj_val, filename)
 
@@ -489,10 +495,10 @@ Save S matrix into a csv file.
 
 # Examples
 ```julia
-JudiLing.save_S_matrix(S, joinpath(@__DIR__, "L.csv"), latin, :Word)
+JudiLing.save_S_matrix(S, joinpath(@__DIR__, "S.csv"), latin, :Word)
 ```
 """
-function save_S_matrix(S, filename, data, target_col; sep=" ")
+function save_S_matrix(S, filename, data, target_col; sep = " ")
 
     S_df = DataFrame(S)
     insertcols!(S_df, 1, :col_names => data[:,target_col])
@@ -514,14 +520,136 @@ Load S matrix from a csv file.
 
 # Examples
 ```julia
-JudiLing.load_S_matrix(joinpath(@__DIR__, "L.csv"))
+JudiLing.load_S_matrix(joinpath(@__DIR__, "S.csv"))
 ```
 """
-function load_S_matrix(filename; header = false, sep=" ")
+function load_S_matrix(filename; header = false, sep = " ")
 
     S_df = DataFrame(CSV.File(filename, header = header, delim = sep))
     words = S_df[:, 1]
     S = Array(select(S_df, Not(1)))
 
     S, words
 end
+
+"""
+    write_comprehension_eval(SChat, SC, data, filename)
+
+Write comprehension evaluation into a CSV file, include target and predicted 
+ids and indentifiers and their correlations.
+
+# Obligatory Arguments
+- `SChat::Matrix`: the Shat/Chat matrix
+- `SC::Matrix`: the S/C matrix
+- `data::DataFrame`: the data
+- `target_col::Symbol`: the name of target column
+- `filename::String`: the filename/filepath
+
+# Optional Arguments
+- `k`: top k candidates
+- `root_dir::String="."`: dir path for project root dir
+- `output_dir::String="."`: output dir inside root dir
+
+# Examples
+```julia
+JudiLing.write_comprehension_eval(Chat, cue_obj.C, latin, :Word, "output.csv",
+    k=10, root_dir=@__DIR__, output_dir="out")
+```
+"""
+function write_comprehension_eval(
+    SChat,
+    SC,
+    data,
+    target_col,
+    filename;
+    k = 10,
+    root_dir = ".",
+    output_dir = "."
+    )
+
+    output_path = joinpath(root_dir, output_dir)
+    mkpath(output_path)
+    total = size(SChat, 1)
+
+    io = open(joinpath(output_path, filename), "w")
+
+    write(
+        io,
+        "\"target_utterance\",\"target_identifier\",\"predicted_utterance\",\"predicted_identifier\",\"support\"\n",
+    )
+
+    rSC = cor(
+        convert(Matrix{Float64}, SChat),
+        convert(Matrix{Float64}, SC),
+        dims = 2,
+    )
+
+    for i = 1:total
+        p = sortperm(rSC[i, :], rev = true)
+        p = p[1:k]
+        for j in p
+            write(
+                io,
+                "\"$i\",\"$(data[i,target_col])\",\"$j\",\"$(data[j,target_col])\",\"$(rSC[i,j])\"\n",
+            )
+        end
+    end
+
+    close(io)
+end
+
+"""
+    write_comprehension_eval(SChat, SC, data, filename)
+
+Write comprehension evaluation into a CSV file for both training and validation
+datasets, include target and predicted ids and indentifiers and their
+correlations.
+
+# Obligatory Arguments
+- `SChat::Matrix`: the Shat/Chat matrix
+- `SC::Matrix`: the S/C matrix
+- `SC_rest::Matrix`: the rest S/C matrix
+- `data::DataFrame`: the data
+- `data_rest::DataFrame`: the rest data
+- `target_col::Symbol`: the name of target column
+- `filename::String`: the filename/filepath
+
+# Optional Arguments
+- `k`: top k candidates
+- `root_dir::String="."`: dir path for project root dir
+- `output_dir::String="."`: output dir inside root dir
+
+# Examples
+```julia
+JudiLing.write_comprehension_eval(Shat_val, S_val, S_train, latin_val, latin_train,
+    :Word, "all_output.csv", k=10, root_dir=@__DIR__, output_dir="out")
+```
+"""
+function write_comprehension_eval(
+    SChat,
+    SC,
+    SC_rest,
+    data,
+    data_rest,
+    target_col,
+    filename;
+    k = 10,
+    root_dir = ".",
+    output_dir = "."
+    )
+
+    data_combined = copy(data)
+    append!(data_combined, data_rest)
+
+    write_comprehension_eval(
+        SChat,
+        vcat(SC, SC_rest),
+        data_combined,
+        target_col,
+        filename,
+        k = k,
+        root_dir = root_dir,
+        output_dir = output_dir
+        )
+
+end