Skip to content

Commit

Permalink
Merge branch 'master' into dev
Browse files Browse the repository at this point in the history
  • Loading branch information
MegamindHenry committed Jun 20, 2021
2 parents b1c26ae + 5cac3d2 commit 38526d4
Show file tree
Hide file tree
Showing 9 changed files with 303 additions and 25 deletions.
2 changes: 1 addition & 1 deletion Project.toml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
name = "JudiLing"
uuid = "b43a184b-0e9d-488b-813a-80fd5dbc9fd8"
authors = ["Xuefeng Luo"]
version = "0.4.10"
version = "0.5.2"

[deps]
BenchmarkTools = "6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf"
Expand Down
90 changes: 90 additions & 0 deletions README_dev.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
# JudiLing README for future developers

Author: Xuefeng Luo
Email: [email protected]
Data: June 04, 2021


## Project Structure

```bash
├── docs
│ ├── build: not uploaded in GitHub, auto-generated by Documenter.jl
│ │ └── ...
│ ├── src: the source code of documentation
│ │ ├── man: manual pages
│ │ │ └── ...: all manual pages
│ │ └── index.md: the index page
│ ├── make.jl: the script to gernerate from src
│ ├── Project.toml
│ ├── Manifest.toml
│ └── pdf_make.jl: the script to gernerate latex format documentation
├── examples: example scripts
│ └── ...
├── src: the source code
│ ├── JudiLing.jl: the entry-point module script
│ └── ...: all other source code scripts
├── test: the test source code
│ ├── data: data for all tests
│ ├── runtests.jl: the entry-point test script
│ └── ...: all other test source code scripts
├── thesis
│ └── (thesis).pdf
├── .gitignore
├── LICENSE
├── Manifest.toml
├── Project.toml
├── README_dev.md
└── README.md
```

## Test, Dev and Production environments
In total, there are three environments. Production environment, development environment and under-construction environment.
Production environment is tagged with version numbers. It should be thoroughly tested, carefully maintained and bug-free (well, at least try it). Normally users can download them through `add JudiLing` in pkg mode.
Development environment the master branch in GitHub. It may contain bugs but it is ready to be tested by users.
Under-construction environment is any other branches that haven't been merged into master branch. They can be used for checkpoints for the developments.

## Development

### deploy dev environment locally

Usually, directly cloning JudiLIng from GitHub and writing codes there is not convenient because it is hard to add this into Julia environment. A more sophisticated way to do so is to

1. create a working dir, for example `judiling_dev`
2. add JudiLing in the environment, `add JudiLing`
3. clone package locally `develop --local Example`
4. update the package, go to `dev/JudiLing` and `git pull`
5. write code in `dev/JudiLing` and test it through scripts in `judiling_dev`

Please see more details in:

https://docs.julialang.org/en/v1/stdlib/Pkg/#Pkg


### new functions
To develop new functions, please make sure to write corresponding docs and tests for that functions.
To test package, run `test JudiLing` in pkg mode ("]").
To make docs, run `make.jl` in the docs dir and you can test/verify that docs looks good. On the production side, docs are auto-generate by GitHub action after CI. See `.github/workflows/ci.yml` for more details.

### update new version
1. update version number in `Project.toml` ( please make sure the `Project.toml` file is the one located in the root dir of JudiLing not somewhere else.)
2. test the functions and docs
3. push to GitHub and wait for CI script to complete
4. if NOT pass, fix bugs and repeat steps 2 and 3
5. if pass, post a new comment under issue "register" with "@JuliaRegistrator register", then wait for Julia registry to pass
6. if NOT pass (rarely happened), follow the instruction in the Julia registry threads.

### test_combo function
`test_combo` function is a huge monster that contains almost 50 parameters. The workflow is:

1. prepare datasets in four different modes
2. create C matrices
3. create S matrices
4. calculate/learn F matrices
5. calculate/learn G matrices
6. predicting Shat
7. prediction Chat
8. learn_paths
9. build_paths
10. evaluate
11. output
2 changes: 2 additions & 0 deletions docs/src/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -548,6 +548,8 @@ In general, `test_combo` function will perform the following operations:
- evaluate results
- save outputs

You can download the available datasets you need for the following demos. ([french.csv](https://osf.io/b3mju/download), [estonian_train.csv](https://osf.io/3xvp4/download) and [estonian_val.csv](https://osf.io/zqt2c/download))

### Split mode
`test_combo` function provides four split mode. `:train_only` give the opportunity to only evaluate the model with training data or partial training data. `data_path` is the path to the CSV file and `data_output_dir` is the directory for store training and validation datasets for future analysis.

Expand Down
27 changes: 23 additions & 4 deletions src/find_path.jl
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,8 @@ word, which n-grams are best supported for a given position in the sequence of n
- `sparse_ratio::Float64=0.05`: the ratio to decide whether a matrix is sparse
- `if_pca::Bool=false`: turn on to enable pca mode
- `pca_eval_M::Matrix=nothing`: pass original F for pca mode
- `activation::Function=nothing`: the activation function you want to pass
- `ignore_nan::Bool=true`: whether to ignore NaN when compare correlations, otherwise NaN will be selected as the max correlation value
- `check_threshold_stat::Bool=false`: if true, return a threshold and torlerance proportion for each timestep
- `verbose::Bool=false`: if true, more information is printed
Expand Down Expand Up @@ -233,6 +235,8 @@ function learn_paths(
sparse_ratio = 0.05,
if_pca = false,
pca_eval_M = nothing,
activation = nothing,
ignore_nan = true,
check_threshold_stat = false,
verbose = false
)
Expand Down Expand Up @@ -304,9 +308,11 @@ function learn_paths(
if is_truly_sparse(Ythat_val, verbose = verbose)
Ythat_val = sparse(Ythat_val)
end
# Ythat = sparse(Ythat)
# verbose && println("Sparsity of Ythat: $(length(Ythat.nzval)/Ythat.m/Ythat.n)")

# apply activation to Yt hat
if !isnothing(activation)
Ythat_val = activation.(Ythat_val)
end
# collect supports for gold path each timestep
if check_gold_path && !isnothing(gold_ind)
for j = 1:size(data_val, 1)
Expand Down Expand Up @@ -458,7 +464,7 @@ function learn_paths(

verbose && println("Evaluating paths...")
res =
eval_can(res, S_val, F_train, i2f, max_can, if_pca, pca_eval_M, verbose)
eval_can(res, S_val, F_train, i2f, max_can, if_pca, pca_eval_M, ignore_nan, verbose)

# initialize gpi
if check_gold_path
Expand Down Expand Up @@ -520,6 +526,8 @@ for users who is very new to JudiLing and learn_paths function.
- `is_tolerant::Bool=false`: if true, select a specified number (given by `max_tolerance`) of n-grams whose supports are below threshold but above a second tolerance threshold to be added to the path
- `tolerance::Float64=(-1000.0)`: the value set for the second threshold (in tolerant mode) such that if the support for an n-gram is in between this value and the threshold and the max_tolerance number has not been reached, then allow this n-gram to be added to the path
- `max_tolerance::Int64=4`: maximum number of n-grams allowed in a path
- `activation::Function=nothing`: the activation function you want to pass
- `ignore_nan::Bool=true`: whether to ignore NaN when compare correlations, otherwise NaN will be selected as the max correlation value
- `verbose::Bool=false`: if true, more information is printed
# Examples
Expand All @@ -539,6 +547,8 @@ function learn_paths(
is_tolerant = false,
tolerance = (-1000.0),
max_tolerance = 3,
activation = nothing,
ignore_nan = true,
verbose = true)

max_t = JudiLing.cal_max_timestep(data, cue_obj.target_col,
Expand Down Expand Up @@ -568,6 +578,8 @@ function learn_paths(
sep_token = cue_obj.sep_token,
keep_sep = cue_obj.keep_sep,
target_col = cue_obj.target_col,
activation = activation,
ignore_nan = ignore_nan,
verbose = verbose,
)
end
Expand Down Expand Up @@ -675,6 +687,7 @@ function build_paths(
start_end_token = "#",
if_pca = false,
pca_eval_M = nothing,
ignore_nan = true,
verbose = false,
)
# initialize queues for storing paths
Expand Down Expand Up @@ -775,7 +788,7 @@ function build_paths(
end

verbose && println("Evaluating paths...")
eval_can(res, S_val, F_train, i2f, max_can, if_pca, pca_eval_M, verbose)
eval_can(res, S_val, F_train, i2f, max_can, if_pca, pca_eval_M, ignore_nan, verbose)
end

"""
Expand All @@ -793,6 +806,7 @@ function eval_can(
max_can,
if_pca,
pca_eval_M,
ignore_nan = true,
verbose = false,
)

Expand All @@ -819,6 +833,11 @@ function eval_can(
push!(res, Result_Path_Info_Struct(ci, n, Scor))
end
end

if ignore_nan
res = filter(x -> !isnan(x.support), res)
end

# we collect only top x candidates from the top
res_l[i] = collect(Iterators.take(
sort!(res, by = x -> x.support, rev = true),
Expand Down
136 changes: 132 additions & 4 deletions src/output.jl
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,12 @@ that is optionally returned as second output result.
"""
function write2df end

"""
Write comprehension evaluation into a CSV file, include target and predicted
ids and indentifiers and their correlations.
"""
function write_comprehension_eval end

"""
write2csv(res, data, cue_obj_train, cue_obj_val, filename)
Expand Down Expand Up @@ -489,10 +495,10 @@ Save S matrix into a csv file.
# Examples
```julia
JudiLing.save_S_matrix(S, joinpath(@__DIR__, "L.csv"), latin, :Word)
JudiLing.save_S_matrix(S, joinpath(@__DIR__, "S.csv"), latin, :Word)
```
"""
function save_S_matrix(S, filename, data, target_col; sep=" ")
function save_S_matrix(S, filename, data, target_col; sep = " ")

S_df = DataFrame(S)
insertcols!(S_df, 1, :col_names => data[:,target_col])
Expand All @@ -514,14 +520,136 @@ Load S matrix from a csv file.
# Examples
```julia
JudiLing.load_S_matrix(joinpath(@__DIR__, "L.csv"))
JudiLing.load_S_matrix(joinpath(@__DIR__, "S.csv"))
```
"""
function load_S_matrix(filename; header = false, sep=" ")
function load_S_matrix(filename; header = false, sep = " ")

S_df = DataFrame(CSV.File(filename, header = header, delim = sep))
words = S_df[:, 1]
S = Array(select(S_df, Not(1)))

S, words
end

"""
write_comprehension_eval(SChat, SC, data, filename)
Write comprehension evaluation into a CSV file, include target and predicted
ids and indentifiers and their correlations.
# Obligatory Arguments
- `SChat::Matrix`: the Shat/Chat matrix
- `SC::Matrix`: the S/C matrix
- `data::DataFrame`: the data
- `target_col::Symbol`: the name of target column
- `filename::String`: the filename/filepath
# Optional Arguments
- `k`: top k candidates
- `root_dir::String="."`: dir path for project root dir
- `output_dir::String="."`: output dir inside root dir
# Examples
```julia
JudiLing.write_comprehension_eval(Chat, cue_obj.C, latin, :Word, "output.csv",
k=10, root_dir=@__DIR__, output_dir="out")
```
"""
function write_comprehension_eval(
SChat,
SC,
data,
target_col,
filename;
k = 10,
root_dir = ".",
output_dir = "."
)

output_path = joinpath(root_dir, output_dir)
mkpath(output_path)
total = size(SChat, 1)

io = open(joinpath(output_path, filename), "w")

write(
io,
"\"target_utterance\",\"target_identifier\",\"predicted_utterance\",\"predicted_identifier\",\"support\"\n",
)

rSC = cor(
convert(Matrix{Float64}, SChat),
convert(Matrix{Float64}, SC),
dims = 2,
)

for i = 1:total
p = sortperm(rSC[i, :], rev = true)
p = p[1:k]
for j in p
write(
io,
"\"$i\",\"$(data[i,target_col])\",\"$j\",\"$(data[j,target_col])\",\"$(rSC[i,j])\"\n",
)
end
end

close(io)
end

"""
write_comprehension_eval(SChat, SC, data, filename)
Write comprehension evaluation into a CSV file for both training and validation
datasets, include target and predicted ids and indentifiers and their
correlations.
# Obligatory Arguments
- `SChat::Matrix`: the Shat/Chat matrix
- `SC::Matrix`: the S/C matrix
- `SC_rest::Matrix`: the rest S/C matrix
- `data::DataFrame`: the data
- `data_rest::DataFrame`: the rest data
- `target_col::Symbol`: the name of target column
- `filename::String`: the filename/filepath
# Optional Arguments
- `k`: top k candidates
- `root_dir::String="."`: dir path for project root dir
- `output_dir::String="."`: output dir inside root dir
# Examples
```julia
JudiLing.write_comprehension_eval(Shat_val, S_val, S_train, latin_val, latin_train,
:Word, "all_output.csv", k=10, root_dir=@__DIR__, output_dir="out")
```
"""
function write_comprehension_eval(
SChat,
SC,
SC_rest,
data,
data_rest,
target_col,
filename;
k = 10,
root_dir = ".",
output_dir = "."
)

data_combined = copy(data)
append!(data_combined, data_rest)

write_comprehension_eval(
SChat,
vcat(SC, SC_rest),
data_combined,
target_col,
filename,
k = k,
root_dir = root_dir,
output_dir = output_dir
)

end
Loading

0 comments on commit 38526d4

Please sign in to comment.