This is the code used to produce the results in the paper, published at NoDaLiDa 2021.
This repo contains submodules which contains the source code for structbilty (bilstm-aux) and MaChAmp (mtp). To clone all three repos (this + 2 submodules) use the following command:
$ git clone --recurse-submodules https://github/kris927b/JobStack.git
Or use these two commands:
$ git clone https://github/kris927b/JobStack.git
$ git submodule update --init --recursive
You can get the data set produced as part of the paper by filling out this form: https://forms.gle/ZnTSdCn2AmTYimys9.
We are providing a docker image to run the code in a closed environment with all the right dependencies.
To run the docker image you need the following requirements:
- nvidia driver (418+),
- Docker (19.03+),
- nvidia-container-toolkit.
To run the experiments for Bilty...
To run Bilty with static transformer embeddings, you have to create the specific embeddings in the data files. To do so run either of the follwing commands, depending on your preferred BERT model.
MTL_SET = JobStack | conll | i2b2
# BERT Base
$ bash scripts/bilty.embeddings.bert.sh $MTL_SET $PATH_TO_DATA
# BERT Overflow
$ bash scripts/bilty.embeddings.overflow.sh $MTL_SET $PATH_TO_DATA
Where $MTL_SET
is the name of the data set you want to convert, i.e. JobStack, conll or i2b2.
To train and test Bilty with only JobStack you have the following two scenarios.
To train embeddings from scratch, with or without transformer use either of these 3 commands:
### Only JobStack ###
CRF = crf | nocrf
# No transformer embeddings
$ bash scripts/bilty.finetune.vanilla.sh $CRF $PATH_TO_DATA $PATH_TO_MODELS $PATH_TO_PREDS $PATH_TO_LOGS
# BERT Base embeddings
$ bash scripts/bilty.finetune.bert.sh $CRF $PATH_TO_DATA $PATH_TO_MODELS $PATH_TO_PREDS $PATH_TO_LOGS
# BERT Overflow embeddings
$ bash scripts/bilty.finetune.overflow.sh $CRF $PATH_TO_DATA $PATH_TO_MODELS $PATH_TO_PREDS $PATH_TO_LOGS
The last three paths (models, preds and logs) are where to save the output from the script.
And you can use the CRF
parameter to run either with or without a CRF layer.
To train Bilty with embeddings initialized from the pre-trained Glove embeddings, use the follwing command:
### Only JobStack ###
# Glove embeddings
$ bash scripts/bilty.finetune.glove.sh $CRF $PATH_TO_DATA $PATH_TO_MODELS $PATH_TO_PREDS $PATH_TO_LOGS
NB: While training, the embeddings will be updated as with the vanilla embeddings used above.
To train and test Bilty with Multi Task Learning you can use either of the follwoing two commands, one for BERTBase and one for BERTOverflow embeddings.
### Multi Task Learning ###
MTL_SET = conll | i2b2 | both
# BERT Base
$ bash scripts/bilty.MTL.bert.sh $MTL_SET $PATH_TO_DATA $PATH_TO_MODELS $PATH_TO_PREDS $PATH_TO_LOGS
# BERT Overflow
$ bash scripts/bilty.MTL.overflow.sh $MTL_SET $PATH_TO_DATA $PATH_TO_MODELS $PATH_TO_PREDS $PATH_TO_LOGS
NB: Compared to when you created the embeddings files, MTL_SET
can here be either, conll, i2b2 or both.
To run the experiments using MaChAmp use either of the following 6 commands.
To train and test MaChAmp on only JobStack use one of these two commands. The first uses BERTBase as its transformer model. The second uses BERTOverflow as its transformer model. Both of them finetunes the transformer model during training. To use CRF specify it in the first parameter.
### Only JobStack ###
CRF = crf | nocrf
# BERT Base
$ bash scripts/mtp.finetune.bert.sh $CRF $PATH_TO_DATA $PATH_TO_CONFIG $PATH_TO_LOGS
# BERT Overflow
$ bash scripts/mtp.finetune.overflow.sh $CRF $PATH_TO_DATA $PATH_TO_CONFIG $PATH_TO_LOGS
To train and test MaChAmp using Multi Task Learning you should use either of the two following commands.
Again one is to use BERTBase and the other BERTOverflow.
To select which of the two data sets to use along side JobStack input either "conll", "i2b2" or "both" in place of MTL_DATA_SET
.
### Multi Task Learning ###
MTL_DATA_SET = conll | i2b2 | both
# BERT Base
$ bash scripts/mtp.MTL.bert.sh $MTL_DATA_SET $PATH_TO_DATA $PATH_TO_CONFIG $PATH_TO_LOGS
# Bert Overflow
$ bash scripts/mtp.MTL.overflow.sh $MTL_DATA_SET $PATH_TO_DATA $PATH_TO_CONFIG $PATH_TO_LOGS
The lask experiment is the Masked language model. To perform this experiment you will have to get the data by filling out the form, this is also described in data/README.md
.
Else, to run the experiments you should just do like below.
### Masked Language Modeling ###
# BERT Base
$ bash scripts/mtp.mlm.bert.sh $PATH_TO_DATA $PATH_TO_CONFIG $PATH_TO_LOGS
# BERT Overflow
$ bash scripts/mtp.mlm.overflow.sh $PATH_TO_DATA $PATH_TO_CONFIG $PATH_TO_LOGS
The data folder pointed to by $PATH_TO_DATA
needs to have the structure described in data/README.md
.
If you are using this code or the accompanying data, please cite the paper:
@inproceedings{jensen-et-al-2021,
title = "De-identification of Privacy-related Entities in Job Postings",
author = "Jensen, {Kristian N{\o}rgaard} and Mike Zhang and Barbara Plank",
year = "2021",
month = mar,
day = "22",
language = "English",
booktitle = "Proceedings of the 23rd Nordic Conference on Computational Linguistics",
publisher = "Association for Computational Linguistics",
address = "United States",
note = "NoDaLiDa 2021 ; Conference date: 31-05-2021",
}