pangoling

pangoling¹ is an R package for estimating the log-probabilities of words in a given context using transformer models. The package provides an interface for utilizing pre-trained transformer models (such as GPT-2 or BERT) to obtain word probabilities. These log-probabilities are often utilized as predictors in psycholinguistic studies. This package can be useful for researchers in the field of psycholinguistics who want to leverage the power of transformer models in their work.

The package is mostly a wrapper of the python package transformers to process data in a convenient format.

Important! Limitations and bias

The training data of the most popular models (such as GPT-2) haven’t been released, so one cannot inspect it. It’s clear that the data contain a lot of unfiltered content from the internet, which is far from neutral. See for example the scope in the openAI team’s model card for GPT-2, but it should be the same for many other models, and the limitations and bias section of GPT-2 in Hugging Face website.

Installation

There is still no released version of pangoling. The package is in the ** early** stages of development, and it will probably be subject to changes. To install the latest version from github use:

# install.packages("remotes") # if needed
remotes::install_github("bnicenboim/pangoling")

install_py_pangoling function facilitates the installation of Python packages needed for using pangoling within an R environment, using the reticulate package for managing Python environments. This needs to be done once.

install_py_pangoling()

Example

This is a basic example which shows you how to get log-probabilities of words in a dataset:

library(pangoling)
library(tidytable) #fast alternative to dplyr

Given a (toy) dataset where sentences are organized with one word or short phrase in each row:

sentences <- c("The apple doesn't fall far from the tree.", 
               "Don't judge a book by its cover.")
(df_sent <- strsplit(x = sentences, split = " ") |> 
  map_dfr(.f =  ~ data.frame(word = .x), .id = "sent_n"))
#> # A tidytable: 15 × 2
#>    sent_n word   
#>     <int> <chr>  
#>  1      1 The    
#>  2      1 apple  
#>  3      1 doesn't
#>  4      1 fall   
#>  5      1 far    
#>  6      1 from   
#>  7      1 the    
#>  8      1 tree.  
#>  9      2 Don't  
#> 10      2 judge  
#> 11      2 a      
#> 12      2 book   
#> 13      2 by     
#> 14      2 its    
#> 15      2 cover.

One can get the log-transformed probability of each word based on GPT-2 as follows:

df_sent <- df_sent |>
  mutate(lp = causal_lp(word, by = sent_n))
#> Processing using causal model ''...
#> Processing a batch of size 1 with 10 tokens.
#> Processing a batch of size 1 with 9 tokens.
#> Text id: 1
#> `The apple doesn't fall far from the tree.`
#> Text id: 2
#> `Don't judge a book by its cover.`

df_sent
#> # A tidytable: 15 × 3
#>    sent_n word         lp
#>     <int> <chr>     <dbl>
#>  1      1 The      NA    
#>  2      1 apple   -10.9  
#>  3      1 doesn't  -5.50 
#>  4      1 fall     -3.60 
#>  5      1 far      -2.91 
#>  6      1 from     -0.745
#>  7      1 the      -0.207
#>  8      1 tree.    -1.58 
#>  9      2 Don't    NA    
#> 10      2 judge    -6.27 
#> 11      2 a        -2.33 
#> 12      2 book     -1.97 
#> 13      2 by       -0.409
#> 14      2 its      -0.257
#> 15      2 cover.   -1.38

How to cite

Nicenboim B (2023). pangoling: Access to language model predictions in R. R package version 0.0.0.9010, DOI: 10.5281/zenodo.7637526, https://github.com/bnicenboim/pangoling.

How to contribute

See the Contributing guidelines.

Code of conduct

Please note that this package is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

Name		Name	Last commit message	Last commit date
Latest commit History 140 Commits
.github		.github
R		R
inst		inst
man		man
pkgdown/favicon		pkgdown/favicon
tests		tests
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.covrignore		.covrignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.Rmd		README.Rmd
README.md		README.md
_pkgdown.yml		_pkgdown.yml
codecov.yml		codecov.yml
codemeta.json		codemeta.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

pangoling

Important! Limitations and bias

Installation

Example

How to cite

How to contribute

Code of conduct

See also

About

Licenses found

Releases

Packages

Languages

License

Licenses found

bnicenboim/pangoling

Folders and files

Latest commit

History

Repository files navigation

pangoling

Important! Limitations and bias

Installation

Example

How to cite

How to contribute

Code of conduct

See also

Footnotes

About

Topics

Resources

License

Licenses found

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages