Skip to content

v0.12.0

Compare
Choose a tag to compare
@percevalw percevalw released this 21 May 23:27
· 145 commits to master since this release

Changelog

Added

  • The eds.transformer component now accepts prompts (passed to its preprocess method, see breaking change below) to add before each window of text to embed.
  • LazyCollection.map / map_batches now support generator functions as arguments.
  • Window stride can now be disabled (i.e., stride = window) during training in the eds.transformer component by training_stride = False
  • Added a new eds.ner_overlap_scorer to evaluate matches between two lists of entities, counting true when the dice overlap is above a given threshold
  • edsnlp.load now accepts EDS-NLP models from the huggingface hub 🤗 !
  • New python -m edsnlp.package command to package a model for the huggingface hub or pypi-like registries

Changed

  • Trainable embedding components now all use foldedtensor to return embeddings, instead of returning a tensor of floats and a mask tensor.
  • 💥 TorchComponent __call__ no longer applies the end to end method, and instead calls the forward method directly, like all torch modules.
  • The trainable eds.span_qualifier component has been renamed to eds.span_classifier to reflect its general purpose (it doesn't only predict qualifiers, but any attribute of a span using its context or not).
  • omop converter now takes the note_datetime field into account by default when building a document
  • span._.date.to_datetime() and span._.date.to_duration() now automatically take the note_datetime into account
  • nlp.vocab is no longer serialized when saving a model, as it may contain sensitive information and can be recomputed during inference anyway
  • 💥 Major breaking change in trainable components, moving towards a more "task-centric" design:
    • the eds.transformer component is no longer responsible for deciding which spans of text ("contexts") should be embedded. These contexts are now passed via the preprocess method, which now accepts more arguments than just the docs to process.
    • similarly the eds.span_pooler is now longer responsible for deciding which spans to pool, and instead pools all spans passed to it in the preprocess method.

Consequently, the eds.transformer and eds.span_pooler no longer accept their span_getter argument, and the eds.ner_crf, eds.span_classifier, eds.span_linker and eds.span_qualifier components now accept a context_getter argument instead, as well as a span_getter argument for the latter two. This refactoring can be summarized as follows:

- eds.transformer.span_getter
+ eds.ner_crf.context_getter
+ eds.span_classifier.context_getter
+ eds.span_linker.context_getter

- eds.span_pooler.span_getter
+ eds.span_qualifier.span_getter
+ eds.span_linker.span_getter

and as an example for the eds.span_linker component:

nlp.add_pipe(
    eds.span_linker(
        metric="cosine",
        probability_mode="sigmoid",
+       span_getter="ents",
+       # context_getter="ents",  -> by default, same as span_getter
        embedding=eds.span_pooler(
            hidden_size=128,
-           span_getter="ents",
            embedding=eds.transformer(
-               span_getter="ents",
                model="prajjwal1/bert-tiny",
                window=128,
                stride=96,
            ),
        ),
    ),
    name="linker",
)

Fixed

  • edsnlp.data.read_json now correctly read the files from the directory passed as an argument, and not from the parent directory.
  • Overwrite spacy's Doc, Span and Token pickling utils to allow recursively storing Doc, Span and Token objects in the extension values (in particular, span._.date.doc)
  • Removed pendulum dependency, solving various pickling, multiprocessing and missing attributes errors

Pull Requests

Full Changelog: v0.11.2...v0.12.0