Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: bump version to 0.15.0 #355

Merged
merged 2 commits into from
Dec 13, 2024
Merged

chore: bump version to 0.15.0 #355

merged 2 commits into from
Dec 13, 2024

Conversation

percevalw
Copy link
Member

Changelog

Added

  • edsnlp.data.read_parquet now accept a work_unit="fragment" option to split tasks between workers by parquet fragment instead of row. When this is enabled, workers do not read every fragment while skipping 1 in n rows, but read all rows of 1/n fragments, which should be faster.
  • Accept no validation data in edsnlp.train script
  • Log the training config at the beginning of the trainings
  • Support a specific model output dir path for trainings (output_model_dir), and whether to save the model or not (save_model)
  • Specify whether to log the validation results or not (logger=False)
  • Added support for the CoNLL format with edsnlp.data.read_conll and with a specific eds.conll_dict2doc converter
  • Added a Trainable Biaffine Dependency Parser (eds.biaffine_dep_parser) component and metrics
  • New eds.extractive_qa component to perform extractive question answering using questions as prompts to tag entities instead of a list of predefined labels as in eds.ner_crf.

Fixed

  • Fix join_thread missing attribute in SimpleQueue when cleaning a multiprocessing executor
  • Support huggingface transformers that do not set cls_token_id and sep_token_id (we now also look for these tokens in the special_tokens_map and vocab mappings)
  • Fix changing scorers dict size issue when evaluating during training
  • Seed random states (instead of using random.RandomState()) when shuffling in data readers : this is important for
    1. reproducibility
    2. in multiprocessing mode, ensure that the same data is shuffled in the same way in all workers
  • Bubble BaseComponent instantiation errors correctly
  • Improved support for multi-gpu gradient accumulation (only sync the gradients at the end of the accumulation), now controled by the optiona sub_batch_size argument of TrainingData.
  • Support again edsnlp without pytorch installed
  • We now test that edsnlp works without pytorch installed
  • Fix units and scales, ie 1l = 1dm3, 1ml = 1cm3

@percevalw percevalw force-pushed the v0.15.0 branch 2 times, most recently from 6f6310d to dbbfd69 Compare December 13, 2024 18:07
Copy link

Coverage Report

NameStmtsMiss∆ MissCover
edsnlp/pipes/misc/quantities/quantities.py

Was already missing at lines 147-149

     def __getitem__(self, item: int):
-         assert isinstance(item, int)
-         return [self][item]
Was already missing at lines 160-163
     def __eq__(self, other: Any):
-         if isinstance(other, SimpleQuantity):
-             return self.convert_to(other.unit) == other.value
-         return False
Was already missing at line 166
         if other.unit == self.unit:
-             return SimpleQuantity(self.value + other.value, self.unit, self.registry)
         return SimpleQuantity(
Was already missing at line 193
             return self.convert_to(other_unit)
-         except KeyError:
             raise AttributeError(f"Unit {other_unit} not found")
Was already missing at line 198
     def verify(cls, ent):
-         return True
New missing coverage at line 264 !
     def __lt__(self, other: Union[SimpleQuantity, "RangeQuantity"]):
-         return max(self.convert_to(other.unit)) < min((part.value for part in other))
New missing coverage at line 275 !
             return self.convert_to(other.unit) == other.value
-         return False
New missing coverage at line 289 !
     def verify(cls, ent):
-         return True
New missing coverage at line 888 !
         if snippet.end != last and doclike.doc[last: snippet.end].text.strip() == "":
-             pseudo.append("w")
         pseudo = "".join(pseudo)
New missing coverage at line 1069 !
                             if start_line is None:
-                                 continue
New missing coverage at lines 1100-1102 !
                         unit_norm = self.unit_followers[unit_before.label_]
-                 except (KeyError, AttributeError, IndexError):
-                     pass
New missing coverage at line 1145 !
             ):
-                 ent = doc[unit_text.start: number.end]
             else:
New missing coverage at lines 1152-1154 !
                 dims = self.unit_registry.parse_unit(unit_norm)[0]
-             except KeyError:
-                 continue
New missing coverage at lines 1260-1262 !
                     last._.set(last.label_, new_value)
-                 except (AttributeError, TypeError):
-                     merged.append(ent)
             else:

44020095.45%
TOTAL10943212098.06%
Files without new missing coverage
NameStmtsMiss∆ MissCover
edsnlp/utils/torch.py

Was already missing at line 102

 def load_pruned_obj(obj, _):
-     return obj
Was already missing at line 118
     def save_align_devices_hook(pickler, obj):
-         pickler.save_reduce(load_align_devices_hook, (obj.__dict__,), obj=obj)
Was already missing at lines 121-128
     def load_align_devices_hook(state):
-         state["execution_device"] = MAP_LOCATION
  ...
-     AlignDevicesHook = None
Was already missing at line 143
             if torch.Tensor in copyreg.dispatch_table:
-                 old_dispatch[torch.Tensor] = copyreg.dispatch_table[torch.Tensor]
             copyreg.pickle(torch.Tensor, reduce_empty)

839089.16%
edsnlp/utils/span_getters.py

Was already missing at lines 52-55

         else:
-             for span in candidates:
-                 if span.label_ in span_filter:
-                     yield span
Was already missing at lines 59-61
     if span_getter is None:
-         yield doc[:], None
-         return
     if callable(span_getter):
Was already missing at lines 62-64
     if callable(span_getter):
-         yield from span_getter(doc)
-         return
     for key, span_filter in span_getter.items():
Was already missing at line 66
         if key == "*":
-             candidates = (
                 (span, group) for group in doc.spans.values() for span in group
Was already missing at lines 75-78
         else:
-             for span, group in candidates:
-                 if span.label_ in span_filter:
-                     yield span, group
Was already missing at line 82
     if callable(span_setter):
-         span_setter(doc, matches)
     else:
Was already missing at line 124
             elif isinstance(v, str):
-                 new_value[k] = [v]
             elif isinstance(v, list) and all(isinstance(i, str) for i in v):
Was already missing at line 162
             elif isinstance(v, str):
-                 new_value[k] = [v]
             elif isinstance(v, list) and all(isinstance(i, str) for i in v):

14914090.60%
edsnlp/utils/resources.py

Was already missing at line 33

     if not verbs:
-         return conjugated_verbs

241095.83%
edsnlp/utils/numbers.py

Was already missing at line 34

     else:
-         string = s
     string = string.lower().strip()
Was already missing at lines 38-41
         return int(string)
-     except ValueError:
-         parsed = DIGITS_MAPPINGS.get(string, None)
-         return parsed

164075.00%
edsnlp/utils/filter.py

Was already missing at line 206

     if isinstance(label, int):
-         return [span for span in spans if span.label == label]
     else:

741098.65%
edsnlp/utils/bindings.py

Was already missing at line 23

         return "." + path
-     return path

651098.46%
edsnlp/training/trainer.py

Was already missing at line 79

     if result is None:
-         result = {}
     if isinstance(x, dict):

2391099.58%
edsnlp/processing/spark.py

Was already missing at line 50

         getActiveSession = SparkSession.getActiveSession
-     except AttributeError:

471097.87%
edsnlp/processing/multiprocessing.py

Was already missing at lines 386-391

                 self.on_stop()
-         except BaseException as e:
  ...
-             self.main_control_queue.put(e)
         finally:
Was already missing at lines 395-397
                     pass
-             except StopSignal:
-                 pass
             for name, queue in self.consumer_queues(stage):
Was already missing at line 535
                     while schedule[task_idx] is None:
-                         task_idx = (task_idx + 1) % len(schedule)
Was already missing at lines 599-601
             if isinstance(docs, StreamSentinel):
-                 self.active_batches[stage].append([None, None, None, docs])
-                 continue
             batch_id = str(hash(tuple(id(x) for x in docs)))[-8:] + "-" + self.uid
Was already missing at lines 1113-1119
                 if out[0].kind == requires_sentinel:
-                     missing_sentinels -= 1
  ...
-                         missing_sentinels = len(self.cpu_worker_names)
                 continue

62414097.76%
edsnlp/processing/deprecated_pipe.py

Was already missing at lines 207-209

         def converter(doc):
-             res = results_extractor(doc)
-             return (
                 [{"note_id": doc._.note_id, **row} for row in res]

572096.49%
edsnlp/pipes/trainable/span_linker/span_linker.py

Was already missing at lines 402-404

             if self.reference_mode == "synonym":
-                 embeds = embeds.to(new_lin.weight)
-                 new_lin.weight.data = embeds
             else:

1732098.84%
edsnlp/pipes/trainable/span_classifier/span_classifier.py

Was already missing at line 345

         if not all(keep_bindings):
-             logger.warning(
                 "Some attributes have no labels or values and have been removed:"

1591099.37%
edsnlp/pipes/trainable/ner_crf/ner_crf.py

Was already missing at line 254

         if self.labels is not None and not self.infer_span_setter:
-             return
Was already missing at lines 262-264
             if callable(self.target_span_getter):
-                 for span in get_spans(doc, self.target_span_getter):
-                     inferred_labels.add(span.label_)
             else:

1583098.10%
edsnlp/pipes/trainable/layers/crf.py

Was already missing at line 21

     # out: 2 * N * O
-     return (log_A.unsqueeze(-1) + log_B.unsqueeze(-3)).logsumexp(-2)
Was already missing at line 29
     # out: 2 * N * O
-     return (log_A.unsqueeze(-1) + log_B.unsqueeze(-3)).max(-2)
Was already missing at line 98
         if learnable_transitions:
-             self.transitions = torch.nn.Parameter(
                 torch.zeros_like(forbidden_transitions, dtype=torch.float)
Was already missing at line 108
         if learnable_transitions and with_start_end_transitions:
-             self.start_transitions = torch.nn.Parameter(
                 torch.zeros(num_tags, dtype=torch.float)
Was already missing at line 117
         if learnable_transitions and with_start_end_transitions:
-             self.end_transitions = torch.nn.Parameter(
                 torch.zeros(num_tags, dtype=torch.float)

1375096.35%
edsnlp/pipes/trainable/embeddings/transformer/transformer.py

Was already missing at line 165

         if quantization is not None:
-             kwargs["quantization_config"] = quantization
Was already missing at line 185
         if self.cls_token_id is None:
-             [self.cls_token_id] = self.tokenizer.convert_tokens_to_ids(
                 [self.tokenizer.special_tokens_map["bos_token"]]
Was already missing at line 189
         if self.sep_token_id is None:
-             [self.sep_token_id] = self.tokenizer.convert_tokens_to_ids(
                 [self.tokenizer.special_tokens_map["eos_token"]]

1663098.19%
edsnlp/pipes/qualifiers/reported_speech/reported_speech.py

Was already missing at lines 24-28

         return "REPORTED"
-     elif token._.rspeech is False:
-         return "DIRECT"
-     else:
-         return None

993096.97%
edsnlp/pipes/qualifiers/negation/negation.py

Was already missing at line 28

     else:
-         return None

991098.99%
edsnlp/pipes/qualifiers/hypothesis/hypothesis.py

Was already missing at line 27

     else:
-         return None

961098.96%
edsnlp/pipes/qualifiers/history/history.py

Was already missing at lines 26-32

 def history_getter(token: Union[Token, Span]) -> Optional[str]:
-     if token._.history is True:
-         return "ATCD"
-     elif token._.history is False:
-         return "CURRENT"
-     else:
-         return None
Was already missing at lines 337-343
                 )
-             except ValueError:
  ...
-                 note_datetime = None
Was already missing at lines 352-358
                 )
-             except ValueError:
  ...
-                 birth_datetime = None
Was already missing at lines 424-427
                         )
-                     except ValueError as e:
-                         absolute_date = None
-                         logger.warning(
                             "In doc {}, the following date {} raises this error: {}. "

17714092.09%
edsnlp/pipes/qualifiers/family/family.py

Was already missing at line 27

     else:
-         return None

811098.77%
edsnlp/pipes/qualifiers/base.py

Was already missing at line 178

     def __call__(self, doc: Doc) -> Doc:
-         results = self.process(doc)
         raise NotImplementedError(f"{type(results)} should be used to tag the document")

501098.00%
edsnlp/pipes/ner/tnm/model.py

Was already missing at line 147

     def __str__(self):
-         return self.norm()
Was already missing at line 171
             )
-             exclude_unset = skip_defaults

1122098.21%
edsnlp/pipes/ner/scores/sofa/sofa.py

Was already missing at line 32

             if not assigned:
-                 continue
             if assigned.get("method_max") is not None:
Was already missing at line 40
             else:
-                 method = "Non précisée"

252092.00%
edsnlp/pipes/ner/scores/elston_ellis/patterns.py

Was already missing at line 26

         if x <= 5:
-             return 1
Was already missing at lines 32-36
         else:
-             return 3
- 
-     except ValueError:
-         return None

214080.95%
edsnlp/pipes/ner/scores/charlson/patterns.py

Was already missing at lines 21-23

             return int(extracted_score)
-     except ValueError:
-         return None

132084.62%
edsnlp/pipes/ner/scores/base_score.py

Was already missing at line 154

             if value is None:
-                 continue
             normalized_value = self.score_normalization(value)

471097.87%
edsnlp/pipes/ner/disorders/solid_tumor/solid_tumor.py

Was already missing at lines 130-136

         for span in spans:
-             span.label_ = "solid_tumor"
  ...
-             yield span

376083.78%
edsnlp/pipes/ner/disorders/peripheral_vascular_disease/peripheral_vascular_disease.py

Was already missing at line 107

                 if "peripheral" not in span._.assigned.keys():
-                     continue

151093.33%
edsnlp/pipes/ner/disorders/diabetes/diabetes.py

Was already missing at line 131

                 # Mostly FP
-                 continue
Was already missing at line 134
             elif self.has_far_complications(span):
-                 span._.status = 2
Was already missing at line 146
         if next(iter(self.complication_matcher(context)), None) is not None:
-             return True
         return False

313090.32%
edsnlp/pipes/ner/disorders/connective_tissue_disease/connective_tissue_disease.py

Was already missing at line 103

                 # Huge change of FP / Title section
-                 continue

141092.86%
edsnlp/pipes/ner/disorders/ckd/ckd.py

Was already missing at lines 120-123

             dfg_value = float(dfg_span.text.replace(",", ".").strip())
-         except ValueError:
-             logger.trace(f"DFG value couldn't be extracted from {dfg_span.text}")
-             return False

293089.66%
edsnlp/pipes/ner/disorders/cerebrovascular_accident/cerebrovascular_accident.py

Was already missing at lines 111-113

             if span._.source == "ischemia":
-                 if "brain" not in span._.assigned.keys():
-                     continue

172088.24%
edsnlp/pipes/ner/adicap/models.py

Was already missing at line 15

     def norm(self) -> str:
-         return self.code
Was already missing at line 18
     def __str__(self):
-         return self.norm()

162087.50%
edsnlp/pipes/misc/split/split.py

Was already missing at lines 175-177

         if max_length <= 0 and self.regex is None:
-             yield doc
-             return

702097.14%
edsnlp/pipes/misc/sections/sections.py

Was already missing at line 126

         if sections is None:
-             sections = patterns.sections
         sections = dict(sections)

451097.78%
edsnlp/pipes/misc/dates/models.py

Was already missing at line 156

                     else:
-                         d["month"] = note_datetime.month
                 if self.day is None:
Was already missing at lines 160-166
             else:
-                 if self.year is None:
  ...
-                     d["day"] = default_day
Was already missing at lines 174-176
                 return dt
-             except ValueError:
-                 return None
Was already missing at line 192
         else:
-             return None
Was already missing at line 208
         if self.second:
-             norm += f"{self.second:02}s"

19911094.47%
edsnlp/pipes/misc/dates/dates.py

Was already missing at line 249

         if isinstance(absolute, str):
-             absolute = [absolute]
         if isinstance(relative, str):
Was already missing at line 251
         if isinstance(relative, str):
-             relative = [relative]
         if isinstance(duration, str):
Was already missing at line 253
         if isinstance(duration, str):
-             relative = [duration]
         if isinstance(false_positive, str):
Was already missing at lines 357-366
             if self.merge_mode == "align":
-                 alignments = align_spans(matches, spans, sort_by_overlap=True)
  ...
-                         matches.append(span)
Was already missing at lines 462-464
                 if v1.mode == Mode.DURATION:
-                     m1 = Bound.FROM if v2.bound == Bound.UNTIL else Bound.UNTIL
-                     m2 = v2.mode or Bound.FROM
                 elif v2.mode == Mode.DURATION:

15314090.85%
edsnlp/pipes/misc/consultation_dates/consultation_dates.py

Was already missing at line 131

         else:
-             self.date_matcher = None
Was already missing at line 134
         if not consultation_mention:
-             consultation_mention = []
         elif consultation_mention is True:

482095.83%
edsnlp/pipes/core/normalizer/__init__.py

Was already missing at line 7

 def excluded_or_space_getter(t):
-     return t.is_space or t.tag_ == "EXCLUDED"

51080.00%
edsnlp/pipes/core/endlines/endlines.py

Was already missing at lines 156-160

         if end_lines_model is None:
-             path = build_path(__file__, "base_model.pkl")
- 
-             with open(path, "rb") as inp:
-                 self.model = pickle.load(inp)
         elif isinstance(end_lines_model, str):
Was already missing at lines 163-165
                 self.model = pickle.load(inp)
-         elif isinstance(end_lines_model, EndLinesModel):
-             self.model = end_lines_model
         else:
Was already missing at line 196
         ):
-             return "ENUMERATION"
Was already missing at line 283
         if np.isnan(sigma):
-             sigma = 1

877091.95%
edsnlp/pipes/core/contextual_matcher/models.py

Was already missing at lines 28-32

     if isinstance(v, list):
-         assert (
-             len(v) == 2
-         ), "`window` should be a tuple/list of two integer, or a single integer"
-         v = tuple(v)
     if isinstance(v, int):

1382098.55%
edsnlp/pipes/core/contextual_matcher/contextual_matcher.py

Was already missing at line 94

             )
-             label = label_name
         if label is None:
Was already missing at line 343
                 if assigned is None:
-                     continue
                 if replace_entity:

1432098.60%
edsnlp/patch_spacy.py

Was already missing at lines 67-69

             # if module is reloaded.
-             existing_func = registry.factories.get(internal_name)
-             if not util.is_same_func(factory_func, existing_func):
                 raise ValueError(

312093.55%
edsnlp/package.py

Was already missing at lines 475-477

             version = version or pyproject["project"]["version"]
-         except (KeyError, TypeError):
-             version = "0.1.0"
         name = name or pyproject["project"]["name"]
Was already missing at line 481
         else:
-             main_package = None
         model_package = snake_case(name.lower())

2073098.55%
edsnlp/metrics/span_attributes.py

Was already missing at lines 56-58

         )
-         assert attributes is None
-         attributes = kwargs.pop("qualifiers")
     if attributes is None:

712097.18%
edsnlp/matchers/simstring.py

Was already missing at line 280

     if custom:
-         attr = attr[1:].lower()
Was already missing at line 295
             if custom:
-                 token_text = getattr(token._, attr)
             else:

1462098.63%
edsnlp/language.py

Was already missing at line 103

             if last != begin:
-                 logger.warning(
                     "Missed some characters during"

511098.04%
edsnlp/data/standoff.py

Was already missing at line 38

     def __init__(self, ann_file, line):
-         super().__init__(f"File {ann_file}, unrecognized Brat line {line}")
Was already missing at line 192
                         )
-                 except Exception:
                     raise Exception(

1862098.92%
edsnlp/data/polars.py

Was already missing at line 36

         if hasattr(data, "collect"):
-             data = data.collect()
         assert isinstance(data, pl.DataFrame)

551098.18%
edsnlp/data/json.py

Was already missing at line 81

                 return records
-         except Exception as e:
             raise Exception(f"Cannot read {file}: {e}")

1121099.11%
edsnlp/data/converters.py

Was already missing at line 142

     if "tokenizer" in CONTEXT[0]:
-         return CONTEXT[0]["tokenizer"]
     if _DEFAULT_TOKENIZER is None:
Was already missing at line 414
                 elif key == "XPOS":
-                     word.tag_ = value
                 elif key == "FEATS":
Was already missing at line 718
     if isinstance(converter, type):
-         return converter(**kwargs), {}
     return converter, validate_kwargs(converter, kwargs)

2343098.72%
edsnlp/data/conll.py

Was already missing at lines 81-83

             )
-         except StopIteration:
-             cols = DEFAULT_COLUMNS
             warnings.warn(
Was already missing at lines 92-96
         if not line:
-             if doc["words"]:
-                 yield doc
-                 doc = {"words": []}
-             continue
         if line.startswith("#"):

766092.11%
edsnlp/core/torch_component.py

Was already missing at line 392

             if hasattr(self, "compiled"):
-                 res = self.compiled(batch)
             else:
Was already missing at line 438
         """
-         return self.preprocess(doc)
Was already missing at line 463
         if object.__repr__(self) in exclude:
-             return
         exclude.add(object.__repr__(self))

1873098.40%
edsnlp/core/stream.py

Was already missing at lines 190-192

                 if isinstance(batch, StreamSentinel):
-                     yield batch
-                     continue
                 results = []
Was already missing at lines 999-1001
                 elif op.batch_fn is None:
-                     batch_size = op.size
-                     batch_fn = batchify
                 else:

3544098.87%
edsnlp/core/pipeline.py

Was already missing at line 605

             if name in exclude:
-                 continue
             if name not in components:
Was already missing at lines 716-719
         """
-         res = Stream.ensure_stream(docs)
-         res = res.map(functools.partial(self.preprocess, supervision=supervision))
-         return res

4424099.10%
edsnlp/connectors/omop.py

Was already missing at line 69

         if not isinstance(row.ents, list):
-             continue
Was already missing at line 87
             else:
-                 doc.spans[span.label_].append(span)
Was already missing at line 127
     if df.note_id.isna().any():
-         df["note_id"] = range(len(df))
Was already missing at line 171
         if i > 0:
-             df.term_modifiers += ";"
         df.term_modifiers += ext + "=" + df[ext].astype(str)

844095.24%

273 files skipped due to complete coverage.

Coverage success: total of 98.06% is above 98.06% 🎉

@percevalw percevalw merged commit f14900d into master Dec 13, 2024
11 of 14 checks passed
@percevalw percevalw deleted the v0.15.0 branch December 13, 2024 18:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant