-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Text regions not showing in label interface #6579
Comments
Hello, I think the problem is with special characters. Because in the image it shows that this part did get highlighted: Thank you,
|
@heidi-humansignal I checked your suggestion and tried placing the prediction region offsets to exclude those boundary special characters but the problem persists. Additionally, there are missing regions without any special characters (only newlines). Edit: It is actually double newline or newline + whitespace as explained below. No special character at boundary: Second missing here also has no special character at boundary: |
I found the issue and managed to bypass it: Any text region label will stop showing in the interface after a newline followed by whitespace, so "\n\n" or "\n\s" is no good. What caused my confusion is that ending a region with double newlines will display the region correctly up until/excluding those newlines. All I had to do for my sentence splitter pre-annotation usecase, is split for the newlines+whitespace and make new regions: import re
newline_labelstudio_patt = re.compile(r"(\n+\s*)")
def split_newline_labelstudio(sentences: list[dict]) -> list[dict]:
"""Splits sentences containing leading or trailing newlines to maintain display in LabelStudio.
LabelStudio stops showing annotation regions after newlines. This function splits
regions at newline boundaries to ensure the original text is displayed without newline
interruptions in annotation.
Args:
sentences (list[dict]): List of sentences, each as a dictionary with keys:
- "text" (str): Sentence text.
- "start" (int): Start offset of the sentence.
- "end" (int): End offset of the sentence.
Returns:
list[dict]: List of sentence dictionaries with adjusted start and end offsets.
Example:
>>> sentences = [{"text": "Hello\nworld", "start": 0, "end": 11}]
>>> split_newline_labelstudio(sentences)
[
{"text": "Hello", "start": 0, "end": 5},
{"text": "world", "start": 6, "end": 11}
]
"""
new_sentences = []
for sent in sentences:
text = sent["text"]
start_offset = sent["start"]
current_offset = start_offset
new_sents = list(newline_labelstudio_patt.split(text))
if len(new_sents) > 1:
for segment in new_sents:
end_offset = current_offset + len(segment)
split_new = {
"text": segment,
"start": current_offset,
"end": end_offset,
}
new_sentences.append(split_new)
current_offset = end_offset
else: # No split needed.
new_sentences.append(sent)
return new_sentences Now this still seems like a bug, there are valid use-cases where you want to support newlines+whitespace in annotation regions. I am just lucky here that in my corpus the fix often coincides with desired boundaries. |
Yes, this seems like bug. I'll create a ticket for eng team Thank you,
|
@GillesJ thanks for this detailed report, also experiencing this on Label Studio From the evidence in #6579 (comment), I see a simpler pattern:
While your solution in #6579 (comment) works, it breaks up large regions into many smaller ones, which can make annotation harder in my use-case. Suggesting a workaround of removing trailing whitespaces to address case 3 above (also affects non-problematic case 2): # remove any trailing whitespace, including newlines
new_segment = segment.rstrip()
# recompute prediction
new_prediction = {
"text": new_segment,
"start": current_offset
"end": current_offset + len(new_segment),
} Does this also work for you? |
@atreyasha I am sure this works, but removing characters is not an option for my use-case. I need to maintain 1:1 character values with a production NLP preprocessing pipeline for reversibility as a requirement, so removing newlines is not an option. |
Describe the bug
I use a sentence splitter to divide my plaintext to annotate into regions for annotation of choices (my corpus is multilingual so I need language-specific sentence splitting for good results).
For nearly every document there are intermittent regions which do not get displayed and are not selectable.
The only consistently shared property of missing regions is that they contain newlines
\n
and/or punctuation.However, similar regions containing newlines internally or at boundaries are displayed correctly.
Edit: The issue is caused with one-or-more newlines \n followed by a whitespace character, the labeled region stops displaying after a newline-with-whitespace "\n+\s". More details #6579 (comment)*
data.text
and newlines
\n
count as 1 char length.\n\n
at region boundaries or inside regions. But this does not seem the case because many regions that have multiple newlines internally or at boundaries are displayed correctly.To Reproduce
Steps to reproduce the behavior:
value.text
key is not present in my production files but is there for illustrative purposes (problem still occurs withvalue.text
omitted from task json.).I attached an example of a full json task file with many missing regions:
example-missing-regions.json
Here is the full example with the
value.text
on the regions for debugging:example-with-text-missing-regions.json
Expected behavior
Text regions are highlighted and shown correctly.
Environment (please complete the following information):
The text was updated successfully, but these errors were encountered: