Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Imageserver stops working on stage when a file can't be found (not always the same file) #2148

Open
andrewjbtw opened this issue May 2, 2024 · 2 comments

Comments

@andrewjbtw
Copy link

Occasionally, the imageserver stops serving images in the stage environment. This generally shows up in sul-embed as network "failed to fetch" errors. When this happens, the failures appear to be across the board rather than affecting only a subset of SDR items.

There are two imageserver nodes in stage and you can view their status at these links:
http://sul-imageserver-stage-a.stanford.edu/health
http://sul-imageserver-stage-b.stanford.edu/health

When the imageserver is having a problem, one or both of those health checks will have the "color" of "RED". There will also be a message like:

“/stacks/dm/057/nt/0476/asawa.jp2 (No such file or directory) (dm/057/nt/0476/asawa.jp2 -> edu.illinois.library.cantaloupe.source.FilesystemSource)”

In each of the instances where I've investigated this issue, the file that's reported missing in the error message is a file that appeared to have been deleted properly. By "deleted properly" I mean that I've checked Argo and the item history shows that someone intentionally removed the file, which happens when someone changes the file's "shelve" status to "no". These have not been cases where the file was deleted outside of SDR processes, like someone going to the filesystem and just deleting it.

To resolve this error, what I've done is put the file back up at the path indicated in the message. After doing that, the healthcheck turns back to green. The imageserver doesn't seem to care that the file is the same file as before, just that a file appears at the indicated path. You could probably just do touch /path/to/missing/file and clear up the check.

The odd thing is that once the error is cleared, we've found that you can then delete the same file and the error will not come back.

It is not clear what specifically generates this error, but since it apparently takes the whole server down, we would benefit from figuring out what's going on. I should note that I have never seen the issue in production, only stage.

Frequency of occurrence

The first time I remember being aware enough of this issue to monitor it was 2023-09 (see related Slack thread).

This happened again on 2024-05-01. The imageserver reported a file missing from an item that I had made dark. I reaccessioned the item to shelve the file and then the healthcheck turned green. Deleting the file again later did not trigger a recurrence of the error.

I had not been tracking occurrences, so those are the only two that I can identify with certain time frames.

Based on standup discussion this morning (2024-05-02) we decided we should create an issue, if only to have a place to track recurrences.

@andrewjbtw
Copy link
Author

Also, this issue is starting in embed because it's not clear if we have a more specific repo for it.

@justinlittman
Copy link
Contributor

It looks like Cantaloupe's health check strategy is to check the file of the most recent image that it processed: https://github.com/cantaloupe-project/cantaloupe/blob/develop/src/main/java/edu/illinois/library/cantaloupe/status/HealthChecker.java

If that file is no longer there (which it might not be, e.g., if it was made dark), then this is considered a health check failure.

An alternative if to use the more basic health check option (https://cantaloupe-project.github.io/manual/5.0/endpoints.html#Health%20Check) which is probably fine, especially for stage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants