-
-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Coral stick becomes unresponsive #38
Comments
OK notice that this has happened again and the white LED on the stick is blinking, according to the docs:
In the logs:
I then hammered it with 240 requests without issue. Perhaps use a timeout on the request |
I am speaking with Manoj ([email protected]) at Google about this issue, however no resolution yet. The actions he suggested, my answers in bold:
Update from Manoj, 21/8 We have filed a bug with our development team regarding this issue and hope to get a response soon. |
Just a bit more info from my experience. When my coral USB TPU gets into this state, I don't need to plug/replug - I can just restart the container (and in effect, the flask process) that has the device open. The replacement process can then pick up and continue processing as usual. In a way I'm glad this isn't a unique problem, so very likely not a hardware problem, or a specific artifact of running in a Docker container. |
I guess one workaround is to periodically restart the app, but hopefully Google can find and fix the bug so this isn't required |
Latest advice is to try a 5v@3A power supply |
Hmm.. I'm using the Coral TPU on an Intel i5 NUC "clone" in a USB 3 port, not a Raspberry Pi. I'll check the power supply voltage one of the USB ports to see if it looks out of spec. |
As a workaround, I've built a healthcheck in the Dockerfile to detect when the container running the flask server stops responding. I had expected docker to automagically restart my container for me, but that's not the case, either starting it directly with I wonder if the problem is some sort of concurrency issue with multiple requests landing at the same time? I have 4 cameras I'm using and grabbing frames from them every 10 to 30 seconds or so. I've not investigated the coral API and associated python libraries to see if they're safe in that regard. Of course, it could be that Flask is serially processing requests and this can't happen. |
I ran a script that hammered the server and it didn't fall over. It appears to fail when running for > 12 hours, regardless of load. In my case I am only doing about an image an hour. I guess we could remove flask from the equation by just having a script that periodically performs an inference and seeing when/if that fails |
@lmamakos are you still experiencing this issue? |
I believe so, though I haven't checked recently. I built a health-check for the container and it gets restarted when it hangs.. so out of sight, out of mind. I will look at the log when I return from my business travel later today and see what it's been up to. I'll get a view of the current state of things, and then look at updating the Home Assistant component and to the latest Home Assistant release and watch it going forward as well. |
As a reminder, I have the Coral USB stick plugged into an Intel i5 in a NUC-like system and not a Raspberry Pi. I'll put a voltmeter on the USB port to check, but it seems unlikely that power is a problem in my case. There's only SSD in this system (no spinning rust, also no spinning fans, either), so I'd expect there would be plenty of headroom in the power supply. I have the Coral USB stick plugged in using their provided USB cable. |
Just took a look at the auto-restart log:
So a failure/hang about daily, more or less. |
I connected my 20000 count Fluke voltmeter to the USB interface on my NUC (via a chopped off USB cable, and over a few hours of activity, measured a low voltage of 5.107 and a high voltage of 5.132, with it normally hanging out around 5.113 or so. Curiously, as the system is loaded, the USB voltage actually bumps up a hundredth of a volt or two, probably because the switching regulator in there is having to push harder on the CPU core voltage or something.. I ran a ZFS "scrub" operation which tends to bang on the SATA and PCI-e I/O (for the SATA and M.2 storage that I have), as well as at least one of the cores as does checksum verification of all the allocated disk blocks during this time. I wanted to try to load the system a little bit.. During the course of a few hours, the coral stick didn't hang, so a little inconclusive. There seems to be plenty of margin in the power supply.. but I've not captured the voltages when it enters this hang state. I'll have to try this again; I'm a little uncomfortable leaving the test cable plugged in unattended, what with the bare wires and curious cats. My Fluke meter samples about 3 times per second which also might not be frequently enough to capture a brief voltage sag. I don't have any sort of real data logger among my bag of tricks... even triggering my oscilloscope on a low voltage isn't really useful if there's no timestamp I can compare with the observed hangs. Or it might just be haunted? In the US, Halloween is coming up later this month; perhaps the additional ghosts and spirits will mix things up a bit.. |
Well I assume google are aware of this issue and root causes, since they suggested the power supply fix. Therefore they will hopefully fix the issue in due course. |
I just ran into this issue with rasbperry Pi 4. I'm hitting this issue with both a 2A power supply and the and the canakit 3.5A power supply. It works fine on my desktop though I'm thinking about trying out an externally powered USB hub to see if that helps Are other people still experiencing this issue? It hangs pretty fast for me <10 minutes and restarting the process temporarily fixes it. |
Using
sudo systemctl start coral.service
the app appears to die after 12 hours, no errors in logs, just doesn't respond to requestsThe text was updated successfully, but these errors were encountered: