-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose Model Predictions from diskprediction_local
module
#13
Comments
Thanks for bringing this up; Indeed, we plan to add a disk failure prediction to the devices dashboard. We currently contemplate whether to implement it on the server side, rather than sending the data from the client side. |
@yaarith I think it's already implemented on the client side, i.e. the code is built into ceph - see the PR that @chauhankaranraj linked above. |
Of course, the disk failure prediction is already implemented on the client side and built into Ceph. What I meant was to generate the prediction on the server side, instead of sending the prediction result along with the device's telemetry. Thinking long term - this way we can:
Hope it makes more sense now :-) |
@yaarith ok. then we're thinking along the same lines. Can you point us in the right direction where we would plug in the code for running the prediction server-side? |
@durandom Yes, it's going to be in this repo. The device's PR will soon be available here. Regarding the prediction itself, we basically wish to have:
A possible scenario is that a user is willing to tolerate a disk with up to 5 read errors: We don't have enough data yet from our devices telemetry for this scale of model training, so Backblaze's data can help with training (at least hard disks). |
Thanks for the feedback @yaarith, I like these ideas.
Could this just be a python script? Also just to clarify, is this something that the ceph team is planning on working on, or should the AIOps team work on this?
Currently, we don't have a forecasting model like you've described. For the next steps, I could start looking into building one using backblaze data.
For this, I think we have two options.
What do you think? |
Sounds good, @chauhankaranraj!
Sure, it could be a python script. A couple of questions please:
For example:
Sounds good! Please let me know if you have any questions.
There are several issues with the current device telemetry data, for instance:
I don't expect the telemetry data to be too useful for this purpose at this point, so Backblaze's data is to the rescue in this case too :-) That said, we might be able to use telemetry for SSD and NVMe training sometime in the future. |
For the "AI Magic" part, IIUC the existing upstream model already does what we want for "Model 2" - i.e. given 6-12 days of SMART data from a device, predict device health as For the integration bit (wrapping a cmd line tool around the model), if it unclear what the model takes as input or what it produces as output, I'm happy to go through that :)
The upstream model requires at least 6 days of data to generate a prediction. If there's >6 days of data, then it won't throw an error, but won't really make any good use of that extra data either.
It doesn't save state across runs, and doesn't consume the entire history either. Just the most recent 6 days. This design choice was made because module.py, which is the script that initializes and calls the models, sends 6-12 days of SMART data to the models. It throws an error if there's <6 days (this line), and ignores data that is >12 days (this line). This is something that already existed upstream, I'm not sure if it was contributed by ProphetStor or someone else in the ceph community. So I didn't modify this file, but instead adapted the models to take the kind of input (6-12 days SMART data) that it already provides. Let me know if something doesn't make sense or if this information failed to answer any of questions :) For the next steps, does the following sound reasonable:
|
Hey @yaarith, what do you think of these suggestions? Could I get an ack or nack please? 😃 |
Hi @chauhankaranraj, I have a simplified version of the integration piece ready: predict_device.py is a command line tool which receives a device_id (note it’s not the uuid, but an internal database serial id), and currently returns its (very simple) failure prediction:
In model.py I added a placeholder for the actual model you wrote. Can you please inject its code there? There’s a sample Grafana (currently private) dashboard to present these prediction results: After you integrate your code we can display the results of the that model.
In this model (the existing one, or "Model 2") we wish to get finer granularity (instead of “good”, “medium”, “bad”) or more fine grained time prediction, if possible. Regarding Model 1, is it possible to tell:
The infrastructure is ready from our side, we'll be happy to see your code injected :-)
Great, definitely Backblaze's data. |
This is great! Thanks a lot Yaarit, much appreciated! 🙏
Will do :)
I'm not up to date with developments in smartmontools so you'd likely know better. But it seems to me that the jsons in
Getting a finer time-to-failure granularity is bit tricky for the existing model, since it's a classification model (vs regression model). If we want that, then we'd have to modify the training setup and train an entirely new model instead. That said, it is possible to get finer insight into failure using the existing model. In addition to the prediction, we could also show the confidence in it. E.g. "model is
I think 6 data points, one from each day in the past. Having more data could be helpful, but if we want to leave the door open for integrating "Model 1" to upstream, then we need to ensure compatibility with existing codebase. That is, make sure it works when there's 6 days data only. So atm generating outcome using 6 days data sounds more appealing to me.
AFAICT it won't need to save state :) |
The JSON structure of the SMART attributes in the sample files is different since it’s generated from a database table which holds the attributes. This allows for higher efficiency and performance in fetching and processing. The input format is a dictionary, where the keys are the timestamp of the SMART metrics scraping (e.g. “2020-07-20 00:07:47”). These timestamp keys are sorted (ascending). Each timestamp holds an “attr” key, which is a dictionary. The (also sorted, ascending) keys in this “attr” dictionary are the SMART attribute id ("1", "3", etc.). The value is another dictionary which holds 2 keys: the attribute's name, and raw value. For example: Regarding the normalized values:
It would be great if it's possible to reach a finer granularity in both time and confidence axes. I have the user in mind, and we wish to deliver a better assessment of the disk health's state.
I wouldn’t worry about the compatibility adjustments, since we care more about having a more accurate prediction :-)
In case we choose to use significantly more data points (other than 6-12), is it possible to keep states between runs? |
Gotcha, thanks for the clarification! :)
Yes, the model uses both raw and normalized SMART values
Ah okay, understood. Would it make sense to add these optimizations and finer predictions incrementally (do I hear a "release early release often")? That is, start with surfacing the existing coarse grained model predictions ("Model 2"), then build the "Model 1" and integrate that, and then work on making it finer grained and more accurate. My rationale was, we don't know for sure yet which type of model (Model 1, Model 2) would the users find more helpful and which one would they want to see improved. So it might be a good idea to focus our optimization efforts based on where they're needed the most. Does that sound reasonable?
Hmm I think saving state within the models might be difficult. Is there are a specific reason we'd want to do this, instead of just saving state to the database? |
I'll add them to the sample files.
Sounds good!
I meant to keep states between runs (when applying a model) in the database :-) |
So the Backblaze dataset contains both raw and normalized values for each SMART metric. I.e. it has two columns per SMART metric, one containing the raw value and other containing normalized. So the model takes as input both the raw and normalized values for a given set of SMART metrics. That is, treats them as separate features. Does that answer your question? p.s. for a glimpse of what that dataframe looks like, you could check out the last cell of this notebook. |
I wonder how the model makes use of the normalized values in a sense of which insights are derived from them, compared with the raw ones. Do they have weights? |
@yaarith and @chauhankaranraj sorry to jump into the middle of your discussion, i just learned from Rick that they stopped offering the cloud-based disk healthy prediction service. see ceph/ceph#36557 (comment) . so based on your discussion, it seems we are interested in making the disk prediction on the server side of ceph-telemetry. to me, i see it as our own cloud-based disk healthy prediction service. am i right? if that's the case, does this imply that instead of removing the |
@tchaikov thanks for the update! Yes, sounds good :-)
|
@chauhankaranraj thanks! Here are the updated sample files with the normalized SMART values: |
tysm @yaarith, much appreciated 😄 Is there a way we can get the values stored in the |
Hi @chauhankaranraj, I pushed the changes here: Since capacity is the same per device, I put it in a key per an entire input sample ('capacity_bytes'), and not per each scarping date. Let me know if you have any questions. |
Hi team,
I'm a data scientist on the AICoE team. Some time back, we had added failure prediction models to the
diskprediction_local
module in ceph upstream (PR). Would it be possible to expose predictions from these models on the "devices" grafana dashboard in some way?Basically, we want to ensure these predictions make their way to users / SMEs / open source community, so that they can provide us some feedback. This feedback would be incredibly valuable for improving existing models, and also for better understanding if the kind of output we're providing is useful or is there something better that we can provide. Let's have a discussion here :)
cc @yaarith @MichaelClifford @durandom
The text was updated successfully, but these errors were encountered: