-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] tau_2 is dependent on tau_1 #15
Comments
Hi alabamagan. Thank you for investigating this, this looks interesting. The way tau_2 is designed it can never exceed tau_1 (meaning that you are right, tau_2 is dependent on tau_1), only take values equal to or lower than tau_1. This means that if tau_1 takes value 1.0, then it is fully possible that tau_2 takes the value 1.0. In this particular case a feature has been selected across all K models (tau_1 = 1.0) and all weights have the same sign, either positive or negative (which results in tau_2 = 1.0). If, say, tau_1 = 0.7 then tau_2 cannot take a value higher than 0.7, since in 70% of the models the weights are non-zero and therefore must have a sign being either positive or negative. The reason why tau_2 was introduced is that we have seen for some datasets that weights of a feature were often non-zero giving a first impression that this feature is important, because it is selected often. But then we discovered that the weights of that feature had both positive and negative signs (in addition to being relatively small). If you want to select a feature only if all non-zero weights have the same sign, then you need to set tau_2 to the same value as tau_1. If you want to be less strict regarding this you may set tau_2 to a lower value. In our article at arxiv we describe that all three criteria tau_1, tau_2 and tau_3 must be fulfilled to select a feature. In general, the user has the freedom to choose which of the three tau_1, tau_2 and tau_3 should be used to identify and select features. If the user doesn't care about tau_2 and tau_3 the user can set them to zero, meaning that they are practically eliminated as a criteria. But just to be sure I understand your plots correctly:
|
I am not sure if this is the intended behavior but it seems like tau_2 is dependent on tau_1, i.e., the rate of feature being selected, judging from the article on arxiv and also the code. It seems that this would prioritize tau_1 over tau_2 and tau_2 will be scaled according to the rate of the feature being selected during the feature selection.
I conducted a simulation of tau_2 of one feature x at different rate the feature being selected, and I plot positive rate of x against tau_2 (right):
It clearly shows that the tau_2 is scaled significantly with the frequency of the feature being selected by the model. If the features is not selected 100% of in all model, the range of tau_2 is not (0 to 1).
Seeing this might be problematic, I added a scaling factor w.r.t to the observed zero-rate (left). (rate of feature not selected across K runs), and this return the range back to (0 to 1).
I just want to ask if this is the intended behavior and how could this affect the results of feature selection?
The text was updated successfully, but these errors were encountered: