Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Commit Networks #263

Merged
merged 16 commits into from
Aug 28, 2024
Merged

Add Commit Networks #263

merged 16 commits into from
Aug 28, 2024

Conversation

Leo-Send
Copy link
Contributor

@Leo-Send Leo-Send commented May 21, 2024

Prerequisites

  • I adhere to the coding conventions (described here) in my code.
  • I have updated the copyright headers of the files I have modified.
  • I have written appropriate commit messages, i.e., I have recorded the goal, the need, the needed changes, and the location of my code modifications for each commit. This includes also, e.g., referencing to relevant issues.
  • I have put signed-off tags in all commits.
  • I have updated the changelog file NEWS.md appropriately.
  • I have checked whether I need to adjust the showcase file showcase.R with respect to my changes.
  • The pull request is opened against the branch dev.

Description

Add a new type of network to coronet: Commit Network
The commit network uses commits as vertices. The edges are based on commit interactions or cochange.

Changelog

Added

  • Add commit network as a new type of network. It uses commits as vertices and connects them either via cochange or commit interactions. This includes adding new config parameters and a function for adding vertex attributes to a commit network(PR Add Commit Networks #263, ab73271, ab73271, cd9a930)

Changed7Improved

Fixed

@Leo-Send
Copy link
Contributor Author

Currently the part in the showcase is still missing, I will look into where and what to add there.

@Leo-Send
Copy link
Contributor Author

Currently the networks contain very little information, because most information typically is

  • in the commit data
  • shown on the edges of the network

Because in this case the commits are the vertices of the network, that information is not portayed anywhere - meaning the date, author information and commit-id have to be looked up seperately from the network using the commit hash.

Should I include more vertex attributes that also contain this information or some other functionality to more easily retrieve this data?

@Leo-Send
Copy link
Contributor Author

In the network construction for the commit cochange network, respect.temporal.order is currently always set to 'TRUE'. This matches the construction of the artifact cochange network, that's why I did the same. Should that be the case?

@Leo-Send
Copy link
Contributor Author

I am currently removing columns from the 'commit.net.data' in 'get.commit.network.cochange' because they contain commit information that should not be on the edges in the commit network. The function 'construct.edge.list.from.key.value.list' returns columns that are not wanted/needed for the commit network construction.

The question is, should I remove the lines as I currently do, or should I add a boolean parameter to the function, similar to the existing 'artifact.edges' parameter?

Copy link

codecov bot commented May 29, 2024

Codecov Report

Attention: Patch coverage is 95.56650% with 9 lines in your changes missing coverage. Please review.

Project coverage is 80.89%. Comparing base (74ebe0b) to head (5842073).
Report is 17 commits behind head on dev.

Files Patch % Lines
util-networks.R 95.08% 9 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##              dev     #263      +/-   ##
==========================================
+ Coverage   80.24%   80.89%   +0.65%     
==========================================
  Files          16       16              
  Lines        4905     5036     +131     
==========================================
+ Hits         3936     4074     +138     
+ Misses        969      962       -7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Leo-Send Leo-Send marked this pull request as ready for review June 19, 2024 14:18
@bockthom
Copy link
Collaborator

@Leo-Send Could you please rebase your branch onto the current dev branch? Otherwise the CI pipeline will not work (for two reasons: one the one hand, there are conflicting changes in util-networks-covariates.R; on the other hand, we have made some changes to the CI pipeline anyway)

Copy link
Contributor

@hechtlC hechtlC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall thank you @Leo-Send !
I have a couple of points that need addressing but its nothing too bad.

util-networks-covariates.R Outdated Show resolved Hide resolved
util-networks-covariates.R Outdated Show resolved Hide resolved
util-networks-covariates.R Outdated Show resolved Hide resolved
util-networks.R Outdated Show resolved Hide resolved
util-networks.R Show resolved Hide resolved
util-networks.R Outdated Show resolved Hide resolved
util-networks.R Outdated Show resolved Hide resolved
util-networks.R Outdated Show resolved Hide resolved
util-networks.R Outdated Show resolved Hide resolved
util-networks.R Outdated Show resolved Hide resolved
Leo-Send added 11 commits July 24, 2024 15:30
'get.commit.network' will delegate calls to corresponding methods,
depending on 'commit.relation' config parameter in NetworkConf

Signed-off-by: Leo Sendelbach <[email protected]>
functions 'get.commit.network.cochange' and
'get.commit.network.commit.interaction' are called in
'get.commit.network'. Also add 'group.commits.by.data.column', a helper
function used in constructing the cochange commit network.

Signed-off-by: Leo Sendelbach <[email protected]>
Also add first test for commit-interaction based commit network and
fixed a minoir error in network creation

Signed-off-by: Leo Sendelbach <[email protected]>
Initializing vertex kind to 'TYPE.COMMIT' in the correct position

Signed-off-by: Leo Sendelbach <[email protected]>
Tests for each artifact type, parameterized for directed attribute

Signed-off-by: Leo Sendelbach <s8lesendqstud.uni-saarland.de>
Commit Network now also built when calling function 'get.networks'.

Signed-off-by: Leo Sendelbach <[email protected]>
show how to construct commit network in showcase. Also fixed bug that
resulted in showcase crashing.

Signed-off-by: Leo Sendelbach <[email protected]>
In this process, also refactor 'construct.edge.list.from.key.value.list'
method. Some more comments might be necessary.

Signed-off-by: Leo Sendelbach <[email protected]>
New function allows adding vertex attributes from commit data to commit
network vertices

Signed-off-by: Leo Sendelbach <[email protected]>
'add.vertex.attribute.commit.network' is now used in showcase. Also
minor changes to documentation and performance improvement in cochange
commit network creation.

Signed-off-by: Leo Sendelbach <[email protected]>
attribute 'date' added to cochange commit network edges, attribute
artifact.type added to all networks based on commit interactions

Signed-off-by: Leo Sendelbach <[email protected]>
@Leo-Send
Copy link
Contributor Author

I have Identified the issue with the showcase: since we manually add the column 'artifact.type' to dataframes when constructing commit interaction based networks, it breaks when the existing data is empty. The empty data was not an issue for the showcase before; there were just warnings but since we do not do anything with the networks anyway, everything worked. Now, I see two ways to fix this problem:

  1. Adding commit interaction data in the data path that the showcase uses (tests/codeface-data/results/testing/test_feature/feature)
  2. Checking in the network construction if the data is empty before adding a column

Which would you prefer @hechtlC ?

@bockthom
Copy link
Collaborator

bockthom commented Jul 24, 2024

I don't know what @hechtlC prefers, but here a few thoughts from my side on your two suggestions:

2. Checking in the network construction if the data is empty before adding a column

I don't fully understand what this column is for. Is it just used for network construction or is it also available to the end user? If it is available for the end user, I would like to make sure that the expected columns are there, independent of empty data or not. If it is just used internally for network construction and does not have any effect on the resulting network, then a check for empty data before adding a column might make sense (but I am not sure).

1. Adding commit interaction data in the data path that the showcase uses (tests/codeface-data/results/testing/test_feature/feature)

Regarding adding commit interaction data: I am not sure. Is it just an issue of the non-existent file, or of the non-existent data (as it would also be in an empty file)? Missing data should not lead to failing network construction. So, even when adding more test data would solve the problem, I don't think that this should be considered as a fix for the underlying problem.

@Leo-Send
Copy link
Contributor Author

Leo-Send commented Jul 24, 2024

I know what you mean and will add the column to the empty data frames. I will then change the already existing part that is used for setting the correct value to fill this new column.

On another note:
After a makeshift-fix for the showcase by adding the test data in the required places, the showcase still doesn't run and I do not understand why:

ERROR::The specified parameter 'project.data' of class [try-error] inherits from the wrong class. When calling 'NetworkBuilder$new' the parameter must inherit from one of the following classes: ProjectData

This is the error message. It occurs in line 230 in my code, in the lapply (see below) there - even after uncommenting all changes I made to the showcase. Manually checking the type of the cf.data, it says it doesn't know the type because of infinite recursion... Any Ideas what I can do there?

my.networks = lapply(cf.data, function(range.data) {
y = NetworkBuilder$new(project.data = range.data, network.conf = net.conf)
return(y$get.author.network())
})

EDIT:
After rolling back to the state after the last commit that i pulled today it works. So it has to be something that I added that interacts in some way with this splitting that I do not understand.

@Leo-Send
Copy link
Contributor Author

For some reason, adding the line

proj.conf$update.value("commit.interactions", TRUE)

breaks the splitting, meaning that the content of 'cf.data' after

mybins = c("2012-07-10 15:58:00", "2012-07-15 16:02:00", "2012-07-20 16:04:00", "2012-07-25 16:06:30")
cf.data = split.data.time.based(x.data, bins = mybins)

is total gibberish. This seems to be a problem in the splitting - I believe commit interaction data should be excluded from any splitting as it is not annotated with the relevant data for splitting anyway.

@bockthom
Copy link
Collaborator

bockthom commented Jul 24, 2024

For some reason, adding the line

proj.conf$update.value("commit.interactions", TRUE)

breaks the splitting, meaning that the content of 'cf.data' after

mybins = c("2012-07-10 15:58:00", "2012-07-15 16:02:00", "2012-07-20 16:04:00", "2012-07-25 16:06:30")
cf.data = split.data.time.based(x.data, bins = mybins)

is total gibberish. This seems to be a problem in the splitting - I believe commit interaction data should be excluded from any splitting as it is not annotated with the relevant data for splitting anyway.

I agree with @Leo-Send: commit-interaction data should not be split. Therefore, commit-interaction data should be considered as additional data sources that are ignored by splitting and added to each range data object as a whole, such as pasta data or author data. According to the following line, commit-interaction data are correctly categorized as additional data sources:

coronet/util-data.R

Lines 1895 to 1902 in 74ebe0b

"only.additional" = list(
"authors" = "authors",
"commit.messages" = "commit.messages",
"synchronicity" = "synchronicity",
"pasta" = "pasta",
"commit.interactions" = "commit.interactions",
"custom.event.timestamps" = "custom.event.timestamps"
)

So, there might be any other location where commit-interaction data are considered to be in the wrong category... Maybe there is a typo somewhere (using the plural version instead of the singular, or something similar leads to be not recognized correctly...)

Is this something that has worked correctly before your rebase today? Then we might find the problem in the previous PR(s) that have been merged on dev in the meanwhile. Otherwise, the problem could be located anywhere else in the data or splitting module...

If you could trace down the problem until where splitting breaks, we could have a closer look at that particular part of the code. Our expert for splitting is @MaLoefUDS, maybe he also has an idea on what's going on there?

Added linebreaks, fixed spelling, removed cbind

Signed-off-by: Leo Sendelbach <[email protected]>
@Leo-Send
Copy link
Contributor Author

Leo-Send commented Jul 25, 2024

I am even more confused now - checking older commits I found that adding this line to the showcase has broken the splitting since commit interactions were added. But I added the commit network using commit interactions to the showcase a month ago - and I am very sure that I tried it and the CI also worked (apart from the old R version that is now removed). Since I rebased and force-pushed, I do not now how to access the CI run reports from old commits, so I cannot confirm it currently.

EDIT:
I checked my commit message from the commit that modified the showcase (f9b3293) and I did mention fixing something in the showcase - so I definetely ran it and this error did not occur back then.

EDIT 2:
I know why it did not happen before - because there was no data to begin with, which I now added locally to test something else in the showcase. I will look into how to exclude commit interaction data from splitting.

@bockthom
Copy link
Collaborator

I will look into how to exclude commit interaction data from splitting.

As already said yesterday: Commit-interaction data are already in the correct category that is excluded from splitting, as the following snippet shows:

coronet/util-data.R

Lines 1895 to 1902 in 74ebe0b

"only.additional" = list(
"authors" = "authors",
"commit.messages" = "commit.messages",
"synchronicity" = "synchronicity",
"pasta" = "pasta",
"commit.interactions" = "commit.interactions",
"custom.event.timestamps" = "custom.event.timestamps"
)

So, either there is some typo somewhere in the used spelling variants of "commit interaction(s)", or there is something else going on that is totally weird. Again, if you can provide the exact line at which it breaks, we can have a look why this line is executed at all.

@Leo-Send
Copy link
Contributor Author

Leo-Send commented Jul 25, 2024

Okay so i found some weird stuff going on. As I am not an expert on how splitting works, please correct me if I am misunderstanding something.

For this bug to occur in the showcase, you need to copy the 'commit.ineractions.yaml' from the testing/test_proximity/proximity folder to the testing/test_feature/feature and subsequent 002-v2-v3 folder.

If you want to replicate my 'testing without commit interactions in showcase', comment out line 78 and change the value in line 69 to "cochange".

  1. When testing without commit interactions in showcase and setting 'browser()' statements in the function 'split.data.by.time.or.bins', I found that the 'Re-arranging data' part around line 1025 sets the content of 'data.split' to NULL. It does this regardless of the commit interactions. Is this intended?

  2. The only difference between the construction of the RangeData objects in the two runs with and without commit interaction data is the content of 'additional.data.sources'. I am currently looking into the possibility that there is an infinite recursion somewhere.

EDIT:
There is indeed an infinite recursion, although I do not understand why. The setting (or getting) of commit interactions calls 'update.commit.interactions' which in turn calls 'get.commits' which calls 'set.commits' which again calls 'update.commit.interactions'. This SHOULD be prevented by caching (as it does normally... In my tests for commit interactions there is no infinite recursion and I also call 'get.commit.interactions' on a new projectData object). The only difference I can spot is that of RangeData vs ProjectData, although I am unsure if that can have anything to do with it.

Copy link
Collaborator

@bockthom bockthom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR @Leo-Send. I have reviewed your changes (except for the tests and showcase) and found a couple of minor inconsistencies or cases in which I am not sure about whether the current implementation is consistent or not. Please see my detailed comments below. Please don't see every comment as a need to change something - in some cases I just would like you to check similar cases in the existing implementation before deciding on whether we should change something or not.

In addition, could you please rebase the last two commits?

  • The second sentence of the commit message of "Add 'artifact.type' to commit interaction data" contains the word "until" twice; and I also don't understand the sentence. Could you please rephrase that?
  • Fix crash in showcase and add update tests: I guess this commit will not be present in the final PR any more, but I would like to mention it though: First, the "add update tests" part sounds broken. Second, the commit message does neither describe what's going on there, nor what the problem is that is fixed by this commit. "Crash in showcase" could be everything, but the problem is not the showcase, but something else in coronet that causes the showcase to crash ;) But I guess this will change anyway after we have figured out the actual problem there.

util-data.R Outdated Show resolved Hide resolved
util-data.R Show resolved Hide resolved
util-networks-covariates.R Outdated Show resolved Hide resolved
util-networks-covariates.R Outdated Show resolved Hide resolved
util-networks-covariates.R Show resolved Hide resolved
util-networks.R Show resolved Hide resolved
util-networks.R Outdated Show resolved Hide resolved
tests/test-read.R Outdated Show resolved Hide resolved
tests/test-read.R Outdated Show resolved Hide resolved
showcase.R Outdated Show resolved Hide resolved
@bockthom
Copy link
Collaborator

There is indeed an infinite recursion, although I do not understand why. The setting (or getting) of commit interactions calls 'update.commit.interactions' which in turn calls 'get.commits' which calls 'set.commits' which again calls 'update.commit.interactions'. This SHOULD be prevented by caching (as it does normally... In my tests for commit interactions there is no infinite recursion and I also call 'get.commit.interactions' on a new projectData object). The only difference I can spot is that of RangeData vs ProjectData, although I am unsure if that can have anything to do with it.

In general, it should not make a difference whether you use a RangeData or a ProjectData object. But maybe there is some inconsistency regarding caching additional data sources. I suggest to check how the different additional data sources are handled? Maybe there is a different behavior because commit-interaction data are linked to the commit data, which is not the case for author data, for example. Maybe the connection between commit data and commit-interaction data leads to this infinite loop? Maybe you could also check what's different to how PaStA data are handled (but, if I am not mistaken, the relationship between PaStA and commit data is the other way round, which might not trigger the problem...)

Unfortunately, I have no further ideas.

@maxloeffler
Copy link
Contributor

Hey, sorry for the late response. I have some nice insight on the issue regarding the splitting in showcase.R that "breaks" which @Leo-Send found here:

When testing without commit interactions in showcase and setting 'browser()' statements in the function 'split.data.by.time.or.bins', I found that the 'Re-arranging data' part around line 1025 sets the content of 'data.split' to NULL. It does this regardless of the commit interactions. Is this intended?

First of all, this does have nothing to do with the selected data source or if commit interactions are present (as Leo already mentioned). Interestingly this broken application has been in showcase.R for a really long time. I guess nobody really cared because the only use of the split data is plotting it once, which does not break directly.

The reason for the break is that for some data sources, some of their elements are before the predefined bins by which we split. When we split by given bins, we use the base::findInterval method to separate all data elements into the corresponding bins they belong, i.e., it returns a vector of integers, of which each integer describes the index of the bin in which we should place an item (e.g., 1 1 1 2 2 3 3 3 3 3 6). Then we split according to this vector and rename all splits according to the datestring of the bin.

However, the findInterval function returns 0 for datapoints that do not lie in any bin (i.e., before the first bin). Consequently, when trying to find the name for said split, we index into the vector of bin-labels with 0 resulting in character(0). This in the end leads to the breakage observed. This might actually require a fix from our side. I believe, spitting should not break like this when some elements are not in any bin.

@bockthom
Copy link
Collaborator

bockthom commented Jul 30, 2024

Good catch @MaLoefUDS! Thanks for looking into this! I agree with you that splitting should not break when there are elements that are not part of a bin. I guess that this can happen in two different ways: Prior to the first bin, and after the last bin. Do we already handle the case "after the last bin" correctly? If so, we could try to handle "prior to the first bin" similarly. If we don't handle either of them up until now, I suggest to open a new issue for that and discuss potential solutions there, not to mess up this PR more than we currently already do...

@Leo-Send: Independent of the splitting bug that @MaLoefUDS has identified, the infinite recursion seems to be a different problem that also needs to be fixed... Is there a way to fix this problem independent of the splitting bug, or do the two problems depend on each other? That is, does the infinite recursion occur depending on the splitting bug, or does it occur independently?

@Leo-Send
Copy link
Contributor Author

Leo-Send commented Jul 31, 2024

I found the reason for the infinite recursion and it does not seem to be directly related to what @MaLoefUDS found, although I believe it both originates from the problem that we do not verify the bins when splitting.

This infinite recursion is supposed to be stopped by checks if the data source is already cached - the method set.commits first sets the field for storing commits and then calls update.commit.interactions, thus does not get calledin update.commit.interactions again, as commits are cached. This does not, however, work if the commit data is empty. This specific call to splitting that is in the showcase results in one RangeData to have empty commit data, thus leading to this recursion.

I see two ways to handle this:

  1. disallow bins which would result in empty commit data. I believe this should be done if it is generally unwanted to ever have empty commit data.
  2. remove the potential for this circular recursion entirely, by checking who the caller of the function was. This would of course only handle this problem in commit interactions and not any other problem that might arise when commit data is empty.

@maxloeffler
Copy link
Contributor

Regarding fixing this problem of broken splitting which may or may not be the sole cause of Leo's problems, I have two possible fixes in mind.

But first of: Currently we do not have any problems with data elements that lie beyond the last bin. The bins we provide are always just the first date bound. The next following bin is then the last date bound, while also being the first date of its own new bin. In the current implementation, the last bin is then holds all elements from this point on until infinity. Obviously, we cannot simply include all elements before the first bin into the first bin as well, as this would lead to unintuitive behavior (i.e., what would it mean to provide only one bin when splitting?).

Here are my ideas for fixes:

  1. Ignore all elements before the first bin and disregard them. We can simply filter the bin assignment vector and filter out all 0s.
  2. Create an artificial bin before the first actual bin with some made up name (?) that holds all elements before the first bin. This is also easy to implement as we can just rename the bin of character(0) to whatever.

I don't have any preference on this, since I am not sure which idiomatic values of coronet are at stake 😅

@bockthom
Copy link
Collaborator

Now it is getting confusing here. We discuss two different problems and both of you, @Leo-Send and @MaLoefUDS, have suggested two potential solutions for one of the two problems...


Let's start with the infinite recursion spotted by @Leo-Send:

I see two ways to handle this:

1. disallow bins which would result in empty commit data. I believe this should be done if it is generally unwanted to ever have empty commit data.

2. remove the potential for this circular recursion entirely, by checking who the caller of the function was. This would of course only handle this problem in commit interactions and not any other problem that might arise when commit data is empty.

Option 1. is not a valid solution. There always can be time ranges without any commit and we still want to keep this range.

Option 2. sounds promising. When choosing option 2, we also need to check whether there are other cases which lead to a similar problem. Either we are able to find a common solution that handles all these cases, or we need to handle each of them separately.

Maybe you can think about that until our meeting, and then let's discuss the potential options in our meeting.


Regarding the problems occurring during splitting @MaLoefUDS:

We need to figure out in which situations the elements before the first bin are actually needed. This is the key question to this problem, and I don't have an answer to this question. If there are no such situations, we could potentially remove them. Otherwise we need to find a way to keep them. But first we need to figure out whether there is any use case in which the elements before the first bin are relevant.

As this problem seems to be independent of @Leo-Send's problem (as the infinite recursions can occur on any empty commit data), let's continue the discussion on splitting in a separate issue.

@bockthom
Copy link
Collaborator

I've just transferred the comments that discuss the splitting problem into a new issue #267. Please continue discussing the splitting problem there.

Let's stick to the changes of this PR and the discussion of the infinite recursion here.

Networks based on commit interaction data now correctly have an edge
attribute called 'artifact.type'.
Value of column 'artifact.type' in commit interaction data is
'CommitInteraction' until potentially overwritten in
artifact network construction

Signed-off-by: Leo Sendelbach <[email protected]>
Add check for calling function in the beginning of
'update.commit.interactions'. Also contains minor fixes to adress PR
comments and updates tests to reflect changes made in previous commit.

Signed-off-by: Leo Sendelbach <[email protected]>
Include this PR's changelog in the NEWS.md
Add constant for commit interaction artifact type
Move check for avoiding infinite recursion to the correct position and
add commentary

Signed-off-by: Leo Sendelbach <[email protected]>
Copy link
Collaborator

@bockthom bockthom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a look at README, NEWS, and the most recent changes @Leo-Send.
Please see my comments below - there is one line in which we should make use of the new constant in the code; and there are some missing items in the README.
Everything else are just styling issues (missing spaces etc.).

util-read.R Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Show resolved Hide resolved
NEWS.md Outdated Show resolved Hide resolved
NEWS.md Outdated Show resolved Hide resolved
NEWS.md Outdated Show resolved Hide resolved
NEWS.md Outdated Show resolved Hide resolved
Minor changes in response to reviews. Also added a use for constant
`ARTIFACT.COMMIT.INTERACTION` that was previously overlooked.

Signed-off-by: Leo Sendelbach <[email protected]>
@hechtlC hechtlC merged commit 55dc0cc into se-sic:dev Aug 28, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants