Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cumulative ranges #96

Merged
merged 26 commits into from
Feb 27, 2018
Merged

Cumulative ranges #96

merged 26 commits into from
Feb 27, 2018

Conversation

clhunsen
Copy link
Collaborator

With the list of patches in this PR, we add the possibility to construct cumulative ranges as proposed by @bockthom and @ecklbarb in issue #61.

To construct ranges properly, we introduce a bunch of functions construct.*.ranges in the misc module to construct ranges from a start date and an end date. This includes overlapping ranges and cumulative ranges. Additionally, the aggregate.ranges function can be used to generate aggregated ranges from a list of existing ranges.
For splitting, we add the function split.*.time.based.by.ranges, which is able to split data or networks by a list of ranges (rather than bins as in split.*.time.based).

Existing functions are (partly) refactored to use the new functions. The test suite has been extended accordingly.

Important: The time periods for splitting change slightly, as we changed completely to lubridate for all range and bin constructions. In detail, "1 month" now means "30 days, 10 hours, and 30 minutes" (which is the average month) and "1 year" means "365 days and 6 hours" (as the average year).
We made this change as it improves the bin and range construction more reliable, comprehensible, and more uniform. Although, this change make the ranges less readable, but that is an easy prize to pay.

For better overview with respect to upcoming changes regarding #61, the
range-construction-related functions are all clustered in the misc
module now.

Signed-off-by: Claus Hunsen <[email protected]>
With this patch, it is now possible to obtain also raw ranges, i.e.,
tuples of revisions in comparison to formatted range strings.

Signed-off-by: Claus Hunsen <[email protected]>
When passing POSIXct object to 'construct.ranges' that denote midnight,
the time was omitted by mistake. This is now corrected by calling
'get.date.string' now.

Signed-off-by: Claus Hunsen <[email protected]>
As an request from issue #61, we add the function
'construct.overlapping.ranges' in this patch, with which reliably
overlapping ranges can be constructed. the user needs to pass a start
date, an end date, a time period for the ranges, and an overlap time
period. The ranges can be requested in raw format as for
'construct.ranges' if needed.

Signed-off-by: Claus Hunsen <[email protected]>
As an request from issue #61, we add the function
'construct.cumulative.ranges' in this patch, with which reliably
cumulative ranges can be constructed. The user needs to pass a start
date, an end date, and a time period for the ranges. Each range will
then start at the start date, but end at different dates, in a
cumulative manner.
The ranges can be requested in raw format as for 'construct.ranges' if
needed.

Signed-off-by: Claus Hunsen <[email protected]>
As an request from issue #61, we add the function 'aggregate.ranges' in
this patch, with which ranges can be aggregated on various levels
reliably. The functionality is already internally used in
'split.data.by.networks'.

Fix documentation for 'split.data.by.networks'.

Signed-off-by: Claus Hunsen <[email protected]>
In the function 'split.data.by.networks', we can now use the newly added
function 'aggregate.ranges' to implement the functionality.

Signed-off-by: Claus Hunsen <[email protected]>
In this patch, we add the function 'construct.consecutive.ranges', with
which consecutive ranges can be constructed (with no overlap). The user
needs to pass a start date, an end date, and a time period for the
ranges.
The ranges can be requested in raw format as for 'construct.ranges' if
needed.

Signed-off-by: Claus Hunsen <[email protected]>
Using lubridate functionality, the function 'generate.date.sequence'
generates a date sequence based on the given parameters.

Signed-off-by: Claus Hunsen <[email protected]>
In the range construction, the last range's ending bound is *exclusive*,
thus, we need to add a second (1s) to the end date 'end'.

The documentation is improved to this end and fixed regarding a copy-
paste mistake.

Signed-off-by: Claus Hunsen <[email protected]>
For interoperability and reliability, we now use the function
'generate.date.sequence' to generate the bin bounds that are needed to
split any data in a time-based manner in 'split.get.bins.time.based'.

As, with lubridate, a year is now exactly 1 year and 6 hours to mitigate
leap years, the tests are adapted accordingly. Same thing holds for
months, which are 30 days, 10 hours, and 30 minutes long, on overage.

Signed-off-by: Claus Hunsen <[email protected]>
With the functions 'split.data.time.based.by.ranges' and
'split.network.time.based.by.ranges', it is now possible to split data
and networks, respectively, by a list of given ranges. The ranges are
applied

This fixes #61.

Signed-off-by: Claus Hunsen <[email protected]>
To increase code re-use and code maintainability, the function
'split.data.time.based.by.ranges' is now used to implement the
functionality of 'split.data.by.networks'.

Signed-off-by: Claus Hunsen <[email protected]>
Rearrange tests and introduce section headers.

Signed-off-by: Claus Hunsen <[email protected]>
Update the test suite for the splitting functionality with tests for
splitting by ranges.

Signed-off-by: Claus Hunsen <[email protected]>
We just forgot to handle the end date properly when generating
overlapping ranges.

Signed-off-by: Claus Hunsen <[email protected]>
The overlap for overlapping ranges must not exceed the time period that
each range should span.

Signed-off-by: Claus Hunsen <[email protected]>
We just forgot to handle the project end date properly when aggregating
ranges with aggregation level "complete".

In de752f8, the same fix is applied for
the construction of overlapping ranges.

Signed-off-by: Claus Hunsen <[email protected]>
As the project end date may come from the timestamp data in a
ProjectData instance, but the end date in any range is *exclusive*, we
need to add one second (1s) when using the aggregation level "complete".
The function was implemented correctly already, but the test was not
adapted to it.

Signed-off-by: Claus Hunsen <[email protected]>
When passing exclicit bins to the functions 'split.*.time.based', the
bins are taken as is. I.e., the last bin and, thus, the last range bound
is *exclusive*.

Signed-off-by: Claus Hunsen <[email protected]>
Before now, the function 'generate.date.sequence' was not usable with
date-time strings. Now, all given parameters are parsed as POSIXct
objects first, before using them in the function.

Signed-off-by: Claus Hunsen <[email protected]>
Signed-off-by: Claus Hunsen <[email protected]>
Copy link
Collaborator

@bockthom bockthom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, the PR looks good! Thanks for your effort in implementing all that stuff.
Nevertheless, there are some issues regarding documentation and also regarding inclusiveness resp. exclusiveness of last ranges' ends (see individual comments).

util-misc.R Outdated
#'
#' @param start The start time as string or POSIXct object
#' @param end The start time as string or POSIXct object
#' @param time.period The time period describing the length of time between dates, a character
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter time.period in the documentation does not exist, as the corresponding parameter is called by here.

util-split.R Outdated
@@ -410,6 +410,36 @@ split.data.by.networks = function(list.of.networks, project.data,
return(net.to.range.list)
}

#' Split the given data to the given ranges and return the resulting list.
#'
#' Note: You may want to use any function \code{generate.*.ranges} to obtain
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The functions are called construct.*.ranges and not generate.*.ranges.

util-split.R Outdated
@@ -675,6 +705,36 @@ split.network.activity.based = function(network, number.edges = 5000, number.win
return(networks)
}

#' Split the given network to the given ranges and return the resulting list.
#'
#' Note: You may want to use any function \code{generate.*.ranges} to obtain
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

util-misc.R Outdated
@@ -155,14 +155,17 @@ get.date.string = function(input) {
## / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / /
## Range construction and handling -----------------------------------------

#' Construct the range strings.
#' Construct ranges from the given list/vector of revisions. If \code{raw} is *not*
#' \code{FALSE}, the ranges are construction in the format "rev[n]-rev[n+1]".
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation here is hard to understand and -- as I understand it -- is wrong.
Instead of If \code{raw} is *not* \code{FALSE}, I would suggest If \code{raw} is \code{TRUE},.
So, if raw is TRUE, what happens then? I guess we do not get the specified format then...

util-misc.R Outdated
#' @param project.end the project end time as string or POSIXct object
#' @param aggregation.level One of \code{"range"}, \code{"cumulative"}, \code{"all.ranges"},
#' \code{"project.cumulative"}, \code{"project.all.ranges"}, and
#' \code{"complete"}. S7ee above for more details.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

S7ee ;)

#' @param bins the date objects defining the start of ranges (the last date defines the end of the last range).
#' If set, the 'time.period' parameter is ignored.
#' @param bins the date objects defining the start of ranges (the last date defines the end of the last range, in an
#' *exclusive* manner). If set, the 'time.period' parameter is ignored.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clhunsen I am not enthusiastic about the new behaviour here.
The only case in which I currently use this function is to use bins extracted from previous splittings and apply it to other data resp. network splittings. In the end, I expect that both results use exactly the same ranges. As I cannot think of a use case in which one would deliberately use the end of the last range in an exclusive manner, I would suggest the following solution:

When using bins in those functions, automatically modify the last one (i.e., add 1 second) to achieve the expected ranges.
(If one would really need the last range in an exclusive manner, we should state in the documentation how to achieve that.)

As we have inclusive ranges everywhere, I would like to have that also here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not enthusiastic about the new behaviour here.

This is not new behaviour, I just fixed the documentation on that (afc96f4). The last bin was always used to be the end of the last range. And as such, it is exclusive.

As we have inclusive ranges everywhere, I would like to have that also here.

As stated above, the ranges/intervals have always been constructed like this: [range start, range end), i.e., including range start, but excluding range end. Has never been different.

The only case in which I currently use this function is to use bins extracted from previous splittings and apply it to other data resp. network splittings. In the end, I expect that both results use exactly the same ranges.

So, the bins reported as an attribute on a splitting result do respect that. This means, when you split a data object in its entirety, the last range is always like this: [some date in the middle, last activity in the data + 1 second) and, thus, the last bin reported in the attribute is last activity in the data + 1 second.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks for the clarification. I just had a look at the documentation changes here, which seemed to describe a new behaviour due to the changes in several commits before. As the behaviour has not changed, everything is fine here. Sorry for my unwarranted complaint.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worry. 😉

util-misc.R Outdated
## Range construction and handling -----------------------------------------

#' Construct ranges from the given list/vector of revisions. If \code{raw} is *not*
#' \code{FALSE}, the ranges are construction in the format "rev[n]-rev[n+1]".
Copy link
Collaborator

@bockthom bockthom Feb 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the location of this code snippet has changed during the commits of the PR, see the corresponding comment here:
#96 (comment)

This patch introduces only some minor fixes and inprovements to the
documentation for the changes in PR #96.

Props to @bockthom for being that precise.

Signed-off-by: Claus Hunsen <[email protected]>
@bockthom bockthom merged commit e9183bd into se-sic:dev Feb 27, 2018
@clhunsen clhunsen deleted the cumulative-ranges branch February 27, 2018 08:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants