Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inefficient ProxyDatabase slicing #20

Open
DominikStiller opened this issue Sep 12, 2024 · 2 comments
Open

Inefficient ProxyDatabase slicing #20

DominikStiller opened this issue Sep 12, 2024 · 2 comments

Comments

@DominikStiller
Copy link
Contributor

DominikStiller commented Sep 12, 2024

Operations on a ProxyDatabase that loop over all records slow down/do not have constant iteration time (as measured by tqdm's it/s), which is especially noticeable for large databases. Take for example slicing:

cfr/cfr/proxy.py

Lines 1730 to 1746 in a99c13b

def slice(self, timespan):
''' Slice the records in the proxy database.
Args:
timespan (tuple or list):
The list of time points for slicing, whose length must be even.
When there are n time points, the output Series includes n/2 segments.
For example, if timespan = [a, b], then the sliced output includes one segment [a, b];
if timespan = [a, b, c, d], then the sliced output includes segment [a, b] and segment [c, d].
'''
new = ProxyDatabase()
for pid, pobj in tqdm(self.records.items(), total=self.nrec, desc='Slicing ProxyRecord'):
spobj = pobj.slice(timespan=timespan)
new += spobj
new.refresh()
return new

The operation new += spobj calls .refresh() every time, which itself iterates over all proxies. Therefore, it is O(n^2), while it could easily be O(n) if they were added to a dictionary first, then converted to a ProxyDatabase at the end. Is there a reason this is not being done?

@DominikStiller DominikStiller changed the title Slow ProxyDatabase slicing Inefficient ProxyDatabase slicing Sep 12, 2024
@fzhu2e
Copy link
Owner

fzhu2e commented Sep 12, 2024

Hi Dominik, thanks for pointing this out! I'm sure the performance of the codebase is not optimal and lots of improvements could be done. I developed most of the infrastructure from scratch in a relatively short time period, and the runtime speed was usually not the priority as long as it's not significantly slowing down the major tasks. Now that this project is open to the community, you are mostly welcome to implement those improvements and submit PRs. Once there is sufficient upgrade of the codebase in terms of features and performance, we may submit another manuscript to report the updates.

@DominikStiller
Copy link
Contributor Author

Thanks for clarifying, I'll submit a PR for this issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants