Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collect file extension stats in gh-pages #12

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

abitrolly
Copy link

No description provided.

@abitrolly
Copy link
Author

Counting lines of code takes 3 seconds and calculating file extensions takes 3 minutes.

image

It appears that git ls-files | xargs -n 1 basename > 2 command is very slow - 3m14s compared to git ls-files > 1 which takes only 2.2s.

@abitrolly
Copy link
Author

Firefox repo contains almost 22 million files, so that means xargs needs to run external process 22 million times. Looks like a major bottleneck for all shell pipelines.

✗ git ls-files | wc -c
21857117


on:
push:
branches:
- main
pull_request:
workflow_dispatch:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For testing workflow in branches other than main.


- name: Gather commits by day and file extension statistics
run: |
./01stats.sh gecko-dev build
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we get out of this build step? Also I see that it takes pretty long time to complete.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal is to move stats collection commands out of GitHub Actions YAML, so that they could be run standalone.

As I continued experiments in my main branch after opening this PR, more things started to creep in. The command that takes the most time is git fetch --unshallow added in the last commit to start playing with historical data.

@4e6
Copy link
Owner

4e6 commented Jun 2, 2022

Hi! Thanks for your contribution. Just curious, what's the motivation behind these changes?

@abitrolly
Copy link
Author

@4e6 well, this PR is far from being finished. The final goal was to get the dataset for diagrams on Firefox Oxidization over time. (like in #10). Because I didn't know the codebase, I started with the code that seemed to be the easiest to get up. Like counting file extensions over time.

Because only full git checkout takes the whole 15 minutes, going commit over commit probably won't be feasible to do in one CI run, so the plan is to collect the data month by month over multiple CI runs.

Maybe it will be faster that restoring `gecko-dev` history with,
`git fetch --unshallow`, which takes about 15 minutes.
@abitrolly
Copy link
Author

Doing complete gecko-dev checkout through the action, and it tool 10 minutes, where the stats script took only 12s. That's still too slow. Maybe the initial checkout can be cached.

image

https://github.com/abitrolly/firefox-lang-stats/runs/6713143940

@abitrolly
Copy link
Author

abitrolly commented Jun 2, 2022

Opened the issue in actions/checkout#818 to maybe track possible solutions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants