Skip to content

Commit

Permalink
Download data for all packages on PyPI then slice into 8k files
Browse files Browse the repository at this point in the history
  • Loading branch information
hugovk committed Nov 30, 2024
1 parent 00a50ac commit e571c49
Show file tree
Hide file tree
Showing 13 changed files with 24,086 additions and 23,988 deletions.
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -99,3 +99,7 @@ ENV/

# mypy
.mypy_cache/

# Big unzipped files
top-pypi-packages-30-days-all.csv
top-pypi-packages-30-days-all.json
18 changes: 16 additions & 2 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -1,25 +1,39 @@
repos:
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.8.0
hooks:
- id: ruff
args: [--exit-non-zero-on-fix]

- repo: https://github.com/psf/black-pre-commit-mirror
rev: 24.10.0
hooks:
- id: black

- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v5.0.0
hooks:
- id: check-added-large-files
exclude: top-pypi-packages-30-days-all.*
- id: check-case-conflict
- id: check-merge-conflict
- id: check-json
- id: check-toml
- id: check-yaml
- id: debug-statements
- id: end-of-file-fixer
- id: forbid-submodules
- id: trailing-whitespace

- repo: https://github.com/python-jsonschema/check-jsonschema
rev: 0.29.3
rev: 0.29.4
hooks:
- id: check-github-workflows


- repo: meta
hooks:
- id: check-hooks-apply
- id: check-useless-excludes

ci:
autoupdate_schedule: quarterly
29 changes: 29 additions & 0 deletions .ruff.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
fix = true

lint.select = [
"C4", # flake8-comprehensions
"E", # pycodestyle
"EM", # flake8-errmsg
"F", # pyflakes
"I", # isort
"ICN", # flake8-import-conventions
"ISC", # flake8-implicit-str-concat
"LOG", # flake8-logging
"PGH", # pygrep-hooks
"PT", # flake8-pytest-style
"PYI", # flake8-pyi
"RUF022", # unsorted-dunder-all
"RUF100", # unused noqa (yesqa)
"S", # flake8-bandit
"UP", # pyupgrade
"W", # pycodestyle
"YTT", # flake8-2020
]
lint.ignore = [
"E203", # Whitespace before ':'
"E221", # Multiple spaces before operator
"E226", # Missing whitespace around arithmetic operator
"E241", # Multiple spaces after ','
"UP038", # Makes code slower and more verbose
]
lint.isort.required-imports = [ "from __future__ import annotations" ]
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,12 +18,12 @@ Old versions can be found in [releases](https://github.com/hugovk/top-pypi-packa

From cron, it runs pypinfo to dump JSON and commit back to this repo.

### Install jq
### Install jq and zip

For example on Ubuntu 22.04:

```bash
sudo apt-get install jq
sudo apt-get install jq zip
```

### Install and set up pypinfo
Expand Down
4 changes: 4 additions & 0 deletions build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,10 @@ git pull origin main
# Generate the files
bash generate.sh

# Remove big unzipped file
rm top-pypi-packages-30-days-all.csv
rm top-pypi-packages-30-days-all.json

# Make output directory, don't fail if it exists
# mkdir -p build

Expand Down
16 changes: 14 additions & 2 deletions generate.sh
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,20 @@ python3 -m pip install -U pypinfo
python3 -m pip --version
/home/botuser/.local/bin/pypinfo --version

# Check if zip is installed
if ! command -v zip &> /dev/null
then
echo "zip not be found, consider: apt install zip"
exit 1
fi

# Generate and minify for 30 days
/home/botuser/.local/bin/pypinfo --all --json --indent 0 --limit 8000 --days 30 "" project > top-pypi-packages-30-days.json
/home/botuser/.local/bin/pypinfo --all --json --indent 0 --limit 10000000 --days 30 "" project > top-pypi-packages-30-days-all.json
python3 trim.py > top-pypi-packages-30-days.json
jq -c . < top-pypi-packages-30-days.json > top-pypi-packages-30-days.min.json
echo 'download_count,project' > top-pypi-packages-30-days-all.csv
echo 'download_count,project' > top-pypi-packages-30-days.csv
jq -r '.rows[] | [.download_count, .project] | @csv' top-pypi-packages-30-days.json >> top-pypi-packages-30-days.csv
jq -r '.rows[] | [.download_count, .project] | @csv' top-pypi-packages-30-days-all.json >> top-pypi-packages-30-days-all.csv
jq -r '.rows[] | [.download_count, .project] | @csv' top-pypi-packages-30-days.json >> top-pypi-packages-30-days.csv
zip top-pypi-packages-30-days-all.csv.zip top-pypi-packages-30-days-all.csv
zip top-pypi-packages-30-days-all.json.zip top-pypi-packages-30-days-all.json
3 changes: 2 additions & 1 deletion index.html
Original file line number Diff line number Diff line change
Expand Up @@ -129,7 +129,8 @@ <h2 id="changelog">Changelog</h2>
<li>2021-07: Fetch data for 5,000 packages over only 30 days (<a href="https://github.com/hugovk/top-pypi-packages/pull/20">#20</a>)</li>
<li>2021-09: Fetch data for 8,000 packages (<a href="https://github.com/hugovk/top-pypi-packages/pull/30">#30</a>)</li>
<li>2024-05: Provide data in CSV in addition to JSON (<a href="https://github.com/hugovk/top-pypi-packages/issues/31">#31</a>)</li>
<li>2024-11: Fetch data for all installers, not only pip (<a href="https://github.com/hugovk/top-pypi-packages/issues/39">#39</a>)</li>
<li>2024-11: Fetch data for all PyPI packages (<a href="https://github.com/hugovk/top-pypi-packages/issues/41">#41</a>)
and for installers, not only pip (<a href="https://github.com/hugovk/top-pypi-packages/issues/39">#39</a>)</li>
</ul>
</div>
<div class="col-sm-6">
Expand Down
Binary file added top-pypi-packages-30-days-all.csv.zip
Binary file not shown.
Binary file added top-pypi-packages-30-days-all.json.zip
Binary file not shown.
Loading

0 comments on commit e571c49

Please sign in to comment.