-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
10 changed files
with
1,081 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
Copyright 2023+ Hubert Tournier | ||
|
||
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: | ||
|
||
1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. | ||
|
||
2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. | ||
|
||
3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. | ||
|
||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,60 @@ | ||
NAME=adsv | ||
SOURCES=src/${NAME}/__init__.py src/${NAME}/main.py src/${NAME}/dsv_library.py src/${NAME}/file_utilities.py src/${NAME}/string_utilities.py | ||
|
||
# Default action is to show this help message: | ||
.help: | ||
@echo "Possible targets:" | ||
@echo " check-code Verify PEP 8 compliance (lint)" | ||
@echo " check-security Verify security issues (audit)" | ||
@echo " check-unused Find unused code" | ||
@echo " check-version Find required Python version" | ||
@echo " check-sloc Count Single Lines of Code" | ||
@echo " checks Make all the previous tests" | ||
@echo " format Format code" | ||
@echo " package Build package" | ||
@echo " upload-test Upload the package to TestPyPi" | ||
@echo " upload Upload the package to PyPi" | ||
@echo " distclean Remove all generated files" | ||
|
||
check-code: /usr/local/bin/pylint | ||
-pylint ${SOURCES} | ||
|
||
lint: check-code | ||
|
||
check-security: /usr/local/bin/bandit | ||
-bandit -r ${SOURCES} | ||
|
||
audit: check-security | ||
|
||
check-unused: /usr/local/bin/vulture | ||
-vulture --sort-by-size ${SOURCES} | ||
|
||
check-version: /usr/local/bin/vermin | ||
-vermin ${SOURCES} | ||
|
||
check-sloc: /usr/local/bin/pygount | ||
-pygount --format=summary . | ||
|
||
checks: check-code check-security check-unused check-version check-sloc | ||
|
||
format: /usr/local/bin/black | ||
black ${SOURCES} | ||
|
||
love: | ||
@echo "Not war!" | ||
|
||
man/${NAME}.1.gz: man/${NAME}.1 | ||
@gzip -k9c man/${NAME}.1 > man/${NAME}.1.gz | ||
|
||
package: man/${NAME}.1.gz | ||
python -m build | ||
|
||
upload-test: | ||
python -m twine upload --repository testpypi dist/* | ||
|
||
upload: | ||
python -m twine upload dist/* | ||
|
||
distclean: | ||
rm -rf build dist src/*.egg-info man/${NAME}.1.gz | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,165 @@ | ||
.Dd January 22, 2023 | ||
.Dt ADSV 1 | ||
.Os | ||
.Sh NAME | ||
.Nm adsv | ||
.Nd Analyze delimiter-separated values files | ||
.Sh SYNOPSIS | ||
.Nm | ||
.Op Fl d|--delimiter Ar CHAR | ||
.Op Fl e|--encoding Ar STRING | ||
.Op Fl f|--fields Ar LIST | ||
.Op Fl F|--flatten | ||
.Op Fl h|--hide Ar INT | ||
.Op Fl m|--min Ar INT | ||
.Op Fl M|--max Ar INT | ||
.Op Fl t|--top Ar INT | ||
.Op Fl -debug | ||
.Op Fl -help|-? | ||
.Op Fl -version | ||
.Op Fl - | ||
.Ar filename | ||
.Op Ar ... | ||
.Sh DESCRIPTION | ||
The | ||
.Nm | ||
utility analyzes | ||
.Em delimiter-separated values | ||
files, such as Comma-Separated Values .csv or Tab-Separated Values .tsv files, | ||
and either prints information about their structure and the data in each of their fields, | ||
or prints a selection of fields in the order requested. | ||
.Pp | ||
The information gathered are: | ||
.Bl -bullet | ||
.It | ||
for the file: | ||
.Bl -bullet | ||
.It | ||
the character set encoding | ||
.It | ||
the CSV dialect (characters used for delimiting, quoting, escaping or lines terminating. Plus the use or not of double quoting) | ||
.It | ||
the presence or not of a headers line | ||
.It | ||
the number of lines and fields | ||
.El | ||
.It | ||
for each field: | ||
.Bl -bullet | ||
.It | ||
its number and header | ||
.It | ||
the number of distinct values | ||
.It | ||
the values type (strings, integers, floating numbers, complex numbers, date and time (whatever their format)) | ||
.It | ||
the values by descending count | ||
.It | ||
the values range by ascending order using the detected type (useful for numbers and dates) | ||
.El | ||
.El | ||
.Pp | ||
When analyzing a DSV dataset, this allows for a quick and automated way of getting global information about the contents, and explore any oddities... | ||
.Pp | ||
There are options: | ||
.Bl -bullet | ||
.It | ||
to control and limit what is printed ( | ||
.Fl h|--hide , | ||
.Fl m|--min , | ||
.Fl M|--max | ||
and | ||
.Fl t|--top | ||
), | ||
.It | ||
to avoid (or correct) the detection of the character set encoding and delimiter ( | ||
.Fl d|--delimiter , | ||
.Fl e|--encoding | ||
): | ||
.Bl -bullet | ||
.It | ||
the character set detection can take a long time with big files, so if you know that the file is in "Windows-1252" or "utf-8" encoding, it's quicker to say it... | ||
.El | ||
.El | ||
.Pp | ||
If you use the | ||
.Fl f|--fields | ||
option, you'll skip printing the file analysis, and instead print the selected fields in the order requested, | ||
using the detected delimiting, quoting, escaping and line terminating characters. | ||
.Pp | ||
If you encounter multi-lines fields and want to "flatten" them to single lines, you can use the | ||
.Fl F|--flatten | ||
option for that. | ||
.Ss OPTIONS | ||
.Op Fl d|--delimiter Ar CHAR | ||
Specify delimiter to be CHAR | ||
.Pp | ||
.Op Fl e|--encoding Ar STRING | ||
Specify charset encoding to be STRING (because detecting encoding can take a long time!) | ||
.Pp | ||
.Op Fl f|--fields Ar LIST | ||
Extract LISTed fields values in given order (ex: 6,2-4,1 with fields numbered from 1) | ||
.Pp | ||
.Op Fl F|--flatten | ||
Make multi-lines fields single line | ||
.Pp | ||
.Op Fl h|--hide Ar INT | ||
Hide the display of distinct values above INT % (default is 20%) | ||
.Pp | ||
.Op Fl m|--min Ar INT | ||
Only display distinct values whose count >= INT (default is to display all distinct values) | ||
.Pp | ||
.Op Fl M|--max Ar INT | ||
Only display INT lines of distinct values (default is to display all distinct values, within the hide limit) | ||
.Pp | ||
.Op Fl t|--top Ar INT | ||
Only display the top/bottom INT lines of values (default is to display the 5 bottom and top lines) | ||
.Pp | ||
.Op Fl -debug | ||
Enable debug mode | ||
.Pp | ||
.Op Fl -help|-? | ||
Print usage and this help message and exit | ||
.Pp | ||
.Op Fl -version | ||
Print version and exit | ||
.Pp | ||
.Op Fl - | ||
Options processing terminator | ||
.Sh ENVIRONMENT | ||
The | ||
.Ev ADSV_DEBUG | ||
environment variable can also be set to any value to enable debug mode. | ||
.Sh EXIT STATUS | ||
.Ex -std adsv | ||
.Sh SEE ALSO | ||
.Xr cut 1 , | ||
.Xr file 1 | ||
.Sh STANDARDS | ||
The | ||
.Nm | ||
utility is not a standard UNIX command. | ||
.Pp | ||
This implementation tries to follow the PEP 8 style guide for Python code. | ||
.Pp | ||
The DSV dialects that can be handled are those compatible with | ||
.Em RFC 4180: Common Format and MIME Type for Comma-Separated Values (CSV) Files . | ||
.Sh PORTABILITY | ||
Tested OK under Windows. | ||
.Sh HISTORY | ||
This implementation was made for the | ||
.Lk https://github.com/HubTou/PNU PNU project | ||
.Pp | ||
I do this kind of analysis with each dataset I have to work with. | ||
Last time I did that, I decided that it was about time to fully automate the process, | ||
especially as I was working with fields containing multi-lines values... | ||
.Sh LICENSE | ||
It is available under the 3-clause BSD license. | ||
.Sh AUTHORS | ||
.An Hubert Tournier | ||
.Sh CAVEATS | ||
Using "Sep=X" as a first line in order to set the X character as a delimiter is not supported. | ||
.Pp | ||
There is no support either for potential commented lines inside the data (for example, with | ||
.Pa /etc/passwd | ||
files under Unix), but it's not part of any recognized DSV dialect anyway. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
[build-system] | ||
requires = [ | ||
"setuptools>=42", | ||
"wheel" | ||
] | ||
build-backend = "setuptools.build_meta" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
[metadata] | ||
name = pnu_adsv | ||
description = Analyze delimiter-separated values files | ||
long_description = file: README.md | ||
long_description_content_type = text/markdown | ||
version = 1.0.0 | ||
license = BSD 3-Clause License | ||
license_files = License | ||
author = Hubert Tournier | ||
author_email = [email protected] | ||
url = https://github.com/HubTou/adsv/ | ||
project_urls = | ||
Bug Tracker = https://github.com/HubTou/adsv/issues | ||
keywords = pnu-project | ||
classifiers = | ||
Development Status :: 5 - Production/Stable | ||
Environment :: Console | ||
Intended Audience :: Developers | ||
Intended Audience :: End Users/Desktop | ||
License :: OSI Approved :: BSD License | ||
Natural Language :: English | ||
Operating System :: OS Independent | ||
Operating System :: POSIX :: BSD :: FreeBSD | ||
Operating System :: Microsoft :: Windows | ||
Programming Language :: Python :: 3 | ||
Programming Language :: Python :: 3.6 | ||
Programming Language :: Python :: 3.7 | ||
Programming Language :: Python :: 3.8 | ||
Programming Language :: Python :: 3.9 | ||
Programming Language :: Python :: 3.10 | ||
Programming Language :: Python :: 3.11 | ||
Topic :: Software Development :: Libraries :: Python Modules | ||
Topic :: System | ||
Topic :: Utilities | ||
|
||
[options] | ||
package_dir = | ||
= src | ||
packages = find: | ||
python_requires = >=3.6 | ||
install_requires = | ||
pnu-libpnu | ||
chardet | ||
python-dateutil | ||
|
||
[options.packages.find] | ||
where = src | ||
|
||
[options.entry_points] | ||
console_scripts = | ||
adsv = adsv:main | ||
|
||
[options.data_files] | ||
man/man1 = | ||
man/adsv.1.gz | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
"""Wrapper for the source code files""" | ||
from .main import * |
Oops, something went wrong.