Skip to content

Commit

Permalink
Add files via upload
Browse files Browse the repository at this point in the history
  • Loading branch information
HubTou authored Jan 22, 2023
1 parent 03bf012 commit c3b73a8
Show file tree
Hide file tree
Showing 10 changed files with 1,081 additions and 0 deletions.
11 changes: 11 additions & 0 deletions License
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
Copyright 2023+ Hubert Tournier

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
60 changes: 60 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
NAME=adsv
SOURCES=src/${NAME}/__init__.py src/${NAME}/main.py src/${NAME}/dsv_library.py src/${NAME}/file_utilities.py src/${NAME}/string_utilities.py

# Default action is to show this help message:
.help:
@echo "Possible targets:"
@echo " check-code Verify PEP 8 compliance (lint)"
@echo " check-security Verify security issues (audit)"
@echo " check-unused Find unused code"
@echo " check-version Find required Python version"
@echo " check-sloc Count Single Lines of Code"
@echo " checks Make all the previous tests"
@echo " format Format code"
@echo " package Build package"
@echo " upload-test Upload the package to TestPyPi"
@echo " upload Upload the package to PyPi"
@echo " distclean Remove all generated files"

check-code: /usr/local/bin/pylint
-pylint ${SOURCES}

lint: check-code

check-security: /usr/local/bin/bandit
-bandit -r ${SOURCES}

audit: check-security

check-unused: /usr/local/bin/vulture
-vulture --sort-by-size ${SOURCES}

check-version: /usr/local/bin/vermin
-vermin ${SOURCES}

check-sloc: /usr/local/bin/pygount
-pygount --format=summary .

checks: check-code check-security check-unused check-version check-sloc

format: /usr/local/bin/black
black ${SOURCES}

love:
@echo "Not war!"

man/${NAME}.1.gz: man/${NAME}.1
@gzip -k9c man/${NAME}.1 > man/${NAME}.1.gz

package: man/${NAME}.1.gz
python -m build

upload-test:
python -m twine upload --repository testpypi dist/*

upload:
python -m twine upload dist/*

distclean:
rm -rf build dist src/*.egg-info man/${NAME}.1.gz

165 changes: 165 additions & 0 deletions man/adsv.1
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
.Dd January 22, 2023
.Dt ADSV 1
.Os
.Sh NAME
.Nm adsv
.Nd Analyze delimiter-separated values files
.Sh SYNOPSIS
.Nm
.Op Fl d|--delimiter Ar CHAR
.Op Fl e|--encoding Ar STRING
.Op Fl f|--fields Ar LIST
.Op Fl F|--flatten
.Op Fl h|--hide Ar INT
.Op Fl m|--min Ar INT
.Op Fl M|--max Ar INT
.Op Fl t|--top Ar INT
.Op Fl -debug
.Op Fl -help|-?
.Op Fl -version
.Op Fl -
.Ar filename
.Op Ar ...
.Sh DESCRIPTION
The
.Nm
utility analyzes
.Em delimiter-separated values
files, such as Comma-Separated Values .csv or Tab-Separated Values .tsv files,
and either prints information about their structure and the data in each of their fields,
or prints a selection of fields in the order requested.
.Pp
The information gathered are:
.Bl -bullet
.It
for the file:
.Bl -bullet
.It
the character set encoding
.It
the CSV dialect (characters used for delimiting, quoting, escaping or lines terminating. Plus the use or not of double quoting)
.It
the presence or not of a headers line
.It
the number of lines and fields
.El
.It
for each field:
.Bl -bullet
.It
its number and header
.It
the number of distinct values
.It
the values type (strings, integers, floating numbers, complex numbers, date and time (whatever their format))
.It
the values by descending count
.It
the values range by ascending order using the detected type (useful for numbers and dates)
.El
.El
.Pp
When analyzing a DSV dataset, this allows for a quick and automated way of getting global information about the contents, and explore any oddities...
.Pp
There are options:
.Bl -bullet
.It
to control and limit what is printed (
.Fl h|--hide ,
.Fl m|--min ,
.Fl M|--max
and
.Fl t|--top
),
.It
to avoid (or correct) the detection of the character set encoding and delimiter (
.Fl d|--delimiter ,
.Fl e|--encoding
):
.Bl -bullet
.It
the character set detection can take a long time with big files, so if you know that the file is in "Windows-1252" or "utf-8" encoding, it's quicker to say it...
.El
.El
.Pp
If you use the
.Fl f|--fields
option, you'll skip printing the file analysis, and instead print the selected fields in the order requested,
using the detected delimiting, quoting, escaping and line terminating characters.
.Pp
If you encounter multi-lines fields and want to "flatten" them to single lines, you can use the
.Fl F|--flatten
option for that.
.Ss OPTIONS
.Op Fl d|--delimiter Ar CHAR
Specify delimiter to be CHAR
.Pp
.Op Fl e|--encoding Ar STRING
Specify charset encoding to be STRING (because detecting encoding can take a long time!)
.Pp
.Op Fl f|--fields Ar LIST
Extract LISTed fields values in given order (ex: 6,2-4,1 with fields numbered from 1)
.Pp
.Op Fl F|--flatten
Make multi-lines fields single line
.Pp
.Op Fl h|--hide Ar INT
Hide the display of distinct values above INT % (default is 20%)
.Pp
.Op Fl m|--min Ar INT
Only display distinct values whose count >= INT (default is to display all distinct values)
.Pp
.Op Fl M|--max Ar INT
Only display INT lines of distinct values (default is to display all distinct values, within the hide limit)
.Pp
.Op Fl t|--top Ar INT
Only display the top/bottom INT lines of values (default is to display the 5 bottom and top lines)
.Pp
.Op Fl -debug
Enable debug mode
.Pp
.Op Fl -help|-?
Print usage and this help message and exit
.Pp
.Op Fl -version
Print version and exit
.Pp
.Op Fl -
Options processing terminator
.Sh ENVIRONMENT
The
.Ev ADSV_DEBUG
environment variable can also be set to any value to enable debug mode.
.Sh EXIT STATUS
.Ex -std adsv
.Sh SEE ALSO
.Xr cut 1 ,
.Xr file 1
.Sh STANDARDS
The
.Nm
utility is not a standard UNIX command.
.Pp
This implementation tries to follow the PEP 8 style guide for Python code.
.Pp
The DSV dialects that can be handled are those compatible with
.Em RFC 4180: Common Format and MIME Type for Comma-Separated Values (CSV) Files .
.Sh PORTABILITY
Tested OK under Windows.
.Sh HISTORY
This implementation was made for the
.Lk https://github.com/HubTou/PNU PNU project
.Pp
I do this kind of analysis with each dataset I have to work with.
Last time I did that, I decided that it was about time to fully automate the process,
especially as I was working with fields containing multi-lines values...
.Sh LICENSE
It is available under the 3-clause BSD license.
.Sh AUTHORS
.An Hubert Tournier
.Sh CAVEATS
Using "Sep=X" as a first line in order to set the X character as a delimiter is not supported.
.Pp
There is no support either for potential commented lines inside the data (for example, with
.Pa /etc/passwd
files under Unix), but it's not part of any recognized DSV dialect anyway.
6 changes: 6 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
[build-system]
requires = [
"setuptools>=42",
"wheel"
]
build-backend = "setuptools.build_meta"
56 changes: 56 additions & 0 deletions setup.cfg
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
[metadata]
name = pnu_adsv
description = Analyze delimiter-separated values files
long_description = file: README.md
long_description_content_type = text/markdown
version = 1.0.0
license = BSD 3-Clause License
license_files = License
author = Hubert Tournier
author_email = [email protected]
url = https://github.com/HubTou/adsv/
project_urls =
Bug Tracker = https://github.com/HubTou/adsv/issues
keywords = pnu-project
classifiers =
Development Status :: 5 - Production/Stable
Environment :: Console
Intended Audience :: Developers
Intended Audience :: End Users/Desktop
License :: OSI Approved :: BSD License
Natural Language :: English
Operating System :: OS Independent
Operating System :: POSIX :: BSD :: FreeBSD
Operating System :: Microsoft :: Windows
Programming Language :: Python :: 3
Programming Language :: Python :: 3.6
Programming Language :: Python :: 3.7
Programming Language :: Python :: 3.8
Programming Language :: Python :: 3.9
Programming Language :: Python :: 3.10
Programming Language :: Python :: 3.11
Topic :: Software Development :: Libraries :: Python Modules
Topic :: System
Topic :: Utilities

[options]
package_dir =
= src
packages = find:
python_requires = >=3.6
install_requires =
pnu-libpnu
chardet
python-dateutil

[options.packages.find]
where = src

[options.entry_points]
console_scripts =
adsv = adsv:main

[options.data_files]
man/man1 =
man/adsv.1.gz

2 changes: 2 additions & 0 deletions src/adsv/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
"""Wrapper for the source code files"""
from .main import *
Loading

0 comments on commit c3b73a8

Please sign in to comment.