Skip to content

JeMunos/data_hacks

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 

Repository files navigation

data_hacks

Command line utilities for data analysis

Installing: pip install data_hacks

Installing from github pip install -e git://github.com/bitly/data_hacks.git#egg=data_hacks

Installing from source python setup.py install

data_hacks are friendly. Ask them for usage information with --help

histogram.py

A utility that parses input data points and outputs a text histogram

Example:

$ cat /tmp/data | histogram.py
# NumSamples = 29; Max = 10.00; Min = 1.00
# Mean = 4.379310; Variance = 5.131986; SD = 2.265389
# each * represents a count of 1
    1.0000 -     1.9000 [     1]: *
    1.9000 -     2.8000 [     5]: *****
    2.8000 -     3.7000 [     8]: ********
    3.7000 -     4.6000 [     3]: ***
    4.6000 -     5.5000 [     4]: ****
    5.5000 -     6.4000 [     2]: **
    6.4000 -     7.3000 [     3]: ***
    7.3000 -     8.2000 [     1]: *
    8.2000 -     9.1000 [     1]: *
    9.1000 -    10.0000 [     1]: *

ninety_five_percent.py

A utility script that takes a stream of decimal values and outputs the 95% time.

This is useful for finding the 95% response time from access logs.

Example (assuming response time is the last column in your access log):

$ cat access.log | awk '{print $NF}' | ninety_five_percent.py

sample.py

Filter a stream to a random sub-sample of the stream

Example:

$ cat access.log | sample.py 10% | post_process.py

run_for.py

Pass through data for a specified amount of time

Example:

$ tail -f access.log | run_for.py 10s | post_process.py

bar_chart.py

Generate an ascii bar chart for input data (this is like a visualization of uniq -c)

$ cat data | bar_chart.py --sort-keys
# each * represents a count of 2
19:0 [     1] 
19:1 [    24] ************
19:2 [     3] *
19:3 [     9] ****
19:4 [     5] **
19:5 [    41] ********************
20:0 [   115] *********************************************************
20:1 [   181] ******************************************************************************************
20:2 [   136] ********************************************************************
20:3 [   155] *****************************************************************************
20:4 [   150] ***************************************************************************
20:5 [    79] ***************************************
21:0 [    64] ********************************
21:1 [     8] ****

bar_chart.py also supports ingesting aggregated values. Simply provide a two column input of keyvalue:

$ cat data | uniq -c | bar_chart.py --sort-keys --agg-values

This is very convenient if you pull data out, say Hadoop or MySQL already aggregated.

About

Command line utilities for data analysis

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published