GitHub - davidhoward/BLB: An ASP specializer for statistical bootstrapping

davidhoward / BLB Public

forked from richardxia/BLB

Notifications You must be signed in to change notification settings
Fork 1
Star 3

An ASP specializer for statistical bootstrapping

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 128 Commits
analysis		analysis
convert		convert
data		data
distributed		distributed
templates		templates
tests		tests
README		README
__init__.py		__init__.py
blb.py		blb.py
blb_setup.py		blb_setup.py
blb_setup_EXAMPLE		blb_setup_EXAMPLE
blb_spark_setup.sh		blb_spark_setup.sh
run_dist_tests.sh		run_dist_tests.sh
run_local_tests.sh		run_local_tests.sh
run_openmp_tests.py		run_openmp_tests.py

Repository files navigation

Thank you for using the UC Berkeley BLB Specializer. For more information on this
specializer and the ASP project, please visit http://www.sejits.org. This document
describes the basic steps necessary to install and use the UC Berkeley BLB
Specializer. For information on using this specializer, please see
https://github.com/davidhoward/BLB/wiki/Home .

1. Prerequisites
2. Installation and Setup
3. Verifying your installation
4. Running your first specialized program

1. PREREQUISITES

The UC Berkeley BLB Specializer relies upon several other software packages in order
to function. Please verify that these are installed and working correctly on your machine
before attempting to install this specializer.

0) A Unix based operating system
See sejits.org for information about using ASP on a non unix-y platform.

i) Python 2.6.x
ASP SEJITS is designed to work with the 2.6 series of python releases. This specializer
and the libraries it relies upon may not function with other versions of python. See
http://www.python.org for information and to download python.

ii) The ASP Framework
This package contains functions common to all ASP specializers, and is needed by this
specializer to function. See sejits.org for information on installing the ASP framework.

iii) Gnu Scientific Library
The Gnu Scientific Library may or may not be used for certain specialized operations,
but is necessary for the specializer to function properly. This specializer was tested against
GSL 1.15. For information and to get GSL, please visit http://www.gnu.org/s/gsl.

iv) Spark Computing Cluster (Optional for Distributed Version of Specializer)
See spark-project.org for information on this open source cluster computing system
(developed in UC Berkeley's AMP Research Laboratory) and how to install.

2. INSTALLATION AND SETUP

Preparing the UC Berkeley BLB Specializer for use involves three steps: Acquiring the
source files, altering the appropriate environment variables, and writing the
configuration files. For the distributed version of the specializer, one must get up and
running a Spark computing cluster. See the final section here for additional details on
this and what to do once the Spark cluster is up.

ACQUIRING THE SOURCE

The most recent relase build of this specializer is available at
https://github.com/shoaibkamil/asp/tree/master/specializers/BLB

This is a git repository, and may be cloned as such.

SETTING ENVIRONMENT VARIABLES

For easy use of this specializer, please ensure that the directory containing the compiled GSL
binary is on your LD_LIBRARY_PATH, and that the root directory of your specializer
installation (the one containing this document) is on your PYTHONPATH.

WRITING THE CONFIGURATION FILE

The UC Berkeley BLB Specializer requires some information about your system to function
properly. You should create a file in your root BLB directory named blb_setup.py with the
following contents:

- gslroot = "/path/to/gsl/header_files"
The specializer needs to know where to find the gsl header files.
- cache_dir = "/path/to/cgen/cache"
The specializer caches modules it has already compiled in this directory, so that it does
not have to re-compile for cases it has already seen.

An example of a properly formatted blb_setup.py is located in blb_setup_EXAMPLE.txt.

DISTRIBUTED VERSION SETUP

To setup the distributed version, first get up and running a Spark cluster. Inside the master
node on the cluster, run the setup script located here:
https://github.com/davidhoward/BLB/blob/master/blb_spark_setup.sh
Then, it is advisable to adjust the level of parallelism for your cluster. To do this,
open the file ~/sejits/asp/specializers/blb/blb_core_parallel.scala and go to line 88. There,
one can adjust the level of parallelism that is ideal for the their cluster (likely the number
of cores in the cluster).

3. VERIFYING YOUR INSTALLATION

To verify a proper installation of the distributed version, open the BLB folder (~/BLB) and
run ./run_dist_tests.sh

If this works, congrats! You have just run the BLB (bag of little bootstraps) algorithm on
your cluster. The default behavior is to compute a classifier's estimated accuracy on a subset
of the Enron email corpus.

4. RUNNING YOUR FIRST SPECIALIZED PROGRAM

For the distributed version, to have gotten here, you must have already actually ran the
test BLB SEJITS program when verifying your installation. To customize this test and perform
alternative calculations, open the file on your cluster ~/BLB/tests/spark_test.py and edit
the three input functions to the BLB algorithm (compute_estimate, reduce_bootstraps, and
average). An easy change would be to calculate the standard deviation instead of the mean
in the reduce_bootstraps function. There is already a commented out standard deviation
function written.

Past this customization, one can create their own distributed specializer instance as done in
the bottom of spark_test.py file. For more information, see the wiki for this repo,
specifically the section page titled "Using the Specializer."