GitHub - bwkimmel/sortable: Sortable.com Coding Challenge Implementation

bwkimmel / sortable Public

Notifications You must be signed in to change notification settings
Fork 0
Star 2

Sortable.com Coding Challenge Implementation

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
.settings		.settings
src/ca/eandb/sortable		src/ca/eandb/sortable
.classpath		.classpath
.gitignore		.gitignore
.project		.project
AUTHORS		AUTHORS
COPYING		COPYING
README		README
build.xml		build.xml
ivy.xml		ivy.xml

Repository files navigation

===============================================================================
README - Sortable Challenge Implementation
===============================================================================
Author: Brad Kimmel <[email protected]>
Date  : March 2012
-------------------------------------------------------------------------------

Outline
=======

This project is an implementation for a programming challenge issued by
Sortable, Inc.  Details regarding this challenge may be found at:

    http://sortable.com/blog/coding-challenge/

The program takes two input files: a list of products in JSON format, one per
line, and a collection of listings in JSON format, one per line.  The challenge
is to examine the listings and match them to at most one product.  One result
file is produced which indicates which listings match which products.



Running
=======

To run this program against the default data provided as part of the challenge
specification, simply run:

    ant

on the command line.  This will download dependencies, compile the application,
download and untar the default data, and run the application on the test data.
The results may be found in

    output/results.txt

To run this program against custom data, enter the following on the command
line:

    ant run -Dproducts=<product_file> -Dlistings=<listings_file> \
        -Dresults=<results_file>

Alternatively, any of the parameters (the ones beginning with -D) may be
omitted and the user will be prompted for them.

To group the results by listing instead of by product, add the following
command line parameter to either of the above commands:

    -Dca.eandb.sortable.groupByListing=true

If this is set, the matching listings will be reprinted with the corresponding
product_name, model, and family fields added.

To output a list of the unmatched listings instead, add the following command
line parameter to either of the above commands:

    -Dca.eandb.sortable.printMisses=true



Technical Summary
=================

This program works by storing information about each product in two tries.  A
trie is a tree that stores a collection of strings.  Each node in the trie
denotes one character of input, and the path to the root from a given node
denotes a prefix of a string stored in the trie.  For further details, see:

    http://en.wikipedia.org/wiki/Trie

Two tries are created from the products listing, one for the "manufacturer"
field, and one for the model name.  Each trie contains all of the words (and
concatenations of consecutive words) from the corresponding fields in JSON
file.  Some of these strings are excluded from the trie to reduce the chance
of false positives.  For details, see comments in:

    ca.eandb.sortable.ProductTrieBuilder

Additionally, the fields are first massaged to eliminate differences due to
case, accents, certain word boundaries (or lack thereof), etc.  See comments
in:

    ca.eandb.sortable.StringUtil.normalize(String)

for details.  The nodes in the trie corresponding to the end of each string
are associated with the product.  After processing all products, any particular
node may be associated with zero or more products.

The tries act as deterministic finite automata (DFAs) used to match against
strings in the listing.  Since the set of products is likely to change less
often than the set of listings, we could build these tries and store them for
later use against new listings.

The manufacturer trie and the model trie are used to match against the
"manufacturer" and "title" fields, respectively, for each listing.  For the
field in the listing, we search for each word (and concatenations of
consecutive words) in the corresponding trie.  Each matching node in the trie
yields a set of possible products to be further considered.  After processing
all words (and concatenations of consecutive words) in the listing, we have
to resolve the sets corresponding to each matching node to a minimal matching
set.  In basic terms, this involves computing the intersection of these
matching sets, but there are additional details to handle some odd cases that
might occur.  For full details of the matching process, see comments in:

    ca.eandb.sortable.json.JSONListingReader

For the model number, if there are still multiple matches after the resolution
step described above, we only consider matches which are "maximal".  A maximal
match is one for which no suffix could be appended to yield a longer match for
the same product.  This allows us to handle the case where there are two
products which all strings which match one product also match the other.  For
example, consider the following two products:

    (a) Pentax WG-1
    (b) Pentax WG-1 GPS

Without further consideration, no listings would match (a) because the algorithm
would think that the correct result could be (a) or (b), and therefore would
report no match.  However, since "WG-1" is a maximal match for (a) but not for
(b), we accept (a) and reject (b) if "GPS" is not also present.