Skip to content
This repository has been archived by the owner on Oct 30, 2018. It is now read-only.

Fixed some typos, wrapped lines, minor edits for clarity. #78

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 34 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,58 +1,77 @@
#Goose - Article Extractor

##Intro

##Intro

Goose was originally an article extractor written in Java that has most recently (aug2011) converted to a scala project. It's mission is to take any news article or article type web page and not only extract what is the main body of the article but also all meta data and most probable image candidate.
Goose was originally an article extractor written in Java that has been
converted to a Scala project. Its mission is to take a news article
or article-type web page and extract the main body of the article, all
metadata, and most probable image candidate.

The extraction goal is to try and get the purest extraction from the beginning of the article for servicing flipboard/pulse type applications that need to show the first snippet of a web article along with an image.
The extraction goal is the purest extraction from the beginning of the
article for servicing flipboard/pulse type applications that need to
show the first snippet of a web article along with an image.

Goose will try to extract the following information:

- Main text of an article
- Main image of article
- Any Youtube/Vimeo movies embedded in article
- Any YouTube/Vimeo movies embedded in article
- Meta Description
- Meta tags
- Publish Date


The wiki has the full details on how to use Goose [https://github.com/jiminoc/goose/wiki](https://github.com/jiminoc/goose/wiki)
The wiki has the full details on how to use Goose
[https://github.com/jiminoc/goose/wiki](https://github.com/jiminoc/goose/wiki)

Goose was open sourced by Gravity.com in 2011

Lead Programmer: Jim Plush (Gravity.com)

Contributers: Robbie Coleman (Gravity.com)
Contributors: Robbie Coleman (Gravity.com)


Try it out online!
http://jimplush.com/blog/goose
[Try it out online!](http://jimplush.com/blog/goose)


##Licensing
If you find Goose useful or have issues please drop me a line, I'd love to hear how you're using it or what features should be improved

Goose is licensed by Gravity.com under the Apache 2.0 license, see the LICENSE file for more details
If you find Goose useful or have issues, please drop me a line, I'd love
to hear how you're using it or what features should be improved.

Goose is licensed by Gravity.com under the Apache 2.0 license, see the
LICENSE file for more details.


##Take it for a spin

To use goose from the command line:

cd into the goose directory
mvn compile
MAVEN_OPTS="-Xms256m -Xmx2000m"; mvn exec:java -Dexec.mainClass=com.gravity.goose.TalkToMeGoose -Dexec.args="http://techcrunch.com/2011/05/13/native-apps-or-web-apps-particle-code-wants-you-to-do-both/" -e -q > ~/Desktop/gooseresult.txt


##Regarding the port from JAVA to Scala
##Regarding the port from Java to Scala

Here are some of the reasons for the port to Scala:

- Gravity has moved more towards Scala development internally so maintenance started to become an issue
- Gravity has moved more towards Scala development internally so
maintenance started to become an issue
- There wasn't enough contribution to warrant keeping it in Java
- The packages were all namespaced under a person's name and not the company's name
- The packages were all namespaced under a person's name and not the
company's name
- Scala is more fun


##Issues
It was a pretty fast Java to Scala port so lots of the nicities of the Scala language aren't in the codebase yet, but those will come over the coming months as we re-write alot of the internal methods to be more Scalesque.
We made sure it was still nice and operable from Java as well so if you're using goose from java you still should be able to use it with a few changes to the method signatures.

The Java to Scala port was done quickly, so many niceties of the
Scala language aren't in the codebase yet, but those will come over the
coming months as we re-write alot of the internal methods to be more
Scala-esque.

We made sure it was still nice and operable from Java as well, so you
should still be able to use goose from java with a few changes to the
method signatures.
26 changes: 26 additions & 0 deletions build.sbt
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
name := "Goose"

version := "2.1.22"

organization := "GravityLabs"

organizationHomepage := Some(url("http://gravity.com/"))

homepage := Some(url("https://github.com/GravityLabs/goose"))

description := "Extracts text, metadata, and key image from web articles."

licenses += "Apache2" -> url("http://www.apache.org/licenses/")

// scalacOptions ++= Seq("-unchecked", "-deprecation")

libraryDependencies ++= Seq(
"junit" % "junit" % "4.8.1" % "test",
"org.slf4j" % "slf4j-api" % "1.6.1" % "compile",
"org.slf4j" % "slf4j-log4j12" % "1.6.1" % "test",
"org.slf4j" % "slf4j-simple" % "1.6.1",
"org.jsoup" % "jsoup" % "1.5.2",
"commons-io" % "commons-io" % "2.0.1",
"org.apache.httpcomponents" % "httpclient" % "4.1.2",
"commons-lang" % "commons-lang" % "2.6"
)
36 changes: 21 additions & 15 deletions src/main/scala/com/gravity/goose/TalkToMeGoose.scala
Original file line number Diff line number Diff line change
Expand Up @@ -7,21 +7,27 @@ package com.gravity.goose
*/
object TalkToMeGoose {
/**
* you can use this method if you want to run goose from the command line to extract html from a bashscript
* or to just test it's functionality
* you can run it like so
* cd into the goose root
* mvn compile
* MAVEN_OPTS="-Xms256m -Xmx2000m"; mvn exec:java -Dexec.mainClass=com.gravity.goose.TalkToMeGoose -Dexec.args="http://techcrunch.com/2011/05/13/native-apps-or-web-apps-particle-code-wants-you-to-do-both/" -e -q > ~/Desktop/gooseresult.txt
*
* Some top gun love:
* Officer: [in the midst of the MIG battle] Both Catapults are broken, sir.
* Stinger: How long will it take?
* Officer: It'll take ten minutes.
* Stinger: Bullshit ten minutes! This thing will be over in two minutes! Get on it!
*
* @param args
*/
* You can use this method to run goose from the command line
* to extract html from a bash script, or to just test its functionality:
*
* cd into the goose root
* mvn compile
* MAVEN_OPTS="-Xms256m -Xmx2000m"; mvn exec:java -Dexec.mainClass=com.gravity.goose.TalkToMeGoose -Dexec.args="http://techcrunch.com/2011/05/13/native-apps-or-web-apps-particle-code-wants-you-to-do-both/" -e -q > ~/Desktop/gooseresult.txt
*
* or if using sbt:
*
* cd into the goose root
* sbt
* > run http://www.thestar.com/news/insight/2013/04/26/spotting_tiny_gnatcatcher_can_put_a_spring_in_your_step.html
*
* Some top gun love:
* Officer: [in the midst of the MIG battle] Both Catapults are broken, sir.
* Stinger: How long will it take?
* Officer: It'll take ten minutes.
* Stinger: Bullshit ten minutes! This thing will be over in two minutes! Get on it!
*
* @param args
*/
def main(args: Array[String]) {
try {
val url: String = args(0)
Expand Down