- Separate stable version to its own branch.
- From Readability: strip identification and presentational attributes from each nodes.
- Improve lazy load image extractor.
- Mark large blocks around main content as content as well.
- From Readability: exclude nodes which roles indicate that it's not a content.
- From Readability: skip byline, empty divs and unlikely elements.
- From Readability: convert anchor that uses Javascript URL and only contains a single text node into an ordinary text node.
- From Readability: convert
<font>
elements to<span>
. - Make sure figure caption doesn't contain
<noscript>
tags.
- Mark
<acronym>
and<tt>
as inline elements. At this point the port process has finished so I tagged it as v1.0.0. - From Readability: check if node is probably invisible by using class name and
aria-hidden
attribute. - From Readability: exclude form and input element.
- Simplify function for getting display style.
- Fix fatal error in doc builder which caused missing contents.
- Add initial test files.
- Improve lazy-loaded image replacer in image extractor.
- Port
LogUtil
fromlLogUtil.java
- Fix pagination finder
PrevNextFinder
ignores page number in URL queries.
- Merge pagination finder. Now pagination link to previous and next partial page is accessible via
Result.PaginationInfo
. - Improve page number pagination finder to also find page numbers in web page where its page number are not all consecutive (like in ArsTechnica).
- Move all models from
internal/model
directory (which can't be imported by package's user) todata
directory.
- Port
PagingLinksFinder
fromPagingLinksFinder.java
- Restructure models by moving distiller
Result
out of internal directory and remove unused data fields.
- Port
PageParameterParser
fromPageParameterParser.java
- Fix panic when generating image output.
- Implement test for
testutil.TextDocumentBuilder
followingjavatest/TextDocumentConstructionTest
. - Implement test for
webdoc.TextDocument
followingjavatest/TextDocumentStatisticsTest
.
- Port
PathComponentPagePattern
fromPathComponentPagePattern.java
- Port
PageParameterDetector
fromPageParameterDetector.java
- Port
PageLinkInfo
fromPageLinkInfo.java
- Port
PageParamInfo
fromPageParamInfo.java
- Port
MonotonicPageInfosGroups
fromMonotonicPageInfosGroups.java
- Port
PagePattern
interface fromPageParameterDetector.java
- Port
QueryParamPagePattern
fromQueryParamPagePattern.java
- Port
ContentExtractor
fromContentExtractor.java
- Remove
JsTestEntryGenerator
fromjavatest/JsTestEntryGenerator.java
because it's only used in Java to prepare the unit tests.
- Port all
ImageScorer
inwebdocuments/filters/images/
- Port
LeadImageFinder
fromwebdocuments/filters/LeadImageFinder.java
- Port
NestedElementRetainer
fromwebdocuments/filters/NestedElementRetainer.java
- Port
RelevantElements
fromwebdocuments/filters/RelevantElements.java
- Port
TestWebDocumentBuilder
fromjavatest/webdocument/TestWebDocumentBuilder.java
- Port
NumWordsRulesClassifier
fromfilters/english/NumWordsRulesClassifier.java
- Port
TerminatingBlocksFinder
fromfilters/english/TerminatingBlocksFinder.java
- Port
BlockProximityFusion
fromfilters/heuristics/BlockProximityFusion.java
- Port
DocumentTitleMatchClassifier
fromfilters/heuristics/DocumentTitleMatchClassifier.java
- Port
ExpandTitleToContentFilter
fromfilters/heuristics/ExpandTitleToContentFilter.java
- Port
HeadingFusion
fromfilters/heuristics/HeadingFusion.java
- Port
KeepLargestBlockFilter
fromfilters/heuristics/KeepLargestBlockFilter.java
- Port
LargeBlockSameTagLevelToContentFilter
fromfilters/heuristics/LargeBlockSameTagLevelToContentFilter.java
- Port
ListAtEndFilter
fromfilters/heuristics/ListAtEndFilter.java
- Port
SimilarSiblingContentExpansion
fromfilters/heuristics/SimilarSiblingContentExpansion.java
- Port
BoilerplateBlockFilter
fromfilters/simple/BoilerplateBlockFilter.java
- Port
LabelToBoilerplateFilter
fromfilters/simple/LabelToBoilerplateFilter.java
- Port
TestTextBlockBuilder
fromjavatest/TestTextBlockBuilder.java
- Port
TestTextDocumentBuilder
fromjavatest/TestTextDocumentBuilder.java
- Port
TextDocumentTestUtil
fromjavatest/document/TextDocumentTestUtil.java
- Port
TestWebTextBuilder
fromjavatest/webdocument/TestWebTextBuilder.java
- Port
ArticleExtractor
fromextractors/ArticleExtractor.java
- Remove
filters/simple/MarkEverythingBoilerplateFilter.java
since it's not used anywhere. - Remove
filters/simple/MarkEverythingContentFilter.java
andfilters/simple/MinWordsFilter.java
since it's only used inKeepEverythingExtractor.java
andKeepEverythingWithMinKWordsExtractor.java
that we already removed back in 8 October.
- Port
DomConverter
fromwebdocument/DomConverter.java
- Port
FakeWebDocumentBuilder
fromjavatest/webdocument/FakeWebDocumentBuilder.java
- Replace
alecthomas/assert
withstretchr/testify/assert
. Nothing wrong with the former but the latter is better since it prints the log as raw text instead of formatted one. Might be useful if in later days we decide to set CI for testing.
- Port
WebDocument
fromwebdocument/WebDocument.java
- Port
WebDocumentBuilder
fromwebdocument/WebDocumentBuilder.java
- Port
EmbedExtractor
fromextractors/embed/EmbedExtractor.java
- Port
ImageExtractor
fromextractors/embed/ImageExtractor.java
- Port
TwitterExtractor
fromextractors/embed/TwitterExtractor.java
- Port
VimeoExtractor
fromextractors/embed/Vimeotractor.java
- Port
YouTubeExtractor
fromextractors/embed/YouTubeExtractor.java
- Remove
JavaScript.java
because functions inside it already available in Go standard library. - Remove
GwtOverlayProtoTest.java
because it's only test model for Protobuf which we don't use. - Remove
KeepEverythingExtractor.java
andKeepEverythingWithMinKWordsExtractor.java
because it's not used anywhere.
- Port
WebVideo
fromwebdocument/WebVideo.java
- Port
TextBlock
fromdocument/TextBlock.java
- Port
TextDocument
fromdocument/TextDocument.java
anddocument/TextDocumentStatistics.java
- Add initial MIT license.
- Port
WebTag
fromwebdocument/WebTag.java
- Port
WebText
fromwebdocument/WebText.java
- Port
WebEmbed
fromwebdocument/WebEmbed.java
- Port
WebImage
fromwebdocument/WebImage.java
- Port
WebTable
fromwebdocument/WebTable.java
- Port
WebFigure
fromwebdocument/WebFigure.java
- Port
WebTextBuilder
fromwebdocument/WebTextBuilder.java
- Port
ElementAction
fromwebdocument/ElementAction.java
- Port
DomWalker
fromDomWalker.java
- Remove
NodeListExpander
since it has identical result asTreeCloneBuilder
and we already port the latter (even their unit tests are similar). - Remove
NodeTree
since it's only used inNodeListExpander
. Besides that, it also requires us to compute stylesheet which is impossible to implement right now. - Remove
OrderedNodeMatcher
since it's only used inNodeListExpander
andTreeCloneBuilder
and our implementation ofTreeCloneBuilder
doesn't require it.
- Port
TableClassifier
fromTableClassifier.java
- Remove
DomDistillerEntry
since it's useless for our case. - Remove
Assert
because we already usetestify
package that provide assertion utilities. - Remove
JsTestCase
,JsTestEntry
,JsTestSuitBase
andDomDistillerJsTestCase
because it's only used in Java to prepare the unit tests.
- Port
CreateDivTree
fromTestUtil.java
- Port
BuildTreeClone
fromTreeCloneBuilder.java
- Port
SchemaOrgParser
andSchemaOrgParserAccessor
fromSchemaOrg.java
- Port
MarkupParser
fromMarkupParser.java
- Port
getDocumentTitle
fromDocumentTitleGetter.java
- Port
IEReadingViewParser
fromIEReadingViewParser.java
- Porting process started
- Port
WordCounter
interface fromStringUtil.java
- Port
OpenGraphParser
andOpenGraphParserAccessor
fromOpenGraphParser.java