Skip to content

Commit

Permalink
Merge pull request #11 from nbulaj/html_adapters
Browse files Browse the repository at this point in the history
Html adapters (Nokogiri || Oga)
  • Loading branch information
nbulaj authored Dec 8, 2017
2 parents ada3d5a + f619182 commit ba9e663
Show file tree
Hide file tree
Showing 32 changed files with 444 additions and 131 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ pickle-email-*.html
Gemfile.lock
*.gem
certs
gemfiles/*.gemfile.lock

# TODO Comment out this rule if you are OK with secrets being uploaded to the repo
config/initializers/secret_token.rb
Expand Down
2 changes: 1 addition & 1 deletion .rubocop.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
LineLength:
Max: 120
AllCops:
TargetRubyVersion: 2.4
TargetRubyVersion: 2.1
Exclude:
- 'spec/**/*'
- 'bin/*'
Expand Down
9 changes: 9 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,12 @@ language: ruby
before_install: gem install bundler
bundler_args: --without yard guard benchmarks
script: "rake spec"
env:
global:
- "JRUBY_OPTS='$JRUBY_OPTS --debug'"
gemfile:
- gemfiles/oga.gemfile
- gemfiles/nokogiri.gemfile
rvm:
- 2.0
- 2.1
Expand All @@ -12,3 +18,6 @@ rvm:
matrix:
allow_failures:
- rvm: ruby-head
exclude:
- rvm: 2.0
gemfile: gemfiles/nokogiri.gemfile # Nokogiri doesn't support Ruby 2.0
5 changes: 4 additions & 1 deletion Gemfile
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,10 @@ source 'https://rubygems.org'

gemspec

gem 'nokogiri', '~> 1.8'
gem 'oga', '~> 2.0'

group :test do
gem 'coveralls', require: false
gem 'evil-proxy'
gem 'evil-proxy', '~> 0.2'
end
68 changes: 58 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,17 +5,19 @@
[![Code Climate](https://codeclimate.com/github/nbulaj/proxy_fetcher/badges/gpa.svg)](https://codeclimate.com/github/nbulaj/proxy_fetcher)
[![License](http://img.shields.io/badge/license-MIT-brightgreen.svg)](#license)

This gem can help your Ruby application to make HTTP(S) requests from proxy by fetching and validating actual
This gem can help your Ruby application to make HTTP(S) requests using proxy by fetching and validating actual
proxy lists from multiple providers.

It gives you a `Manager` class that can load proxy lists, validate them and return random or specific proxies. Take a look
at the documentation below to find all the gem features.
It gives you a special `Manager` class that can load proxy lists, validate them and return random or specific proxies.
It also has a `Client` class that encapsulates all the logic for the sending HTTP requests using proxies.
Take a look at the documentation below to find all the gem features.

Also this gem can be used with any other programming language (Go / Python / etc) as standalone solution for downloading and
validating proxy lists from the different providers. [Checkout examples](#standalone) of usage below.

## Table of Contents

- [Dependencies](#dependencies)
- [Installation](#installation)
- [Example of usage](#example-of-usage)
- [In Ruby application](#in-ruby-application)
Expand All @@ -28,12 +30,24 @@ validating proxy lists from the different providers. [Checkout examples](#standa
- [Contributing](#contributing)
- [License](#license)

## Dependencies

ProxyFetcher gem itself requires only Ruby `>= 2.0.0`.

However, it requires an adapter to parse HTML. If you do not specify any specific adapter, then it will use
default one - [Nokogiri](https://github.com/sparklemotion/nokogiri). It's OK for any Ruby on Rails project
(because they uses it by default).

But if you want to use some specific adapter (for example your Ruby application uses [Oga](https://gitlab.com/yorickpeterse/oga),
then you need to manually add your dependencies to your project and configure ProxyFetcher to use another adapter. Moreover,
you can implement your own adapter if it your use-case. Take a look at the [Configuration](#configuration) section for more details.

## Installation

If using bundler, first add 'proxy_fetcher' to your Gemfile:

```ruby
gem 'proxy_fetcher', '~> 0.5'
gem 'proxy_fetcher', '~> 0.6'
```

or if you want to use the latest version (from `master` branch), then:
Expand Down Expand Up @@ -234,7 +248,25 @@ Btw, if you need support of JavaScript or some other features, you need to imple

## Configuration

To change open/read timeout for `cleanup!` and `connectable?` methods you need to change `ProxyFetcher.config`:
ProxyFetcher is very flexible gem. You can configure the most important parts of the library and use your own solutions.

Default configuration looks as follows:

```ruby
ProxyFetcher.configure do |config|
config.user_agent = ProxyFetcher::Configuration::DEFAULT_USER_AGENT
config.pool_size = 10
config.timeout = 3
config.http_client = ProxyFetcher::HTTPClient
config.proxy_validator = ProxyFetcher::ProxyValidator
config.providers = ProxyFetcher::Configuration.registered_providers
config.adapter = ProxyFetcher::Configuration::DEFAULT_ADAPTER # :nokogiri by default
end
```

You can change any of the options above. Let's look at this deeper.

To change open/read timeout for `cleanup!` and `connectable?` methods you need to change `timeout` options:

```ruby
ProxyFetcher.configure do |config|
Expand All @@ -245,18 +277,19 @@ manager = ProxyFetcher::Manager.new
manager.cleanup!
```

Also you can set your custom User-Agent:
Also you can set your custom User-Agent string:

```ruby
ProxyFetcher.configure do |config|
config.user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'
end
```

ProxyFetcher uses simple Ruby solution for dealing with HTTP(S) requests - `net/http` library from the stdlib. If you wanna add, for example, your custom provider that
was developed as a Single Page Application (SPA) with some JavaScript, then you will need something like [selenium-webdriver](https://github.com/SeleniumHQ/selenium/tree/master/rb)
to properly load the content of the website. For those and other cases you can write your own class for fetching HTML content by the URL and setup it
in the ProxyFetcher config:
ProxyFetcher uses standard Ruby solution for dealing with HTTP(S) requests - `net/http` library from the Ruby core.
If you wanna add, for example, your custom provider that was developed as a Single Page Application (SPA) with some JavaScript,
then you will need something like [selenium-webdriver](https://github.com/SeleniumHQ/selenium/tree/master/rb) to properly
load the content of the website. For those and other cases you can write your own class for fetching HTML content by
the URL and setup it in the ProxyFetcher config:

```ruby
class MyHTTPClient
Expand Down Expand Up @@ -300,6 +333,21 @@ manager.validate!
#=> [ ... ]
```

Be default, ProxyFetcher gem uses [Nokogiri](https://github.com/sparklemotion/nokogiri) for parsing HTML. If you want
to use [Oga](https://gitlab.com/yorickpeterse/oga) instead, then you need to add `gem 'oga'` to your Gemfile and configure
ProxyFetcher as follows:

```ruby
ProxyFetcher.config.adapter = :oga
```

Also you can write your own HTML parser implementation and use it, take a look at the [abstract class and implementations](lib/proxy_fetcher/document).
Configure it as:

```ruby
ProxyFetcher.config.adapter = MyHTMLParserClass
```

### Proxy validation speed

There are some tricks to increase proxy list validation performance.
Expand Down
11 changes: 11 additions & 0 deletions gemfiles/nokogiri.gemfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
source 'https://rubygems.org'

gemspec path: '../'

gem 'nokogiri', '~> 1.8'

group :test do
gem 'coveralls', require: false
gem 'evil-proxy', '~> 0.2'
gem 'rspec-rails', '~> 3.6'
end
11 changes: 11 additions & 0 deletions gemfiles/oga.gemfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
source 'https://rubygems.org'

gemspec path: '../'

gem 'oga', '~> 2.0'

group :test do
gem 'coveralls', require: false
gem 'evil-proxy', '~> 0.2'
gem 'rspec-rails', '~> 3.6'
end
18 changes: 15 additions & 3 deletions lib/proxy_fetcher.rb
Original file line number Diff line number Diff line change
@@ -1,7 +1,5 @@
require 'uri'
require 'net/https'
require 'nokogiri'
require 'thread'

require File.dirname(__FILE__) + '/proxy_fetcher/exceptions'
require File.dirname(__FILE__) + '/proxy_fetcher/configuration'
Expand All @@ -10,12 +8,18 @@
require File.dirname(__FILE__) + '/proxy_fetcher/manager'

require File.dirname(__FILE__) + '/proxy_fetcher/utils/http_client'
require File.dirname(__FILE__) + '/proxy_fetcher/utils/html'
require File.dirname(__FILE__) + '/proxy_fetcher/utils/proxy_validator'
require File.dirname(__FILE__) + '/proxy_fetcher/client/client'
require File.dirname(__FILE__) + '/proxy_fetcher/client/request'
require File.dirname(__FILE__) + '/proxy_fetcher/client/proxies_registry'

require File.dirname(__FILE__) + '/proxy_fetcher/document'
require File.dirname(__FILE__) + '/proxy_fetcher/document/adapters'
require File.dirname(__FILE__) + '/proxy_fetcher/document/node'
require File.dirname(__FILE__) + '/proxy_fetcher/document/adapters/abstract_adapter'
require File.dirname(__FILE__) + '/proxy_fetcher/document/adapters/nokogiri_adapter'
require File.dirname(__FILE__) + '/proxy_fetcher/document/adapters/oga_adapter'

module ProxyFetcher
module Providers
require File.dirname(__FILE__) + '/proxy_fetcher/providers/base'
Expand All @@ -36,5 +40,13 @@ def config
def configure
yield config
end

private

def configure_adapter!
config.adapter = Configuration::DEFAULT_ADAPTER if config.adapter.nil?
end
end

configure_adapter!
end
11 changes: 9 additions & 2 deletions lib/proxy_fetcher/configuration.rb
Original file line number Diff line number Diff line change
@@ -1,11 +1,13 @@
module ProxyFetcher
class Configuration
attr_accessor :providers, :timeout, :pool_size, :user_agent
attr_accessor :http_client, :proxy_validator
attr_accessor :timeout, :pool_size, :user_agent
attr_reader :adapter, :http_client, :proxy_validator, :providers

# rubocop:disable Metrics/LineLength
DEFAULT_USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112 Safari/537.36'.freeze

DEFAULT_ADAPTER = :nokogiri

class << self
def providers_registry
@registry ||= ProvidersRegistry.new
Expand Down Expand Up @@ -35,6 +37,11 @@ def reset!
self.providers = self.class.registered_providers
end

def adapter=(name_or_class)
@adapter = ProxyFetcher::Document::Adapters.lookup(name_or_class)
@adapter.setup!
end

def providers=(value)
@providers = Array(value)
end
Expand Down
23 changes: 23 additions & 0 deletions lib/proxy_fetcher/document.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
module ProxyFetcher
class Document
class << self
def parse(data)
new(ProxyFetcher.config.adapter.parse(data))
end
end

attr_reader :backend

def initialize(backend)
@backend = backend
end

def xpath(*args)
backend.xpath(*args).map { |node| backend.proxy_node.new(node) }
end

def css(*args)
backend.css(*args).map { |node| backend.proxy_node.new(node) }
end
end
end
24 changes: 24 additions & 0 deletions lib/proxy_fetcher/document/adapters.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
module ProxyFetcher
class Document
class Adapters
ADAPTER = 'Adapter'.freeze
private_constant :ADAPTER

class << self
def lookup(name_or_class)
raise Exceptions::BlankAdapter if name_or_class.nil? || name_or_class.to_s.empty?

case name_or_class
when Symbol, String
adapter_name = name_or_class.to_s.capitalize << ADAPTER
ProxyFetcher::Document.const_get(adapter_name)
else
name_or_class
end
rescue NameError
raise Exceptions::UnknownAdapter, name_or_class
end
end
end
end
end
31 changes: 31 additions & 0 deletions lib/proxy_fetcher/document/adapters/abstract_adapter.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
module ProxyFetcher
class Document
class AbstractAdapter
attr_reader :document

def initialize(document)
@document = document
end

# You can override this method in your own adapter class
def xpath(selector)
document.xpath(selector)
end

# You can override this method in your own adapter class
def css(selector)
document.css(selector)
end

def proxy_node
self.class.const_get('Node')
end

def self.setup!(*args)
install_requirements!(*args)
rescue LoadError => error
raise Exceptions::AdapterSetupError.new(name, error.message)
end
end
end
end
35 changes: 35 additions & 0 deletions lib/proxy_fetcher/document/adapters/nokogiri_adapter.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
module ProxyFetcher
class Document
class NokogiriAdapter < AbstractAdapter
def self.install_requirements!
require 'nokogiri'
end

def self.parse(data)
new(::Nokogiri::HTML(data))
end

class Node < ProxyFetcher::Document::Node
def at_xpath(*args)
self.class.new(node.at_xpath(*args))
end

def at_css(*args)
self.class.new(node.at_css(*args))
end

def attr(*args)
clear(node.attr(*args))
end

def content
clear(node.content)
end

def html
node.inner_html
end
end
end
end
end
Loading

0 comments on commit ba9e663

Please sign in to comment.