Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect tokenization in HTML::Parser [rt.cpan.org #83570] #14

Open
oalders opened this issue Aug 24, 2020 · 0 comments
Open

Incorrect tokenization in HTML::Parser [rt.cpan.org #83570] #14

oalders opened this issue Aug 24, 2020 · 0 comments

Comments

@oalders
Copy link
Member

oalders commented Aug 24, 2020

Migrated from rt.cpan.org#83570 (status was 'open')

Requestors:

Attachments:

From [email protected] on 2013-02-23 17:43:32
:

Hi Gisle,



First, thank you for all of your huge contributions to Perl over the years!



I've discovered a site (http://www.scotts.com/) that has HTML that HTML-Parser does not tokenize correctly.



Envs (tried on two machines, same results):

*         HTML::Parser (3.65 and 3.69)

*         Perl 5.14.2, and 5.10.1

*         'full_uname' => 'Linux 449876-app3.blosm.com 2.6.18-238.37.1.el5 #1 SMP Fri Apr 6 13:47:10 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux',

*         'os_distro' => 'Red Hat Enterprise Linux Server release 5.9 (Tikanga) Kernel \\r on an \\m<file:///\\m>',

*         'full_uname' => 'Linux idx02 2.6.43.5-2.fc15.x86_64 #1 SMP Tue May 8 11:09:22 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux',

*         'os_distro' => 'Fedora release 15',



I'm attaching a representative page. The page came from:
http://www.scotts.com/smg/templates/index.jsp?pageUrl=orthoLanding

The problem seems to occur around the HTML:
                <noscript>
                                <iframe height="0" width="0" style="display:none; visibility:hidden;"
                                                src="//www.googletagmanager.com/ns.html?id=GTM-PVLS"
                                                />
                </noscript>
                <script>

I've added some debugging to the HTML::TokeParser::get_tag sub so it looks like:
use Data::Dumper;
sub get_tag
{
    my $self = shift;
    my $token;
    while (1) {
    $token = $self->get_token || return undef;

        warn "Checking token: [".Dumper($token)."]";

    my $type = shift @$token;
    next unless $type eq "S" || $type eq "E";
    substr($token->[0], 0, 0) = "/" if $type eq "E";
    return $token unless @_;
    for (@_) {
        return $token if $token->[0] eq $_;
    }
    }
}

I've tried both version 3.65 and 3.69 of HTML::Parser, which both produce the same results. They produce output in the "output" attachment. You can see on like 290 of the output that it is tokenizing almost the entire page after the iframe as one big text blob.

Thanks again,

-Carl


Carl Eklof
CTO @ Blosm Inc.
blosm.com<http://blosm.com/>
424.888.4BEE
Confidentiality Note: This e-mail message and any attachments to it are intended only for the named recipients and may contain confidential information. If you are not one of the intended recipients, please do not duplicate or forward this e-mail message and immediately delete it from your computer.  By accepting and opening this email, recipient agrees to keep all information confidential and is not allowed to distribute to anyone outside their organization.



From [email protected] on 2015-01-04 01:23:37
:

I've been seeing this with some code I'm working on soon.

To summarize this very simply, it seems like HTML::TokeParser does something weird when a tag contains a self-closing slash. If the tag is written as "<hr/>" then the parser things the tag is "hr/". If it's written as "<hr />" then we end up with a "/" attribute.

From [email protected] on 2015-01-04 16:00:15
:

I cloned the repo with the intention of fixing this, but when I looked through the test cases I realized that this behavior is actually tested for.

Gisle, what's up with this? It's not documented, AFAICT, and it really doesn't make much sense.

From [email protected] on 2016-01-19 00:12:40
:

On Sun Jan 04 11:00:15 2015, DROLSKY wrote:
> I cloned the repo with the intention of fixing this, but when I looked
> through the test cases I realized that this behavior is actually
> tested for.
>
> Gisle, what's up with this? It's not documented, AFAICT, and it really
> doesn't make much sense.

Perhaps just based on my understanding of what status this had based on this
advice from the XHTML spec.


C.2. Empty Elements
-------------------

Include a space before the trailing / and > of empty elements, e.g. <br />, <hr
/> and <img src="karen.jpg" alt="Karen" />. Also, use the minimized tag syntax
for empty elements, e.g. <br />, as the alternative syntax <br></br> allowed by
XML gives uncertain results in many existing user agents.


From [email protected] on 2016-01-19 00:21:34
:

http://www.w3.org/TR/html5/syntax.html#tag-name-state seems clear on allowing
this, so feel free to change the tests


From [email protected] on 2016-01-19 17:34:19
:

Just turning on the "empty_element_tags" option might make the parser behave
the way you expect. It might be that we should just switch the default for this
option.


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant