Menu

#22 $keepHTML = false does not work with all tags

open
markdownify (8)
5
2013-04-10
2013-04-10
Anonymous
No

It seems like only element types which are included in a predefined list of tags are successfully ignored. In particular, I noticed that some elements which are new in HTML5 remain in the output even when $keepHTML is false.

A simple test case follows (also attached):

<?php
$markdownifier = new Markdownify(MDFY_LINKS_EACH_PARAGRAPH, MDFY_BODYWIDTH, /* $keepHTML */ false);
echo $markdownifier->parseString('<div>some text</div>');
echo PHP_EOL;

$markdownifier = new Markdownify(MDFY_LINKS_EACH_PARAGRAPH, MDFY_BODYWIDTH, /* $keepHTML */ false);
echo $markdownifier->parseString('<article>some text</article>');
echo PHP_EOL;
?>

This will need to be run from the CLI or by viewing source in a web browser to see the literal output. You'll notice that the second test outputs "<article>" when it should not.

Discussion

  • Anonymous

    Anonymous - 2013-04-10

    I did a bit more digging and this bug is in the parser.

    When running the following code:
    <?php
    function getNodeInfo($parser) {
    return array(
    'nodeType' => $parser->nodeType,
    'tagName' => $parser->tagName,
    'node' => $parser->node,
    );
    }

    function debugParser($html) {
    $parser = new parseHTML;
    $parser->html = $html;

    $nodes = array();
    while($parser->nextNode()) {
    $nodes[] = getNodeInfo($parser);
    }

    echo $html.PHP_EOL;
    var_export($nodes);
    echo PHP_EOL;
    echo PHP_EOL;
    }

    debugParser('<div>some text</div>');
    debugParser('<article>some text</article>');
    ?>

    The output shows that the second case was not parsed correctly:
    <div>some text</div>
    array (
    0 =>
    array (
    'nodeType' => 'tag',
    'tagName' => 'div',
    'node' => '<div>',
    ),
    1 =>
    array (
    'nodeType' => 'text',
    'tagName' => NULL,
    'node' => 'some text',
    ),
    2 =>
    array (
    'nodeType' => 'tag',
    'tagName' => 'div',
    'node' => '</div>',
    ),
    )

    <article>some text</article>
    array (
    0 =>
    array (
    'nodeType' => 'text',
    'tagName' => '',
    'node' => '&lt;article>some text',
    ),
    1 =>
    array (
    'nodeType' => 'text',
    'tagName' => '',
    'node' => '&lt;/article>',
    ),
    )

     
  • Anonymous

    Anonymous - 2013-04-10

    I was able to work around the issue for my use case by adding some tags to the arrays in the parseHTML class. For posterity I've attached a patch with my changes (this is undertested, but it seems to work for me).

    If Markdownify ever undergoes active development again, I'd recommend ditching the custom parser in favor of something more established/robust. This StackOverflow answer discusses many options: http://stackoverflow.com/a/3577662/3625

     

Log in to post a comment.

MongoDB Logo MongoDB