It seems like only element types which are included in a predefined list of tags are successfully ignored. In particular, I noticed that some elements which are new in HTML5 remain in the output even when $keepHTML is false.
A simple test case follows (also attached):
<?php
$markdownifier = new Markdownify(MDFY_LINKS_EACH_PARAGRAPH, MDFY_BODYWIDTH, /* $keepHTML */ false);
echo $markdownifier->parseString('<div>some text</div>');
echo PHP_EOL;
$markdownifier = new Markdownify(MDFY_LINKS_EACH_PARAGRAPH, MDFY_BODYWIDTH, /* $keepHTML */ false);
echo $markdownifier->parseString('<article>some text</article>');
echo PHP_EOL;
?>
This will need to be run from the CLI or by viewing source in a web browser to see the literal output. You'll notice that the second test outputs "<article>" when it should not.
View and moderate all "bugs Discussion" comments posted by this user
Mark all as spam, and block user from posting to "Bugs"
A simple test case.
View and moderate all "bugs Discussion" comments posted by this user
Mark all as spam, and block user from posting to "Bugs"
I did a bit more digging and this bug is in the parser.
When running the following code:
<?php
function getNodeInfo($parser) {
return array(
'nodeType' => $parser->nodeType,
'tagName' => $parser->tagName,
'node' => $parser->node,
);
}
function debugParser($html) {
$parser = new parseHTML;
$parser->html = $html;
$nodes = array();
while($parser->nextNode()) {
$nodes[] = getNodeInfo($parser);
}
echo $html.PHP_EOL;
var_export($nodes);
echo PHP_EOL;
echo PHP_EOL;
}
debugParser('<div>some text</div>');
debugParser('<article>some text</article>');
?>
The output shows that the second case was not parsed correctly:
<div>some text</div>
array (
0 =>
array (
'nodeType' => 'tag',
'tagName' => 'div',
'node' => '<div>',
),
1 =>
array (
'nodeType' => 'text',
'tagName' => NULL,
'node' => 'some text',
),
2 =>
array (
'nodeType' => 'tag',
'tagName' => 'div',
'node' => '</div>',
),
)
<article>some text</article>
array (
0 =>
array (
'nodeType' => 'text',
'tagName' => '',
'node' => '<article>some text',
),
1 =>
array (
'nodeType' => 'text',
'tagName' => '',
'node' => '</article>',
),
)
View and moderate all "bugs Discussion" comments posted by this user
Mark all as spam, and block user from posting to "Bugs"
I was able to work around the issue for my use case by adding some tags to the arrays in the parseHTML class. For posterity I've attached a patch with my changes (this is undertested, but it seems to work for me).
If Markdownify ever undergoes active development again, I'd recommend ditching the custom parser in favor of something more established/robust. This StackOverflow answer discusses many options: http://stackoverflow.com/a/3577662/3625
View and moderate all "bugs Discussion" comments posted by this user
Mark all as spam, and block user from posting to "Bugs"
Patch to make HTML parser able to handle more elements.