-
Notifications
You must be signed in to change notification settings - Fork 857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTML support #16
Comments
+1 |
+1 ! :) |
I think this is more than a "nice to have" - it's kind of a basic feature of markdown. See http://daringfireball.net/projects/markdown/syntax#html. |
It's basic but my personally view is that its a really bad idea to mix markdown and HTML. If someone writes the code to do this I'll happily accept it - I'm just not going write it myself. |
Well if markdown had something for styling (if only defining CSS classes..), I wouldn't need it. Does anyone have ideas? |
We've got a (Maruku dialiect)[http://maruku.rubyforge.org/proposal.html] that supports their metadata proposal so you can do:
to add id's and classes |
My main issue with the lack of HTML support is that it makes tables much harder. I know there are various implementations/proposals for a table syntax in markdown, but it does not seem like markdown-js supports any. |
+1 - i'd use this for tables and other non-markdown supported tags (video). |
+1 required for any table support (including CSS for styling the tables globally via |
Found @cadorn fork which includes HTML inline and works great! |
Can test that a bit more thoroughly and let me know if it works and then we'll get it merged in. |
I've used it for extensive table creation, and inline |
This is a duplicate of issue 11. I don't think cadorn's fork should be merged in in its current state; although it looks like a good solution for applications like writing blog posts you host on your own server, it's only applicable in cases where you completely trust the source of the Markdown, and as such, it would open XSS security holes in applications that are currently using markdown-js to render input across trust boundaries. I'm pasting here the comment that I made on his commit: So, while on one hand I really want this feature for my application of markdown-js, on the other hand I really want a way to filter the HTML to keep out things like the following:
I think I'd also be a little happier with something other than I think another thing that we run into trouble with is entity handling. I ought to be able to write What this adds up to is that we probably need to run all the strings that are the contents of paragraphs, list items, or headers, through a more or less actual HTML parser that can be supplied with whitelists of tags, attributes, and URL schemes, so that it can successfully pass through the subset of well-formed HTML that's right for the application in question. In modern browsers, we could actually use DOMParser, but in Node we might have to use our own. It probably doesn't have to be quite as robust as a browser's parser, since many applications (and basically all applications that use some arbitrary subset of HTML) will give the user a chance to preview and fix their Markdown, so if it barfs on overlapping span-level tags (as GitHub sort of does: I'm not proposing that you should do all this work for me; I was just checking out the network to see if someone had already done it. It looks like you're the one that's come closest. Would it be useful to you if I did what I'm describing? Would it remove the need for the code in this commit for you? (Edited: fixed a typo.) |
@kragen - I am all for a more intelligent HTML parser the way you describe it. I am using this lib to render documentation for my internal projects so there was no need to check the HTML. Should be pretty easy to hook into what I have started. |
(I've bolded some phrases below to facilitate skimming; hope it's not too annoying when you're reading straight through.) Okay, well, I guess I'm committed to implementing this, then. Here's what I'm thinking about how to do it. Is this a good way to do it? I'd really appreciate comments before I go haring off without the benefit of other people's advice and experience.
I assert that this is a bug, because it robs JsonML of any way to represent SGML entity references. I propose that JsonML strings should be treated as PCDATA — allowed to contain entities but not tags — rather than CDATA (plain text) or HTML (text with entities and tags).
I'm hoping I can use an existing pure-JS HTML parser — say, jsdom's, or kn_htmlsafe_htmlSanitize, or NodeHtmlParser — rather than hacking one together from scratch. (As a fallback, I could write a very simple parser for the tags-and-attributes subset of XHTML.) I'm a little worried about the performance implications of this; markdown-js is already a little slower than Showdown, and this could make the matter worse. Does anybody have recommendations here? (In the case where it's running in a modern browser, we could use
Other variations:
This has the disadvantage that it would make some kinds of processing on the intermediate representation harder — for example, in yamemex, I want to support Twitter-style #hashtags, and that will be easier to do if I can tell which hash The advantage is that it would probably make the intermediate processing run faster and take less memory, and it expands the HTML parsers that can be used beyond just those that build a parse tree, which is slow; HTML parsers that simply produce sanitized HTML could be used. Also, the intermediate representation would be simpler, since it wouldn't have HTML tag names in it. This would involve changing the intermediate "JsonML" representation to have HTML rather than CDATA or PCDATA contents — so (My project using markdown-js for, ultimately, social bookmarking.) So, what do other people think? The above represents a few hours of me thinking about the problem, but I anticipate that implementing it will take at least a few days of work, so I'd really appreciate help in thinking this through before I jump in. |
I guess I should elaborate a little bit on the kinds of use cases/threat models I'm thinking of here:
I believe yamemex is in category 2, because I excerpt the pages I bookmark with it all the time. markdown-js is currently safe for this case because it escapes all HTML, but I want it to let through safe HTML. I think that almost any server-based web application that renders Markdown taken from client requests is in category 2, if not category 3, and I think (though I don't know) that many markdown-js applications do that. |
I would pull in something like: Don't parse on output. I think it should make it into the JsonML structure and be sanitized by then. Keep defaults safe and write minimal code using third party tested libraries where possible. |
Oh hey! That looks nice! Thanks for finding that! I wonder how much of Caja I'd have to pull in to get it to work. Doesn't look like that much. In general I'm not that enthusiastic about the quality of random third-party "tested" libraries in JS, but Caja is an exception; the project leads are programmers who use JavaScript, not "JavaScript programmers", and they are good ones. So if the code can be made to do the job (which is still an "if"), that looks like a better option than the alternatives I suggested earlier. Maybe if I dig in I'll change my mind. Apache 2.0 license should be okay, right, Ash? |
Good work, your long comment
Apache 2 is compatible with BSD right? For preference anyway I'd prefer if you just require another lib/module than pull the source in directly. If thats much of a pain to achieve then a subdir under lib/caja/ works too. Above all else it seems you've got it well thought out. So long as there is some docs on how it behaves and it's not too tightly tied to only working in one way I'm more than happy to accept a pull request. Bonus points for having tests - I'm happy if these only run under node so long as the code itself is portable to browsers. |
I'll pitch in on this one. I started using this library yesterday and wrote my own small modifications to markdown.js to handle the two problems presented in this issue (HTML tags and character entities). This was before checking the issues page finding this recently-discussed topic. :) My use case is, for now, limited to my own usage (just like cadorn), but it'd be great with something more reliable than what I currently have. My main problem was actually block-level HTML, that I didn't want to be wrapped in a As for the suggestion of pulling in Caja, I think it sounds like a great idea! Might be good to make it optional though, since, well, it is an additional dependency. Perhaps let it be an option which defaults to on, so that people who don't need the feature don't need to have Caja installed (or can remove it if it has to be bundled in lib/). Anyway, great to see that this is being worked on/that I'm not the only one who want HTML support. |
Hey, I just realized I never responded to the comments above. Didn't do anything this weekend, or yesterday. Everything is compatible with BSD. I'm trying to get the regression tests running and fix some smaller bugs first — see #26 if curious. |
Why not…? That is what has always seemed most sensible to me – for the reason @kragen mentioned, that it treats all tags the same whatever their provenance. After all Markdown is by intent a shorthand for the most common of HTML. Whether you write |
@ap This library is great because it has the JsonML intermediate layer. I send the JSON to the client and have the client convert if from JSON to HTML. I think HTML should be sanitized as it enters JsonML. The conversion from JsonML to HTML should be a simple dumb transformation so alternative output formats can be easily targeted. What exactly are the specific problems with this approach (the comments above are too verbose to follow). |
My favoured approach would be to have the HTML parsed and converted into JsonML for two reasons. The first is as cadorn mentioned. The second is if you parse it into JsonML it should be easier to sanitize/limit the tags that are allowed. @ap's comment sounds like violent agreement - i.e. you both want the HTML parsed? |
Hmm. Basically I consider a Markdown implementation incomplete unless all constructs that can be written using Markdown shorthand can also be written explicitly using the equivalent HTML – i.e. So if one is filtered, the other should be too, and if not, then neither should be. Sanitising at the output stage (after the Markdown has become HTML) has the advantage that a) this equivalence just falls out of the implementation directly with zero further effort and b) the sanitiser is not coupled to the Markdown processor. If you want to sanitise using an intermediate representation that differentiates between Markdown shorthand and explicit HTML then I guess you’d need to use a mapping table or function from Markdown to HTML so that the sanitiser can use it to treat Markdown shorthand syntax as if it were the implied HTML. That would work. The obvious disadvantage is that you are then effectively converting the Markdown to HTML twice, once for the sanitiser and once for output. However, if sanitising happens server-side and the output conversion on the client, then that may be worthwhile anyhow. (It would save you the re-parsing using a completely decoupled HTML parser. And maybe the mapping during the sanitisation stage is cheap enough that it is negligible anyway.) As for targeting alternative output formats, that is essentially a question of converting HTML to the output format in question. Again, by design and intent, Markdown is HTML, just an alternative form that supplies shorthand syntax for a chosen subset of tags. You can convert either all of HTML to the output format or only a defined subset, take your pick – but you convert HTML either way. (E.g. you could pick the subset that only covers the tags which have Markdown shorthands, and that’s fine. Note what this way of looking at it implies: that Does this help? |
What if we convert all known HTML tags (that correspond to markdown syntax) to markdown on input and leave the remaining HTML nodes as HTML (after sanitizing them). This would give a JsonML graph with markdown and HTML nodes allowing for Markdown syntax within HTML content. We can then have a special tag with some options to optionally warp a chunk of HTML to customize how it is to be handled (markdown in content on pure HTML). On the JsonML -> HTML side any HTML nodes just get dumped. This may be the best solution but harder to implement using a third party library unless you get more into the guts of it. |
That would work, I think. One thing though, things like OTOH if you build a hard-coded Markdown-based list of allowed tags into the sanitiser you can get that effect even with a HTML→Markdown mapping. (Which then means you cannot avoid running the sanitiser by simply dropping all explicit HTML tags, because these hard-coded tags must still be allowed through even if nothing else is. But that’s neither here nor there since Markdown = HTML anyway, so either you don’t sanitise at all or you sanitise both forms…) |
@ap attributes are possible, certainly at the JsonML level since the Maruku dialect support this via: |
I don’t mean whether it is possible to parse them, I mean how they are treated by the sanitiser. If the sanitiser is configured to disallow everything, it should still allow Then if the sanitiser implements equivalence of HTML and Markdown by first mapping HTML to Markdown where there is equivalent Markdown syntax, as @cadorn proposed, then edge cases like this which only half map to Markdown are likely not to work quite like they would with output sanitisation – unless care is taken to support them explicitly. A good test suite is probably of the essence to ensure that the intent of the explicit support is preserved in the future, though! The separate output HTML filter stage has the advantage that this will all just work as desired by definition, implicitly – it’s robust in a way inline sanitisation is not, albeit, of course, at a performance penalty that we are trying to avoid here. |
This lib is for Markdown -> JsonML -> HTML conversion. I have no problem with not being able to go backwards from the resulting HTML to Markdown + HTML if the source HTML used non-standard tags. Warning can be thrown if this happens during the Markdown + HTML -> HTML conversion. If someone wants bi-directional conversion certain rules must be followed which are too restrictive for many cases. I want to write website content in Markdown + HTML and want the conversion with HTML and inline Markdown in HTML to JsonML without sanitation as I have control over the source. In this case I want all HTML attributes to come through. I also want the public to edit markdown + HTML for comments etc... in which sanitation is a must. I am not going to discuss the same point back and forth any more as I think @ashb and I are on the same page for the overall approach. We just need to work out the details and get coding. |
No problem. I was referring to this bit from you:
This is workable, and will mostly fulfil the criterion I was talking about (that if In the scenario where you sanitise by parsing the output, the sanitiser would certainly be configured to allow Now if it works the way you suggested, then you will map Does that help? |
I think we need to hard-wire the Markdown <-> HTML tag mappings anyway to make any of this work. Looks like we just need a HTML -> JsonML parser and a sanitizer that works on JsonML. It should not be too difficult to modify a good/portable/purejs HTML parser to do that for us. @ap So are we on the same page now? |
Ahhh. Nice. That addresses the issue I was talking about then, excellent. Yes, I believe we’re on the same page. |
Agreed. But this also seems like a lot of work if you wan't to deal with less than well formed HMTL - I would be happy for badly formed HTML to just fall back to being parsed as markdown (i.e you'd see literal < in the output etc. etc.) Thoughts? |
Maybe it’s possible to tie in an existing HTML5 parser? Otherwise just showing syntactically bad tags as literal text is fine with me. (Maybe do that by default with the option of adding a parser so that people can pay the cost only if they want it.) |
This list may be a good resource to ask for a HTML to JsonML converter or suggestion about which HTML parser to use: https://groups.google.com/group/js-tools Do we have a spec for JsonML? |
The grammar for JsonML is list on the website (http://www.jsonml.org/) if that's what you're looking for |
And in terms of which node names we use, we kinda just made them up. See... https://github.com/evilstreak/markdown-js/blob/master/lib/markdown.js#L1470-1559 |
Why are you guys overcomplicating this? Stick an option in there to allow inline html and leave it at that. Default it to false if you want to. Trying to make the decision for the developer that you need to protect them from scenarios (i.e. cross-site scripting) that are outside the scope of translating markdown to html just causes the library to become bloated, less maintainable and annoys all the people who are expecting it to behave as per the original markdown specification. I suggest you read http://daringfireball.net/projects/markdown/syntax#html - nowhere does it specify that you should escape HTML tags. If you're going to make it support less than the markdown specification at a minimum, or behave contrary to how markdown should behave, then you should call it something other than markdown and remove the hold on the "markdown" identifier in the npm registry, as there are a huge number of developers out there who see this library as the "preferred" library for markdown in node.js (or otherwise) and then start using it only to discover that you don't support the proper markdown specification. |
Even a simple 'allow inline HTML' flag needs some level of HTML parsing to know when to switch back to parsing Markdown again:
I'm personally against putting inline HTML in my markdown as it just feels wrong to me which is why I haven't written the code do to this yet. If someone submits a pull request that achieves even simple inline HMTL and has some tests I'm more than happy to merge it in. |
You have not attained Markdown nature yet, Ash. :-) |
Just so you are all aware: replying with "+1" and nothing else makes me less likely to want to work on this. It's going to happen at some point but you aren't helping. I'm going to delete those comments because they just add noise. |
@ashb any word on this bug? it's been a couple years so just curious if this will be implemented or if you've decided not to.. |
It'd be nice to have an update on this issue. I ran into it myself but side-stepped it for now by escaping HTML on the way into the Markdown parser. So > would end up at &gt; and I'd replace them on the content that comes back out. Not the nicest route to take but didn't want to muck around with the module itself (for what I was working on). Ignore my workaround, it didn't allow for code snippets :-( |
Yeah I'm just noticing the entity substitution also. Not the biggest deal since a lot of browsers know what it means, and render it accordingly, but still, it'd be nice. |
For those asking for updates, there have been a number of pull requests, the most recent of which is #98 |
I wouldn't hold your breath it doesn't look like the maintainer is planning on doing anything at all. |
From the threads I get the impression that this functionality is not really near to the heart of the maintainer, but (s)he hasn’t explicitly said he’ll refuse pull requests… The linked pull request is still open… I asked for a comment on what is blocking the pull request so that we know if there is a way to help out? cheers, |
Guys, there are better alternatives nowdays anyway: |
That depends on what you are looking for; for a project we needed to extend Markdown with a new dialect, and this was much easier to do in markdown-js then in marked, for example. I’d still be really happy with an HTML supporting markdown.js |
I agree with the other comments, this parser should not be aware of things like XSS that's the developer problem and should be handled by other parts of the application ( that's obvious ) Moving to marked too |
+1 for not escaping html in markdown parser. If this not going to be fixed, please say so at the top of the README. I just wasted 2 days playing with this library and need to rewrite my parser now. |
would be really nice, and bring it closer to the discount implementation
The text was updated successfully, but these errors were encountered: