Proposal: markdown parsing for react-native-render-html #3983

jsamr · 2021-07-12T16:43:19Z

This is a follow-up on a request from @roryabraham. I have investigated the options to integrate Markdown parsing a little further, and here are my findings:

Markdown parsing is hard. While XML or HTML markups are not context sensible, Markdown is. So implementing my own tokenizer is a no-go for a production-ready deliverable (although I might try the challenge as a hobby at some point).
Changing the structure of the DOM would bring too many breaking changes, so using remarkjs is a no-go.
The solution would be intermediary: use an open-source, well tested markdown tokenizer and plug that in htmlparser2 Tokenizer class to emit a DOM.

Chosing a Tokenizer

I've forked this benchmark to add micromark which I found very well structured and solid (via remark-html), and below are my findings (Intel i7-8809G, 32GB of RAM, Nodejs 14.16.0).

Average Ops per second

Minmax parse time

Average Throughput

Conclusion

Markdown-it is the clear winner, since there is no official web assembly support in React Native. Other plus:

Great ecosystem with many plugins and GFM support including emojis;
Safe by default;
Great maintenance metrics (5 open issues).

Implementation Plan

Get inspiration from MarkdownIt.Renderer:
consume a token tree from MarkdownIt.parse and invoke corresponding htmlparser2 callbacks while walking the tree.

I'll also need some help to assess which features you want to enable for Expensify.cash.

Package Design

I need to think of a new package design since I don't want @native-html/core to depend directly on markdown-it.

Testing Strategy

The parser will be tested against the official commonmark-spec repository.

The text was updated successfully, but these errors were encountered:

MelvinBot · 2021-07-12T16:43:22Z

Triggered auto assignment to @johncschuster (AutoAssignerTriage), see https://stackoverflow.com/c/expensify/questions/4749 for more details.

quinthar · 2021-07-12T17:51:57Z

To help me understand, can you confirm:

Are you sure wasm isn't supported? @robertjchen is planning on using it for encryption, I think.
Can you benchmark this against our current custom markdown>html parser?
I thought the idea was translate directly from markdown to DOM, skipping HTML entirely. But this seems like it is just benchmarking various open source markdown>html converters. Can you restate the goal?

Thanks!

roryabraham · 2021-07-12T17:59:54Z

Nice, markdown-it looks like it has a very active community, which we love to see! I'm also excited about the rule customization and plugin architecture it offers.

I'll also need some help to assess which features you want to enable for Expensify.cash.

Let me know what questions you have and how I can help! I am envisioning a phased scope-of-work that might look something like this:

You work on a POC of the MD integration in react-native-render-html
Once you've got a working POC, you work on benchmarking MD parsing v.s. HTML parsing in react-native-render-html so we can get a clear idea of the performance implications there.
Next, assuming the benchmarks are promising, we work together on implementing direct MD rendering into Expensify.cash.
1. There's a good chance we'll need an internal engineer to work on this phase with you, so we can locally test any necessary API changes.
2. I think the most challenging aspect of that integration will be that we'll need to be careful to maintain backwards-compatibility so that any necessary API changes don't break older versions of the application.
Once we have a working implementation of MD parsing via react-native-render-html in Expensify.cash, we'll do benchmark testing of Expensify.cash before (with our current markdown <-> HTML converter) and analyze the performance implications before we decide to use it in production.

Also, as @quinthar has suggested above, if you could benchmark markdown-it, etc... against our current markdown -> HTML converter (which lives in our expensify-common repo) that would likely be a great first step before we dive into the phased implementation outlined above.

I thought the idea was translate directly from markdown to DOM, skipping HTML entirely.

This is my understanding as well.

jsamr · 2021-07-12T18:47:51Z

@quinthar

Are you sure wasm isn't supported? @robertjchen is planning on using it for encryption, I think.

My understanding is that WASM is not officially supported at the moment and there is an open ticket for Hermes engine, but you can use react-native-wasm to execute WASM in a WebView. Hence I'd rather wait for official support before considering markdown-wasm.

I thought the idea was translate directly from markdown to DOM, skipping HTML entirely. But this seems like it is just benchmarking various open source markdown>html converters. Can you restate the goal?

Yes, the goal is still to translate markdown to DOM! The benchmark is a quick glimpse at each parser capability, assuming that the "whichever-kind-of-syntax-tree-structure to HTML part" performance would be sensibly the same for every engine. I could rewrite the benchmark for accuracy, but honestly I looked for the fastest-to-implement path. To develop the statement in the OP, the plan is to use markdown-it tokenizer and plug it in a htmlparser2 Tokenizer derived class, and then pass this class reference to htmlparser2 Parser which in turns emits events consumed by domhandler to build up a DOM.

Can you benchmark this against our current custom markdown>html parser?

The thing is, all the tested markdown parsers claim compliance with CommonMark, and I doubt this ad-hoc ExpensiMark parser does. So to benchmark your parser, I would need to tailor tests scenarios and make sure the parsers are working in equal conditions. I'd be happy to do that in a preliminary "0" step!

@roryabraham

That sounds like a solid plan. As react-native-render-html development model is through consulting and sponsoring, would you go along with the idea of financing the initial exploratory steps? And by the way, if ExpensiMark turns out to be faster because it's tailored for your needs, I could also consider a POC based on it (however, its reliance on regex to identify tokens makes me doubt it will).

MelvinBot · 2021-07-12T18:49:14Z

Triggered auto assignment to @tgolen (Engineering), see https://stackoverflow.com/c/expensify/questions/4319 for more details.

robertjchen · 2021-07-12T18:59:25Z

My understanding is that WASM is not officially supported at the moment and there is an open ticket for Hermes engine, but you can use react-native-wasm to execute WASM in a WebView

To clarify, this is somewhat true- we won't be able to use pure WASM in React without the WebView workaround, but we can still compile markdown-wasm to the highly optimized asm.js JS subset (now confusingly under the umbrella of WebAssembly™️ ).

Theoretically this might net us performance results between markdown-it and pure WASM markdown-wasm. Just another option to consider 👍

jsamr · 2021-07-12T21:17:25Z

@robertjchen Thanks for pointing that out! I must confess that I'm not proficient in WASM nor asm.js. After reading markdown-wasm in surface, I realize that it's basically a wrapper around md4c, compiled in wasm. For our usecase, we would need to get the parse tree generated by md4c, and access the tree in memory with some JS bindings. I don't know if WASM offers this capability! Nonetheless, JSI (JavaScript Interface) offers this feature with C/C++ so that would probably be a better fit! However I am not a C/C++ developer (although I'd love to get my hands on), but I find the prospect very interesting. The Graal would be a transient render engine written in C++ with JSI bindings, but that is clearly a huge endeavor and obviously out of scope 🤣

jsamr · 2021-07-13T23:58:59Z

@roryabraham @quinthar

Can you benchmark this against our current custom markdown>html parser?

I changed my mind and thought that if it does perform worse, that will be flagrant whether or not the same set of syntax rules are covered. I managed to incorporate ExpensiMark in the benchmark (although in a hacky way because it uses ES modules , I just copied the required sources).

As I had anticipated, ExpensiMark turns out to be pretty slow in comparison to other parsers (I'm still dumbfounded by the bad performance of remark). This is probably because of its regex-based parsing, which requires vastly more input traversals than a automaton "finite state machine" based parser.

quinthar · 2021-07-14T00:08:36Z

I would expect that over time we're going to want our own parser, to do nonstandard things -- we want to own our own grammar. But this is a huge testament to the power of WASM and we should absolutely be aiming for that. Can we just port ExpensiMark to C++ and compile to WASM? No need to use regexp.

jsamr · 2021-07-16T18:09:32Z

@quinthar If you can get your hands on C/C++, I'm not sure I see any advantage of using WASM over JSI. AFAIK WASM is still interpreted, while C/C++ will be run in an architecture-specific binary. Plus, JSI is the future standard for performant low-level react-native modules... An important thing to consider if you want to follow that scheme (porting ExpensiMark to C++ based on md4c), is converting back HTML to markdown. But there are good and fast C++ HTML parsers and that part should be much easier.

No hurry, but please let me know if you're still interested in markdown being supported directly in RNRH, most likely with markdown-it. mardown-it allows you to create new syntax rules via plugins so you shouldn't feel limited in that regard.

Hopefully at some point the whole transient render engine will be ported to C++ where we could use md4c for Markdown. In any case we would make sure you can reasonably easily plug-in your own parser implementation to generate a TRT and render that in RNRH.

MelvinBot · 2021-11-29T19:00:58Z

@jsamr, this Monthly task hasn't been acted upon in 6 weeks; closing.

If you disagree, feel encouraged to reopen it -- but pick your least important issue to close instead.

jsamr added the AutoAssignerTriage Auto assign issues for triage to an available triage team member label Jul 12, 2021

MelvinBot assigned johncschuster Jul 12, 2021

MelvinBot removed the AutoAssignerTriage Auto assign issues for triage to an available triage team member label Jul 12, 2021

johncschuster added the Engineering label Jul 12, 2021

MelvinBot assigned tgolen Jul 12, 2021

johncschuster unassigned tgolen and johncschuster Jul 12, 2021

kadiealexander added the n6-hold label Sep 16, 2021

MelvinBot added the Monthly KSv2 label Sep 16, 2021

mallenexpensify removed the n6-hold label Nov 16, 2021

MelvinBot added the Overdue label Nov 16, 2021

MelvinBot closed this as completed Nov 29, 2021

MelvinBot added Not a priority and removed Overdue labels Nov 29, 2021

JalalMitali mentioned this issue Feb 20, 2022

[HOLD for payment 2022-05-20] [$2000] When copy&paste multiple messages line breaks aren't maintained #7352

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: markdown parsing for react-native-render-html #3983

Proposal: markdown parsing for react-native-render-html #3983

jsamr commented Jul 12, 2021

MelvinBot commented Jul 12, 2021

quinthar commented Jul 12, 2021

roryabraham commented Jul 12, 2021

jsamr commented Jul 12, 2021

MelvinBot commented Jul 12, 2021

robertjchen commented Jul 12, 2021

jsamr commented Jul 12, 2021

jsamr commented Jul 13, 2021

quinthar commented Jul 14, 2021

jsamr commented Jul 16, 2021

MelvinBot commented Nov 29, 2021

Proposal: markdown parsing for react-native-render-html #3983

Proposal: markdown parsing for react-native-render-html #3983

Comments

jsamr commented Jul 12, 2021

Chosing a Tokenizer

Average Ops per second

Minmax parse time

Average Throughput

Conclusion

Implementation Plan

Package Design

Testing Strategy

MelvinBot commented Jul 12, 2021

quinthar commented Jul 12, 2021

roryabraham commented Jul 12, 2021

jsamr commented Jul 12, 2021

MelvinBot commented Jul 12, 2021

robertjchen commented Jul 12, 2021

jsamr commented Jul 12, 2021

jsamr commented Jul 13, 2021

quinthar commented Jul 14, 2021

jsamr commented Jul 16, 2021

MelvinBot commented Nov 29, 2021