Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Orca integration for static image export #1105

Closed
jonmmease opened this issue Aug 10, 2018 · 7 comments
Closed

Orca integration for static image export #1105

jonmmease opened this issue Aug 10, 2018 · 7 comments
Milestone

Comments

@jonmmease
Copy link
Contributor

jonmmease commented Aug 10, 2018

Overview

This is design proposal for the integration of orca into plotly.py in order to support the programmatic export of high quality static images.

Related issues:

Background

The programmatic export of static raster and vector images from JavaScript-based data visualization libraries is a notoriously complicated problem. One common solution is to combine selenium with a driver for a headless web browser like phantomjs or headless firefox/chrome. This approach is used by Bokeh and Altair for example. One challenge with this approach is that it requires the installation of dependencies that are not managed by a Python environment friendly package manager like conda (Although phantomjs is available through conda, its development has been suspended and it does not support WebGL). This presents challenges in terms of portability and reproducibility.

The plotly.js team has taken a different approach with the Orca project. Orca is a standalone Electron application that can run as a command line image export tool, or it can run in a server mode and respond to image export requests interactively. Orca is the backbone of the plot.ly image export service, and it was open sourced earlier this year.

Because Orca can be built into a standalone executable that does not depend on a system web browser, it is possible to package Orca as a conda package, and we've had recent success towards this goal.

This issue is for the discussion of how to build the best plotly.py image export experience on top of Orca.

Goals

  • Users shouldn't need to be aware of how complicated static image export is. At most it should require a single additional conda installation command.
  • It should be as easy and reliable to use as matplotlib's image export.
  • Nothing should flash on the screen or dock or taskbar during export.
  • For raster formats, it should support png, jpg, and webp with configurable resolution.
  • For vector formats it should support svg, pdf, and eps.
  • It should be possible to save images directly to the local filesystem, or to a writable file object.
  • It should be possible to return a byte string containing the image data without specifying filenames (and ideally without actually writing anything to temp files).
  • It should be fast enough to support use as an interactive plotting backend (See New module proposal: plotly.io #1098)
  • It should provide really helpful error messages if the orca executable isn't found.

Potential Approaches

1. Use command-line interface with figure as arg

The current Python instructions in the Orca README suggest the following usage:

from subprocess import call
import json
import plotly

fig = {"data": [{"y": [1,2,1]}]}
call(['orca', 'graph', json.dumps(fig, cls=plotly.utils.PlotlyJSONEncoder)])

Here the figure is serialized to a JSON string and passed as a command line argument to orca. This is nice because it avoids the need to create a temporary file. Unfortunately, there's a limit to how large the command line arguments can be, and large figures cross that boundary, resulting in an exception.

2. Use command-line interface with figure as tmp file

An alternative that doesn't run into this scaling problem is to first write the figure to a temporary file and then call orca with the path to the file. Furthermore, if a collection of figures needs to converted at once, the paths can all be passed to orca at once and orca will convert them in a batch mode. This is much faster on average because the orca executable only has to start up and shut down once per batch, rather than once per figure.

3. Use orca in server mode

Another approach would be to launch orca as a subprocess in server mode. The Python library would send individual image export requests to the server on an agreed upon port. The server would respond with the byte string of the converted image. This approach has several advantages, but also some increased complexity.

3.1 Advantages

Response time: Launch orca as a command line program or as a server process takes roughly 2 seconds to complete. However requests to an already running server process are much faster. I've seen round trip request to response times of under 50ms. 2 seconds is acceptable in the context of exporting figures to images on the filesystem, but it is not acceptable for interactive use as a static backend. 50ms feels as fast as matplotlib.

No temp files: This approach doesn't involve the use of any temporary files, and it makes it much simpler to support the non-file image use cases, like returning a bytes string or PIL.Image.Image object to the user.

3.2 Complications

There are some additional complications to this approach. First, the long runner server process would need to be managed by the Python library. It's too resource intensive to run all the time by default, so the user would need to start it explicitly, or we would need to start it the first time an export is requested.

Then there's the question of whether we leave the server process running indefinitely. Or do we implement some kind of timeout that would shut the process down after a (configurable) period of inactivity?

Finally, the communication between the Python process and the server requires an open local port, so there's the potential for restrictive firewalls to be a problem. (But, on the other hand, this is also true of the Jupyter Notebook and most applications that interact with an ipython kernel.)

What's next

Next we're going to work on testing and releasing conda packages for orca version 1.1.0.

Method 2 above (temp files) is probably the least risky approach, but I really want the advantages that come with Method 3 (server process), so I'd like to give this a shot first. I've already developed a prototype of the server mode approach, with automatic startup and timeout shutdown, and I have it working on OS X, Linux, and Windows. So far I've found it to be very reliable, and the responsiveness is really exciting.

So, I'm quite hopeful that we'll be able to build a solid user experience on top of the server mode. But I would like hear some other perspectives here.

@chriddyp @jackparmer @cldougl @nicolaskruchten @etpinard @Kully

@nicolaskruchten
Copy link
Contributor

Thanks for the writeup! I'll think about things a bit more deeply but one idea that came up while reading was what about baking something into the Jupyter ecosystem? Like a special kernel or something whose lifecycle could be managed by Jupyter?

@etpinard
Copy link
Contributor

A potential variant of 2 (Use CLI with figure as tmp file) as first wrote down in plotly/orca#110 (comment), would be a "batch" mode where multiple temp JSON files are saved and then exported at once using

orca graph fig.json fig1.json fig2.json

# which can also be saved in a directory e.g.
orca graph fig.json fig1.json fig2.json -d orca-outputs/

which I think could be useful for folks writing scripts (not so much for Jupyter Notebook users I guess).

@jonmmease
Copy link
Contributor Author

@etpinard Yeah, if we go with 2, I was picturing some kind of batch context manager API.

from plotly.io import save_image, batch_image
fig = ...
save_image(f1, 'out.png'). # Write temp and run of orca

with batch_image(parallel_limit=4):
    for i in range(100):
        fig = ...
        save_image(fig, 'out%d.png' % i). # <- write temp file here

# <- Run orca in batch on all temp files when context manager exits.

@malmaud
Copy link
Contributor

malmaud commented Aug 14, 2018

Great write-up! I'd vote for Method 3.

I have scientific code that wants to save a static figure every iteration of an algorithm, something like once every several seconds. A 2-second overhead to start orca each call would be a non-starter in that context. But having to manually use a batching context manager and associated complications also feels bad, given how easy imsave is to use at the moment.

I think you could even consider just keep the Orca server running indefinitely after the first API call that requires it, at least as the default - I'm worried about the negative user experience if someone is using Plotly in an interactive REPL and experience seemingly-random 2-second delays as Orca is restarted after being shut down in the background.

In what I think would be quite the rare case that the Orca server's memory usage is really problematic, the API could offer a function to manually shut it down.

@jonmmease
Copy link
Contributor Author

Thanks for chiming in @malmaud! Yeah, I do want to get as close as possible to the imsave experience with this.

I'm torn regarding whether or not to auto shutdown the orca server. In my latest testing, on OS X, an orca server process that hasn't done any work yet consumes around 120MB of ram across three process (And there's nothing we can do to shrink this any further given its built on electron). After saving large images the memory usage increases, and it doesn't always decrease back down to baseline right away (garbage collection is up to electron).

My first cut at this is going to autostart the sever on first use, and then shut it down after a certain amount of time if a configuration timeout property is set. If timeout is None then it will not be shut down automatically. I will also have functions to manually start and shutdown the server.

So, if all goes well, users will be able to configure the auto shutdown behavior. The remaining question is, what should the default behavior be?

If it shuts down automatically, there may be some confusion as to why some image save calls slow down. If it doesn't shutdown, there may be some concern that a long running process is using a bunch or ram and not doing anything.

Sounds like @malmaud would prefer not shutting the server down at all by default (maybe with the option of turning on the timeout behavior if a user wants it). Anyone else care to share a preference one way or the other?

@jackparmer
Copy link
Contributor

jackparmer commented Aug 15, 2018

Sounds like @malmaud would prefer not shutting the server down at all by default (maybe with the option of turning on the timeout behavior if a user wants it).

+1 from me. My guess is that we'll get more community/support questions about the image generation time lag than RAM usage, just judging by the savvy of an average beginning Python user / data scientist. I think we should also aim to replicate the imsave experience as faithfully as possible.

@jonmmease
Copy link
Contributor Author

jonmmease commented Aug 29, 2018

Merged in #1120

Thanks for the discussion everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants