Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datasette serve should accept paths/URLs to CSVs and other file formats #123

Open
simonw opened this issue Nov 19, 2017 · 9 comments
Open

Comments

@simonw
Copy link
Owner

simonw commented Nov 19, 2017

This would remove the csvs-to-sqlite step which I end up using for almost everything.

I'm hesitant to introduce pandas as a required dependency though since it require compiling numpy. Could build it so this option is only available if you have pandas installed.

@simonw
Copy link
Owner Author

simonw commented Dec 10, 2017

I'm going to keep this separate in csvs-to-sqlite.

@simonw simonw closed this as completed Dec 10, 2017
@simonw simonw added the wontfix label Dec 10, 2017
@simonw
Copy link
Owner Author

simonw commented Mar 15, 2019

I'm reopening this one as part of #417.

Further experience with Python's CSV standard library module has convinced me that pandas is not a required dependency for this. My sqlite-utils package can do most of the work here with very few dependencies.

@simonw
Copy link
Owner Author

simonw commented Mar 15, 2019

How would Datasette accepting URLs work?

I want to support not just SQLite files and CSVs but other extensible formats (geojson, Atom, shapefiles etc) as well.

So datasette serve needs to be able to take filepaths or URLs to a variety of different content types.

If it's a URL, we can use the first 200 downloaded bytes to decide which type of file it is. This is likely more reliable than hoping the web server provided the correct content-type.

Also: let's have a threshold for downloading to disk. We will start downloading to a temp file (location controlled by an environment variable) if either the content length header is above that threshold OR we hit that much data cached in memory already and don't know how much more is still to come.

There needs to be a command line option for saying "grab from this URL but force treat it as CSV" - same thing for files on disk.

datasette mydb.db --type=db http://blah/blah --type=csv

If you provide less --type options thatn you did URLs then the default behavior is used for all of the subsequent URLs.

Auto detection could be tricky. Probably do this with a plugin hook.

https://github.com/h2non/filetype.py is interesting but deals with images video etc so not right for this purpose.

I think we need our own simple content sniffing code via a plugin hook.

What if two plugin type hooks can both potentially handle a sniffed file? The CLI can quit and return an error saying content is ambiguous and you need to specify a --type, picking from the following list.

@simonw simonw changed the title Datasette should accept CSV file paths directly Datasette serve should accept paths/URLs to CSVs and other file formats Mar 15, 2019
@obra
Copy link

obra commented Sep 24, 2020

As a half-measure, I'd get value out of being able to upload a CSV and have datasette run csv-to-sqlite on it.

@simonw
Copy link
Owner Author

simonw commented Sep 24, 2020

@obra there's a plugin for that! https://github.com/simonw/datasette-upload-csvs

@obra
Copy link

obra commented Sep 24, 2020 via email

@jsancho-gpl
Copy link

datasette-connectors provides an API for making connectors for any file based database. For example, datasette-pytables is a connector for HDF5 files, so now is possible to use this type of files with Datasette.

It'd be nice if Datasette coud provide that API directly, for other file formats and for urls too.

@RayBB
Copy link

RayBB commented Jul 18, 2021

I also love the idea for this feature and wonder if it could work without having to download the whole database into memory at once if it's a rather large db. Obviously this could be slower but could have many use cases.

My comment is partially inspired by this post about streaming sqlite dbs from github pages or such
https://news.ycombinator.com/item?id=27016630

@simonw
Copy link
Owner Author

simonw commented Jul 19, 2021

I've been thinking more about this one today too. An extension of this (touched on in #417, Datasette Library) would be to support pointing Datasette at a directory and having it automatically load any CSV files it finds anywhere in that folder or its descendants - either loading them fully, or providing a UI that allows users to select a file to open it in Datasette.

For larger files I think the right thing to do is import them into an on-disk SQLite database, which is limited only by available disk space. For smaller files loading them into an in-memory database should work fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants