-
Notifications
You must be signed in to change notification settings - Fork 61
How do I get the id of each book? #105
Comments
Hi @iamyihwa and thanks for reaching out. Did you take a look at the |
Hi @c-w thanks for your reply. I have just tried using the functions that were in the link you sent. from gutenberg.query import get_metadata AttributeError Traceback (most recent call last) ~/anaconda3/envs/fastai/lib/python3.6/site-packages/gutenberg/acquire/metadata.py in _add_namespaces(graph) ~/anaconda3/envs/fastai/lib/python3.6/site-packages/rdflib/graph.py in bind(self, prefix, namespace, override) ~/anaconda3/envs/fastai/lib/python3.6/site-packages/rdflib/graph.py in _get_namespace_manager(self) ~/anaconda3/envs/fastai/lib/python3.6/site-packages/rdflib/namespace.py in init(self, graph) ~/anaconda3/envs/fastai/lib/python3.6/site-packages/rdflib/namespace.py in bind(self, prefix, namespace, override, replace) ~/anaconda3/envs/fastai/lib/python3.6/site-packages/rdflib/plugins/sleepycat.py in namespace(self, prefix) AttributeError: 'Sleepycat' object has no attribute '_Sleepycat__namespace' During handling of the above exception, another exception occurred: InvalidCacheException Traceback (most recent call last) ~/anaconda3/envs/fastai/lib/python3.6/site-packages/gutenberg/query/api.py in get_etexts(feature_name, value) ~/anaconda3/envs/fastai/lib/python3.6/site-packages/gutenberg/query/extractors.py in get_etexts(cls, requested_value) ~/anaconda3/envs/fastai/lib/python3.6/site-packages/gutenberg/query/api.py in _metadata(cls) ~/anaconda3/envs/fastai/lib/python3.6/site-packages/gutenberg/acquire/metadata.py in load_metadata(refresh_cache) ~/anaconda3/envs/fastai/lib/python3.6/site-packages/gutenberg/acquire/metadata.py in open(self) InvalidCacheException: The cache is invalid or not created |
Did you ensure to create the metadata cache before running the query? from gutenberg.acquire import get_metadata_cache
cache = get_metadata_cache()
cache.populate() This should only need to be done once since the results are cached on disk. If this doesn't work for you (due to the BerkelyDB setup on your machine), you can also try using the SQLite cache which works everywhere but is somewhat slower: from gutenberg.acquire import set_metadata_cache
from gutenberg.acquire.metadata import SqliteMetadataCache
cache = SqliteMetadataCache('/my/custom/location/cache.sqlite')
cache.populate()
set_metadata_cache(cache) There's more documentation on this here: https://github.com/c-w/gutenberg#looking-up-meta-data |
Hi @c-w Thanks it worked with cache trick! I get something that says it is 'frozenset' .. any clues? |
Hi @iamyihwa. The In order to do a fuzzy search on the titles like find all the books where the title contains "math", you might be able to use or adapt this snippet: from gutenberg.acquire import get_metadata_cache
from gutenberg.query.api import MetadataExtractor
# define search parameters
search_term = 'math'
search_field = 'title'
# get a reference to the metadata graph
cache = get_metadata_cache()
cache.open()
graph = cache.graph
# execute the search
extractor = MetadataExtractor.get(search_field)
results = ((extractor._uri_to_etext(etext), value.toPython())
for (etext, value) in graph[:extractor.predicate():]
if search_term.lower() in value.toPython().lower())
# print the first result of the search: (25387, 'Mathematical Essays and Recreations')
result = next(results)
print(result) |
Thanks @c-w !! What I would like to do eventually is to get some domain specific texts and do some training on it and use that classifier to later determine the domain of unseen text. I see to do this sorting by popularity could be one option, if you know of any other way it would be nice! |
Hi @c-w I have just tried the function, however with the index that I get, I cannot use it to retrieve the test. I want to get the text out of the book 'Four Lecture on Mathematics', which has the index 29788. However I get error. What am I doing wrong?? Could you have a look? |
In order to have meaningful search relevance, I'd suggest to do a rough filtering of the documents using the Gutenberg library and then ingest the document's full text into a real search engine like Elastic Search or Azure Search. In that way you'll get nice disambiguation. If that approach is too heavy, you can also adjust the query condition in the text search snippet that I sent earlier Downloading book 29788 fails since it doesn't offer a textual download. I've updated the error message to make this clearer. You can check the available formats for a book like this: from gutenberg.query import get_metadata
print(get_metadata('formaturi', 29788))
# frozenset({
# 'http://www.gutenberg.org/files/29788/29788-t/29788-t.tex',
# 'http://www.gutenberg.org/files/29788/29788-pdf.pdf',
# 'http://www.gutenberg.org/files/29788/29788-pdf.zip',
# 'http://www.gutenberg.org/ebooks/29788.rdf',
# 'http://www.gutenberg.org/files/29788/29788-t.zip'
# }) In order to download one of these non-textual formats, you can use this snippet: from gutenberg.acquire.text import _etextno_to_uri_subdirectory
from gutenberg.acquire.text import _GUTENBERG_MIRROR
text = 29788
extension = '-pdf.pdf'
url = '{mirror}/{path}/{text}{extension}'.format(
mirror=_GUTENBERG_MIRROR,
path=_etextno_to_uri_subdirectory(text),
text=text,
extension=extension)
|
The `load_etext` function currently throws an exception when there is no textual download candidate available for a given book. However, some users might want to use Gutenberg to download non-textual versions of books. All available formats of a book can already be looked up via the formaturi metadata extractor, so this change exposes a method to enable a client to format the download URL for an arbitrary extension. See #105
Thanks @c-w for quick feedback and ways to make my way through! I have also tested the new function, however _format_download_uri_for_extension didn't work even after the update. |
As I mentioned the new method just was published to master but we haven't made a new PyPI release yet. This means that you'll have to install the package from Github, e.g. via |
@iamyihwa Closing this issue since all of your questions seem to have been addressed. Feel free to reopen if you have any additional questions. |
@c-w Thanks for the support. Yes right now all the doubts and problems have been solved. Thanks a lot again for all the help! Will surely get back when I have more issues. |
Hello
I have been looking for ways to get ids of each book in an intuitive way.
Getting the id from the webpage of each book doesn't seem to work.
When I run 'text = strip_headers(load_etext(17384)).strip()', it says the book doesn't exist.
One way would be to look at catalogs.
http://www.gutenberg.org/dirs/GUTINDEX.1996
However these indices are not complete, and there are too many files.
I would like ideally to have a way to search with some keywords, get list of books, then using that title, or identifier, get the text out of it.
The text was updated successfully, but these errors were encountered: