Merge pull request #6 from ti-oluwa/dev

Dev
ti-oluwa · Jun 9, 2024 · 870d275 · 870d275
2 parents bf5fd40 + 77fcfe2
commit 870d275
Show file tree

Hide file tree

Showing 6 changed files with 206 additions and 36 deletions.
diff --git a/README.md b/README.md
@@ -1,12 +1,14 @@
 # Schlumberger Petroleum Glossary
 
-Search the Schlumberger Petroleum Glossary using Selenium.
+Browse the Schlumberger Petroleum Glossary in Python.
 
-> **For optimum performance, it is advisable to use the module with the Chrome browser and a fast and stable internet connection.**
+**For optimum performance, Use the Chrome browser and a fast and stable internet connection.**
+
+> This package is intended for research or instructional use only.
 
 ## Installation
 
-* The package can be installed using pip as follows:
+* Install using pip:
 
 ```bash
 pip install slb-glossary
@@ -15,15 +17,15 @@ pip install slb-glossary
 ## Dependencies
 
 * [seleneium](https://pypi.org/project/selenium/)
-* [openpyxl](https://pypi.org/project/openpyxl/)
+* [openpyxl](https://pypi.org/project/openpyxl/) (for exporting search results to Excel)
 
 ## Quick Start
 
 ```python
 import slb_glossary as slb
 
 # Create a glossary object
-glossary = slb.Glossary(slb.Browser.CHROME. open_browser=True)
+glossary = slb.Glossary(slb.Browser.CHROME, open_browser=True)
 
 # Search for a term
 results = glossary.search("porosity")
@@ -33,19 +35,182 @@ for result in results:
     print(result.asdict())
 ```
 
-<!-- ## Usage
+## Usage
+
+**Please note that this is just a brief overview of the module. The modules is heavily documented and you are encouraged to read the docstrings for more information on the various methods and classes.**
+
+> "topics" used in the context of this documentation refers to the subjects or topics in the glossary.
+
+### Instantiate a glossary object
+
+Import the module:
+
+```python
+import slb_glossary as slb
+```
+
+To use the glossary, you need to create a `Glossary` object. The `Glossary` class takes a few arguments:
+    - `browser`: The browser to use. It can be any of the values in the `Browser` enum.
+    **Ensure you have the browser selected installed on your machine.**
+    - `open_browser`: A boolean indicating whether to open the browser when searching the glossary or not.
+    If this is True, a browser window is open when you search for a term. This can be useful for monitoring
+    and debugging the search process. If you don't need to see the browser window, set this to False.
+    This is analogous to running the browser in headless mode. The default value is False.
+    - `page_load_timeout`: The maximum time to wait for a page to load before raising an exception.
+    - `implicit_wait_time`: The maximum time to wait for an element to be found before raising an exception.
+    - `language`: The language to use when searching the glossary. This ca be any of the values in the `Language` enum.
+    Presently, only English and Spanish are supported. The default value is `Language.ENGLISH`.
+
+```python
+glossary = slb.Glossary(slb.Browser.CHROME, open_browser=True)
+```
+
+### Get all topics/subjects available in the glossary
+
+When you initialize a glossary, the available topics are automatically fetched and stored in the `topics` attribute.
+
+```python
+topics = glossary.topics
+print(topics)
+```
+
+This returns a mapping of the topic to the number of terms under the topic in the glossary
+
+```python
+{
+    "Drilling": 452,
+    "Geology": 518,
+    ...
+}
+```
+
+Use `glossary.topics_list` if you only need a list of the topics in the glossary. `glossary.size` returns the total number of terms in the glossary.
 
-### Searching for a term
+If you need to refetch all topics call `glossary.get_topics()`. Read the method's docstring for more info on its use.
 
-To begin, create a `Glossary` object and call the `search` method with the term you want to search for.
+### Get a topic match
+
+Do you have a topic in mind and are not sure if it is in the glossary? Use the `get_topic_match` method to get a topic match. It returns a single topic that best matches the input topic.
 
 ```python
-from slb_glossary import Glossary
+topic = glossary.get_topic_match("drill")
+print(topic)
+
+# Output: Drilling
+```
+
+### Search for a term
+
+Use the `search` method to search for a term in the glossary
 
-glossary = Glossary()
+```python
 results = glossary.search("porosity")
-``` -->
+```
+
+This returns a list of [`SearchResult`](#search-results)s for "porosity". You can also pass some optional arguments to the `search` method:
+    - `under_topic`: Streamline search to a specific topic
+    - `start_letter`: Limit the search to terms starting with the given letter(s)
+    - `max_results`: Limit the number of results returned.
+
+### Search for a term under a specific topic
+
+```python
+results = glossary.get_terms_on(topic="Well workover")
+```
+
+The `get_terms_on` method returns a list of `SearchResult`s for all terms under the specified topic.
+The difference between `search` and `get_terms_on` is that `search` searches the entire glossary while `get_terms_on` searches only under the specified topic. Hence, search can contain terms from different topics.
+
+The topic passed need not be an exact match to what is in the glossary. The glossary will choose the closest match to the provided topic that is available in the glossary.
+
+### Search results
+
+Search results are returned as `SearchResult` objects. Each `SearchResult` object has the following attributes:
+    - `term`: The term being searched for
+    - `definition`: The definition of the term
+    - `grammatical_label`: The grammatical label of the term. Basically the part of speech of the term
+    - `topic`: The topic under which the term is found
+    - `url`: The URL to the term in the glossary
+
+To get the search results as a dictionary, use the `asdict` method.
+
+```python
+results = glossary.search("oblique fault")
+for result in results:
+    print(result.asdict())
+```
+
+You could also convert search results to tuples using the `astuple` method.
+
+```python
+results = glossary.search("oblique fault")
+for result in results:
+    print(result.astuple())
+```
+
+### Other methods
+
+Some other methods available in the `Glossary` class are:
+    - `get_search_url`: Returns the correct glossary url for the given parameters.
+    - `get_terms_urls`: Returns the URLs of all terms gotten using the given parameters.
+    - `get_results_from_url`: Extracts search results from a given URL. Returns a list of `SearchResult`s.
+
+### Save/export search results to a file
+
+A convenient way to save search results to a file is to use the `saver` attribute of the glossary object.
+
+```python
+results = glossary.search("gas lift")
+glossary.saver.save(results, "./gas_lift.txt")
+```
+
+The `save` method takes a list of `SearchResult`s and the filename or file path to save the results to. The file save format is determined by the file extension. The supported file formats by default are 'xlsx', 'txt', 'csv' and 'json'.
+Or check `glossary.saver.supported_file_types`.
+
+### Customizing how results are saved
+
+By default, the `Glossary` class uses a `Saver` class to save search results. This base `Saver` class only supports a few file formats, which should be sufficient. However, if you need to save in an unsupported format. You can subclass the `Saver` class thus;
+
+```python
+from typing import List
+import slb_glossary as slb
+
+class FooSaver(slb.Saver):
+    @staticmethod
+    def save_as_xyz(results: List[SearchResult], filename: str):
+        # Validate filename or path 
+        # Your implementation goes here
+        ...
+```
+
+Read the docstrings of the `Saver` class to get a good grasp of how to do this. Also, you may read the `slb_glossary.saver` module to get an idea of how you would implement your custom save method.
+
+There are two ways you can use your custom saver class.
+
+1; Create a `Glossary` subclass:
+
+```python
+import slb_glossary as slb
+
+class FooGlossary(slb.Glossary):
+    saver_class = FooSaver
+    ...
+
+glossary = FooGlossary(...)
+glossary.saver.save(...)
+```
+
+2; Instantiate a saver directly
+
+```python
+saver = FooSaver()
+saver.save(...)
+```
 
 ## Contributing
 
 Contributions are welcome. Please fork the repository and submit a pull request.
+
+## Credits
+
+This project was inspired by the 2023/24/25 Petrobowl Team of the Federal University of Petroleum Resources, Effurun, Delta state, Nigeria. It aided the team's preparation for the PetroQuiz and PetroBowl competitions organized by the Society of Petroleum Engineers(SPE).
diff --git a/dist/slb_glossary-0.0.1b0-py3-none-any.whl b/dist/slb_glossary-0.0.1b0-py3-none-any.whl
diff --git a/dist/slb_glossary-0.0.1b0.tar.gz b/dist/slb_glossary-0.0.1b0.tar.gz
diff --git a/slb_glossary/__init__.py b/slb_glossary/__init__.py
@@ -2,10 +2,10 @@
 Search the Schlumberger Oilfield Glossary programmatically using Selenium.
 https://glossary.slb.com/
 
-This package is meant for educational/instructional use only and may not be 
+This package is meant for research/instructional use only and may not be 
 suitable for production.
 
-#### Internet Connection Required!!!
+#### Stable Internet Connection Required!!!
 
 @Author: Daniel T. Afolayan (ti-oluwa)
 """

diff --git a/slb_glossary/glossary.py b/slb_glossary/glossary.py
@@ -66,8 +66,10 @@ class SearchResult:
     """Basically the part of speech of the term"""
     topic: Optional[str]
     """The topic the term is related to or the topic under which the definition was found"""
+    url: Optional[str]
+    """The url of the page containing the definition of the term in the glossary"""
 
-    def astuple(self) -> Tuple[str, Optional[str], Optional[str]]:
+    def astuple(self) -> Tuple[str, Optional[str], Optional[str], Optional[str]]:
         """Return the search result as a tuple"""
         return astuple(self)
 
@@ -330,11 +332,11 @@ def size(self) -> int:
 
 
     def _get_element_by_css_selector(self, 
-            css_selector: str,
-            *,  
-            root: Optional[Union[WebDriver, WebElement]] = None, 
-            max_retry: int = 3
-        ) -> WebElement | None:
+        css_selector: str,
+        *,  
+        root: Optional[Union[WebDriver, WebElement]] = None, 
+        max_retry: int = 3
+    ) -> WebElement | None:
         """
         Get the first element with the given css selector
 
@@ -348,19 +350,22 @@ def _get_element_by_css_selector(self,
         while tries < max_retry:
             try:
                 return root.find_element(by=By.CSS_SELECTOR, value=css_selector)
-            except (StaleElementReferenceException, NoSuchElementException) as exc:
+            except (
+                StaleElementReferenceException, 
+                NoSuchElementException
+            ) as exc:
                 time.sleep(1)
                 tries += 1
                 if tries == max_retry:
                     raise exc
 
 
     def _get_elements_by_css_selector(self, 
-            css_selector: str,
-            *,  
-            root: Optional[Union[WebDriver, WebElement]] = None, 
-            max_retry: int = 3
-        ) -> List[WebElement] | None:
+        css_selector: str,
+        *,  
+        root: Optional[Union[WebDriver, WebElement]] = None, 
+        max_retry: int = 3
+    ) -> List[WebElement] | None:
         """
         Get the all elements with the given css selector
 
@@ -374,7 +379,10 @@ def _get_elements_by_css_selector(self,
         while tries < max_retry:
             try:
                 return root.find_elements(by=By.CSS_SELECTOR, value=css_selector)
-            except (StaleElementReferenceException, NoSuchElementException) as exc:
+            except (
+                StaleElementReferenceException, 
+                NoSuchElementException
+            ) as exc:
                 time.sleep(1)
                 tries += 1
                 if tries == max_retry:
@@ -677,16 +685,13 @@ def get_results_from_url(self, url: str, *, under_topic: Optional[str] = None) -
             grammatical_label = _full_grammatical_label(grammatical_label_abbreviation)
 
             if under_topic and under_topic.lower() in term_definition_sub.lower():
-                result = SearchResult(term_name, term_definition, grammatical_label, under_topic)
+                result = SearchResult(term_name, term_definition, grammatical_label, under_topic, url)
                 results.append(result)
                 return results
             else:
                 topic = term_definition_sub.split('.')[-1].strip().removesuffix(']').removeprefix('[')
-                result = SearchResult(term_name, term_definition, grammatical_label, topic)
+                result = SearchResult(term_name, term_definition, grammatical_label, topic, url)
                 results.append(result)
-
-        if results == []:
-            results.append(SearchResult(term_name, None, None, None))
         return results
 
 
@@ -727,7 +732,7 @@ def search(
         Search the glossary for terms matching the given query and return their definitions
 
         :param query: The search query
-        :param under_topic: What topics should the definitions extracted be related to.
+        :param under_topic: What topics should the definitions extracted be related to. Streamline search to this topic.
         
         NOTE: It is advisable to use a topic that is available on the glossary website.
         If topic is not available it uses the nearest match for topics available on the slb glossary website. If no match is found,

diff --git a/slb_glossary/saver.py b/slb_glossary/saver.py
@@ -74,7 +74,7 @@ def save_as_xlsx(results: List[SearchResult], filename: str) -> None:
             import openpyxl
         except ImportError:
             raise ImportError(
-                '"openpyxl" is required to save to xlsx files. Run `pip install openpyxl` in yut terminal to install it'
+                '"openpyxl" is required to save to xlsx files. Run `pip install openpyxl` in your terminal to install it'
             )
 
         name, ext = os.path.splitext(filename)
@@ -84,7 +84,7 @@ def save_as_xlsx(results: List[SearchResult], filename: str) -> None:
         wb = openpyxl.Workbook()
         ws = wb.active
         ws.title = name.title()
-        ws.append(('Term', 'Definition', 'Grammatical Label', 'Topic')) # Add a header row
+        ws.append(('Term', 'Definition', 'Grammatical Label', 'Topic', "URL")) # Add a header row
         for result in results:
             ws.append(result.astuple())
 
@@ -109,7 +109,7 @@ def save_as_csv(results: List[SearchResult], filename: str) -> None:
             writer = csv.writer(file, delimiter=', ', quotechar='"', quoting=csv.QUOTE_MINIMAL)
             writer.writerow((name.title(),)) # Add a title row
             file.write('\n')
-            writer.writerow(('Term', 'Definition', 'Grammatical Label', 'Topic')) # Add a header row
+            writer.writerow(('Term', 'Definition', 'Grammatical Label', 'Topic', "URL")) # Add a header row
             file.write('\n')
             for result in results:
                 writer.writerow(result.astuple())
@@ -154,7 +154,7 @@ def save_as_txt(results: List[SearchResult], filename: str) -> None:
             for i, result in enumerate(results, start=1):
                 file.write(
                     f"({i}). {result.term} ({result.topic or ""}) - {result.grammatical_label}:\n"
-                    f"{result.definition or ""}\r\n"
+                    f"{result.definition or ""}.\nReference; {result.url}\r\n"
                 )
         return None