-
Notifications
You must be signed in to change notification settings - Fork 17
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
5 changed files
with
65 additions
and
69 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,20 +6,18 @@ _Author: [Benjamin Brünau](mailto:[email protected])_ | |
|
||
## TL;DR | ||
|
||
Elasticsearch, a distributed search and analytics engine, is a powerful tool for full-text search and data analysis. | ||
Built on Apache Lucene and written in Java, it has gained popularity for its flexibility, scalability, and ease of use. | ||
Elasticsearch, a distributed search and analytics engine, is a powerful tool for full-text search and data analysis. | ||
Built on Apache Lucene and written in Java, it has gained popularity for its flexibility, scalability, and ease of use. | ||
This article provides both a broad overview on the components and background of Elasticsearch but also a more in-depth view on the techniques employed for efficient text searching. | ||
|
||
|
||
## Introduction | ||
|
||
Elasticsearch, first introduced in 2010 by Shay Bannon which lead to the company Elastic, is a distributed search and analytics engine | ||
Elasticsearch, first introduced in 2010 by Shay Bannon which lead to the company Elastic, is a distributed search and analytics engine | ||
designed for handling large amounts of unstructured data. Primarily used for full-text search, it employs a combination of indexing and searching to deliver relevant results efficiently. | ||
Elasticsearch is often used to provide a broader amount of search functionality on other database systems that do not provide a sophisticated way to do Full-Text search (the data is then usually synchronized between both systems). | ||
|
||
## What is Elasticsearch? | ||
|
||
|
||
### Overview | ||
|
||
### Elasticsearch Components: The ELK Stack | ||
|
@@ -30,7 +28,7 @@ responsible for the search and analytics part, Logstash for data processing and | |
and managing stored data. Other services like Beats are often integrated for various functionalities, e.g. collection of data. | ||
|
||
!!! example "Elastic Stack" | ||
|
||
Logstash is usually used to ingest data from various sources into Elasticsearch (optionally parsing it beforehand). | ||
Beats are Agents, attached to for example Applications to collect logs or other metrics. Kibana utilizes the powerful search engine of Elasticsearch to then visualize the data. | ||
![ELK Stack - Data Flow](./assets/elasticsearch-elk-stack-data-flow.png) | ||
|
@@ -42,19 +40,17 @@ of the Server-Side Public License. This shift was driven by Elastic's [dissatisf | |
as a service. In response, an open-source fork named OpenSearch emerged, supported by AWS, RedHat, SAP, and others. | ||
|
||
!!! info "[Licensing Situation now](https://www.elastic.co/de/pricing/faq/licensing)" | ||
|
||
While no longer being open-source, Elasticsearch is still "source-available". Elasticsearch can still be used and modified at will. | ||
It is just not allowed to offer Elasticsearch as a Service (Software as a Service - SaaS) to potential customers, like Amazon did in the past on AWS. | ||
|
||
While no longer being open-source, Elasticsearch is still "source-available". Elasticsearch can still be used and modified at will. | ||
It is just not allowed to offer Elasticsearch as a Service (Software as a Service - SaaS) to potential customers, like Amazon did in the past on AWS. | ||
|
||
## Interlude: Full Text Search | ||
|
||
With Full-Text Search the whole content of something is searched (e.g. a whole book) and not (only) its metadata (author, title, abstract). | ||
With Full-Text Search the whole content of something is searched (e.g. a whole book) and not (only) its metadata (author, title, abstract). | ||
It is therefore all about searching unstructured data for example Tweets. | ||
When searching for a specific query inside a document or a small set of documents the whole content can be scanned, this is usually done when using _STRG + F_ in the browser or editor of your choice, | ||
but also by cli tools like `grep` on Unix systems. | ||
|
||
|
||
Once the amount of documents becomes larger it gets increasingly less efficient to scan all the documents and their content. | ||
The amount of effort and therefore time needed to search for a query is no longer sustainable. | ||
|
||
|
@@ -75,30 +71,26 @@ When ingesting documents Elasticsearch also builds and updates an Index, in this | |
|
||
To make more clear why an inverted Index is used and why it is so efficient for Full-Text search I will explain the difference between a _Forward Index_ and an _Inverted Index_. | ||
|
||
|
||
### Different Index Types & Elasticsearch's Inverted Index | ||
|
||
A _Forward Index_ saves for each document the keywords it contains, mapping the ID of that Document to the keywords. | ||
Queriying the Index would mean that the entry for each Document would need to be searched for the search term of the query. | ||
An example for such an Index would be the list of contents of a book, when looking for something you would be able to jump to the chapter through the entry in the list but you would still need to search the whole chapter | ||
An example for such an Index would be the list of contents of a book, when looking for something you would be able to jump to the chapter through the entry in the list but you would still need to search the whole chapter | ||
for the term you are looking for. | ||
|
||
An _Inverted Index_ on the other hand maps the keyword onto the DocumentID's which contain that word, therefore it is only necessary to search the "keys" of the Index. | ||
An example would be the Index at the end of the book, which lists all the pages where a keyword appears. | ||
|
||
Generally a _Forward Index_ is fast when building the Index but slow when searching it, the _Inverted Index_ is rather slow when indexing documents but much faster when searching it. | ||
|
||
|
||
!!! example "_Forward Index_ and _Inverted Index_" | ||
|
||
![Example: Forward Index and Inverted Index](./assets/elasticsearch-index-example.png) | ||
|
||
|
||
The _Inverted Index_ utilized by Elasticsearch not only saves for each unique keyword in which documents it appears but also on which position inside the document. | ||
Before building the Index an analysis process is run by an _analyzer_ on the input data for more accurate and flexible results when searching the Index and not only exact matches. | ||
Indexing is done continuously, making documents available for searching directly after ingestion. | ||
|
||
|
||
!!! info "Elasticsearch Field Types" | ||
|
||
All of the mentioned processes are only applied for indexing so called _full text_ fields of the saved JSON documents. | ||
|
@@ -107,20 +99,20 @@ Indexing is done continuously, making documents available for searching directly | |
|
||
### Text Analysis & Processing Techniques | ||
|
||
To enhance full-text search, Elasticsearch employs [natural language processing techniques](/lectures/preprocessing/) during the analysis phase. | ||
To enhance full-text search, Elasticsearch employs [natural language processing techniques](/lectures/preprocessing/) during the analysis phase. | ||
Tokenization breaks strings into words, and normalization ensures consistent representation, handling variations like capitalization and synonyms. | ||
Elasticsearch provides a couple of different built-in [_analyzers_](https://www.elastic.co/guide/en/elasticsearch/reference/8.12/analysis-overview.html) | ||
Elasticsearch provides a couple of different built-in [_analyzers_](https://www.elastic.co/guide/en/elasticsearch/reference/8.12/analysis-overview.html) | ||
next to the commonly used _standard analyzer_, but also the possibility to create an own, _custom analyzer_ | ||
|
||
Text Analysis in Elasticsearch usually involves two steps: | ||
Text Analysis in Elasticsearch usually involves two steps: | ||
|
||
1. **Tokenization**: splitting up text into tokens and indexing each word | ||
1. **Tokenization**: splitting up text into tokens and indexing each word | ||
2. **Normalization**: capitalization, synonyms and word stems are indexed as a single term | ||
|
||
Tokenization enables the terms in a query string to be looked up individually, but not similar tokens (e.g. upper-and lowercase, word stems or synonyms) which makes a Normalization step necessary. | ||
To make a query match to the analyzed and indexed keywords, the same analysis steps are applied to the string of the query. | ||
|
||
While this makes it possible to fetch accurate results that match a search term, this could sometimes be hundreds of documents. It is cumbersome to search these results for | ||
While this makes it possible to fetch accurate results that match a search term, this could sometimes be hundreds of documents. It is cumbersome to search these results for | ||
the most relevant documents ourselves. | ||
Elasticsearch applies similarity scoring on search results to solve this problem. | ||
|
||
|
@@ -143,13 +135,11 @@ It still has a couple of shortcomings, for example the length of a document is n | |
Elasticsearch therefore utilizes the **BM25** algorithm which is based on **TF-IDF**. While the **IDF** part of the **BM25** algorithm is similar (rare words lead to a higher score), it also | ||
addresses the length of a document: the score is lower for bigger documents (based on the amount of words that do not match the query). | ||
|
||
|
||
### Scalability and Distribution | ||
|
||
Elasticsearch's popularity stems from its scalability and distribution capabilities. Running on clusters, it automatically distributes data to nodes, | ||
utilizing shards (each node gets a part of the inverted index, a shard) to enable parallel processing of search queries. This makes it well-suited for handling large datasets efficiently. | ||
|
||
|
||
![Elasticsearch as a distributed systems](./assets/elasticsearch-distributed-system.png) | ||
|
||
### Advanced Features and Use Cases - Vector Embeddings & Semantic Search | ||
|
@@ -159,7 +149,6 @@ This is mostly used for k-nearest neighbor search, which returns the _k_ nearest | |
The embeddings can be generated before ingesting data into Elasticsearch or delegated to a NLP model inside of Elasticsearch which has to be added by the user | ||
beforehand. | ||
|
||
|
||
Elasticsearch also offers its own built-in, domain free **ELSER** model (Elastic Learned Sparse Encoder), which is a paid service that does not need to be trained on a customers data beforehand. | ||
|
||
The storage of data as vector representations in Elasticsearch enables advanced searches, making it suitable for applications like recommendation engines and multimedia content searches. | ||
|
@@ -184,4 +173,4 @@ The storage of data as vector representations in Elasticsearch enables advanced | |
- [BM25 Algorithm 1](https://www.elastic.co/de/blog/practical-bm25-part-1-how-shards-affect-relevance-scoring-in-elasticsearch) | ||
- [BM25 Algorithm 2](https://www.elastic.co/de/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables) | ||
- [OpenSearch Project](https://opensearch.org/) | ||
- [Apache Lucene](https://lucene.apache.org/) | ||
- [Apache Lucene](https://lucene.apache.org/) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,5 @@ | ||
# Gensim | ||
|
||
|
||
_Author: [Fabian Renz](mailto:[email protected])_ | ||
|
||
## TL;DR | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,10 +1,10 @@ | ||
# Hugging Face | ||
|
||
_Author: [Luis Nothvogel](mailto:[email protected])_ | ||
## TL;DR | ||
Hugging Face has emerged as a pivotal player in the AI and machine learning arena, specializing in natural language processing (NLP). This article delves into its core offerings, including model hosting, spaces, datasets, pricing, and the Terraformer API. Hugging Face is not only a repository for cutting-edge models but also a platform for collaboration and innovation in AI. | ||
_Author: [Luis Nothvogel](mailto:[email protected])_ | ||
|
||
## TL;DR | ||
|
||
Hugging Face has emerged as a pivotal player in the AI and machine learning arena, specializing in natural language processing (NLP). This article delves into its core offerings, including model hosting, spaces, datasets, pricing, and the Terraformer API. Hugging Face is not only a repository for cutting-edge models but also a platform for collaboration and innovation in AI. | ||
|
||
## Model Hosting on Hugging Face | ||
|
||
|
@@ -14,13 +14,14 @@ Hugging Face has made a name for itself in model hosting. It offers a vast repos | |
from transformers import pipeline, set_seed | ||
|
||
# Example of using a pre-trained model | ||
generator = pipeline('text-generation', model='gpt2') | ||
set_seed(42) | ||
generated_texts = generator("The student worked on", max_length=30, num_return_sequences=2) | ||
generator = pipeline('text-generation', model='gpt2') | ||
set_seed(42) | ||
generated_texts = generator("The student worked on", max_length=30, num_return_sequences=2) | ||
print(generated_texts) | ||
``` | ||
|
||
This outputs the following: | ||
|
||
```python | ||
[{'generated_text': 'The student worked on his paper, which you can read about here. You can get an ebook with that part, or an audiobook with some of'}, {'generated_text': 'The student worked on this particular task by making the same basic task in his head again and again, without the help of some external helper, even when'}] | ||
``` | ||
|
@@ -34,14 +35,14 @@ Spaces are an innovative feature of Hugging Face, offering a collaborative envir | |
The Hugging Face ecosystem includes a wide range of datasets, catering to different NLP tasks. The Datasets library simplifies the process of loading and processing data, ensuring efficiency and consistency in model training. According to them they host over 75k datasets. | ||
|
||
[Wikipdia Referenz](https://huggingface.co/datasets/wikimedia/wikipedia) | ||
|
||
```python | ||
from datasets import load_dataset | ||
|
||
# Example of loading a dataset | ||
ds = load_dataset("wikimedia/wikipedia", "20231101.en") | ||
``` | ||
|
||
|
||
## Transformers API: Transform Text Effortlessly | ||
|
||
The Transformers API is a testament to Hugging Face's innovation. This API simplifies the process of text transformation, making it accessible even to those with limited programming skills. It supports a variety of NLP tasks and can be integrated into various applications. | ||
|
@@ -68,12 +69,14 @@ tokenizer = Tokenizer(BPE(unk_token="[UNK]")) | |
|
||
Hugging Face Inference plays a crucial role in turning trained language models into productive applications. The platform provides an intuitive and powerful infrastructure for inferencing models, which means that developers can easily access pre-trained models to generate real-time predictions for a wide range of NLP tasks. Thanks to its efficient implementation and support for hardware acceleration technologies, Hugging Face Inference enables the seamless integration of language models into applications ranging from chatbots to machine translation and sentiment analysis. | ||
|
||
The Inference API Url is always defined like this: | ||
The Inference API Url is always defined like this: | ||
|
||
```python | ||
ENDPOINT = https://api-inference.huggingface.co/models/<MODEL_ID> | ||
``` | ||
|
||
Example in Python with gpt2: | ||
|
||
```python | ||
import requests | ||
API_URL = "https://api-inference.huggingface.co/models/gpt2" | ||
|
Oops, something went wrong.