Skip to content

Commit

Permalink
Run pre-commit hooks on all files
Browse files Browse the repository at this point in the history
  • Loading branch information
pkeilbach authored Feb 26, 2024
1 parent 7b975fd commit 1cd8c07
Show file tree
Hide file tree
Showing 5 changed files with 65 additions and 69 deletions.
39 changes: 14 additions & 25 deletions docs/presentations/articles/elasticsearch.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,20 +6,18 @@ _Author: [Benjamin Brünau](mailto:[email protected])_

## TL;DR

Elasticsearch, a distributed search and analytics engine, is a powerful tool for full-text search and data analysis.
Built on Apache Lucene and written in Java, it has gained popularity for its flexibility, scalability, and ease of use.
Elasticsearch, a distributed search and analytics engine, is a powerful tool for full-text search and data analysis.
Built on Apache Lucene and written in Java, it has gained popularity for its flexibility, scalability, and ease of use.
This article provides both a broad overview on the components and background of Elasticsearch but also a more in-depth view on the techniques employed for efficient text searching.


## Introduction

Elasticsearch, first introduced in 2010 by Shay Bannon which lead to the company Elastic, is a distributed search and analytics engine
Elasticsearch, first introduced in 2010 by Shay Bannon which lead to the company Elastic, is a distributed search and analytics engine
designed for handling large amounts of unstructured data. Primarily used for full-text search, it employs a combination of indexing and searching to deliver relevant results efficiently.
Elasticsearch is often used to provide a broader amount of search functionality on other database systems that do not provide a sophisticated way to do Full-Text search (the data is then usually synchronized between both systems).

## What is Elasticsearch?


### Overview

### Elasticsearch Components: The ELK Stack
Expand All @@ -30,7 +28,7 @@ responsible for the search and analytics part, Logstash for data processing and
and managing stored data. Other services like Beats are often integrated for various functionalities, e.g. collection of data.

!!! example "Elastic Stack"

Logstash is usually used to ingest data from various sources into Elasticsearch (optionally parsing it beforehand).
Beats are Agents, attached to for example Applications to collect logs or other metrics. Kibana utilizes the powerful search engine of Elasticsearch to then visualize the data.
![ELK Stack - Data Flow](./assets/elasticsearch-elk-stack-data-flow.png)
Expand All @@ -42,19 +40,17 @@ of the Server-Side Public License. This shift was driven by Elastic's [dissatisf
as a service. In response, an open-source fork named OpenSearch emerged, supported by AWS, RedHat, SAP, and others.

!!! info "[Licensing Situation now](https://www.elastic.co/de/pricing/faq/licensing)"

While no longer being open-source, Elasticsearch is still "source-available". Elasticsearch can still be used and modified at will.
It is just not allowed to offer Elasticsearch as a Service (Software as a Service - SaaS) to potential customers, like Amazon did in the past on AWS.

While no longer being open-source, Elasticsearch is still "source-available". Elasticsearch can still be used and modified at will.
It is just not allowed to offer Elasticsearch as a Service (Software as a Service - SaaS) to potential customers, like Amazon did in the past on AWS.

## Interlude: Full Text Search

With Full-Text Search the whole content of something is searched (e.g. a whole book) and not (only) its metadata (author, title, abstract).
With Full-Text Search the whole content of something is searched (e.g. a whole book) and not (only) its metadata (author, title, abstract).
It is therefore all about searching unstructured data for example Tweets.
When searching for a specific query inside a document or a small set of documents the whole content can be scanned, this is usually done when using _STRG + F_ in the browser or editor of your choice,
but also by cli tools like `grep` on Unix systems.


Once the amount of documents becomes larger it gets increasingly less efficient to scan all the documents and their content.
The amount of effort and therefore time needed to search for a query is no longer sustainable.

Expand All @@ -75,30 +71,26 @@ When ingesting documents Elasticsearch also builds and updates an Index, in this

To make more clear why an inverted Index is used and why it is so efficient for Full-Text search I will explain the difference between a _Forward Index_ and an _Inverted Index_.


### Different Index Types & Elasticsearch's Inverted Index

A _Forward Index_ saves for each document the keywords it contains, mapping the ID of that Document to the keywords.
Queriying the Index would mean that the entry for each Document would need to be searched for the search term of the query.
An example for such an Index would be the list of contents of a book, when looking for something you would be able to jump to the chapter through the entry in the list but you would still need to search the whole chapter
An example for such an Index would be the list of contents of a book, when looking for something you would be able to jump to the chapter through the entry in the list but you would still need to search the whole chapter
for the term you are looking for.

An _Inverted Index_ on the other hand maps the keyword onto the DocumentID's which contain that word, therefore it is only necessary to search the "keys" of the Index.
An example would be the Index at the end of the book, which lists all the pages where a keyword appears.

Generally a _Forward Index_ is fast when building the Index but slow when searching it, the _Inverted Index_ is rather slow when indexing documents but much faster when searching it.


!!! example "_Forward Index_ and _Inverted Index_"

![Example: Forward Index and Inverted Index](./assets/elasticsearch-index-example.png)


The _Inverted Index_ utilized by Elasticsearch not only saves for each unique keyword in which documents it appears but also on which position inside the document.
Before building the Index an analysis process is run by an _analyzer_ on the input data for more accurate and flexible results when searching the Index and not only exact matches.
Indexing is done continuously, making documents available for searching directly after ingestion.


!!! info "Elasticsearch Field Types"

All of the mentioned processes are only applied for indexing so called _full text_ fields of the saved JSON documents.
Expand All @@ -107,20 +99,20 @@ Indexing is done continuously, making documents available for searching directly

### Text Analysis & Processing Techniques

To enhance full-text search, Elasticsearch employs [natural language processing techniques](/lectures/preprocessing/) during the analysis phase.
To enhance full-text search, Elasticsearch employs [natural language processing techniques](/lectures/preprocessing/) during the analysis phase.
Tokenization breaks strings into words, and normalization ensures consistent representation, handling variations like capitalization and synonyms.
Elasticsearch provides a couple of different built-in [_analyzers_](https://www.elastic.co/guide/en/elasticsearch/reference/8.12/analysis-overview.html)
Elasticsearch provides a couple of different built-in [_analyzers_](https://www.elastic.co/guide/en/elasticsearch/reference/8.12/analysis-overview.html)
next to the commonly used _standard analyzer_, but also the possibility to create an own, _custom analyzer_

Text Analysis in Elasticsearch usually involves two steps:
Text Analysis in Elasticsearch usually involves two steps:

1. **Tokenization**: splitting up text into tokens and indexing each word
1. **Tokenization**: splitting up text into tokens and indexing each word
2. **Normalization**: capitalization, synonyms and word stems are indexed as a single term

Tokenization enables the terms in a query string to be looked up individually, but not similar tokens (e.g. upper-and lowercase, word stems or synonyms) which makes a Normalization step necessary.
To make a query match to the analyzed and indexed keywords, the same analysis steps are applied to the string of the query.

While this makes it possible to fetch accurate results that match a search term, this could sometimes be hundreds of documents. It is cumbersome to search these results for
While this makes it possible to fetch accurate results that match a search term, this could sometimes be hundreds of documents. It is cumbersome to search these results for
the most relevant documents ourselves.
Elasticsearch applies similarity scoring on search results to solve this problem.

Expand All @@ -143,13 +135,11 @@ It still has a couple of shortcomings, for example the length of a document is n
Elasticsearch therefore utilizes the **BM25** algorithm which is based on **TF-IDF**. While the **IDF** part of the **BM25** algorithm is similar (rare words lead to a higher score), it also
addresses the length of a document: the score is lower for bigger documents (based on the amount of words that do not match the query).


### Scalability and Distribution

Elasticsearch's popularity stems from its scalability and distribution capabilities. Running on clusters, it automatically distributes data to nodes,
utilizing shards (each node gets a part of the inverted index, a shard) to enable parallel processing of search queries. This makes it well-suited for handling large datasets efficiently.


![Elasticsearch as a distributed systems](./assets/elasticsearch-distributed-system.png)

### Advanced Features and Use Cases - Vector Embeddings & Semantic Search
Expand All @@ -159,7 +149,6 @@ This is mostly used for k-nearest neighbor search, which returns the _k_ nearest
The embeddings can be generated before ingesting data into Elasticsearch or delegated to a NLP model inside of Elasticsearch which has to be added by the user
beforehand.


Elasticsearch also offers its own built-in, domain free **ELSER** model (Elastic Learned Sparse Encoder), which is a paid service that does not need to be trained on a customers data beforehand.

The storage of data as vector representations in Elasticsearch enables advanced searches, making it suitable for applications like recommendation engines and multimedia content searches.
Expand All @@ -184,4 +173,4 @@ The storage of data as vector representations in Elasticsearch enables advanced
- [BM25 Algorithm 1](https://www.elastic.co/de/blog/practical-bm25-part-1-how-shards-affect-relevance-scoring-in-elasticsearch)
- [BM25 Algorithm 2](https://www.elastic.co/de/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables)
- [OpenSearch Project](https://opensearch.org/)
- [Apache Lucene](https://lucene.apache.org/)
- [Apache Lucene](https://lucene.apache.org/)
1 change: 0 additions & 1 deletion docs/presentations/articles/gensim.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
# Gensim


_Author: [Fabian Renz](mailto:[email protected])_

## TL;DR
Expand Down
23 changes: 13 additions & 10 deletions docs/presentations/articles/hugging_face.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# Hugging Face

_Author: [Luis Nothvogel](mailto:[email protected])_
## TL;DR
Hugging Face has emerged as a pivotal player in the AI and machine learning arena, specializing in natural language processing (NLP). This article delves into its core offerings, including model hosting, spaces, datasets, pricing, and the Terraformer API. Hugging Face is not only a repository for cutting-edge models but also a platform for collaboration and innovation in AI.
_Author: [Luis Nothvogel](mailto:[email protected])_

## TL;DR

Hugging Face has emerged as a pivotal player in the AI and machine learning arena, specializing in natural language processing (NLP). This article delves into its core offerings, including model hosting, spaces, datasets, pricing, and the Terraformer API. Hugging Face is not only a repository for cutting-edge models but also a platform for collaboration and innovation in AI.

## Model Hosting on Hugging Face

Expand All @@ -14,13 +14,14 @@ Hugging Face has made a name for itself in model hosting. It offers a vast repos
from transformers import pipeline, set_seed

# Example of using a pre-trained model
generator = pipeline('text-generation', model='gpt2')
set_seed(42)
generated_texts = generator("The student worked on", max_length=30, num_return_sequences=2)
generator = pipeline('text-generation', model='gpt2')
set_seed(42)
generated_texts = generator("The student worked on", max_length=30, num_return_sequences=2)
print(generated_texts)
```

This outputs the following:

```python
[{'generated_text': 'The student worked on his paper, which you can read about here. You can get an ebook with that part, or an audiobook with some of'}, {'generated_text': 'The student worked on this particular task by making the same basic task in his head again and again, without the help of some external helper, even when'}]
```
Expand All @@ -34,14 +35,14 @@ Spaces are an innovative feature of Hugging Face, offering a collaborative envir
The Hugging Face ecosystem includes a wide range of datasets, catering to different NLP tasks. The Datasets library simplifies the process of loading and processing data, ensuring efficiency and consistency in model training. According to them they host over 75k datasets.

[Wikipdia Referenz](https://huggingface.co/datasets/wikimedia/wikipedia)

```python
from datasets import load_dataset

# Example of loading a dataset
ds = load_dataset("wikimedia/wikipedia", "20231101.en")
```


## Transformers API: Transform Text Effortlessly

The Transformers API is a testament to Hugging Face's innovation. This API simplifies the process of text transformation, making it accessible even to those with limited programming skills. It supports a variety of NLP tasks and can be integrated into various applications.
Expand All @@ -68,12 +69,14 @@ tokenizer = Tokenizer(BPE(unk_token="[UNK]"))

Hugging Face Inference plays a crucial role in turning trained language models into productive applications. The platform provides an intuitive and powerful infrastructure for inferencing models, which means that developers can easily access pre-trained models to generate real-time predictions for a wide range of NLP tasks. Thanks to its efficient implementation and support for hardware acceleration technologies, Hugging Face Inference enables the seamless integration of language models into applications ranging from chatbots to machine translation and sentiment analysis.

The Inference API Url is always defined like this:
The Inference API Url is always defined like this:

```python
ENDPOINT = https://api-inference.huggingface.co/models/<MODEL_ID>
```

Example in Python with gpt2:

```python
import requests
API_URL = "https://api-inference.huggingface.co/models/gpt2"
Expand Down
Loading

0 comments on commit 1cd8c07

Please sign in to comment.