Skip to content

Commit

Permalink
Merge branch 'restructure-directories' of github.com:szarnyasg/duckdb…
Browse files Browse the repository at this point in the history
…-web into restructure-directories
  • Loading branch information
szarnyasg committed Feb 27, 2025
2 parents b71bcf5 + 55ecd47 commit f7b1f35
Show file tree
Hide file tree
Showing 44 changed files with 126 additions and 97 deletions.
1 change: 1 addition & 0 deletions _includes/list_of_community_extensions.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
| [magic]({% link community_extensions/extensions/magic.md %}) | [<span class=github>GitHub</span>](https://github.com/carlopi/duckdb_magic) | libmagic/file utilities ported to DuckDB |
| [netquack]({% link community_extensions/extensions/netquack.md %}) | [<span class=github>GitHub</span>](https://github.com/hatamiarash7/duckdb-netquack) | DuckDB extension for parsing, extracting, and analyzing domains, URIs, and paths with ease. |
| [open_prompt]({% link community_extensions/extensions/open_prompt.md %}) | [<span class=github>GitHub</span>](https://github.com/quackscience/duckdb-extension-openprompt) | Interact with LLMs with a simple DuckDB Extension |
| [pcap_reader]({% link community_extensions/extensions/pcap_reader.md %}) | [<span class=github>GitHub</span>](https://github.com/quackscience/duckdb-extension-pcap) | Read PCAP files from DuckDB |
| [pivot_table]({% link community_extensions/extensions/pivot_table.md %}) | [<span class=github>GitHub</span>](https://github.com/Alex-Monahan/pivot_table) | Provides a spreadsheet-style pivot_table function |
| [prql]({% link community_extensions/extensions/prql.md %}) | [<span class=github>GitHub</span>](https://github.com/ywelsch/duckdb-prql) | Support for PRQL, the Pipelined Relational Query Language |
| [psql]({% link community_extensions/extensions/psql.md %}) | [<span class=github>GitHub</span>](https://github.com/ywelsch/duckdb-psql) | Support for PSQL, a piped SQL dialect for DuckDB |
Expand Down
17 changes: 13 additions & 4 deletions _posts/2021-01-25-full-text-search.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,9 @@ Alright, enough about the "why", let's get to the "how".

## Preparing the Data

The TREC 2004 Robust Retrieval Track has 250 "topics" (search queries) over TREC disks 4 and 5. The data consist of many text files stored in SGML format, along with a corresponding DTD (document type definition) file. This format is rarely used anymore, but it is similar to XML. We will use OpenSP's command line tool `osx` to convert it to XML. Because there are many files, I wrote a bash script:
The TREC 2004 Robust Retrieval Track has 250 "topics" (search queries) over TREC disks 4 and 5. The data consist of many text files stored in SGML format, along with a corresponding DTD (document type definition) file. This format is rarely used anymore, but it is similar to XML. We will use OpenSP's command line tool `osx` to convert it to XML. Because there are many files, I wrote a Bash script:

```text
#!/bin/bash
```bash
mkdir -p latimes/xml
for i in $(seq -w 1 9); do
cat dtds/la.dtd latimes-$i | osx > latimes/xml/latimes-$i.xml
Expand Down Expand Up @@ -66,6 +65,7 @@ documents_df = pd.DataFrame([x for sublist in list_of_dict_lists for x in sublis
```

Now that we have a dataframe, we can register it in DuckDB.

```python
# create database connection and register the dataframe
con = duckdb.connect(database='db/trec04_05.db', read_only=False)
Expand All @@ -75,11 +75,17 @@ con.register('documents_df', documents_df)
con.execute("CREATE TABLE documents AS (SELECT * FROM documents_df)")
con.close()
```

This is the end of my preparation script, so I closed the database connection.

## Building the Search Engine

We can now build the inverted index and the retrieval model using a `PRAGMA` statement. The extension is [documented here]({% link docs/stable/extensions/full_text_search.md %}). We create an index table on table `documents` or `main.documents` that we created with our script. The column that identifies our documents is called `docno`, and we wish to create an inverted index on the fields supplied. I supplied all fields by using the '\*' shortcut.
We can now build the inverted index and the retrieval model using a `PRAGMA` statement.
The extension is [documented here]({% link docs/stable/extensions/full_text_search.md %}).
We create an index table on table `documents` or `main.documents` that we created with our script.
The column that identifies our documents is called `docno`, and we wish to create an inverted index on the fields supplied.
I supplied all fields by using the '\*' shortcut.

```python
con = duckdb.connect(database='db/trec04_05.db', read_only=False)
con.execute("PRAGMA create_fts_index('documents', 'docno', '*', stopwords='english')")
Expand All @@ -90,6 +96,7 @@ Under the hood, a parameterized SQL script is called. The schema `fts_main_docum
## Running the Benchmark

The data is now fully prepared. Now we want to run the queries in the benchmark, one by one. We load the topics file as follows:

```python
# the 'topics' file is not structured nicely, therefore we need parse some of it using regex
def after_tag(s, tag):
Expand All @@ -106,9 +113,11 @@ with open('../../trec/topics', 'r') as f:
title = after_tag(str(top), 'title')
topic_dict[num] = title
```

This gives us a dictionary that has query number as keys, and query strings as values, e.g. `301 -> 'International Organized Crime'`.

We want to store the results in a specific format, so that they can be evaluated by [trec eval](https://github.com/usnistgov/trec_eval.git):

```python
# create a prepared statement to make querying our document collection easier
con.execute("""
Expand Down
7 changes: 6 additions & 1 deletion _posts/2022-10-12-modern-data-stack-in-a-box.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@ Given that the NBA season is starting soon, a monte carlo type simulation of the
## Building the Environment

The detailed steps to build the project can be found in the repo, but the high-level steps will be repeated here. As a note, Windows Subsystem for Linux (WSL) was chosen to support Apache Superset, but the other components of this stack can run directly on any operating system. Thankfully, using Linux on Windows has become very straightforward.

1. Install Ubuntu 20.04 on WSL.
1. Upgrade your packages (`sudo apt update`).
1. Install python.
Expand All @@ -60,7 +61,8 @@ The detailed steps to build the project can be found in the repo, but the high-l

## Meltano as a Wrapper for Pipeline Plugins

In this example, [Meltano](https://meltano.com/) pulls together multiple bits and pieces to allow the pipeline to be run with a single statement. The first part is the tap (extractor) which is '[tap-spreadsheets-anywhere](https://hub.meltano.com/extractors/tap-spreadsheets-anywhere/)'. This tap allows us to get flat data files from various sources. It should be noted that DuckDB can consume directly from flat files (locally and over the network), or SQLite and PostgreSQL databases. However, this tap was chosen to provide a clear example of getting static data into your database that can easily be configured in the meltano.yml file. Meltano also becomes more beneficial as the complexity of your data sources increases.
In this example, [Meltano](https://meltano.com/) pulls together multiple bits and pieces to allow the pipeline to be run with a single statement. The first part is the tap (extractor) which is '[tap-spreadsheets-anywhere](https://hub.meltano.com/extractors/tap-spreadsheets-anywhere/)'. This tap allows us to get flat data files from various sources. It should be noted that DuckDB can consume directly from flat files (locally and over the network), or SQLite and PostgreSQL databases. However, this tap was chosen to provide a clear example of getting static data into your database that can easily be configured in the meltano.yml file. Meltano also becomes more beneficial as the complexity of your data sources increases.

```yaml
plugins:
extractors:
Expand All @@ -71,6 +73,7 @@ plugins:
```

The next bit is the target (loader), '[target-duckdb](https://github.com/jwills/target-duckdb)'. This target can take data from any Meltano tap and load it into DuckDB. Part of the beauty of this approach is that you don't have to mess with all the extra complexity that comes with a typical database. DuckDB can be dropped in and is ready to go with zero configuration or ongoing maintenance. Furthermore, because the components and the data are co-located, networking is not a consideration and further reduces complexity.

```yaml
loaders:
- name: target-duckdb
Expand All @@ -82,6 +85,7 @@ The next bit is the target (loader), '[target-duckdb](https://github.com/jwills/
```
Next is the transformer: '[dbt-duckdb](https://github.com/jwills/dbt-duckdb)'. dbt enables transformations using a combination of SQL and Jinja templating for approachable SQL-based analytics engineering. The dbt adapter for DuckDB now supports parallel execution across threads, which makes the MDS-in-a-box run even faster. Since the bulk of the work is happening inside of dbt, this portion will be described in detail later in the post.
```yaml
transformers:
- name: dbt-duckdb
Expand All @@ -92,6 +96,7 @@ Next is the transformer: '[dbt-duckdb](https://github.com/jwills/dbt-duckdb)'. d
```
Lastly, [Apache Superset](https://superset.apache.org/) is included as a [Meltano utility](https://hub.meltano.com/utilities/superset/) to enable some data querying and visualization. Superset leverages DuckDB's SQLAlchemy driver, [duckdb_engine](https://github.com/Mause/duckdb_engine), so it can query DuckDB directly as well.
```yaml
utilities:
- name: superset
Expand Down
1 change: 1 addition & 0 deletions _posts/2023-08-23-even-friendlier-sql.md
Original file line number Diff line number Diff line change
Expand Up @@ -544,6 +544,7 @@ SELECT 'First Contact';
| 2 |

However, if a `UNION` type is used, each individual row retains its original data type. A `UNION` is defined using key-value pairs with the key as a name and the value as the data type. This also allows the specific data types to be pulled out as individual columns:

```sql
CREATE TABLE movies (
movie UNION(num INTEGER, name VARCHAR)
Expand Down
2 changes: 2 additions & 0 deletions _posts/2023-10-27-csv-sniffer.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,13 +82,15 @@ A,B,C
```

Here the sniffer would detect that with the delimiter set to `,` the first row has one column, the second has two, but the remaining rows have 3 columns. Hence, if `null_padding` is set to false, it would still select `,` as a delimiter candidate, by assuming the top rows are dirty notes. (Believe me, CSV notes are a thing!). Resulting in the following table:

```csv
A,B,C
1, 2, 3
4, 5, 6
```

If `null_padding` is set to true, all lines would be accepted, resulting in the following table:

```csv
'I like my csv files to have notes to make dialect detection harder', None, None
'I also like commas like this one : ', None, None
Expand Down
1 change: 1 addition & 0 deletions _posts/2024-03-29-external-aggregation.md
Original file line number Diff line number Diff line change
Expand Up @@ -194,6 +194,7 @@ The source code for the H2O.ai benchmark can be found [here](https://github.com/
You can download the file yourself from <https://blobs.duckdb.org/data/G1_1e9_2e0_0_0.csv.zst> (18.8 GB compressed).

We use the following queries from the benchmark to load the data:

```sql
SET preserve_insertion_order = false;
CREATE TABLE y (
Expand Down
2 changes: 1 addition & 1 deletion community_extensions/extensions/avro.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ docs:
extension_star_count: 23
extension_star_count_pretty: 23
extension_download_count: 10514
extension_download_count: 10528
extension_download_count_pretty: 10.5k
image: '/images/community_extensions/social_preview/preview_community_extension_avro.png'
layout: community_extension_doc
Expand Down
4 changes: 2 additions & 2 deletions community_extensions/extensions/bigquery.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,8 +53,8 @@ docs:
extension_star_count: 94
extension_star_count_pretty: 94
extension_download_count: 4412
extension_download_count_pretty: 4.4k
extension_download_count: 6085
extension_download_count_pretty: 6.1k
image: '/images/community_extensions/social_preview/preview_community_extension_bigquery.png'
layout: community_extension_doc
---
Expand Down
4 changes: 2 additions & 2 deletions community_extensions/extensions/blockduck.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,8 @@ docs:

extension_star_count: 1
extension_star_count_pretty: 1
extension_download_count: 410
extension_download_count_pretty: 410
extension_download_count: 413
extension_download_count_pretty: 413
image: '/images/community_extensions/social_preview/preview_community_extension_blockduck.png'
layout: community_extension_doc
---
Expand Down
4 changes: 2 additions & 2 deletions community_extensions/extensions/capi_quack.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,8 @@ docs:
extension_star_count: 9
extension_star_count_pretty: 9
extension_download_count: 411
extension_download_count_pretty: 411
extension_download_count: 415
extension_download_count_pretty: 415
image: '/images/community_extensions/social_preview/preview_community_extension_capi_quack.png'
layout: community_extension_doc
---
Expand Down
4 changes: 2 additions & 2 deletions community_extensions/extensions/chsql.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,8 +95,8 @@ docs:
extension_star_count: 47
extension_star_count_pretty: 47
extension_download_count: 446
extension_download_count_pretty: 446
extension_download_count: 528
extension_download_count_pretty: 528
image: '/images/community_extensions/social_preview/preview_community_extension_chsql.png'
layout: community_extension_doc
---
Expand Down
4 changes: 2 additions & 2 deletions community_extensions/extensions/chsql_native.md
Original file line number Diff line number Diff line change
Expand Up @@ -120,8 +120,8 @@ docs:
extension_star_count: 7
extension_star_count_pretty: 7
extension_download_count: 319
extension_download_count_pretty: 319
extension_download_count: 362
extension_download_count_pretty: 362
image: '/images/community_extensions/social_preview/preview_community_extension_chsql_native.png'
layout: community_extension_doc
---
Expand Down
4 changes: 2 additions & 2 deletions community_extensions/extensions/cronjob.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,8 +59,8 @@ docs:
extension_star_count: 29
extension_star_count_pretty: 29
extension_download_count: 408
extension_download_count_pretty: 408
extension_download_count: 414
extension_download_count_pretty: 414
image: '/images/community_extensions/social_preview/preview_community_extension_cronjob.png'
layout: community_extension_doc
---
Expand Down
4 changes: 2 additions & 2 deletions community_extensions/extensions/crypto.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,8 +64,8 @@ repo:

extension_star_count: 12
extension_star_count_pretty: 12
extension_download_count: 452
extension_download_count_pretty: 452
extension_download_count: 453
extension_download_count_pretty: 453
image: '/images/community_extensions/social_preview/preview_community_extension_crypto.png'
layout: community_extension_doc
---
Expand Down
4 changes: 2 additions & 2 deletions community_extensions/extensions/datasketches.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,8 +85,8 @@ repo:

extension_star_count: 14
extension_star_count_pretty: 14
extension_download_count: 508
extension_download_count_pretty: 508
extension_download_count: 514
extension_download_count_pretty: 514
image: '/images/community_extensions/social_preview/preview_community_extension_datasketches.png'
layout: community_extension_doc
---
Expand Down
8 changes: 4 additions & 4 deletions community_extensions/extensions/duckpgq.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,10 +60,10 @@ docs:
*Disclaimer:* As this extension is part of an ongoing research project by the Database Architectures group at CWI, some features may still be under development. We appreciate your understanding and patience as we continue to improve it.
extension_star_count: 153
extension_star_count_pretty: 153
extension_download_count: 2995
extension_download_count_pretty: 3.0k
extension_star_count: 154
extension_star_count_pretty: 154
extension_download_count: 3543
extension_download_count_pretty: 3.5k
image: '/images/community_extensions/social_preview/preview_community_extension_duckpgq.png'
layout: community_extension_doc
---
Expand Down
4 changes: 2 additions & 2 deletions community_extensions/extensions/evalexpr_rhai.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,8 +98,8 @@ repo:

extension_star_count: 14
extension_star_count_pretty: 14
extension_download_count: 411
extension_download_count_pretty: 411
extension_download_count: 412
extension_download_count_pretty: 412
image: '/images/community_extensions/social_preview/preview_community_extension_evalexpr_rhai.png'
layout: community_extension_doc
---
Expand Down
4 changes: 2 additions & 2 deletions community_extensions/extensions/flockmtl.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,8 +55,8 @@ docs:
extension_star_count: 96
extension_star_count_pretty: 96
extension_download_count: 412
extension_download_count_pretty: 412
extension_download_count: 417
extension_download_count_pretty: 417
image: '/images/community_extensions/social_preview/preview_community_extension_flockmtl.png'
layout: community_extension_doc
---
Expand Down
4 changes: 2 additions & 2 deletions community_extensions/extensions/fuzzycomplete.md
Original file line number Diff line number Diff line change
Expand Up @@ -185,8 +185,8 @@ repo:

extension_star_count: 11
extension_star_count_pretty: 11
extension_download_count: 400
extension_download_count_pretty: 400
extension_download_count: 403
extension_download_count_pretty: 403
image: '/images/community_extensions/social_preview/preview_community_extension_fuzzycomplete.png'
layout: community_extension_doc
---
Expand Down
4 changes: 2 additions & 2 deletions community_extensions/extensions/geography.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,8 +31,8 @@ docs:
extension_star_count: 18
extension_star_count_pretty: 18
extension_download_count: 189
extension_download_count_pretty: 189
extension_download_count: 245
extension_download_count_pretty: 245
image: '/images/community_extensions/social_preview/preview_community_extension_geography.png'
layout: community_extension_doc
---
Expand Down
4 changes: 2 additions & 2 deletions community_extensions/extensions/h3.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,8 @@ docs:
extension_star_count: 186
extension_star_count_pretty: 186
extension_download_count: 15060
extension_download_count_pretty: 15.1k
extension_download_count: 11917
extension_download_count_pretty: 11.9k
image: '/images/community_extensions/social_preview/preview_community_extension_h3.png'
layout: community_extension_doc
---
Expand Down
4 changes: 2 additions & 2 deletions community_extensions/extensions/hdf5.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,8 @@ docs:
extension_star_count: 2
extension_star_count_pretty: 2
extension_download_count: 326
extension_download_count_pretty: 326
extension_download_count: 339
extension_download_count_pretty: 339
image: '/images/community_extensions/social_preview/preview_community_extension_hdf5.png'
layout: community_extension_doc
---
Expand Down
4 changes: 2 additions & 2 deletions community_extensions/extensions/hostfs.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,8 +60,8 @@ docs:
For more information, please see the [HostFS documentation](https://github.com/gropaul/hostFS).
extension_star_count: 13
extension_star_count_pretty: 13
extension_download_count: 415
extension_download_count_pretty: 415
extension_download_count: 428
extension_download_count_pretty: 428
image: '/images/community_extensions/social_preview/preview_community_extension_hostfs.png'
layout: community_extension_doc
---
Expand Down
4 changes: 2 additions & 2 deletions community_extensions/extensions/http_client.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,8 +99,8 @@ docs:
extension_star_count: 51
extension_star_count_pretty: 51
extension_download_count: 601
extension_download_count_pretty: 601
extension_download_count: 610
extension_download_count_pretty: 610
image: '/images/community_extensions/social_preview/preview_community_extension_http_client.png'
layout: community_extension_doc
---
Expand Down
4 changes: 2 additions & 2 deletions community_extensions/extensions/httpserver.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,8 +76,8 @@ docs:
extension_star_count: 156
extension_star_count_pretty: 156
extension_download_count: 452
extension_download_count_pretty: 452
extension_download_count: 547
extension_download_count_pretty: 547
image: '/images/community_extensions/social_preview/preview_community_extension_httpserver.png'
layout: community_extension_doc
---
Expand Down
4 changes: 2 additions & 2 deletions community_extensions/extensions/lindel.md
Original file line number Diff line number Diff line change
Expand Up @@ -121,8 +121,8 @@ repo:

extension_star_count: 39
extension_star_count_pretty: 39
extension_download_count: 426
extension_download_count_pretty: 426
extension_download_count: 423
extension_download_count_pretty: 423
image: '/images/community_extensions/social_preview/preview_community_extension_lindel.png'
layout: community_extension_doc
---
Expand Down
4 changes: 2 additions & 2 deletions community_extensions/extensions/netquack.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,8 @@ docs:
extension_star_count: 4
extension_star_count_pretty: 4
extension_download_count: 315
extension_download_count_pretty: 315
extension_download_count: 324
extension_download_count_pretty: 324
image: '/images/community_extensions/social_preview/preview_community_extension_netquack.png'
layout: community_extension_doc
---
Expand Down
4 changes: 2 additions & 2 deletions community_extensions/extensions/open_prompt.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,8 +106,8 @@ docs:
extension_star_count: 42
extension_star_count_pretty: 42
extension_download_count: 404
extension_download_count_pretty: 404
extension_download_count: 408
extension_download_count_pretty: 408
image: '/images/community_extensions/social_preview/preview_community_extension_open_prompt.png'
layout: community_extension_doc
---
Expand Down
Loading

0 comments on commit f7b1f35

Please sign in to comment.