Skip to content

Commit

Permalink
Merge branch '4paradigm:main' into docs-editiontwo
Browse files Browse the repository at this point in the history
  • Loading branch information
Elliezza authored Jan 16, 2024
2 parents 9324c7b + 930d33b commit b449e35
Show file tree
Hide file tree
Showing 118 changed files with 1,357 additions and 260 deletions.
4 changes: 2 additions & 2 deletions docs/en/deploy/conf.md
Original file line number Diff line number Diff line change
Expand Up @@ -187,8 +187,8 @@
#--max_traverse_cnt=0
# max table traverse unique key number(batch query), default: 0
#--max_traverse_key_cnt=0
# max result size in byte (default: 2MB)
#--scan_max_bytes_size=2097152
# max result size in byte (default: 0 unlimited)
#--scan_max_bytes_size=0
# loadtable
# The number of data bars to submit a task to the thread pool when loading
Expand Down
File renamed without changes.
4 changes: 2 additions & 2 deletions docs/en/openmldb_sql/dql/SELECT_STATEMENT.md
Original file line number Diff line number Diff line change
Expand Up @@ -138,7 +138,7 @@ TableAsName

```{warning}
The `SELECT` running in online mode or the stand-alone version may not obtain complete data.
Because a query may perform a large number of scans on multiple tablets, for stability, the largest number of bytes to scan is limited, namely `scan_max_bytes_size`.
The largest number of bytes to scan is limited, namely `scan_max_bytes_size`, default value is unlimited. But if you set the value of `scan_max_bytes_size` to a specific value, the `SELECT` statement will only scan the data within the specified size. If the select results are truncated, the message of `reach the max byte ...` will be recorded in the tablet's log, but there will be no error.
If the select results are truncated, the message of `reach the max byte ...` will be recorded in the tablet's log, but there will be no error.
Even if the `scan_max_bytes_size` is set to unlimited, the `SELECT` statement may failed, e.g. client errors `body_size=xxx from xx:xxxx is too large`, ` Fail to parse response from xx:xxxx by baidu_std at client-side`. We don't recommend to use `SELECT` in online mode or the stand-alone version. If you want to get the count of the online table, please use `SELECT COUNT(*) FROM table_name;`.
```
18 changes: 18 additions & 0 deletions docs/en/openmldb_sql/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
=============================
OpenMLDB SQL
=============================


.. toctree::
:maxdepth: 1

sql_difference
language_structure/index
data_types/index
functions_and_operators/index
dql/index
dml/index
ddl/index
deployment_manage/index
task_manage/index
udf_develop_guide
266 changes: 266 additions & 0 deletions docs/en/openmldb_sql/sql_difference.md

Large diffs are not rendered by default.

230 changes: 230 additions & 0 deletions docs/en/openmldb_sql/udf_develop_guide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,230 @@
# UDF Development Guideline
## Background
Although OpenMLDB provides over a hundred built-in functions for data scientists to perform data analysis and feature extraction, there are scenarios where these functions might not fully meet the requirements. To facilitate users in quickly and flexibly implementing specific feature computation needs, we have introduced support for user-defined functions (UDFs) based on C++ development. Additionally, we enable the loading of dynamically generated user-defined function libraries.

```{seealso}
Users can also extend OpenMLDB's computation function library using the method of developing built-in functions. However, developing built-in functions requires modifying the source code and recompiling. If users wish to contribute extended functions to the OpenMLDB codebase, they can refer to [Built-in Function Develop Guide](./built_in_function_develop_guide.md).
```

## Development Procedures
### Develop UDF functions
#### Naming Convention of C++ Built-in Function
- The naming of C++ built-in function should follow the [snake_case](https://en.wikipedia.org/wiki/Snake_case) style.
- The name should clearly express the function's purpose.
- The name of a function should not be the same as the name of a built-in function or other custom functions. The list of all built-in functions can be seen [here](../openmldb_sql/udfs_8h.md).

#### C++ Type and SQL Type Correlation
The types of the built-in C++ functions' parameters should be BOOL, NUMBER, TIMESTAMP, DATE, or STRING.
The SQL types corresponding to C++ types are shown as follows:

| SQL Type | C/C++ Type |
|:----------|:------------|
| BOOL | `bool` |
| SMALLINT | `int16_t` |
| INT | `int32_t` |
| BIGINT | `int64_t` |
| FLOAT | `float` |
| DOUBLE | `double` |
| STRING | `StringRef` |
| TIMESTAMP | `Timestamp` |
| DATE | `Date` |


#### Parameters and Return Values

**Return Value**:

* If the output type of the UDF is a basic type and `return_nullable` set to false, it will be processed as a return value.
* If the output type of the UDF is a basic type and `return_nullable` set to true, it will be processed as a function parameter.
* If the output type of the UDF is STRING, TIMESTAMP or DATE, it will return through the **last parameter** of the function.

**Parameters**:

* If the parameter is a basic type, it will be passed by value.
* If the output type of the UDF is STRING, TIMESTAMP or DATE, it will be passed by a pointer.
* The first parameter must be `UDFContext* ctx`. The definition of [UDFContext](../../../include/udf/openmldb_udf.h) is:

```c++
struct UDFContext {
ByteMemoryPool* pool; // Used for memory allocation.
void* ptr; // Used for the storage of temporary variables for aggregate functions.
};
```
**Function Declaration**:
* The functions must be declared by extern "C".
#### Memory Management
- In scalar functions, the use of 'new' and 'malloc' to allocate space for input and output parameters is not allowed. However, temporary space allocation using 'new' and 'malloc' is permissible within the function, and the allocated space must be freed before the function returns.
- In aggregate functions, space allocation using 'new' or 'malloc' can be performed in the 'init' function but must be released in the 'output' function. The final return value, if it is a string, needs to be stored in the space allocated by mempool.
- If dynamic memory allocation is required, OpenMLDB provides memory management interfaces. Upon function execution completion, OpenMLDB will automatically release the memory.
```c++
char *buffer = ctx->pool->Alloc(size);
```
- The maximum size allocated at once cannot exceed 2M.

**Note**:
- If the parameters are declared as nullable, then all parameters are nullable, and each input parameter will have an additional `is_null` parameter.
- If the return value is declared as nullable, it will be returned through parameters, and an additional `is_null` parameter will indicate whether the return value is null.

For instance, to declare a UDF scalar function, sum, which has two parameters, if the input and return value are nullable:
```c++
extern "C"
void sum(::openmldb::base::UDFContext* ctx, int64_t input1, bool is_null, int64_t input2, bool is_null, int64_t* output, bool* is_null) {
```
#### Scalar Function Implementation
Scalar functions process individual data rows and return a single value, such as abs, sin, cos, date, year.
The process is as follows:
- The head file `udf/openmldb_udf.h` should be included.
- Develop the logic of the function.
```c++
#include "udf/openmldb_udf.h" // must include this header file
// Develop a UDF that slices the first 2 characters of a given string.
extern "C"
void cut2(::openmldb::base::UDFContext* ctx, ::openmldb::base::StringRef* input, ::openmldb::base::StringRef* output) {
if (input == nullptr || output == nullptr) {
return;
}
uint32_t size = input->size_ <= 2 ? input->size_ : 2;
//use ctx->pool for memory allocation
char *buffer = ctx->pool->Alloc(size);
memcpy(buffer, input->data_, size);
output->size_ = size;
output->data_ = buffer;
}
```


#### Aggregation Function Implementation

Aggregate functions process a dataset (such as a column of data) and perform computations, returning a single value, such as sum, avg, max, min, count.
The process is as follows:
- The head file `udf/openmldb_udf.h` should be included.
- Develop the logic of the function.

To develop an aggregate function, you need to implement the following three C++ methods:

- init function: Perform initialization tasks such as allocating space for intermediate variables. Function naming format: 'aggregate_function_name_init'.

- update function: Implement the logic for processing each row of the respective field in the update function. Function naming format: 'aggregate_function_name_update'.

- output function: Process the final aggregated value and return the result. Function naming format: 'aggregate_function_name_output'."

**Node**: Return `UDFContext*` as the return value in the init and update function.

```c++
#include "udf/openmldb_udf.h" //must include this header file
// implementation of aggregation function special_sum
extern "C"
::openmldb::base::UDFContext* special_sum_init(::openmldb::base::UDFContext* ctx) {
// allocate space for intermediate variables and assign to 'ptr' in UDFContext.
ctx->ptr = ctx->pool->Alloc(sizeof(int64_t));
// init the value
*(reinterpret_cast<int64_t*>(ctx->ptr)) = 10;
// return pointer of UDFContext, cannot be omitted
return ctx;
}

extern "C"
::openmldb::base::UDFContext* special_sum_update(::openmldb::base::UDFContext* ctx, int64_t input) {
// get the value from ptr in UDFContext
int64_t cur = *(reinterpret_cast<int64_t*>(ctx->ptr));
cur += input;
*(reinterpret_cast<int*>(ctx->ptr)) = cur;
// return the pointer of UDFContext, cannot be omitted
return ctx;
}

// get the aggregation result from ptr in UDFcontext and return
extern "C"
int64_t special_sum_output(::openmldb::base::UDFContext* ctx) {
return *(reinterpret_cast<int64_t*>(ctx->ptr)) + 5;
}

```
For more UDF implementation, see [here](../../../src/examples/test_udf.cc).
### Compile Dynamic Library
- Copy the `include` directory (`https://github.com/4paradigm/OpenMLDB/tree/main/include`) to a certain path (like `/work/OpenMLDB/`) for later compiling.
- Run the compiling command. `-I` specifies the path of the `include` directory. `-o` specifies the name of the dynamic library.
```shell
g++ -shared -o libtest_udf.so examples/test_udf.cc -I /work/OpenMLDB/include -std=c++11 -fPIC
```

### Copy Dynamic Library
The compiled dynamic libraries should be copied into the `udf` directories for both TaskManager and tablets. Please create a new `udf` directory if it does not exist.
- The `udf` directory of a tablet is `path_to_tablet/udf`.
- The `udf` directory of TaskManager is `path_to_taskmanager/taskmanager/bin/udf`.

For example, if the deployment paths of a tablet and TaskManager are both `/work/openmldb`, the structure of the directory is shown below:

```
/work/openmldb/
├── bin
├── conf
├── taskmanager
│   ├── bin
│   │   ├── taskmanager.sh
│   │   └── udf
│   │   └── libtest_udf.so
│   ├── conf
│   └── lib
├── tools
└── udf
   └── libtest_udf.so
```

```{note}
- For multiple tablets, the library needs to be copied to every tablet.
- Dynamic libraries should not be deleted before the execution of `DROP FUNCTION`.
```


### Register, Drop and Show the Functions
For registering, please use [CREATE FUNCTION](../openmldb_sql/ddl/CREATE_FUNCTION.md).

Register an scalar function:
```sql
CREATE FUNCTION cut2(x STRING) RETURNS STRING OPTIONS (FILE='libtest_udf.so');
```
Register an aggregation function:
```sql
CREATE AGGREGATE FUNCTION special_sum(x BIGINT) RETURNS BIGINT OPTIONS (FILE='libtest_udf.so');
```
Register an aggregation function with input value and return value support null:
```sql
CREATE AGGREGATE FUNCTION third(x BIGINT) RETURNS BIGINT OPTIONS (FILE='libtest_udf.so', ARG_NULLABLE=true, RETURN_NULLABLE=true);
```

**note**:
- The types of parameters and return values must be consistent with the implementation of the code.
- `FILE` specifies the file name of the dynamic library. It is not necessary to include a path.
- A UDF function can only work on one type. Please create multiple functions for multiple types.


After successful registration, the function can be used.
```sql
SELECT cut2(c1) FROM t1;
```

You can view registered functions through `SHOW FUNCTIONS`.
```sql
SHOW FUNCTIONS;
```

Use the `DROP FUNCTION` to delete a registered function.
```sql
DROP FUNCTION cut2;
```
2 changes: 1 addition & 1 deletion docs/en/quickstart/concepts/modes.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ The main features of the online preview mode are:
- Online preview mode is mainly used for previewing limited data. Selecting and viewing data directly through SELECT in OpenMLDB CLI or SDKs may result in data truncation. If the data volume is large, it is recommended to use an [export tool](../../tutorial/data_export.html) to view the complete data.
- SELECT statements in online preview mode currently do not support more complex queries such as `LAST JOIN` and `ORDER BY`. Refer to [SELECT](../../openmldb_sql/dql/SELECT_STATEMENT.html).
- The server in the online preview mode executes SQL statements on a single thread. For large data processing, it may be slow and may trigger a timeout. To increase the timeout period, the `--request_timeout` can be configured on the client.
- To prevent impact on online services, online preview mode limits the maximum number of accessed records and the number of different keys. This can be configured using `--max_traverse_cnt` and `--max_traverse_key_cnt`. Similarly, the maximum result size can be set using `--scan_max_bytes_size`. For detailed configuration, refer to the [configuration file](../../deploy/conf.md).
- To prevent impact on online services, you can limit the maximum number of accessed records and the number of different keys in online preview mode. This can be configured using `--max_traverse_cnt` and `--max_traverse_key_cnt`. Similarly, the maximum result size can be set using `--scan_max_bytes_size`. For detailed configuration, refer to the [configuration file](../../deploy/conf.md).

The command for setting online preview mode in OpenMLDB CLI: `SET @@execute_mode='online'`

Expand Down
35 changes: 0 additions & 35 deletions docs/zh/app_ecosystem/feature_platform/concept.md

This file was deleted.

Loading

0 comments on commit b449e35

Please sign in to comment.