Merge branch '4paradigm:main' into docs-editiontwo

4paradigm · Jan 16, 2024 · b449e35 · b449e35
2 parents 9324c7b + 930d33b
commit b449e35
Show file tree

Hide file tree

Showing 118 changed files with 1,357 additions and 260 deletions.
diff --git a/docs/en/deploy/conf.md b/docs/en/deploy/conf.md
@@ -187,8 +187,8 @@
 #--max_traverse_cnt=0
 # max table traverse unique key number(batch query), default: 0
 #--max_traverse_key_cnt=0
-# max result size in byte (default: 2MB)
-#--scan_max_bytes_size=2097152
+# max result size in byte (default: 0 unlimited)
+#--scan_max_bytes_size=0
 
 # loadtable
 # The number of data bars to submit a task to the thread pool when loading

diff --git a/...nce/sql/data_types/date_and_time_types.md → ...ldb_sql/data_types/date_and_time_types.md b/...nce/sql/data_types/date_and_time_types.md → ...ldb_sql/data_types/date_and_time_types.md
diff --git a/docs/en/reference/sql/data_types/index.rst → docs/en/openmldb_sql/data_types/index.rst b/docs/en/reference/sql/data_types/index.rst → docs/en/openmldb_sql/data_types/index.rst
diff --git a/...reference/sql/data_types/numeric_types.md → .../openmldb_sql/data_types/numeric_types.md b/...reference/sql/data_types/numeric_types.md → .../openmldb_sql/data_types/numeric_types.md
diff --git a/.../reference/sql/data_types/string_types.md → ...n/openmldb_sql/data_types/string_types.md b/.../reference/sql/data_types/string_types.md → ...n/openmldb_sql/data_types/string_types.md
diff --git a/docs/en/openmldb_sql/dql/SELECT_STATEMENT.md b/docs/en/openmldb_sql/dql/SELECT_STATEMENT.md
@@ -138,7 +138,7 @@ TableAsName
 
 ```{warning}
 The `SELECT` running in online mode or the stand-alone version may not obtain complete data.
-Because a query may perform a large number of scans on multiple tablets, for stability, the largest number of bytes to scan is limited, namely `scan_max_bytes_size`.
+The largest number of bytes to scan is limited, namely `scan_max_bytes_size`, default value is unlimited. But if you set the value of `scan_max_bytes_size` to a specific value, the `SELECT` statement will only scan the data within the specified size. If the select results are truncated, the message of `reach the max byte ...` will be recorded in the tablet's log, but there will be no error.
 
-If the select results are truncated, the message of `reach the max byte ...` will be recorded in the tablet's log, but there will be no error.
+Even if the `scan_max_bytes_size` is set to unlimited, the `SELECT` statement may failed, e.g. client errors `body_size=xxx from xx:xxxx is too large`, ` Fail to parse response from xx:xxxx by baidu_std at client-side`. We don't recommend to use `SELECT` in online mode or the stand-alone version. If you want to get the count of the online table, please use `SELECT COUNT(*) FROM table_name;`.
 ```
diff --git a/docs/en/openmldb_sql/index.rst b/docs/en/openmldb_sql/index.rst
@@ -0,0 +1,18 @@
+=============================
+OpenMLDB SQL
+=============================
+
+
+.. toctree::
+    :maxdepth: 1
+
+    sql_difference
+    language_structure/index 
+    data_types/index
+    functions_and_operators/index
+    dql/index
+    dml/index
+    ddl/index
+    deployment_manage/index
+    task_manage/index
+    udf_develop_guide
diff --git a/docs/en/openmldb_sql/sql_difference.md b/docs/en/openmldb_sql/sql_difference.md
diff --git a/docs/en/openmldb_sql/udf_develop_guide.md b/docs/en/openmldb_sql/udf_develop_guide.md
@@ -0,0 +1,230 @@
+# UDF Development Guideline
+## Background
+Although OpenMLDB provides over a hundred built-in functions for data scientists to perform data analysis and feature extraction, there are scenarios where these functions might not fully meet the requirements. To facilitate users in quickly and flexibly implementing specific feature computation needs, we have introduced support for user-defined functions (UDFs) based on C++ development. Additionally, we enable the loading of dynamically generated user-defined function libraries.
+
+```{seealso}
+Users can also extend OpenMLDB's computation function library using the method of developing built-in functions. However, developing built-in functions requires modifying the source code and recompiling. If users wish to contribute extended functions to the OpenMLDB codebase, they can refer to [Built-in Function Develop Guide](./built_in_function_develop_guide.md).
+```
+
+## Development Procedures
+### Develop UDF functions
+#### Naming Convention of C++ Built-in Function
+- The naming of C++ built-in function should follow the [snake_case](https://en.wikipedia.org/wiki/Snake_case) style.
+- The name should clearly express the function's purpose.
+- The name of a function should not be the same as the name of a built-in function or other custom functions. The list of all built-in functions can be seen [here](../openmldb_sql/udfs_8h.md).
+
+#### C++ Type and SQL Type Correlation
+The types of the built-in C++ functions' parameters should be BOOL, NUMBER, TIMESTAMP, DATE, or STRING.
+The SQL types corresponding to C++ types are shown as follows:
+
+| SQL Type  | C/C++ Type  |
+|:----------|:------------|
+| BOOL      | `bool`      |
+| SMALLINT  | `int16_t`   |
+| INT       | `int32_t`   |
+| BIGINT    | `int64_t`   |
+| FLOAT     | `float`     |
+| DOUBLE    | `double`    |
+| STRING    | `StringRef` |
+| TIMESTAMP | `Timestamp` |
+| DATE      | `Date`      |
+
+
+#### Parameters and Return Values
+
+**Return Value**:
+
+* If the output type of the UDF is a basic type and `return_nullable` set to false, it will be processed as a return value.
+* If the output type of the UDF is a basic type and `return_nullable` set to true, it will be processed as a function parameter.
+* If the output type of the UDF is STRING, TIMESTAMP or DATE, it will return through the **last parameter** of the function.
+
+**Parameters**: 
+
+* If the parameter is a basic type, it will be passed by value. 
+* If the output type of the UDF is STRING, TIMESTAMP or DATE, it will be passed by a pointer. 
+* The first parameter must be `UDFContext* ctx`. The definition of [UDFContext](../../../include/udf/openmldb_udf.h) is:
+
+```c++
+    struct UDFContext {
+        ByteMemoryPool* pool;  // Used for memory allocation.
+        void* ptr;             // Used for the storage of temporary variables for aggregate functions.
+    };
+```
+
+**Function Declaration**:
+  
+* The functions must be declared by extern "C".
+
+#### Memory Management
+
+- In scalar functions, the use of 'new' and 'malloc' to allocate space for input and output parameters is not allowed. However, temporary space allocation using 'new' and 'malloc' is permissible within the function, and the allocated space must be freed before the function returns.
+
+- In aggregate functions, space allocation using 'new' or 'malloc' can be performed in the 'init' function but must be released in the 'output' function. The final return value, if it is a string, needs to be stored in the space allocated by mempool.
+
+- If dynamic memory allocation is required, OpenMLDB provides memory management interfaces. Upon function execution completion, OpenMLDB will automatically release the memory.
+```c++
+char *buffer = ctx->pool->Alloc(size);
+```
+- The maximum size allocated at once cannot exceed 2M.
+
+**Note**:
+- If the parameters are declared as nullable, then all parameters are nullable, and each input parameter will have an additional `is_null` parameter.
+- If the return value is declared as nullable, it will be returned through parameters, and an additional `is_null` parameter will indicate whether the return value is null.
+
+For instance, to declare a UDF scalar function, sum, which has two parameters, if the input and return value are nullable:
+```c++
+extern "C"
+void sum(::openmldb::base::UDFContext* ctx, int64_t input1, bool is_null, int64_t input2, bool is_null, int64_t* output, bool* is_null) {
+```
+#### Scalar Function Implementation
+
+Scalar functions process individual data rows and return a single value, such as abs, sin, cos, date, year.
+The process is as follows:
+- The head file `udf/openmldb_udf.h` should be included.
+- Develop the logic of the function.
+
+```c++
+#include "udf/openmldb_udf.h"  // must include this header file
+ 
+// Develop a UDF that slices the first 2 characters of a given string. 
+extern "C"
+void cut2(::openmldb::base::UDFContext* ctx, ::openmldb::base::StringRef* input, ::openmldb::base::StringRef* output) {
+    if (input == nullptr || output == nullptr) {
+        return;
+    }
+    uint32_t size = input->size_ <= 2 ? input->size_ : 2;
+    //use ctx->pool for memory allocation
+    char *buffer = ctx->pool->Alloc(size);
+    memcpy(buffer, input->data_, size);
+    output->size_ = size;
+    output->data_ = buffer;
+}
+```
+
+
+#### Aggregation Function Implementation
+
+Aggregate functions process a dataset (such as a column of data) and perform computations, returning a single value, such as sum, avg, max, min, count.
+The process is as follows:
+- The head file `udf/openmldb_udf.h` should be included.
+- Develop the logic of the function.
+
+To develop an aggregate function, you need to implement the following three C++ methods:
+
+- init function: Perform initialization tasks such as allocating space for intermediate variables. Function naming format: 'aggregate_function_name_init'.
+
+- update function: Implement the logic for processing each row of the respective field in the update function. Function naming format: 'aggregate_function_name_update'.
+
+- output function: Process the final aggregated value and return the result. Function naming format: 'aggregate_function_name_output'."
+
+**Node**: Return `UDFContext*` as the return value in the init and update function.
+
+```c++
+#include "udf/openmldb_udf.h"  //must include this header file
+// implementation of aggregation function special_sum
+extern "C"
+::openmldb::base::UDFContext* special_sum_init(::openmldb::base::UDFContext* ctx) {
+    // allocate space for intermediate variables and assign to 'ptr' in UDFContext.
+    ctx->ptr = ctx->pool->Alloc(sizeof(int64_t));
+    // init the value
+    *(reinterpret_cast<int64_t*>(ctx->ptr)) = 10;
+    // return pointer of UDFContext, cannot be omitted
+    return ctx;
+}
+
+extern "C"
+::openmldb::base::UDFContext* special_sum_update(::openmldb::base::UDFContext* ctx, int64_t input) {
+    // get the value from ptr in UDFContext
+    int64_t cur = *(reinterpret_cast<int64_t*>(ctx->ptr));
+    cur += input;
+    *(reinterpret_cast<int*>(ctx->ptr)) = cur;
+    // return the pointer of UDFContext, cannot be omitted
+    return ctx;
+}
+
+// get the aggregation result from ptr in UDFcontext and return
+extern "C"
+int64_t special_sum_output(::openmldb::base::UDFContext* ctx) {
+    return *(reinterpret_cast<int64_t*>(ctx->ptr)) + 5;
+}
+
+```
+
+
+For more UDF implementation, see [here](../../../src/examples/test_udf.cc).
+
+
+### Compile Dynamic Library 
+
+- Copy the `include` directory (`https://github.com/4paradigm/OpenMLDB/tree/main/include`) to a certain path (like `/work/OpenMLDB/`) for later compiling. 
+- Run the compiling command. `-I` specifies the path of the `include` directory. `-o` specifies the name of the dynamic library.
+
+```shell
+g++ -shared -o libtest_udf.so examples/test_udf.cc -I /work/OpenMLDB/include -std=c++11 -fPIC
+```
+
+### Copy Dynamic Library
+The compiled dynamic libraries should be copied into the `udf` directories for both TaskManager and tablets. Please create a new `udf` directory if it does not exist. 
+- The `udf` directory of a tablet is `path_to_tablet/udf`.
+- The `udf` directory of TaskManager is `path_to_taskmanager/taskmanager/bin/udf`. 
+
+For example, if the deployment paths of a tablet and TaskManager are both `/work/openmldb`, the structure of the directory is shown below:
+
+```
+    /work/openmldb/
+    ├── bin
+    ├── conf
+    ├── taskmanager
+    │   ├── bin
+    │   │   ├── taskmanager.sh
+    │   │   └── udf
+    │   │       └── libtest_udf.so
+    │   ├── conf
+    │   └── lib
+    ├── tools
+    └── udf
+        └── libtest_udf.so
+```
+
+```{note}
+- For multiple tablets, the library needs to be copied to every tablet. 
+- Dynamic libraries should not be deleted before the execution of `DROP FUNCTION`.
+```
+
+
+### Register, Drop and Show the Functions
+For registering, please use [CREATE FUNCTION](../openmldb_sql/ddl/CREATE_FUNCTION.md).
+
+Register an scalar function：
+```sql
+CREATE FUNCTION cut2(x STRING) RETURNS STRING OPTIONS (FILE='libtest_udf.so');
+```
+Register an aggregation function:
+```sql
+CREATE AGGREGATE FUNCTION special_sum(x BIGINT) RETURNS BIGINT OPTIONS (FILE='libtest_udf.so');
+```
+Register an aggregation function with input value and return value support null:
+```sql
+CREATE AGGREGATE FUNCTION third(x BIGINT) RETURNS BIGINT OPTIONS (FILE='libtest_udf.so', ARG_NULLABLE=true, RETURN_NULLABLE=true);
+```
+
+**note**:
+- The types of parameters and return values must be consistent with the implementation of the code.
+- `FILE` specifies the file name of the dynamic library. It is not necessary to include a path.
+- A UDF function can only work on one type. Please create multiple functions for multiple types.
+
+
+After successful registration, the function can be used.
+```sql
+SELECT cut2(c1) FROM t1;
+```
+
+You can view registered functions through `SHOW FUNCTIONS`.
+```sql
+SHOW FUNCTIONS;
+```
+
+Use the `DROP FUNCTION` to delete a registered function.
+```sql
+DROP FUNCTION cut2;
+```
diff --git a/docs/en/quickstart/concepts/modes.md b/docs/en/quickstart/concepts/modes.md
@@ -59,7 +59,7 @@ The main features of the online preview mode are:
 - Online preview mode is mainly used for previewing limited data. Selecting and viewing data directly through SELECT in OpenMLDB CLI or SDKs may result in data truncation. If the data volume is large, it is recommended to use an [export tool](../../tutorial/data_export.html) to view the complete data.
 - SELECT statements in online preview mode currently do not support more complex queries such as `LAST JOIN` and `ORDER BY`. Refer to [SELECT](../../openmldb_sql/dql/SELECT_STATEMENT.html).
 - The server in the online preview mode executes SQL statements on a single thread. For large data processing, it may be slow and may trigger a timeout. To increase the timeout period, the `--request_timeout` can be configured on the client.
-- To prevent impact on online services, online preview mode limits the maximum number of accessed records and the number of different keys. This can be configured using `--max_traverse_cnt` and `--max_traverse_key_cnt`. Similarly, the maximum result size can be set using `--scan_max_bytes_size`. For detailed configuration, refer to the [configuration file](../../deploy/conf.md).
+- To prevent impact on online services, you can limit the maximum number of accessed records and the number of different keys in online preview mode. This can be configured using `--max_traverse_cnt` and `--max_traverse_key_cnt`. Similarly, the maximum result size can be set using `--scan_max_bytes_size`. For detailed configuration, refer to the [configuration file](../../deploy/conf.md).
 
 The command for setting online preview mode in OpenMLDB CLI: `SET @@execute_mode='online'`
 

diff --git a/docs/zh/app_ecosystem/feature_platform/concept.md b/docs/zh/app_ecosystem/feature_platform/concept.md