diff --git a/docs/en/developer/built_in_function_develop_guide.md b/docs/en/developer/built_in_function_develop_guide.md index 97d00076f87..3041a789267 100644 --- a/docs/en/developer/built_in_function_develop_guide.md +++ b/docs/en/developer/built_in_function_develop_guide.md @@ -6,17 +6,20 @@ OpenMLDB contains hundreds of built-in functions that help data scientists extra OpenMLDB classifies functions as aggregate or scalar depending on the input data values and result values. -- An *aggregate function* receives **a set of** values for each argument (such as the values of a column) and returns a single-value result for the set of input values. - - A *scalar function* receives **a single value** for each argument and returns a single value result. A scalar function can be classified into several groups: - -- - Mathematical function + - Mathematical function - Logical function - Date & Time function - String function - Conversion function -This article is a hands-on guide for built-in scalar function development in OpenMLDB. We will not dive into aggregate function development in detail. We truly welcome developers who want to join our community and help extend our functions. +- An *aggregate function* receives **a set of** values for each argument (such as the values of a column) and returns a single-value result for the set of input values. + +This article serves as an introductory guide to developing SQL built-in functions, aiming to guide developers in quickly grasping the basic methods of developing custom functions. + +First, we will provide a detailed overview of the development steps, classification, and examples of scalar function development. This will enable developers to understand the basic development and registration patterns of custom functions. + +Subsequently, we will transition to the details of developing complex aggregate functions. We sincerely welcome more developers to join our community and assist us in expanding and developing the built-in function collection. ## 2. Develop a Built-In SQL Function @@ -34,13 +37,39 @@ Developers need to **take care of the following** rules when developing a functi #### 2.1.1 Code Location -Developers can declare function in [hybridse/src/udf/udf.h](https://github.com/4paradigm/OpenMLDB/blob/main/hybridse/src/udf/udf.h) and implement it in [hybridse/src/udf/udf.cc](https://github.com/4paradigm/OpenMLDB/blob/main/hybridse/src/udf/udf.cc) within namespace `hybridse::udf::v1`. +Developers can declare function in [hybridse/src/udf/udf.h](https://github.com/4paradigm/OpenMLDB/blob/main/hybridse/src/udf/udf.h) and implement it in [hybridse/src/udf/udf.cc](https://github.com/4paradigm/OpenMLDB/blob/main/hybridse/src/udf/udf.cc). +If the function is complex, developers can declare and implement in separate `.h` and `.cc` files in [hybridse/src/udf/](https://github.com/4paradigm/OpenMLDB/blob/main/hybridse/src/udf/). + +The functions are usually within namespace `hybridse::udf::v1`. + +- ```c++ + # hybridse/src/udf/udf.h + namespace hybridse { + namespace udf { + namespace v1 { + // declare built-in function + } // namespace v1 + } // namespace udf + } // namespace hybridse + ``` + +- ```c++ + # hybridse/src/udf/udf.cc + namespace hybridse { + namespace udf { + namespace v1 { + // implement built-in function + } // namespace v1 + } // namespace udf + } // namespace hybridse + ``` #### 2.1.2 C++ Function Naming Rules - Function names are all lowercase, with underscores between words. Check [snake_case](https://en.wikipedia.org/wiki/Snake_case) for more details. - Function names should be clear and readable. Use names that describe the purpose or intent of the function. +(c_vs_sql)= #### 2.1.3 C++ and SQL Data Type C++ built-in functions can use limited data types, including BOOL, Numeric, String, Timestamp and Date. The correspondence between the SQL data type and the C++ data type is shown as follows: @@ -64,7 +93,7 @@ C++ built-in functions can use limited data types, including BOOL, Numeric, Stri - SQL function parameters and C++ function parameters have the same position order. -- C++ function parameter types should match the SQL types. Check [2.1.3 C++ and SQL Data Type](#2.1.3-C++ and SQL Data Type) for more details. +- C++ function parameter types should match the SQL types. Check [2.1.3 C++ and SQL Data Type](c_vs_sql) for more details. - SQL function return type: @@ -89,12 +118,17 @@ C++ built-in functions can use limited data types, including BOOL, Numeric, Stri void func_output_nullable_date(int64_t, codec::Date*, bool*); ``` - - Notice that return types have greater impact on built-in function developing behaviours. We will cover the details in a later section [3. Built-in Function Development Template](#3.-Built-in Function Development Template). + - Notice that return types have greater impact on built-in function developing behaviours. We will cover the details in a later section [3.2 Scalar Function Development Classification](sfunc_category). + +- Handling Nullable Parameters: + - Generally, OpenMLDB adopts a uniform approach to handling NULL parameters for all built-in scalar functions. That is, if any input parameter is NULL, the function will directly return NULL. + - However, for scalar functions or aggregate functions that require special handling of NULL parameters, you can configure the parameter as `Nullable`. In the C++ function, you will then use the corresponding C++ type of ArgType and `bool*` to express this parameter. For more details, refer to [3.2.4 Nullable SQL Function Parameters](arg_nullable). #### 2.1.5 Memory Management - Operator `new` operator or method `malloc` are forbidden in C++ built-in function implementation. -- Developers must call provided memory management APIs in order to archive space allocation for output parameters: +- In C++ built-in aggregate functions, it is permissible to use the `new` or `malloc` functions to allocate memory during initialization. However, it is crucial to ensure that the allocated space is released when the `output` generates the final result. +- Developers must call provided memory management APIs in order to archive space allocation for UDF output parameters: - `hybridse::udf::v1::AllocManagedStringBuf(size)` to allocate space. OpenMLDB `ByteMemoryPool` will assign continous space to the function and will release it when safe. - If allocated size < 0, allocation will fail. `AllocManagedStringBuf` return null pointer. - If allocated size exceed the MAX_ALLOC_SIZE which is 2048, the allocation will fail. `AllocManagedStringBuf` return null pointer. @@ -147,33 +181,16 @@ OpenMLDB `DefaultUdfLibrary` stores and manages the global built-in SQL functio - The SQL function name does not have to be the same as the C++ function name, since the SQL function name will be linked to the C++ function via the registry. - SQL function names are case-insensitive. For instance, given register name "aaa_bb", the users can access it by calling `AAA_BB()`, `Aaa_Bb()`, `aAa_bb()` in SQL. -#### 2.2.3 Register and Configure Function - -`DefaultUdfLibrary::RegisterExternal` create an instance of `ExternalFuncRegistryHelper` with a name. The name will be the function's registered name. +#### 2.2.3 Register Function Interface -```c++ -ExternalFuncRegistryHelper helper = RegisterExternal("register_func_name"); -// ... ignore function configuration details -``` +- Registration Scalar functions: + - For scalar function with a single inout type: `RegisterExternal("register_func_name")` + - For generic function that supports multiple types: `RegisterExternalTemplate("register_func_name")` +- Registration for aggregate functions: + - For aggregate function with a single inout type:`RegisterUdaf("register_func_name")` + - For generic function that supports multiple types:`RegisterUdafTemplate("register_func_name")` - `ExternalFuncRegistryHelper` provides a set of APIs to help developers to configure the functions and register it into the *default library*. - -```c++ -RegisterExternal("register_func_name") - .args(built_in_fn_pointer) - .return_by_arg(bool_value) - .returns - .doc(documentation) -``` - -- `args`: Configure argument types. -- `built_in_fn_pointer`: Built-in function pointer. -- `returns`: Configure return type. Notice that when function result is Nullable, we should configure ***return type*** as ***returns>*** explicitly. -- `return_by_arg()` : Configure whether return value will be store in parameters or not. - - When **return_by_arg(false)** , result will be return directly. OpenMLDB configure `return_by_arg(false) ` by default. - - When **return_by_arg(true)**, the result will be stored and returned by parameters. - - if the return type is ***non-nullable***, the result will be stored and returned via the last parameter. - - if the return type is **nullable**, the ***result value*** will be stored in the second-to-last parameter and the ***null flag*** will be stored in the last parameter. if ***null flag*** is true, function result is **null**, otherwise, function result is obtained from second-to-last parameter. +The specific interface definitions will be elaborated in detail in the following sections. #### 2.2.4 Documenting Function @@ -187,12 +204,12 @@ Function docstrings should contain the following information: - **@since** command to specify the production version when the function was added to OpenMLDB. The version can be obtained from the project's [CMakeList.txt](https://github.com/4paradigm/OpenMLDB/blob/main/CMakeLists.txt): ` ${OPENMLDB_VERSION_MAJOR}.${OPENMLDB_VERSION_MINOR}.${OPENMLDB_VERSION_BUG}` ```c++ -RegisterExternal("register_func_name") +RegisterExternal("function_name") //... .doc(R"( - @brief a brief summary of the my_function's purpose and behavior + @brief a brief summary of the my_function's purpose and behavior - @param param1 a brief description of param1 + @param param1 a brief description of param1 Example: @@ -203,15 +220,117 @@ RegisterExternal("register_func_name") @since 0.4.0)"); ``` -#### 2.2.5 RegisterAlias +#### 2.2.5 Register Alias Sometimes, we don't have to implement and register a function when it is an alias to another function that already exists in the default library. We can simply use api `RegisterAlias("alias_func", "original_func")` to link the current register function name with an existing registered name. ```c++ +// substring() is registered into default library already +RegisterAlias("substr", "substring"); +``` + +## 2.3 Function Unit Test + +Once a function is registered/developed, the developer should add some related unit tests to make sure everything is going well. + +#### 2.3.1 Add Unit Tests + +Generally, developers can test scalar functions with [src/codegen/udf_ir_builder_test.cc](https://github.com/4paradigm/OpenMLDB/blob/main/hybridse/src/codegen/udf_ir_builder_test.cc), and test aggregate functions by adding `TEST_F` cases to [src/codegen/udf_ir_builder_test.cc](https://github.com/4paradigm/OpenMLDB/blob/main/hybridse/src/codegen/udf_ir_builder_test.cc). OpenMLDB provides `CheckUdf` so that the developer can perform function checking easily. + +```c++ +CheckUdf("function_name", expect_result, arg_value,...); +``` + +For each function signature, we at least have to: + +- Add a unit test with a normal result +- If parameter is ***nullable***, add a unit test with NULL input to produce a normal result +- Add a unit test with a null result if the result is **nullable** + +**Example**: +- Add unit test in [hybridse/src/codegen/udf_ir_builder_test.cc](https://github.com/4paradigm/OpenMLDB/blob/main/hybridse/src/codegen/udf_ir_builder_test.cc): + ```c++ + // month(timestamp) normal check + TEST_F(UdfIRBuilderTest, month_timestamp_udf_test) { + Timestamp time(1589958000000L); + CheckUdf("month", 5, time); + } + + // date(timestamp) normal check + TEST_F(UdfIRBuilderTest, timestamp_to_date_test_0) { + CheckUdf, Nullable>( + "date", Date(2020, 05, 20), Timestamp(1589958000000L)); + } + // date(timestamp) null check + TEST_F(UdfIRBuilderTest, timestamp_to_date_test_null_0) { + CheckUdf, Nullable>("date", nullptr, nullptr); + } + ``` + +- Add unit test in [hybridse/src/udf/udaf_test.cc](https://github.com/4paradigm/OpenMLDB/blob/main/hybridse/src/udf/udaf_test.cc): + ```c++ + // avg udaf test + TEST_F(UdafTest, avg_test) { + CheckUdf>("avg", 2.5, MakeList({1, 2, 3, 4})); + } + ``` + +(compile_ut)= +#### 2.3.2 Compile and Test + +- Compile `udf_ir_builder_test` and test + ```bash + # Compile udf_ir_builder_test, default output path is build/hybridse/src/codegen/udf_ir_builder_test + make OPENMLDB_BUILD_TARGET=udf_ir_builder_test TESTING_ENABLE=ON + + # Run test, note that environment variable SQL_CASE_BASE_DIR need to be specified as OpenMLDB project path + SQL_CASE_BASE_DIR=${OPENMLDB_DIR} ./build/hybridse/src/codegen/udf_ir_builder_test + ``` +- Compile `udaf_test` and test + ```bash + # Compile udaf_test, default output path is build/hybridse/src/udf/udaf_test + make OPENMLDB_BUILD_TARGET=udaf_test TESTING_ENABLE=ON + + # Run test, note that environment variable SQL_CASE_BASE_DIR need to be specified as OpenMLDB project path + SQL_CASE_BASE_DIR=${OPENMLDB_DIR} ./build/hybridse/src/udf/udaf_test + ``` + +If testing is to be done through SDK or command line, `OpenMLDB` needs to be recompiled. For compilation, refer to [compile.md](../deploy/compile.md). + +## 3. Scalar Function Development +### 3.1 Registration and Interface Configuration + +#### 3.1.1 Registration of Scalar Function Supporting Single Data Type + +The `DefaultUdfLibrary` provides the `RegisterExternal` interface to facilitate the registration of built-in scalar functions and initialize the registration name of the function. This method requires specifying a data type and only supports declared data types. + + +```c++ +RegisterExternal("register_func_name") + .args(static_cast(v1::func_ptr)) + .return_by_arg(bool_value) + .returns +``` + +The configuration of a function generally includes: function pointer configuration, parameter type configuration, and return value configuration. + +- Configuring the C++ function pointer: `func_ptr`. It is important to use static_cast to convert the pointer to a function pointer, considering code readability and compile-time safety. +- Configuring parameter types: `args`. +- Configuring return value type: `returns`. Typically, it is not necessary to explicitly specify the return type. However, if the function result is nullable, you need to explicitly configure the ***return type*** as ***returns>***. +- Configuring the return method: `return_by_arg()`. + - When **return_by_arg(false)**, the result is directly returned through the `return` statement. OpenMLDB defaults to `return_by_arg(false)`. + - When **return_by_arg(true)**, the result is returned through parameters: + - If the return type is ***non-nullable***, the function result is returned through the last parameter. + - If the return type is ***nullable***, the function result value is returned through the second-to-last parameter, and the ***null flag*** is returned through the last parameter. If the ***null flag*** is ***true***, the function result is ***null***; otherwise, the function result is retrieved from the second-to-last parameter. + +The following code demonstrates an example of registering the built-in single-row function `substring`. You can find the code in [hybridse/src/udf/default_udf_library.cc](https://github.com/4paradigm/OpenMLDB/blob/main/hybridse/src/udf/default_udf_library.cc). +```c++ +// void sub_string(StringRef *str, int32_t from, StringRef *output); + RegisterExternal("substring") .args( - static_cast(udf::v1::sub_string)) + static_cast(udf::v1::sub_string)) .return_by_arg(true) .doc(R"( @brief Return a substring `len` characters long from string str, starting at position `pos`. @@ -235,81 +354,62 @@ RegisterExternal("substring") @param len length of substring. If len is less than 1, the result is the empty string. @since 0.1.0)"); - -// substring() is registered into default library already -RegisterAlias("substr", "substring"); ``` -## 2.3 Function Unit Test +#### 3.1.2 Registration of Built-In Functions Supporting Generic Templates +We also provide the `RegisterExternalTemplate` interface to support the registration of generic built-in single-row functions, allowing simultaneous support for multiple data types. -Once a function is registered/developed, the developer should add some related unit tests to make sure everything is going well. - -#### 2.3.1 Add Unit Tests +```c++ +RegisterExternalTemplate("register_func_name") + .args_in() + .return_by_arg(bool_value) +``` -Generally, developers can add `TEST_F` cases to [src/codegen/udf_ir_builder_test.cc](https://github.com/4paradigm/OpenMLDB/blob/main/hybridse/src/codegen/udf_ir_builder_test.cc). +The configuration of a function generally includes: function template configuration, supported parameter types configuration, and return method configuration. -OpenMLDB provides `CheckUdf` in [src/codegen/udf_ir_builder_test.cc](https://github.com/4paradigm/OpenMLDB/blob/main/hybridse/src/codegen/udf_ir_builder_test.cc) so that the developer can perform function checking easily. +- Configuring the function template: `TemplateClass`. +- Configuring supported parameter types: `args_in`. +- Configuring the return method: `return_by_arg()` + - When **return_by_arg(false)**, the result is directly returned through the `return` statement. OpenMLDB defaults to `return_by_arg(false)`. + - When **return_by_arg(true)**, the result is returned through parameters. +The following code shows the code example of registering `abs` scalar function (code can be found at [hybridse/src/udf/default_udf_library.cc](https://github.com/4paradigm/OpenMLDB/blob/main/hybridse/src/udf/default_udf_library.cc)). ```c++ -CheckUdf("function_name", expect_result, arg_value,...); -``` +RegisterExternalTemplate("abs") + .doc(R"( + @brief Return the absolute value of expr. -For each function signature, we at least have to: + Example: -- Add a unit test with a normal result -- Add a unit test with a null result if the result is **nullable** - -**Example**: + @code{.sql} -```c++ -// month(timestamp) normal check -TEST_F(UdfIRBuilderTest, month_timestamp_udf_test) { - Timestamp time(1589958000000L); - CheckUdf("month", 5, time); -} + SELECT ABS(-32); + -- output 32 -// date(timestamp) normal check -TEST_F(UdfIRBuilderTest, timestamp_to_date_test_0) { - CheckUdf, Nullable>( - "date", codec::Date(2020, 05, 20), codec::Timestamp(1589958000000L)); -} -// date(timestamp) null check -TEST_F(UdfIRBuilderTest, timestamp_to_date_test_null_0) { - CheckUdf, Nullable>("date", nullptr, nullptr); -} -``` + @endcode -#### 2.3.2 Compile and Test + @param expr -```bash -cd ./hybridse -mkdir -p build -cd build -cmake .. -DCMAKE_BUILD_TYPE=Release -DTESTING_ENABLE=ON -make udf_ir_builder_test -j4 -SQL_CASE_BASE_DIR=${OPENMLDB_DIR} ./src/codegen/udf_ir_builder_test + @since 0.1.0)") + .args_in(); ``` +Development of generic template built-in scalar functions is similar to that of single data type built-in scalar functions. In this document, we won't delve into detailed discussions on generic template functions. The remaining content in this chapter primarily focuses on the development of single data type built-in scalar functions. - -## 3. Built-in Function Development Template +(sfunc_category)= +## 3.2 Built-in Scalar Function Development Template We classified built-in function into 3 types based on its return type: - SQL functions return **BOOL** or Numeric types, e.g., **SMALLINT**, **INT**, **BIGINT**, **FLOAT**, **DOUBLE** - -- SQL functions return **STRING**, **TIMESTAMP** or **DATE** - - - ```c++ - // SQL: STRING FUNC_STR(INT) - void func_output_str(int32_t, codec::StringRef*); - ``` - +- SQL functions return **STRING**, **TIMESTAMP**, **DATE**, **ArrayRef** - SQL functions return ***Nullable*** type Return types have a greater impact on the built-in function's behaviour. We will cover the details of the three types of SQL functions in the following sections. -### 3.1 SQL Functions Return **BOOL** or Numeric Types +(return_bool)= + +### 3.2.1 SQL Functions Return **BOOL** or Numeric Types If an SQL function returns a BOOL or Numeric type (e.g., **BOOL**, **SMALLINT**, **INT**, **BIGINT**, **FLOAT**, **DOUBLE**), then the C++ function should be designed to return the corresponding C++ type(`bool`, `int16_t`, `int32_t`, `int64_t`, `float`, `double`). @@ -351,7 +451,7 @@ RegisterExternal("my_func") )"); ``` -### 3.2 SQL Functions Return **STRING**, **TIMESTAMP** or **DATE** +### 3.2.2 SQL Functions Return **STRING**, **TIMESTAMP** or **DATE** If an SQL function returns **STRING**, **TIMESTAMP** or **DATE**, then the C++ function result should be returned in the parameter with the corresponding C++ pointer type (`codec::StringRef*`, `codec::Timestamp*`, `codec::Date*`). @@ -359,26 +459,31 @@ Thus the C++ function can be declared and implemented as follows: ```c++ # hybridse/src/udf/udf.h -namespace udf { - namespace v1 { - void func(Arg1 arg1, Arg2 arg2, ..., Ret* result); - } // namespace v1 -} // namespace udf +namespace hybridse { + namespace udf { + namespace v1 { + void func(Arg1 arg1, Arg2 arg2, ..., Ret* result); + } + } +} + ``` ```c++ # hybridse/src/udf/udf.cc -namespace udf { - namespace v1 { - void func(Arg1 arg1, Arg2 arg2, ..., Ret* ret) { - // ... - // *ret = result value +namespace hybridse { + namespace udf { + namespace v1 { + void func(Arg1 arg1, Arg2 arg2, ..., Ret* ret) { + // ... + // *ret = result value + } } - } // namespace v1 -} // namespace udf + } +} ``` -Configure and register the function into `DefaultUdfLibary` in[hybridse/src/udf/default_udf_library.cc](https://github.com/4paradigm/OpenMLDB/blob/main/hybridse/src/udf/default_udf_library.cc): +Configure and register the function into `DefaultUdfLibary` in[hybridse/src/udf/default_udf_library.cc](https://github.com/4paradigm/OpenMLDB/blob/main/hybridse/src/udf/default_udf_library.cc). Note that if the function needs to return through parameter, `return_by_arg(true)` needs to be configured. ```c++ # hybridse/src/udf/default_udf_library.cc @@ -390,7 +495,7 @@ RegisterExternal("my_func") )"); ``` -### 3.3 SQL Functions Return ***Nullable*** type +### 3.2.3 SQL Functions Return ***Nullable*** type If an SQL function return type is ***Nullable***, then we need one more `bool*` parameter to return a `is_null` flag. @@ -439,7 +544,9 @@ RegisterExternal("my_func") )"); ``` -### 3.4 SQL Functions Handle Nullable Argument +(arg_nullable)= + +### 3.2.4 SQL Functions Handle Nullable Argument Generally, OpenMLDB will return a ***NULL*** for a function when any one of its argurements is ***NULL***. @@ -488,11 +595,11 @@ RegisterExternal("my_func") )"); ``` -## 4. SQL Functions Development Examples +## 3.3. SQL Functions Development Examples -### 4.1 SQL Functions Return **BOOL** or Numeric Types: `INT Month(TIMESTAMP)` Function +### 3.3.1 SQL Functions Return **BOOL** or Numeric Types: `INT Month(TIMESTAMP)` Function -`INT Month(TIMESTAMP)` function returns the month for a given `timestamp`. Check [3.1 SQL functions return **BOOL** or Numeric types](#3.1-SQL functions return **BOOL** or Numeric types) for more details. +`INT Month(TIMESTAMP)` function returns the month for a given `timestamp`. Check [3.2.1 SQL functions return **BOOL** or Numeric types](return_bool) for more details. #### Step 1: Declare and Implement C++ Functions @@ -557,7 +664,7 @@ namespace udf { #### Step3: Function Unit Test -[Add unit tests](Add Unit Tests) in [src/codegen/udf_ir_builder_test.cc](https://github.com/4paradigm/OpenMLDB/blob/main/hybridse/src/codegen/udf_ir_builder_test.cc). Then [compile and test it](2.3.2 Compile and test). +Add unit test `TEST_F` in [src/codegen/udf_ir_builder_test.cc](https://github.com/4paradigm/OpenMLDB/blob/main/hybridse/src/codegen/udf_ir_builder_test.cc). Then [Compile and Test](compile_ut)。 ```c++ // month(timestamp) normal check @@ -567,7 +674,7 @@ TEST_F(UdfIRBuilderTest, month_timestamp_udf_test) { } ``` -Now, the `udf::v1:month` has been registered into the default library with the name `month`. As a result, we can call `month` in an SQL query while ignoring upper and lower cases. +Recompile `OpenMLDB` upon completing the development. Now, the `udf::v1:month` has been registered into the default library with the name `month`. As a result, we can call `month` in an SQL query while ignoring upper and lower cases. ```SQL select MONTH(TIMESTAMP(1590115420000)) as m1, month(timestamp(1590115420000)) as m2; @@ -578,9 +685,9 @@ select MONTH(TIMESTAMP(1590115420000)) as m1, month(timestamp(1590115420000)) as ---- ---- ``` -### 4.2 SQL Functions Return **STRING**, **TIMESTAMP** or **DATE** - `STRING String(BOOL)` +### 3.3.2 SQL Functions Return **STRING**, **TIMESTAMP** or **DATE** - `STRING String(BOOL)` -The `STRING String(BOOL)` function accepts a BOOL type input and converts it to an output of type STRING. Check [3.2 SQL functions return **STRING**, **TIMESTAMP** or **DATE**](#3.2-SQL functions return **STRING**, **TIMESTAMP** or **DATE**) for more details. +The `STRING String(BOOL)` function accepts a **BOOL** type input and converts it to an output of type STRING. Check [3.2.2 SQL functions return **STRING**, **TIMESTAMP** or **DATE**](#322sql-functions-return-string-timestamp-or-date) for more details. #### Step 1: Declare and Implement C++ Functions @@ -659,7 +766,7 @@ namespace hybridse { #### Step3: Function Unit Test -[Add unit tests](Add Unit Tests) in [src/codegen/udf_ir_builder_test.cc](https://github.com/4paradigm/OpenMLDB/blob/main/hybridse/src/codegen/udf_ir_builder_test.cc). Then [compile and test it](2.3.2 Compile and test). +Add unit tests `TEST_F` in [src/codegen/udf_ir_builder_test.cc](https://github.com/4paradigm/OpenMLDB/blob/main/hybridse/src/codegen/udf_ir_builder_test.cc). Then [Compile and Test](compile_ut). ```c++ // string(bool) normal check @@ -670,7 +777,7 @@ TEST_F(UdfIRBuilderTest, bool_to_string_test) { } ``` -Now, the `udf::v1:bool_to_string()` function has been registered into the default library with the name `string`. As a result, we can call `string` in an SQL query while ignoring upper and lower cases. +Recompile `OpenMLDB` upon completing the development. Now, the `udf::v1:bool_to_string()` function has been registered into the default library with the name `string`. As a result, we can call `String()` in an SQL query while ignoring upper and lower cases. ```SQL select STRING(true) as str_true, string(false) as str_false; @@ -682,14 +789,13 @@ select STRING(true) as str_true, string(false) as str_false; ``` +### 3.3.3 SQL Functions Return ***Nullable*** Type - `DATE Date(TIMESTAMP)` -### 4.3 SQL Functions Return ***Nullable*** Type - `DATE Date(TIMESTAMP)` - -`DATE Date(TIMESTAMP)()` function converts **TIMESTAMP** type to **DATE** type. Check [3.3 SQL functions return ***Nullable*** type](#3.3-SQL functions return ***Nullable*** type) and [3.2 SQL functions return **STRING**, **TIMESTAMP** or **DATE**](#3.2-SQL functions return **STRING**, **TIMESTAMP** or **DATE**) for more details. +`DATE Date(TIMESTAMP)()` function converts **TIMESTAMP** type to **DATE** type. Check [3.2.3 SQL functions return ***Nullable*** type](#323-sql-functions-return-nullable-type) and [3.2.2 SQL functions return **STRING**, **TIMESTAMP** or **DATE**](#322-sql-functions-return-string-timestamp-or-date) for more details. #### Step 1: Declare and Implement Built-In Functions -We implement a function `timestamp_to_date`to convert `timestamp` to the date type. The input is `timestamp` and the output is nullable `date` which is returned by arguments `codec::Date *output` and `bool *is_null`. +Due to the fact that the `date` type in OpenMLDB is a structured type, when designing functions, the result is not directly returned but is instead stored in the parameters for return. Additionally, considering that date conversions may encounter exceptions or failures, the return result is marked as ***nullable***. Therefore, an additional parameter, ***is_null***, is introduced to indicate whether the result is null or not. Declare the `timestamp_to_date()` function in [hybridse/src/udf/udf.h](https://github.com/4paradigm/OpenMLDB/blob/main/hybridse/src/udf/udf.h): @@ -731,9 +837,10 @@ namespace hybridse { #### Step 2: Register Built-In Function into Default Library -The following example registers the built-in function ` v1::timestamp_to_date` into the default library with the name `"date"`. +The configuration of the function name and function parameters is similar to that of regular functions. However, there are additional considerations for configuring the return value type: -Given the result is a nullable date type, we configure **return_by_arg** as ***true*** and return type as `Nullable`. +- Since the function result is stored in parameters for return, configure `return_by_arg(true)`. +- Since the function result may be null, configure `.returns>`. `DATE Date(TIMESTAMP)` is Date&Time function, developer should configure and register within `DefaultUdfLibrary::InitTimeAndDateUdf()` in [hybridse/src/udf/default_udf_library.cc](https://github.com/4paradigm/OpenMLDB/blob/main/hybridse/src/udf/default_udf_library.cc). @@ -763,7 +870,7 @@ namespace hybridse { #### Step3: Function Unit Test -[Add unit tests](Add Unit Tests) in [src/codegen/udf_ir_builder_test.cc](https://github.com/4paradigm/OpenMLDB/blob/main/hybridse/src/codegen/udf_ir_builder_test.cc). Then [compile and test it](2.3.2 Compile and test). +Add unit tests `TEST_F` in [src/codegen/udf_ir_builder_test.cc](https://github.com/4paradigm/OpenMLDB/blob/main/hybridse/src/codegen/udf_ir_builder_test.cc). Then [Compile and Test](compile_ut). ```c++ // date(timestamp) normal check @@ -777,7 +884,7 @@ TEST_F(UdfIRBuilderTest, timestamp_to_date_test_null_0) { } ``` -Now, the `udf::v1:timestamp_to_date` has been registered into the default library with the name `date`. As a result, we can call `date()` in an SQL query. +Recompile `OpenMLDB` upon completing the development. Now, the `udf::v1:timestamp_to_date` has been registered into the default library with the name `date`. As a result, we can call `date()` in an SQL query while ignoring upper and lower cases. ```SQL select date(timestamp(1590115420000)) as dt; @@ -788,14 +895,148 @@ select date(timestamp(1590115420000)) as dt; ------------ ``` +## 4. Aggregation Function Development + +### 4.1. Registration and Configuration of Interface + +#### 4.1.1 Registration of Aggregation Functions Supporting a Single Data Type + +The `DefaultUdfLibrary` provides the `RegisterUdaf` interface to facilitate the registration of built-in aggregation functions and initialize the function's registration name. This method requires specifying a data type and only supports declared data types. + + +```c++ +RegisterUdaf("register_func_name") + .templates() + .init("init_func_name", init_func_ptr, return_by_arg=false) + .update("update_func_name", update_func_ptr, return_by_arg=false) + .output("output_func_name", output_func_ptr, return_by_arg=false) +``` + + +Unlike the registration of scalar functions, aggregation functions require the registration of three functions: `init`, `update`, and `output`, which correspond to the initialization of the aggregation function, the update of intermediate states, and the output of the final result. +The configuration for these functions is as follows: + +- Configure parameter types: + - OUT: Output parameter type. + - ST: Intermediate state type. + - IN, ...: Input parameter types. +- Configure the `init` function pointer: `init_func_ptr`, with a function signature of `ST* Init()`. +- Configure the `update` function pointer: `update_func_ptr`, + - If the input is non-nullable, the function signature is `ST* Update(ST* state, IN val1, ...)`. + - If it is necessary to check whether the input is **Nullable**, this parameter can be configured as `Nullable`, and an additional `bool` parameter is added after the corresponding parameter in the function to store information about whether the parameter value is null. + The function signature is: `ST* Update(ST* state, IN val1, bool val1_is_null, ...)`. +- Configure the output function pointer: `output_func_ptr`. + When the function's return value may be null, an additional `bool*` parameter is required to store whether the result is null + (refer to [3.2.3 SQL Functions Return ***Nullable*** type](#323-sql-functions-return-nullable-type). + + +The following code demonstrates an example of adding a new aggregation function `second`. The `second` function returns the non-null second element in the aggregated data. For the sake of demonstration, the example supports only the `int32_t` data type: +```c++ +struct Second { + static std::vector* Init() { + auto list = new std::vector(); + return list; + } + + static std::vector* Update(std::vector* state, int32_t val, bool is_null) { + if (!is_null) { + state->push_back(val); + } + return state; + } + + static void Output(std::vector* state, int32_t* ret, bool* is_null) { + if (state->size() > 1) { + *ret = state->at(1); + *is_null = false; + } else { + *is_null = true; + } + delete state; + } +}; + +RegisterUdaf("second") + .templates, Opaque>, Nullable>() + .init("second_init", Second::Init) + .update("second_update", Second::Update) + .output("second_output", reinterpret_cast(Second::Output), true) + .doc(R"( + @brief Get the second non-null value of all values. + + @param value Specify value column to aggregate on. + + Example: + + |value| + |--| + |1| + |2| + |3| + |4| + @code{.sql} + SELECT second(value) OVER w; + -- output 2 + @endcode + @since 0.5.0 + )"); +``` + +#### 4.1.2 Registration of Aggregation Functions Supporting Generics +We also provide the `RegisterUdafTemplate` interface for registering an aggregation function that supports generics. + +```c++ +RegisterUdafTemplate("register_func_name") + .args_in() +``` + +- Configure the aggregation function template: `TemplateClass`. +- Configure all supported parameter types: `args_in`. + +The following code demonstrates an example of registering the `distinct_count` aggregation function. You can find the code in [hybridse/src/udf/default_udf_library.cc](https://github.com/4paradigm/OpenMLDB/blob/main/hybridse/src/udf/default_udf_library.cc). +```c++ +RegisterUdafTemplate("distinct_count") + .doc(R"( + @brief Compute number of distinct values. + + @param value Specify value column to aggregate on. + + Example: + + |value| + |--| + |0| + |0| + |2| + |2| + |4| + @code{.sql} + SELECT distinct_count(value) OVER w; + -- output 3 + @endcode + @since 0.1.0 + )") + .args_in(); +``` + +## 5. Example Code Reference + +### 5.1. Scalar Function Example Code Reference +For more scalar function example code, you can refer to: +[hybridse/src/udf/udf.cc](https://github.com/4paradigm/OpenMLDB/blob/main/hybridse/src/udf/udf.cc) + +### 5.2. Aggregation Function Example Code Reference +For more aggregation function example code, you can refer to: +[hybridse/src/udf/default_udf_library.cc](https://github.com/4paradigm/OpenMLDB/blob/main/hybridse/src/udf/default_udf_library.cc) -## 5. Document Management -Documents for all built-in functions can be found in [Built-in Functions](http://4paradigm.github.io/OpenMLDB/zh/main/reference/sql/udfs_8h.html). It is a markdown file automatically generated from source, so please do not edit it directly. +## 6. Documentation Management -- If you are adding a document for a new function, please refer to [2.2.4 Documenting Function](#224-documenting-function). -- If you are trying to revise a document of an existing function, you can find source code in the files of `hybridse/src/udf/default_udf_library.cc` or `hybridse/src/udf/default_defs/*_def.cc` . +Documentations for all built-in functions can be found in [Built-in Functions](../openmldb_sql/udfs_8h.md). It is a markdown file automatically generated from source, so please do not edit it directly. -There is a daily workflow that automatically converts the source code to a readable format, which are the contents inside the `docs/*/reference/sql/functions_and_operators` directory. The document website will also be updated accordingly. If you are interested in this process, you can refer to the source directory [udf_doxygen](https://github.com/4paradigm/OpenMLDB/tree/main/hybridse/tools/documentation/udf_doxygen). +- If you need to document the newly added functions, please refer to section [2.2.4 Documenting Function](#224-documenting-function) which explains that the documentation for built-in functions is managed in CPP source code. Subsequently, a series of steps will be taken to generate more readable documentation, which will appear in the `docs/*/openmldb_sql/` directory on the website. +- If you need to modify the documentation for an existing function, you can locate the corresponding documentation in the file `hybridse/src/udf/default_udf_library.cc` or `hybridse/src/udf/default_defs/*_def.cc` and make the necessary changes. +In the OpenMLDB project, a GitHub Workflow task is scheduled on a daily basis to regularly update the relevant documentation here. Therefore, modifications to the documentation for built-in functions only require changing the content in the corresponding source code locations as described above. The `docs` directory and the content on the website will be periodically updated accordingly. For details on the documentation generation process, you can check [udf_doxygen](https://github.com/4paradigm/OpenMLDB/tree/main/hybridse/tools/documentation/udf_doxygen). diff --git a/docs/en/developer/contributing.md b/docs/en/developer/contributing.md index a8112053565..86c0baafdcb 100644 --- a/docs/en/developer/contributing.md +++ b/docs/en/developer/contributing.md @@ -1,3 +1,26 @@ # Contributing Please refer to [Contribution Guideline](https://github.com/4paradigm/OpenMLDB/blob/main/CONTRIBUTING.md) +## Pull Request (PR) Guidelines + +When submitting a PR, please pay attention to the following points: +- PR Title: Please adhere to the [commit format](https://github.com/4paradigm/rfcs/blob/main/style-guide/commit-convention.md#conventional-commits-reference) for the PR title. **Note that this refers to the PR title, not the commits within the PR**. +```{note} +If the title does not meet the standard, `pr-linter / pr-name-lint (pull_request)` will fail with a status of `x`. +``` +- PR Checks: There are various checks in a PR, and only `codecov/patch` and `codecov/project` may not pass. Other checks should pass. If other checks do not pass and you cannot fix them or believe they should not be fixed, you can leave a comment in the PR. + +- PR Description: Please explain the intent of the PR in the first comment of the PR. We provide a PR comment template, and while you are not required to follow it, ensure that there is sufficient explanation. + +- PR Files Changed: Pay attention to the `files changed` in the PR. Do not include code changes outside the scope of the PR intent. You can generally eliminate unnecessary diffs by using `git merge origin/main` followed by `git push` to the PR branch. If you need assistance, leave a comment in the PR. +```{note} +If you are not modifying the code based on the main branch, when the PR intends to merge into the main branch, the `files changed` will include unnecessary code. For example, if the main branch is at commit 10, and you start from commit 9 of the old main, add new_commit1, and then add new_commit2 on top of new_commit1, you actually only want to submit new_commit2, but the PR will include new_commit1 and new_commit2. +In this case, just use `git merge origin/main` and `git push` to the PR branch to only include the changes. +``` +```{seealso} +If you want the branch code to be cleaner, you can avoid using `git merge` and use `git rebase -i origin/main` instead. It will add your changes one by one on top of the main branch. However, it will change the commit history, and you need `git push -f` to override the branch. +``` + +## Compilation Guidelines + +For compilation details, refer to the [Compilation Documentation](../deploy/compile.md). To avoid the impact of operating systems and tool versions, we recommend compiling OpenMLDB in a compilation image. Since compiling the entire OpenMLDB requires significant space, we recommend using `OPENMLDB_BUILD_TARGET` to specify only the parts you need. \ No newline at end of file diff --git a/docs/en/developer/index.rst b/docs/en/developer/index.rst index 755fa3873f9..d36c4913923 100644 --- a/docs/en/developer/index.rst +++ b/docs/en/developer/index.rst @@ -10,4 +10,3 @@ Developers built_in_function_develop_guide sdk_develop python_dev - udf_develop_guide diff --git a/docs/en/developer/python_dev.md b/docs/en/developer/python_dev.md index 1f18ede390c..43cf75f3f2f 100644 --- a/docs/en/developer/python_dev.md +++ b/docs/en/developer/python_dev.md @@ -2,9 +2,19 @@ There are two modules in `python/`: Python SDK and an OpenMLDB diagnostic tool. -## SDK Testing Methods +## SDK + +The Python SDK itself does not depend on the pytest and tox libraries used for testing. If you want to use the tests in the tests directory for testing, you can download the testing dependencies using the following method. + +``` +pip install 'openmldb[test]' +pip install 'dist/....whl[test]' +``` + +### Testing Method + +Run the command `make SQL_PYSDK_ENABLE=ON OPENMLDB_BUILD_TARGET=cp_python_sdk_so` under the root directory and make sure the library in `python/openmldb_sdk/openmldb/native/` was the latest native library. Testing typically requires connecting to an OpenMLDB cluster. If you haven't started a cluster yet, or if you've made code changes to the service components, you'll also need to compile the TARGET openmldb and start a onebox cluster. You can refer to the launch section of `steps/test_python.sh` for guidance. -Run the command `make SQL_PYSDK_ENABLE=ON OPENMLDB_BUILD_TARGET=cp_python_sdk_so` under the root directory and make sure the library in `python/openmldb_sdk/openmldb/native/` was the latest native library. 1. Package installation test: Install the compiled `whl`, then run `pytest tests/`. You can use the script `steps/test_python.sh` directly. 2. Dynamic test: Make sure there isn't OpenMLDB in `pip` or the compiled `whl`. Run `pytest test/` in `python/openmldb_sdk`, thereby you can easily debug. @@ -32,6 +42,11 @@ If the python log messages are required in all tests(even successful tests), ple pytest -so log_cli=true --log-cli-level=DEBUG tests/ ``` +You can also use the module mode for running tests, which is suitable for actual runtime testing. +``` +python -m diagnostic_tool.diagnose ... +``` + ## Conda If you use conda, `pytest` may found the wrong python, then get errors like `ModuleNotFoundError: No module named 'IPython'`. Please use `python -m pytest`. diff --git a/docs/en/developer/sdk_develop.md b/docs/en/developer/sdk_develop.md index 00c5edf7725..20246500520 100644 --- a/docs/en/developer/sdk_develop.md +++ b/docs/en/developer/sdk_develop.md @@ -9,22 +9,19 @@ The OpenMLDB SDK can be divided into several layers, as shown in the figure. The The bottom layer is the SDK core layer, which is implemented as [SQLClusterRouter](https://github.com/4paradigm/OpenMLDB/blob/b6f122798f567adf2bb7766e2c3b81b633ebd231/src/sdk/sql_cluster_router.h#L110). It is the core layer of **client**. All operations on OpenMLDB clusters can be done by using the methods of `SQLClusterRouter` after proper configuration. Three core methods of this layer that developers may need to use are: - 1. [ExecuteSQL](https://github.com/4paradigm/OpenMLDB/blob/b6f122798f567adf2bb7766e2c3b81b633ebd231/src/sdk/sql_cluster_router.h#L160) supports the execution of all SQL commands, including DDL, DML and DQL. 2. [ExecuteSQLParameterized](https://github.com/4paradigm/OpenMLDB/blob/b6f122798f567adf2bb7766e2c3b81b633ebd231/src/sdk/sql_cluster_router.h#L166)supports parameterized SQL. 3. [ExecuteSQLRequest](https://github.com/4paradigm/OpenMLDB/blob/b6f122798f567adf2bb7766e2c3b81b633ebd231/src/sdk/sql_cluster_router.h#L156)is the special methods for the OpenMLDB specific execution mode: [Online Request mode](../tutorial/modes.md#4-the-online-request-mode). - +Other methods, such as CreateDB, DropDB, DropTable, have not been removed promptly due to historical reasons. Developers don't need to be concerned about them. ### Wrapper Layer -Due to the complexity of the implementation of the SDK Layer, we didn't develop the Java and Python SDKs from scratch, but to use Java and Python to call the **SDK Layer**. Specifically, we made a wrapper layer using Swig. +Due to the complexity of the implementation of the SDK Layer, we didn't develop the Java and Python SDKs from scratch, but to use Java and Python to call the **SDK Layer**. Specifically, we made a wrapper layer using swig. Java Wrapper is implemented as [SqlClusterExecutor](https://github.com/4paradigm/OpenMLDB/blob/main/java/openmldb-jdbc/src/main/java/com/_4paradigm/openmldb/sdk/impl/SqlClusterExecutor.java). It is a simple wrapper of `sql_router_sdk`, including the conversion of input types, the encapsulation of returned results, the encapsulation of returned errors. Python Wrapper is implemented as [OpenMLDBSdk](https://github.com/4paradigm/OpenMLDB/blob/main/python/openmldb/sdk/sdk.py). Like the Java Wrapper, it is a simple wrapper as well. - - ### User Layer Although the Wrapper Layer can be used directly, it is not convenient enough. So, we develop another layer, the User Layer of the Java/Python SDK. @@ -36,7 +33,8 @@ The Python User Layer supports the `sqlalchemy`. See [sqlalchemy_openmldb](https We want an easier to use C++ SDK which doesn't need a Wrapper Layer. Therefore, in theory, developers only need to design and implement the user layer, which calls the SDK layer. -However, in consideration of code reuse, the SDK Layer code may be changed to some extent, or the core SDK code structure may be adjusted (for example, exposing part of the SDK Layer header file, etc.). + +However, in consideration of code reuse, the SDK Layer code may be changed to some extent, or the core SDK code structure may be adjusted (for example, exposing part of the SDK Layer header file, etc.). ## Details of SDK Layer @@ -48,7 +46,6 @@ The first two methods are using two options, which create a server connecting Cl ``` These two methods, which do not expose the metadata related DBSDK, are suitable for ordinary users. The underlayers of Java and Python SDK also use these two approaches. - Another way is to create based on DBSDK: ``` explicit SQLClusterRouter(DBSDK* sdk); @@ -85,4 +82,18 @@ If you only want to run JAVA testing, try the commands below: ``` mvn test -pl openmldb-jdbc -Dtest="SQLRouterSmokeTest" mvn test -pl openmldb-jdbc -Dtest="SQLRouterSmokeTest#AnyMethod" -``` \ No newline at end of file +``` + +### batchjob test + +batchjob tests can be done using the following method: +``` +$SPARK_HOME/bin/spark-submit --master local --class com._4paradigm.openmldb.batchjob.ImportOfflineData --conf spark.hadoop.hive.metastore.uris=thrift://localhost:9083 --conf spark.openmldb.zk.root.path=/openmldb --conf spark.openmldb.zk.cluster=127.0.0.1:2181 openmldb-batchjob/target/openmldb-batchjob-0.6.5-SNAPSHOT.jar load_data.txt true +``` + +Alternatively, you can copy the compiled openmldb-batchjob JAR file to the `lib` directory of the task manager in the OpenMLDB cluster. Then, you can use the client or Taskmanager Client to send commands for testing. + +When using Hive as a data source, make sure the metastore service is available. For local testing, you can start the metastore service in the Hive directory with the default address being `thrift://localhost:9083`. +``` +bin/hive --service metastore +``` diff --git a/docs/en/developer/udf_develop_guide.md b/docs/en/developer/udf_develop_guide.md deleted file mode 100644 index 4c5aff6d2e1..00000000000 --- a/docs/en/developer/udf_develop_guide.md +++ /dev/null @@ -1,216 +0,0 @@ -# UDF Function Development Guideline -## 1. Background -Although there are already hundreds of built-in functions, they can not satisfy the needs in some cases. In the past, this could only be done by developing new built-in functions. Built-in function development requires a relatively long cycle because it needs to recompile binary files and users have to wait for new version release. -In order to help users to quickly develop computing functions that are not provided by OpenMLDB, we develop the mechanism of user dynamic registration function. OpenMLDB will load the compiled library contains user defined function when executing `Create Function` statement. - -SQL functions can be categorised into scalar functions and aggregate functions. An introduction to scalar functions and aggregate functions can be seen [here](./built_in_function_develop_guide.md). -## 2. Development Procedures -### 2.1 Develop UDF functions -#### 2.1.1 Naming Specification of C++ Built-in Function -- The naming of C++ built-in function should follow the [snake_case](https://en.wikipedia.org/wiki/Snake_case) style. -- The name should clearly express the function's purpose. -- The name of a function should not be the same as the name of a built-in function or other custom functions. The list of all built-in functions can be seen [here](../reference/sql/udfs_8h.md). - -#### 2.1.2 -The types of the built-in C++ functions' parameters should be BOOL, NUMBER, TIMESTAMP, DATE, or STRING. -The SQL types corresponding to C++ types are shown as follows: - -| SQL Type | C/C++ Type | -|:----------|:------------| -| BOOL | `bool` | -| SMALLINT | `int16_t` | -| INT | `int32_t` | -| BIGINT | `int64_t` | -| FLOAT | `float` | -| DOUBLE | `double` | -| STRING | `StringRef` | -| TIMESTAMP | `Timestamp` | -| DATE | `Date` | - - -#### 2.1.3 Parameters and Return Values - -**Return Value**: - -* If the output type of the UDF is a basic type and not support null, it will be processed as a return value. -* If the output type of the UDF is a basic type and support null, it will be processed as function parameter. -* If the output type of the UDF is STRING, TIMESTAMP or DATE, it will return through the last parameter of the function. - -**Parameters**: - -* If the parameter is a basic type, it will be passed by value. -* If the output type of the UDF is STRING, TIMESTAMP or DATE, it will be passed by pointer. -* The first parameter must be `UDFContext* ctx`. The definition of [UDFContext](../../../include/udf/openmldb_udf.h) is: - -```c++ - struct UDFContext { - ByteMemoryPool* pool; // Used for memory allocation. - void* ptr; // Used for the storage of temporary variables for aggregrate functions. - }; -``` - -**Note**: -- if the input value is nullable, there are added `is_null` parameter to lable whether is null -- if the return value is nullable, it should be return by argument and add another `is_null` parameter - -For instance, declare a UDF function that input is nullable and return value is nullable. -```c++ -extern "C" -void sum(::openmldb::base::UDFContext* ctx, int64_t input1, bool is_null, int64_t input2, bool is_null, int64_t* output, bool* is_null); -``` - -**Function Declaration**: - -* The functions must be declared by extern "C". - -#### 2.1.4 Memory Management - -- It is not allowed to use `new` operator or `malloc` function to allocate memory for input and output argument in UDF functions. -- If you use `new` operator or `malloc` function to allocate memory for UDFContext::ptr in UDAF init functions, it need to be freed in output function mannually. -- If you need to request additional memory space dynamically, please use the memory management interface provided by OpenMLDB. OpenMLDB will automatically free the memory space after the function is executed. - -```c++ - char *buffer = ctx->pool->Alloc(size); -``` - -- The maximum size of the space allocated at a time cannot exceed 2M bytes. - - -#### 2.1.5 Implement the UDF Function -- The head file `udf/openmldb_udf.h` should be included. -- Develop the logic of the function. - -```c++ -#include "udf/openmldb_udf.h" // The headfile - -// Develop a UDF which slices the first 2 characters of a given string. -extern "C" -void cut2(::openmldb::base::UDFContext* ctx, ::openmldb::base::StringRef* input, ::openmldb::base::StringRef* output) { - if (input == nullptr || output == nullptr) { - return; - } - uint32_t size = input->size_ <= 2 ? input->size_ : 2; - //To apply memory space in UDF functions, please use ctx->pool. - char *buffer = ctx->pool->Alloc(size); - memcpy(buffer, input->data_, size); - output->size_ = size; - output->data_ = buffer; -} -``` - - -#### 2.1.5 Implement the UDAF Function -- The head file `udf/openmldb_udf.h` should be included. -- Develop the logic of the function. - -It need to develop three functions as below: -- init function. do some init works in this function such as alloc memory or init variables. The function name should be "xxx_init" -- update function. Update the aggretrate value. The function name should be "xxx_update" -- output function. Extract the aggregrate value and return. The function name should be "xxx_output" - -**Node**: It should return `UDFContext*` as return value in init and update function. - -```c++ -#include "udf/openmldb_udf.h" - -extern "C" -::openmldb::base::UDFContext* special_sum_init(::openmldb::base::UDFContext* ctx) { - // allocte memory by memory poll - ctx->ptr = ctx->pool->Alloc(sizeof(int64_t)); - // init the value - *(reinterpret_cast(ctx->ptr)) = 10; - // return the pointer of UDFContext - return ctx; -} - -extern "C" -::openmldb::base::UDFContext* special_sum_update(::openmldb::base::UDFContext* ctx, int64_t input) { - // get the value from ptr in UDFContext - int64_t cur = *(reinterpret_cast(ctx->ptr)); - cur += input; - *(reinterpret_cast(ctx->ptr)) = cur; - // return the pointer of UDFContext - return ctx; -} - -// get the result from ptr in UDFcontext and return -extern "C" -int64_t special_sum_output(::openmldb::base::UDFContext* ctx) { - return *(reinterpret_cast(ctx->ptr)) + 5; -} - -``` - - -For more UDF implementation, see [here](../../../src/examples/test_udf.cc). - - -### 2.2 Compile the Dynamic Library - -- Copy the `include` directory (`https://github.com/4paradigm/OpenMLDB/tree/main/include`) to a certain path (like `/work/OpenMLDB/`) for later compiling. -- Run the compiling command. `-I` specifies the path of `include` directory. `-o` specifies the name of the dynamic library. - -```shell -g++ -shared -o libtest_udf.so examples/test_udf.cc -I /work/OpenMLDB/include -std=c++17 -fPIC -``` - -### 2.3 Copy the Dynamic Library -The compiled dynamic libraries should be copied into the `udf` directories for both TaskManager and tablets. Please create a new `udf` directory if it does not exist. -- The `udf` directory of a tablet is `path_to_tablet/udf`. -- The `udf` directory of TaskManager is `path_to_taskmanager/taskmanager/bin/udf`. - -For example, if the deployment paths of a tablet and TaskManager are both `/work/openmldb`, the structure of the directory is shown below: - -``` - /work/openmldb/ - ├── bin - ├── conf - ├── taskmanager - │   ├── bin - │   │   ├── taskmanager.sh - │   │   └── udf - │   │   └── libtest_udf.so - │   ├── conf - │   └── lib - ├── tools - └── udf -    └── libtest_udf.so -``` - -```{note} -- Note that, for multiple tablets, the library needs to be copied to every one. -- Moreover, dynamic libraries should not be deleted before the execution of `DROP FUNCTION`. -``` - - -### 2.4 Register, Drop and Show the Functions -For registering, please use [CREATE FUNCTION](../reference/sql/ddl/CREATE_FUNCTION.md). -```sql -CREATE FUNCTION cut2(x STRING) RETURNS STRING OPTIONS (FILE='libtest_udf.so'); -``` - -Create an udaf function that input value and return value support null. -```sql -CREATE AGGREGATE FUNCTION third(x BIGINT) RETURNS BIGINT OPTIONS (FILE='libtest_udf.so', ARG_NULLABLE=true, RETURN_NULLABLE=true); -``` - -```{note} -- The types of parameters and return values must be consistent with the implementation of the code. -- `FILE` specifies the file name of the dynamic library. It is not necessary to include a path. -- A UDF function can only work on one type. Please create multiple functions for multiple types. -``` - -After successful registration, the function can be used. -```sql -SELECT cut2(c1) FROM t1; -``` - -You can view registered functions through `SHOW FUNCTIONS`. -```sql -SHOW FUNCTIONS; -``` - -Please use the `DROP FUNCTION` to delete a registered function. -```sql -DROP FUNCTION cut2; -``` diff --git a/docs/en/integration/deploy_integration/index.rst b/docs/en/integration/deploy_integration/index.rst index 15bff333619..edc057efc88 100644 --- a/docs/en/integration/deploy_integration/index.rst +++ b/docs/en/integration/deploy_integration/index.rst @@ -1,5 +1,5 @@ ============================= -dispatch +Dispatch ============================= .. toctree:: diff --git a/docs/en/integration/index.rst b/docs/en/integration/index.rst index 023bd3c9ab9..074131cf88a 100644 --- a/docs/en/integration/index.rst +++ b/docs/en/integration/index.rst @@ -1,5 +1,5 @@ ============================= -Upstream and downstream ecology +Upstream and Downstream Ecology ============================= .. toctree:: diff --git a/docs/en/integration/online_datasources/index.rst b/docs/en/integration/online_datasources/index.rst index 7b2232ef05b..a84d1d406b3 100644 --- a/docs/en/integration/online_datasources/index.rst +++ b/docs/en/integration/online_datasources/index.rst @@ -1,5 +1,5 @@ ============================= -online data source +Online Data Source ============================= .. toctree::