Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support MiniCPM-V-2.6 #8967

Merged
merged 74 commits into from
Aug 16, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
74 commits
Select commit Hold shift + click to select a range
7a49a6f
init
tc-mb May 23, 2024
c536fa6
rename
tc-mb May 23, 2024
2b91903
add run android for termux in readme
tc-mb May 23, 2024
0480d5f
add android readme
tc-mb May 23, 2024
ec1cea7
add instructions in readme
tc-mb May 23, 2024
a491f45
change name in readme
tc-mb May 23, 2024
7573b63
Update README.md
iceflame89 May 23, 2024
94dcaba
fixed line
harvestingmoon May 23, 2024
b31f51f
Merge pull request #1 from harvestingmoon/minicpm-v2.5
tc-mb May 24, 2024
629420e
add result in readme
tc-mb May 24, 2024
b48708a
random pos_embed
tc-mb May 26, 2024
d9fbc1d
add positions index
tc-mb May 26, 2024
18fe620
change for ollama
tc-mb May 26, 2024
2997a68
change for ollama
tc-mb May 26, 2024
8541e99
better pos_embed in clip
tc-mb May 26, 2024
d8974b8
support ollama
tc-mb May 27, 2024
e73a0c7
updata cmakelist
tc-mb May 28, 2024
6366d62
updata cmakelist
tc-mb May 28, 2024
056d178
rename wrapper
tc-mb May 28, 2024
3c306f1
clear code
tc-mb May 28, 2024
9495504
replace and organize code
tc-mb May 28, 2024
b37ab0b
add link
tc-mb May 28, 2024
8767ce2
Merge branch 'prepare-PR-of-minicpm-v2.5' into prepare-PR
tc-mb May 28, 2024
8bd47ce
Merge pull request #7 from OpenBMB/prepare-PR
tc-mb May 28, 2024
28d4a7f
Merge pull request #8 from OpenBMB/master
tc-mb May 28, 2024
02eb445
sync master
tc-mb May 28, 2024
07f48f9
fix warnings
tc-mb May 28, 2024
c38d152
fix warnings
tc-mb May 28, 2024
88f5e6a
fix bug in bicubic resize when need resize iamge smaller
tc-mb May 30, 2024
a913ca4
receive review comments and modify
tc-mb May 31, 2024
a95a6d9
receive review comments and modify
tc-mb Jun 2, 2024
c390dd4
Merge branch 'ggerganov:master' into prepare-PR-of-minicpm-v2.5
tc-mb Jun 4, 2024
efe4c61
put all code into llava dir
tc-mb Jun 4, 2024
ee5b850
Merge pull request #11 from OpenBMB/pr_add_all_in_llava
tc-mb Jun 4, 2024
77beb4d
Merge branch 'prepare-PR-of-minicpm-v2.5' into master
tc-mb Jun 24, 2024
cb8cfb9
Merge pull request #15 from OpenBMB/master
tc-mb Jun 24, 2024
8f03505
fix quality problem in pr code
tc-mb Jun 25, 2024
e68c8bc
change n_layer
tc-mb Jun 25, 2024
4c67d7c
add space in "-1"
tc-mb Jun 25, 2024
977941d
imitate reshape bug of python code
tc-mb Jul 4, 2024
3e6348b
fix bug in clip
tc-mb Jul 7, 2024
c5b6851
fix issues for merging
tc-mb Jul 17, 2024
5959b14
fix llama-minicpmv-cli in cmake file
tc-mb Jul 19, 2024
292a469
change pr readme
tc-mb Jul 20, 2024
be8b5b2
fix code review
tc-mb Jul 22, 2024
4c75583
remove in line 33 directory in the /cmakelists.txt (not in example, i…
tc-mb Jul 22, 2024
62fa15b
fix cmakefile
tc-mb Jul 23, 2024
dad4abe
add warn
tc-mb Jul 23, 2024
3642be9
fix KEY_HAS_MINICPMV_PROJ
tc-mb Jul 23, 2024
fcde997
remove load_image_size into clip_ctx
tc-mb Jul 23, 2024
6fd0937
remove the extern "C", MINICPMV_API
tc-mb Jul 23, 2024
107e1ed
fix uhd code for review comment
tc-mb Jul 25, 2024
72b9629
delete minicpmv-wrapper in pr
tc-mb Jul 25, 2024
f3d400d
remove uhd_image_embed
tc-mb Jul 26, 2024
65f7455
Modify 2 notes
tc-mb Jul 26, 2024
6da5130
support minicpmv2.6
tc-mb Aug 2, 2024
77c580d
modify convert script of minicpmv
tc-mb Aug 2, 2024
ea0c828
modify convert
tc-mb Aug 10, 2024
fc1c860
Merge branch 'prepare-PR-of-minicpm-v2.6' into master
tc-mb Aug 10, 2024
ce0d1a6
Merge pull request #24 from OpenBMB/master
tc-mb Aug 10, 2024
6cad864
modify convert
tc-mb Aug 10, 2024
fe39ecc
add readme
tc-mb Aug 10, 2024
bffbe1c
add resampler of v2.6
tc-mb Aug 10, 2024
28d6a0f
modify clip
tc-mb Aug 10, 2024
4a87d1d
modify readme
tc-mb Aug 10, 2024
32b47f6
fix type-check
tc-mb Aug 10, 2024
662d4c1
fix type-check
tc-mb Aug 12, 2024
a945b3c
fix type-check
tc-mb Aug 12, 2024
89d378c
fix type-check
tc-mb Aug 12, 2024
1ec79f0
modify convert script and readme
tc-mb Aug 12, 2024
1123376
fix convert script and readme
tc-mb Aug 12, 2024
f30c5e1
fix convert
tc-mb Aug 12, 2024
47eb0a5
fix num in convert
tc-mb Aug 12, 2024
1ca3f06
fix type-check
tc-mb Aug 13, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
add link
  • Loading branch information
tc-mb committed May 28, 2024
commit b37ab0b1e5142a92c080072363c14a6a05a694ea
20 changes: 6 additions & 14 deletions examples/minicpmv/clip.cpp
Original file line number Diff line number Diff line change
@@ -1,7 +1,3 @@
// NOTE: This is modified from clip.cpp only for LLaVA,
// so there might be still unnecessary artifacts hanging around
// I'll gradually clean and extend it
// Note: Even when using identical normalized image inputs (see normalize_image_u8_to_f32()) we have a significant difference in resulting embeddings compared to pytorch
#include "clip.h"
#include "common.h"
#include "log.h"
Expand Down Expand Up @@ -1664,6 +1660,9 @@ bool clip_image_batch_encode(clip_ctx * ctx, const int n_threads, const clip_ima
}

{
// inspired from siglip:
// -> https://huggingface.co/HuggingFaceM4/siglip-so400m-14-980-flash-attn2-navit
// -> https://huggingface.co/HuggingFaceM4/siglip-so400m-14-980-flash-attn2-navit/blob/d66538faeba44480d0bfaa42145eef26f9423199/modeling_siglip.py#L316
struct ggml_tensor * positions = ggml_graph_get_tensor(gf, "positions");

int* positions_data = (int*)malloc(ggml_nbytes(positions));
Expand All @@ -1675,6 +1674,9 @@ bool clip_image_batch_encode(clip_ctx * ctx, const int n_threads, const clip_ima
}

{
// inspired from resampler of Qwen-VL:
// -> https://huggingface.co/Qwen/Qwen-VL/tree/main
// -> https://huggingface.co/Qwen/Qwen-VL/blob/0547ed36a86561e2e42fecec8fd0c4f6953e33c4/visual.py#L23
struct ggml_tensor * pos_embed = ggml_graph_get_tensor(gf, "pos_embed");
int pos_w = image_size_width/patch_size;
int pos_h = image_size_height/patch_size;
Expand All @@ -1692,16 +1694,6 @@ bool clip_image_batch_encode(clip_ctx * ctx, const int n_threads, const clip_ima
free(pos_embed_data);
}

// {
// struct ggml_tensor * patches = ggml_graph_get_tensor(gf, "patches");
// int* patches_data = (int*)malloc(ggml_nbytes(patches));
// for (int i = 0; i < num_patches; i++) {
// patches_data[i] = i + 1;
// }
// ggml_backend_tensor_set(patches, patches_data, 0, ggml_nbytes(patches));
// free(patches_data);
// }

if (ggml_backend_is_cpu(ctx->backend)) {
ggml_backend_cpu_set_n_threads(ctx->backend, n_threads);
}
Expand Down
44 changes: 24 additions & 20 deletions examples/minicpmv/minicpmv.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -31,9 +31,9 @@ struct clip_image_grid_shape {
int second;
};

static bool encode_image_with_clip(clip_ctx * ctx_clip, int n_threads, const clip_image_u8 * img, float * image_embd, int * n_img_pos) {
// std::vector<clip_image_f32*> img_res_v; // format VectN x H x W x RGB (N x 336 x 336 x 3), so interleaved RGB - different to the python implementation which is N x 3 x 336 x 336

static bool encode_image_with_clip_uhd(clip_ctx * ctx_clip, int n_threads, const clip_image_u8 * img, float * image_embd, int * n_img_pos) {
// std::vector<clip_image_f32*> img_res_v;
// format VectN x H x W x RGB (N x 448 x 448 x 3)
clip_image_f32 * img_res_v = clip_image_f32_init();
std::pair<int, int> load_image_size;
load_image_size.first = img->nx;
Expand All @@ -46,7 +46,7 @@ static bool encode_image_with_clip(clip_ctx * ctx_clip, int n_threads, const cli
LOG_TEE("\n%s: mm_patch_merge_type is %s.\n", __func__, mm_patch_merge_type);

*n_img_pos = clip_n_patches(ctx_clip);
bool encoded = clip_image_encode(ctx_clip, n_threads, img_res_v, image_embd, load_image_size); // image_embd shape is 576 x 4096
bool encoded = clip_image_encode(ctx_clip, n_threads, img_res_v, image_embd, load_image_size); // image_embd shape is 96 x 4096
if (!encoded) {
LOG_TEE("Unable to encode image\n");
return false;
Expand All @@ -61,7 +61,7 @@ static bool encode_image_with_clip(clip_ctx * ctx_clip, int n_threads, const cli
}

bool llava_validate_embed_size(const llama_context * ctx_llama, const clip_ctx * ctx_clip) {
// make sure that the correct mmproj was used, i.e., compare apples to apples
// make sure that the correct mmproj was used, i.e., compare apples to apples
int n_llama_embd = llama_n_embd(llama_get_model(ctx_llama));
auto n_image_embd = clip_n_mmproj_embd(ctx_clip);
if (n_image_embd != n_llama_embd) {
Expand All @@ -72,14 +72,14 @@ bool llava_validate_embed_size(const llama_context * ctx_llama, const clip_ctx *
}

bool llava_image_embed_make_with_clip_img(clip_ctx * ctx_clip, int n_threads, const clip_image_u8 * img, float ** image_embd_out, int * n_img_pos_out) {
float * image_embd = (float *)malloc(clip_embd_nbytes(ctx_clip)*6); // TODO: base on gridsize/llava model
float * image_embd = (float *)malloc(clip_embd_nbytes(ctx_clip)*6);
if (!image_embd) {
LOG_TEE("Unable to allocate memory for image embeddings\n");
return false;
}

int n_img_pos;
if (!encode_image_with_clip(ctx_clip, n_threads, img, image_embd, &n_img_pos)) {
if (!encode_image_with_clip_uhd(ctx_clip, n_threads, img, image_embd, &n_img_pos)) {
LOG_TEE("%s: cannot encode image, aborting\n", __func__);
free(image_embd);
return false;
Expand Down Expand Up @@ -112,7 +112,7 @@ int ensure_divide(int length, int patch_size) {
return std::max(static_cast<int>(std::round(static_cast<float>(length) / patch_size) * patch_size), patch_size);
}

std::pair<int, int> find_best_resize(std::pair<int, int> original_size, int scale_resolution, int patch_size, bool allow_upscale = false) {
std::pair<int, int> uhd_find_best_resize(std::pair<int, int> original_size, int scale_resolution, int patch_size, bool allow_upscale = false) {
int width = original_size.first;
int height = original_size.second;
if ((width * height > scale_resolution * scale_resolution) || allow_upscale) {
Expand All @@ -129,7 +129,7 @@ inline float clip(float x, float lower, float upper) {
return std::max(lower, std::min(x, upper));
}

std::pair<int, int> get_refine_size(std::pair<int, int> original_size, std::pair<int, int> grid, int scale_resolution, int patch_size, bool allow_upscale = false) {
std::pair<int, int> uhd_get_refine_size(std::pair<int, int> original_size, std::pair<int, int> grid, int scale_resolution, int patch_size, bool allow_upscale = false) {
int width, height;
std::tie(width, height) = original_size;
int grid_x, grid_y;
Expand All @@ -142,7 +142,7 @@ std::pair<int, int> get_refine_size(std::pair<int, int> original_size, std::pair
int grid_height = refine_height / grid_y;

// auto best_grid_size = find_best_resize(std::make_tuple(grid_width, grid_height), scale_resolution, patch_size, allow_upscale); (old line)
auto best_grid_size = find_best_resize(std::make_pair(grid_width, grid_height), scale_resolution, patch_size, allow_upscale); // (new line) => fixes conversion for make_tuple to make_pair
auto best_grid_size = uhd_find_best_resize(std::make_pair(grid_width, grid_height), scale_resolution, patch_size, allow_upscale); // (new line) => fixes conversion for make_tuple to make_pair
int best_grid_width, best_grid_height;
std::tie(best_grid_width, best_grid_height) = best_grid_size;

Expand Down Expand Up @@ -214,7 +214,11 @@ static bool bicubic_resize(const clip_image_u8 &img, clip_image_u8 &dst, int tar
return true;
}

std::vector<std::vector<clip_image_u8 *>> slice_image(const clip_image_u8 * img, const int max_slice_nums=9, const int scale_resolution=448, const int patch_size=14, const bool never_split=false) {
// inspired from LLaVA-UHD:
// -> https://arxiv.org/pdf/2403.11703
// -> https://github.com/thunlp/LLaVA-UHD
// -> https://github.com/thunlp/LLaVA-UHD/blob/302301bc2175f7e717fb8548516188e89f649753/llava_uhd/train/llava-uhd/slice_logic.py#L118
std::vector<std::vector<clip_image_u8 *>> uhd_slice_image(const clip_image_u8 * img, const int max_slice_nums=9, const int scale_resolution=448, const int patch_size=14, const bool never_split=false) {
const std::pair<int, int> original_size={img->nx,img->ny};
const int original_width = img->nx;
const int original_height = img->ny;
Expand All @@ -227,7 +231,7 @@ std::vector<std::vector<clip_image_u8 *>> slice_image(const clip_image_u8 * img,
images.push_back(std::vector<clip_image_u8 *>());

if(multiple <= 1){
auto best_size = find_best_resize(original_size, scale_resolution, patch_size, true);
auto best_size = uhd_find_best_resize(original_size, scale_resolution, patch_size, true);
clip_image_u8 *source_image = clip_image_u8_init();
bicubic_resize(*img, *source_image, best_size.first, best_size.second);
// source_image = image.resize(best_size, Image.Resampling.BICUBIC)
Expand All @@ -243,7 +247,7 @@ std::vector<std::vector<clip_image_u8 *>> slice_image(const clip_image_u8 * img,
candidate_split_grids_nums.push_back(i);
}

auto best_size = find_best_resize(original_size, scale_resolution, patch_size);
auto best_size = uhd_find_best_resize(original_size, scale_resolution, patch_size);
clip_image_u8 *source_image = clip_image_u8_init();
bicubic_resize(*img, *source_image, best_size.first, best_size.second);
// source_image = image.copy().resize(best_resize, Image.Resampling.BICUBIC)
Expand Down Expand Up @@ -273,7 +277,7 @@ std::vector<std::vector<clip_image_u8 *>> slice_image(const clip_image_u8 * img,
}
LOG_TEE("%s: image_size: %d %d; best_grid: %d %d\n", __func__, img->nx, img->ny, best_grid.first, best_grid.second);

auto refine_size = get_refine_size(original_size, best_grid, scale_resolution, patch_size, true);
auto refine_size = uhd_get_refine_size(original_size, best_grid, scale_resolution, patch_size, true);
clip_image_u8 *refine_image = clip_image_u8_init();
bicubic_resize(*img, *refine_image, refine_size.first, refine_size.second);

Expand Down Expand Up @@ -307,8 +311,8 @@ std::vector<std::vector<clip_image_u8 *>> slice_image(const clip_image_u8 * img,
return images;
}

std::vector<std::vector<struct llava_image_embed *>> llava_image_embed_make_with_bytes_slice(struct clip_ctx * ctx_clip, int n_threads, const clip_image_u8 * img) {
std::vector<std::vector<clip_image_u8 *>> imgs = slice_image(img);
std::vector<std::vector<struct llava_image_embed *>> llava_image_embed_make_with_bytes_uhd(struct clip_ctx * ctx_clip, int n_threads, const clip_image_u8 * img) {
std::vector<std::vector<clip_image_u8 *>> imgs = uhd_slice_image(img);
for (size_t i = 0; i < imgs.size(); ++i){
for (size_t j = 0; j < imgs[i].size(); ++j) {
LOG_TEE("%s: %d %d\n", __func__,imgs[i][j]->nx,imgs[i][j]->ny);
Expand Down Expand Up @@ -370,7 +374,7 @@ static bool load_file_to_bytes(const char* path, unsigned char** bytesOut, long
}

bool llava_image_embed_make_with_clip_img_ollama(clip_ctx * ctx_clip, int n_threads, const clip_image_u8 * img, float ** image_embd_out, int * n_img_pos_out) {
auto image_embed_slices = llava_image_embed_make_with_bytes_slice(ctx_clip, n_threads, img);
auto image_embed_slices = llava_image_embed_make_with_bytes_uhd(ctx_clip, n_threads, img);
if (!image_embed_slices[0][0]){
LOG_TEE("%s: failed to embeding image\n", __func__);
return false;
Expand Down Expand Up @@ -412,7 +416,7 @@ bool llava_image_embed_make_with_clip_img_ollama(clip_ctx * ctx_clip, int n_thre
return true;
}

std::vector<std::vector<struct llava_image_embed *>> llava_image_embed_make_with_filename_slice(struct clip_ctx * ctx_clip, int n_threads, const char * image_path) {
std::vector<std::vector<struct llava_image_embed *>> llava_image_embed_make_with_filename_uhd(struct clip_ctx * ctx_clip, int n_threads, const char * image_path) {
unsigned char* image_bytes;
long image_bytes_length;
auto loaded = load_file_to_bytes(image_path, &image_bytes, &image_bytes_length);
Expand All @@ -427,14 +431,14 @@ std::vector<std::vector<struct llava_image_embed *>> llava_image_embed_make_with
return std::vector<std::vector<struct llava_image_embed *>>();
}

std::vector<std::vector<struct llava_image_embed *>> embeds = llava_image_embed_make_with_bytes_slice(ctx_clip, n_threads, img);
std::vector<std::vector<struct llava_image_embed *>> embeds = llava_image_embed_make_with_bytes_uhd(ctx_clip, n_threads, img);

clip_image_u8_free(img);
free(image_bytes);
return embeds;
}

void llava_image_embed_free_slice(std::vector<std::vector<struct llava_image_embed *>> embed) {
void llava_image_embed_free_uhd(std::vector<std::vector<struct llava_image_embed *>> embed) {
for (size_t i = 0; i < embed.size(); ++i){
for (size_t j = 0; j < embed[i].size(); ++j){
free(embed[i][j]->embed);
Expand Down
6 changes: 3 additions & 3 deletions examples/minicpmv/minicpmv.h
Original file line number Diff line number Diff line change
Expand Up @@ -34,11 +34,11 @@ MINICPMV_API bool llava_validate_embed_size(const struct llama_context * ctx_lla
MINICPMV_API bool llava_image_embed_make_with_clip_img(struct clip_ctx * ctx_clip, int n_threads, const struct clip_image_u8 * img, float ** image_embd_out, int * n_img_pos_out);

/** build an image embed from image file bytes */
MINICPMV_API std::vector<std::vector<struct llava_image_embed *>> llava_image_embed_make_with_bytes_slice(struct clip_ctx * ctx_clip, int n_threads, const unsigned char * image_bytes, int image_bytes_length);
MINICPMV_API std::vector<std::vector<struct llava_image_embed *>> llava_image_embed_make_with_bytes_uhd(struct clip_ctx * ctx_clip, int n_threads, const unsigned char * image_bytes, int image_bytes_length);
/** build an image embed from a path to an image filename */
MINICPMV_API bool llava_image_embed_make_with_clip_img_ollama(struct clip_ctx * ctx_clip, int n_threads, const struct clip_image_u8 * img, float ** image_embd_out, int * n_img_pos_out);
MINICPMV_API std::vector<std::vector<struct llava_image_embed *>> llava_image_embed_make_with_filename_slice(struct clip_ctx * ctx_clip, int n_threads, const char * image_path);
MINICPMV_API void llava_image_embed_free_slice(std::vector<std::vector<struct llava_image_embed *>> embed);
MINICPMV_API std::vector<std::vector<struct llava_image_embed *>> llava_image_embed_make_with_filename_uhd(struct clip_ctx * ctx_clip, int n_threads, const char * image_path);
MINICPMV_API void llava_image_embed_free_uhd(std::vector<std::vector<struct llava_image_embed *>> embed);
/** free an embedding made with llava_image_embed_make_* */

/** write the image represented by embed into the llama context with batch size n_batch, starting at context pos n_past. on completion, n_past points to the next position in the context after the image embed. */
Expand Down