Skip to content

Commit

Permalink
Merge pull request #110 from ninehills/create-pull-request/patch
Browse files Browse the repository at this point in the history
Changes by create-pull-request action
  • Loading branch information
ninehills authored Dec 20, 2023
2 parents 1858e2c + d68fb69 commit 68ca346
Show file tree
Hide file tree
Showing 21 changed files with 127 additions and 30 deletions.
23 changes: 12 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,22 @@
# 九原山
## Posts
- #109 [Gemini Pro Vision 作为 表格 OCR 解决方案的简单测试](articles/109.md) 2023-12-20 `blog`
- #107 [大语言模型(LLM)推理性能优化以及推理框架、后端的评测](articles/107.md) 2023-12-19 `blog`
- #104 [Embedding 模型在 RAG 场景下的评估和微调](articles/104.md) 2023-11-23 `blog`
- #104 [Embedding 模型在 RAG 场景下的评估和微调](articles/104.md) 2023-11-03 `blog`
- #100 [实现基于 Github Issues 的博客](articles/100.md) 2023-06-28 `blog`
- #97 [大语言模型(LLM)学习路径和资料汇总](articles/97.md) 2023-12-14 `blog`
- #97 [大语言模型(LLM)学习路径和资料汇总](articles/97.md) 2023-06-27 `blog`
- #96 [中文模型 C-Eval 评测结果简单小评测](articles/96.md) 2023-06-27 `blog`
- #95 [大语言模型(LLM)后训练数据准备相关笔记](articles/95.md) 2023-11-21 `blog`
- #94 [值得关注的对中文支持较好的开源模型](articles/94.md) 2023-07-13 `blog`
- #92 [大语言模型(LLM)微调技术笔记](articles/92.md) 2023-10-31 `blog`
- #88 [小工具 p2pfile 可以快速的用于内网大文件分发](articles/88.md) 2022-11-14 `blog`
- #95 [大语言模型(LLM)后训练数据准备相关笔记](articles/95.md) 2023-06-26 `blog`
- #94 [值得关注的对中文支持较好的开源模型](articles/94.md) 2023-06-21 `blog`
- #92 [大语言模型(LLM)微调技术笔记](articles/92.md) 2023-05-12 `blog`
- #88 [小工具 p2pfile 可以快速的用于内网大文件分发](articles/88.md) 2022-01-29 `blog`
- #78 [《植物大战僵尸》PC/Mac版存档修改](articles/78.md) 2020-05-03 `blog`
- #77 [Kubernetes 基于 Namespace 的物理队列实现](articles/77.md) 2020-05-07 `blog`
- #77 [Kubernetes 基于 Namespace 的物理队列实现](articles/77.md) 2020-04-10 `blog`
- #76 [SRE 技术简报 20200310](articles/76.md) 2020-03-20 `blog`
- #75 [游戏 《天命奇御》](articles/75.md) 2020-01-14 `blog`
- #74 [SRE 技术简报 20200114](articles/74.md) 2020-01-15 `blog`
- #74 [SRE 技术简报 20200114](articles/74.md) 2020-01-14 `blog`
- #73 [SRE 技术简报 20191222](articles/73.md) 2019-12-22 `blog`
- #72 [SRE 技术简报 20191127](articles/72.md) 2019-11-27 `blog`
- #63 [SREcon18 Americas 我的推荐清单](articles/63.md) 2018-06-06 `blog` `done`
- #62 [[MIT 6.824 分布式系统课程] Lab2 Raft 心得](articles/62.md) 2019-02-15 `blog` `done`
- #3 [解决 Mac Docker.qcow2 文件过大的问题](articles/3.md) 2017-11-13 `blog` `done`
- #63 [SREcon18 Americas 我的推荐清单](articles/63.md) 2018-06-02 `blog` `done`
- #62 [[MIT 6.824 分布式系统课程] Lab2 Raft 心得](articles/62.md) 2018-02-28 `blog` `done`
- #3 [解决 Mac Docker.qcow2 文件过大的问题](articles/3.md) 2017-07-13 `blog` `done`
1 change: 0 additions & 1 deletion articles/100.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@
> Author: **ninehills**
> Labels: **blog**
> Created: **2023-06-28T03:31:43Z**
> Updated: **2023-06-28T03:31:43Z**
> Link and comments: <https://github.com/ninehills/blog/issues/100>

Expand Down
1 change: 0 additions & 1 deletion articles/104.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@
> Author: **ninehills**
> Labels: **blog**
> Created: **2023-11-03T16:32:17Z**
> Updated: **2023-11-23T05:36:02Z**
> Link and comments: <https://github.com/ninehills/blog/issues/104>

Expand Down
1 change: 0 additions & 1 deletion articles/107.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@
> Author: **ninehills**
> Labels: **blog**
> Created: **2023-12-19T03:10:54Z**
> Updated: **2023-12-19T03:10:58Z**
> Link and comments: <https://github.com/ninehills/blog/issues/107>

Expand Down
115 changes: 115 additions & 0 deletions articles/109.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
# Gemini Pro Vision 作为 表格 OCR 解决方案的简单测试

> Author: **ninehills**
> Labels: **blog**
> Created: **2023-12-20T06:33:27Z**
> Link and comments: <https://github.com/ninehills/blog/issues/109>

随着 Google 发布了 Gemini Pro Vision,我在思考能否将其用来解决 RAG 中经典的问题,扫描 PDF 中的表格识别和转换为 Markdown 格式。

## 结论先行

1. 使用 Few-shot 可以达到比较完美的识别效果。
2. 英文和数据的识别较好。
3. 中文 OCR 识别能力较弱,无法使用。

## Prompt

我们使用 ai.google.dev,编写 Structured Prompt 如下:

Model: Gemini Pro Vision
Temperature: 0

Prompt 正文:

```
Please perform OCR on the image and convert it to Markdown table.
- Correctly convert tables and recheck the header columns of the tables.
- Use ```markdown``` code block to wrap the output.
```

## Examples

为了提高识别效果,尤其是解决在多行换行识别逻辑的问题,需要增加示例。

Input:

![image](https://github.com/ninehills/blog/assets/270298/7ec9c58b-4a33-4417-bd33-020720a41a75)

Output:

```markdown
表 4-1 几种常用的遥感卫星及其遥感器参数

| 遥感卫星 | 轨道高度/km | 轨道倾角/° | 周期/天 | 波段/μm | 空间分辨率/m | 扫幅带宽/km |
| --- | --- | --- | --- | --- | --- | --- |
| Quick Bird | 450 | 97.2 | 1~3.5 | 0.45~0.9<br/>0.45~0.52<br/>0.52~0.60<br/>0.63~0.69<br/>0.76~0.90 | 0.61~0.72<br/>2.44~2.88<br/>2.44~2.88<br/>2.44~2.88<br/>2.44~2.88 | 16.5 |
| 中巴 CBERS-1 | 778 | 98.7 | 26 | 0.50~1.10<br/>1.55~1.75<br/>2.08~2.35<br/>10.4~12.5 | 77.8<br/>77.8<br/>77.8<br/>77.8 | |
| | | | | CCD 相机<br/>0.45~0.52<br/>0.52~0.59<br/>0.63~0.69<br/>0.77~0.89<br/>0.51~0.73 | 19.5 | |
| | | | | 广角成像仪<br/>0.63~0.69<br/>0.77~0.89 | 256 | |
| Landsat TM | 705 | 98.2 | 16 | 0.45~0.52<br/>0.52~0.60<br/>0.63~0.69<br/>0.76~0.90<br/>1.55~1.75<br/>10.4~12.4<br/>2.05~2.35 | 30<br/>(1~5,7波段) | 185 |
```

## 运行结果

### 运行 1

Input:

![image](https://github.com/ninehills/blog/assets/270298/0e94befa-7f5b-4e7c-a36e-e201d885d74a)

Output:(下划线是github markdown 语法的问题)

表 4-1 续

| 遥感卫星 | 轨道高度/km | 轨道倾角/° | 周期/天 | 波段/μm | 空间分辨率/m | 扫幅带宽/km |
| --- | --- | --- | --- | --- | --- | --- |
| SPOT | 832 | 98.2 | 26 | 0.50~0.59<br/>0.61~0.68<br/>0.79~0.89<br/>0.51~0.73 | 20<br/>20<br/>20<br/>10 | 60 |
| NOAA | 833 | 98.9 | 0.5 | 0.58~0.68<br/>0.72~1.10<br/>3.55~3.93<br/>10.3~11.3<br/>11.5~12.5 | 1 100 | 2 400 |
| IKONOS | 681 | 98.1 | 14 | 0.45~0.9<br/>0.45~0.52<br/>0.52~0.60<br/>0.63~0.69<br/>0.76~0.90 | 0.82<br/>4<br/>4<br/>4<br/>4 | 1 |

### 运行 2
Input:
![image](https://github.com/ninehills/blog/assets/270298/10bc43c1-7c29-414c-859c-7aea4d641115)

Output:

表 5-1 浙江省森林防火数据库文件
字段解释
| 字段序号 | 字段名称 |
| --- | --- |
| 8 | 名称、地址、经度(度,分,秒)、纬度(度,分,秒)、区域面积、县区、单位、负责人姓名和电话 |
| 10 | 名称、地址、值班电话、成员姓名、工作单位、行政级别、指挥权限、办公室电话、住宅电话、传真号码、办公电话 |
| 14 | 名称、地址、负责人、电话、经度(度,分,秒)、纬度(度,分,秒)、海拔高度、森林类型、郁闭度、火灾隐患、主要林木、二号工具、柴油、割灌机、GPS定位仪 |
| 10 | 名称、乡镇(乡镇、村)、村庄、经度(度,分,秒)、纬度(度,分,秒)、海拔高度、森林类型、郁闭度、火灾隐患、主要林木、二号工具、柴油、割灌机、GPS定位仪 |
| 14 | 名称、地址、负责人、电话、经度(度,分,秒)、纬度(度,分,秒)、海拔高度、森林类型、郁闭度、火灾隐患、主要林木、二号工具、柴油、割灌机、GPS定位仪 |
| 8 | 无线电台呼号(经建委)、注册号码、台站名称、台站所在地、经度(度,分,秒)、纬度(度,分,秒)、数量、配置情况、通讯设施状况、处置建议 |
| 30 | 编号、火灾编号、发生时间、扑救开始时间、火场面积、森林类型、火场级别、火灾等级、主要原因、过火面积、受灾林木株数、毁坏林木蓄积、受害林地面积、火场火势、火灾扑救力量、扑救方法、车辆(台)、人员(人)、消防器材、扑救经过、火场天气、火灾损失、年份和健康状况、处理建议 |

### 运行 3
Input:
![image](https://github.com/ninehills/blog/assets/270298/0c97a1eb-6b89-4b6d-a4d2-3a488cf7b75a)

Output:


表1-2 我国各省林业站、点遥感区域面积表
| 站名 | 遥感面积/万hm2 | 航线 | 航线长度/km | 备注 |
| --- | --- | --- | --- | --- |
| 东北林业航空中心 | 190.90 | 3 | 1024 | |
| 吉林省森林消防站 | 227.35 | 1 | 544 | 系统合同线 |
| 贵州省森林消防站 | 491.50 | 4 | 1558 | 不含重要面积 |
| 加格达奇森林消防站 | 512.75 | 3 | 1276 | 同上 |
| 漠河森林消防站 | 620.25 | 3 | 1604 | |
| 广西壮族自治区林业厅 | | | | 广西规划航线 |
| 海南省林业厅 | 430.25 | 7 | 2811 | 不含重要面积 |
| 乌兰察布特种林场 | 365.25 | 4 | 1399 | 同上 |
| 乌鲁木齐电气林场 | 314.75 | 3 | 1030 | |
| 克拉玛依林场 | 754.50 | 8 | 3108 | |
| 长白山森林公园 | | 2 | 543 | |
| 江西省林业厅 | 502.00 | 3 | 1282 | |
| 广西贺州林场 | 648.50 | 5 | 2111 | |
| 东方红林业农场 | | 1 | 238 | |
| 伊春森工林场 | 378.50 | 5 | 1635 | 同上 |
1 change: 0 additions & 1 deletion articles/3.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@
> Author: **ninehills**
> Labels: **blog done**
> Created: **2017-07-13T02:21:52Z**
> Updated: **2017-11-13T09:46:35Z**
> Link and comments: <https://github.com/ninehills/blog/issues/3>

Expand Down
1 change: 0 additions & 1 deletion articles/62.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@
> Author: **ninehills**
> Labels: **blog done**
> Created: **2018-02-28T09:36:26Z**
> Updated: **2019-02-15T06:44:39Z**
> Link and comments: <https://github.com/ninehills/blog/issues/62>

Expand Down
1 change: 0 additions & 1 deletion articles/63.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@
> Author: **ninehills**
> Labels: **blog done**
> Created: **2018-06-02T04:54:54Z**
> Updated: **2018-06-06T13:21:11Z**
> Link and comments: <https://github.com/ninehills/blog/issues/63>

Expand Down
1 change: 0 additions & 1 deletion articles/72.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@
> Author: **ninehills**
> Labels: **blog**
> Created: **2019-11-27T06:39:00Z**
> Updated: **2019-11-27T06:39:00Z**
> Link and comments: <https://github.com/ninehills/blog/issues/72>

Expand Down
1 change: 0 additions & 1 deletion articles/73.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@
> Author: **ninehills**
> Labels: **blog**
> Created: **2019-12-22T05:18:04Z**
> Updated: **2019-12-22T05:19:33Z**
> Link and comments: <https://github.com/ninehills/blog/issues/73>

Expand Down
1 change: 0 additions & 1 deletion articles/74.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@
> Author: **ninehills**
> Labels: **blog**
> Created: **2020-01-14T03:49:09Z**
> Updated: **2020-01-15T04:38:09Z**
> Link and comments: <https://github.com/ninehills/blog/issues/74>

Expand Down
1 change: 0 additions & 1 deletion articles/75.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@
> Author: **ninehills**
> Labels: **blog**
> Created: **2020-01-14T04:01:38Z**
> Updated: **2020-01-14T04:01:38Z**
> Link and comments: <https://github.com/ninehills/blog/issues/75>

Expand Down
1 change: 0 additions & 1 deletion articles/76.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@
> Author: **ninehills**
> Labels: **blog**
> Created: **2020-03-20T02:35:48Z**
> Updated: **2020-03-20T02:35:48Z**
> Link and comments: <https://github.com/ninehills/blog/issues/76>

Expand Down
1 change: 0 additions & 1 deletion articles/77.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@
> Author: **ninehills**
> Labels: **blog**
> Created: **2020-04-10T04:22:00Z**
> Updated: **2020-05-07T08:55:29Z**
> Link and comments: <https://github.com/ninehills/blog/issues/77>

Expand Down
1 change: 0 additions & 1 deletion articles/78.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@
> Author: **ninehills**
> Labels: **blog**
> Created: **2020-05-03T13:33:35Z**
> Updated: **2020-05-03T13:50:24Z**
> Link and comments: <https://github.com/ninehills/blog/issues/78>

Expand Down
1 change: 0 additions & 1 deletion articles/88.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@
> Author: **ninehills**
> Labels: **blog**
> Created: **2022-01-29T14:22:52Z**
> Updated: **2022-11-14T14:26:20Z**
> Link and comments: <https://github.com/ninehills/blog/issues/88>

Expand Down
1 change: 0 additions & 1 deletion articles/92.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@
> Author: **ninehills**
> Labels: **blog**
> Created: **2023-05-12T06:49:26Z**
> Updated: **2023-10-31T08:34:11Z**
> Link and comments: <https://github.com/ninehills/blog/issues/92>

Expand Down
1 change: 0 additions & 1 deletion articles/94.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@
> Author: **ninehills**
> Labels: **blog**
> Created: **2023-06-21T06:15:49Z**
> Updated: **2023-07-13T04:24:34Z**
> Link and comments: <https://github.com/ninehills/blog/issues/94>

Expand Down
1 change: 0 additions & 1 deletion articles/95.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@
> Author: **ninehills**
> Labels: **blog**
> Created: **2023-06-26T14:10:46Z**
> Updated: **2023-11-21T10:04:08Z**
> Link and comments: <https://github.com/ninehills/blog/issues/95>

Expand Down
1 change: 0 additions & 1 deletion articles/96.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@
> Author: **ninehills**
> Labels: **blog**
> Created: **2023-06-27T06:30:11Z**
> Updated: **2023-06-27T14:18:25Z**
> Link and comments: <https://github.com/ninehills/blog/issues/96>

Expand Down
1 change: 0 additions & 1 deletion articles/97.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@
> Author: **ninehills**
> Labels: **blog**
> Created: **2023-06-27T13:42:33Z**
> Updated: **2023-12-14T14:23:05Z**
> Link and comments: <https://github.com/ninehills/blog/issues/97>

Expand Down

0 comments on commit 68ca346

Please sign in to comment.