-
Notifications
You must be signed in to change notification settings - Fork 45
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
7511fa3
commit 3b27049
Showing
22 changed files
with
291 additions
and
55 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
168 changes: 140 additions & 28 deletions
168
...ing/LLM/2024-08-14-emebedding_finetune.md → ...Learning/LLM/2024-08-14-rag_emebedding.md
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
--- | ||
|
||
layout: post | ||
title: rerank微调 | ||
category: 架构 | ||
tags: MachineLearning | ||
keywords: llm emebedding | ||
|
||
--- | ||
|
||
<script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=default"></script> | ||
|
||
* TOC | ||
{:toc} | ||
|
||
## 简介(未完成) | ||
|
||
|
||
## 为什么用rerank | ||
|
||
是使用elasticsearch的retrieval召回的内容相关度有问题,多数情况下score最高的chunk相关度没问题,但是top2-5的相关度就很随机了,这是最影响最终结果的。我们看了elasticsearch的相似度算法,es用的是KNN算法(开始我们以为是暴力搜索),但仔细看了一下,在es8的相似度检索中,用的其实是基于HNSW(分层的最小世界导航算法),HNSW是有能力在几毫秒内从数百万个数据点中找到最近邻的。为了检索的快速,HNSW算法会存在一些随机性,反映在实际召回结果中,最大的影响就是返回结果中top_K并不是我们最想要的,至少这K个文件的排名并不是我们认为的从高分到低分排序的。 | ||
|
||
因为在搜索的时候存在随机性,这应该就是我们在RAG中第一次召回的结果往往不太满意的原因。但是这也没办法,如果你的索引有数百万甚至千万的级别,那你只能牺牲一些精确度,换回时间。这时候我们可以做的就是增加top_k的大小,比如从原来的10个,增加到30个。然后再使用更精确的算法来做rerank,使用一一计算打分的方式,做好排序。 | ||
|
||
## 微调 | ||
|
||
微调数据集格式为[query,正样本集合,负样本集合]。微调在Embeding模型与Reranker模型采用同类型数据集,并将语义相关性任务视为二分类任务,采用BCE作为损失函数。 | ||
|
||
https://zhuanlan.zhihu.com/p/704562748 未细读 |
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,63 @@ | ||
--- | ||
|
||
layout: post | ||
title: bert | ||
category: 架构 | ||
tags: MachineLearning | ||
keywords: gcn | ||
|
||
--- | ||
|
||
## 简介(未完成) | ||
|
||
* TOC | ||
{:toc} | ||
|
||
BERT 是一个用 Transformers 作为特征抽取器的深度双向预训练语言理解模型。通过海量语料预训练,得到序列当前最全面的局部和全局特征表示。 | ||
|
||
[论文](https://arxiv.org/abs/1810.04805v1) | ||
|
||
bert 名称来自 Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional(相对gpt的单向来说,是双向的) representations by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT representations can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. PS:ELMo 是基于rnn,在应用到下游任务时,还需要做一些模型结构的变动。有了一个训练好的bert之后,只需要再加一个额外的层,就可以适配各种任务。 | ||
|
||
|
||
## 模型结构 | ||
|
||
BERT的基础集成单元是Transformer的Encoder,BERT与Transformer 的编码方式一样。将固定长度的字符串作为输入,数据由下而上传递计算,每一层都用到了self attention,并通过前馈神经网络传递其结果,将其交给下一个编码器。 | ||
|
||
![](/public/upload/machine/bert_model.jpg) | ||
|
||
模型输入 | ||
|
||
![](/public/upload/machine/bert_input.jpg) | ||
|
||
输入的第一个字符为[CLS],在这里字符[CLS]表达的意思很简单 - Classification (分类)。 | ||
|
||
模型输出 | ||
|
||
![](/public/upload/machine/bert_output.jpg) | ||
|
||
每个位置返回的输出都是一个隐藏层大小的向量(基本版本BERT为768)。以文本分类为例,我们重点关注第一个位置上的输出(第一个位置是分类标识[CLS]) bert 希望它最后的输出代表整个序列的信息。该向量现在可以用作我们选择的分类器的输入,在论文中指出使用单层神经网络作为分类器就可以取得很好的效果。例子中只有垃圾邮件和非垃圾邮件,如果你有更多的label,你只需要增加输出神经元的个数即可,另外把最后的激活函数换成softmax即可。 | ||
|
||
![](/public/upload/machine/bert_classify.jpg) | ||
|
||
## 训练方式 | ||
|
||
![](/public/upload/machine/bert_masked.jpg) | ||
|
||
PS:训练时,自己知道自己mask 了哪个词,所以也是无监督了。 | ||
|
||
## 应用 | ||
|
||
BERT的论文为我们介绍了几种BERT可以处理的NLP任务: | ||
1. 短文本相似 | ||
![](/public/upload/machine/bert_similarity.jpg) | ||
2. 文本分类 | ||
3. QA机器人 | ||
4. 语义标注 | ||
5. 特征提取 ==> rag 里的emebedding | ||
|
||
PS:最后一层的输出 选用[cls] 对应的embedding 或多个emebedding 套个FFNN + softmax,二分类或多分类任务就都可以解决了。 | ||
|
||
## 其它 | ||
|
||
WordPiece 分词会切词根, 切词根的目的是,很多词根是复用的,这样能减少此表大小(以3w 左右的词典,不然英文单词不只3w) |
Oops, something went wrong.