Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

大哥 麻烦给个数据样本参考一下 #5

Open
kingpingyue opened this issue Mar 5, 2024 · 6 comments
Open

大哥 麻烦给个数据样本参考一下 #5

kingpingyue opened this issue Mar 5, 2024 · 6 comments

Comments

@kingpingyue
Copy link

大哥 麻烦给个数据样本参考一下,我想了解一下 数据处理部分

@jiahe7ay
Copy link
Owner

jiahe7ay commented Mar 5, 2024

"text":xxxxx<Im_end>xxxx (最长为512)im_end来区分两个文本,我是尽量填充到最大长度的

@kingpingyue
Copy link
Author

就是 例如一篇文章,我怎么把这篇文章处理成可以训练模型的数据,代码我没太看懂

@kingpingyue
Copy link
Author

input_ids = [np.array(item) for item in outputs["input_ids"]]

这句我没看懂是为什么

@kingpingyue
Copy link
Author

为啥要转np.arrary啊

@jiahe7ay
Copy link
Owner

如果词表大小小于 65535 用uint16存储,节省磁盘空间,否则用uint32存储

@kingpingyue
Copy link
Author

哦哦 其实 input_batch = [] input_batch.append(input_ids)类似,指定数据类型会节省磁盘空间

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants