Skip to content

Commit

Permalink
Update embed-jobs-api.mdx (#231)
Browse files Browse the repository at this point in the history
* Update embed-jobs-api.mdx

Update code examples for text-embedding part

Signed-off-by: Max Shkutnyk <[email protected]>

* update code examples

* fix code example allignment

---------

Signed-off-by: Max Shkutnyk <[email protected]>
Co-authored-by: Max Shkutnyk <[email protected]>
Co-authored-by: trentfowlercohere <[email protected]>
  • Loading branch information
3 people authored Nov 1, 2024
1 parent a8cdf86 commit 11660ac
Showing 1 changed file with 10 additions and 11 deletions.
21 changes: 10 additions & 11 deletions fern/pages/v2/text-embeddings/embed-jobs-api.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ The Embed Jobs API works in conjunction with the Embed API; in production use-ca
![](../../../assets/images/0826a69-image.png)
### Constructing a Dataset for Embed Jobs

To create a dataset for Embed Jobs, you will need to specify the `embedding_types`, and you need to set `dataset_type` as `embed-input`. The schema of the file looks like: `text:string`.
To create a dataset for Embed Jobs, you will need to set dataset `type` as `embed-input`. The schema of the file looks like: `text:string`.

The Embed Jobs and Dataset APIs respect metadata through two fields: `keep_fields`, `optional_fields`. During the `create dataset` step, you can specify either `keep_fields` or `optional_fields`, which are a list of strings corresponding to the field of the metadata you’d like to preserve. `keep_fields` is more restrictive, since validation will fail if the field is missing from an entry. However, `optional_fields`, will skip empty fields and allow validation to pass.

Expand Down Expand Up @@ -66,10 +66,9 @@ ds=co.datasets.create(
name='sample_file',
# insert your file path here - you can upload it on the right - we accept .csv and jsonl files
data=open('embed_jobs_sample_data.jsonl', 'rb'),
keep_fields=['wiki_id','url','views','title']
optional_fields=['langs']
dataset_type="embed-input",
embedding_types=['float']
keep_fields=['wiki_id','url','views','title'],
optional_fields=['langs'],
type="embed-input"
)

# wait for the dataset to finish validation
Expand All @@ -89,7 +88,7 @@ co = cohere.ClientV2(api_key="<YOUR API KEY>")
input_dataset=co.datasets.create(
name='your_file_name',
data=open('/content/your_file_path', 'rb'),
dataset_type="embed-input"
type="embed-input"
)

# block on server-side validation
Expand All @@ -115,15 +114,15 @@ If your dataset hits a validation error, please refer to the dataset validation
Your dataset is now ready to be embedded. Here's a code snippet illustrating what that looks like:

```python PYTHON
embed_job = co.embed_jobs.create(
embed_job_response = co.embed_jobs.create(
dataset_id=input_dataset.id,
input_type='search_document' ,
model='embed-english-v3.0',
embedding_types=['float'],
embedding_types=['float'],
truncate='END')

# block until the job is complete
co.wait(embed_job)
embed_job = co.wait(embed_job_response)
```

Since we’d like to search over these embeddings and we can think of them as constituting our knowledge base, we set `input_type='search_document'`.
Expand All @@ -133,14 +132,14 @@ Since we’d like to search over these embeddings and we can think of them as co
The output of embed jobs is a dataset object which you can download or pipe directly to a database of your choice:

```python PYTHON
output_dataset=co.datasets.get(id=embed_job.output.id)
output_dataset=co.datasets.get(id=embed_job.output_dataset_id)
co.utils.save(filepath='/content/embed_job_output.csv', format="csv")
```

Alternatively if you would like to pass the dataset into a downstream function you can do the following:

```python PYTHON
output_dataset=co.datasets.get(id=embed_job.output.id)
output_dataset=co.datasets.get(id=embed_job.output_dataset_id)
results=[]
for record in output_dataset:
results.append(record)
Expand Down

0 comments on commit 11660ac

Please sign in to comment.