From f0debaad32da3ff2a1a6451f635dd32a8416d140 Mon Sep 17 00:00:00 2001 From: Arthur Galuza Date: Thu, 14 Oct 2021 16:35:29 +0300 Subject: [PATCH] =?UTF-8?q?=F0=9F=8E=89=20Document=20stream=20caching=20(#?= =?UTF-8?q?6846)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * Upd docs: add caching section * Upd docs * Update docs/connector-development/cdk-python/http-streams.md Co-authored-by: Eugene Kulak * Remove link to details * Update docs/connector-development/cdk-python/http-streams.md Co-authored-by: Sherif A. Nada * Update docs/connector-development/cdk-python/http-streams.md Co-authored-by: Sherif A. Nada * Add demo snippet Co-authored-by: Eugene Kulak Co-authored-by: Sherif A. Nada --- .../cdk-python/http-streams.md | 28 +++++++++++++++++++ 1 file changed, 28 insertions(+) diff --git a/docs/connector-development/cdk-python/http-streams.md b/docs/connector-development/cdk-python/http-streams.md index e088417cd3338..b8c74f4dd27e9 100644 --- a/docs/connector-development/cdk-python/http-streams.md +++ b/docs/connector-development/cdk-python/http-streams.md @@ -5,6 +5,7 @@ The CDK offers base classes that greatly simplify writing HTTP API-based connect * Authentication \(basic auth, Oauth2, or any custom auth method\) * Pagination * Handling rate limiting with static or dynamic backoff timing +* Caching All these features have sane off-the-shelf defaults but are completely customizable depending on your use case. They can also be combined with other stream features described in the [full refresh streams](full-refresh-stream.md) and [incremental streams](incremental-stream.md) sections. @@ -48,3 +49,30 @@ Note that Airbyte will always attempt to make as many requests as possible and o When implementing [stream slicing](incremental-stream.md#streamstream_slices) in an `HTTPStream` each Slice is equivalent to a HTTP request; the stream will make one request per element returned by the `stream_slices` function. The current slice being read is passed into every other method in `HttpStream` e.g: `request_params`, `request_headers`, `path`, etc.. to be injected into a request. This allows you to dynamically determine the output of the `request_params`, `path`, and other functions to read the input slice and return the appropriate value. +## Caching +It's possible to cache data from a stream onto a temporary file on disk. + +This is especially useful when dealing with streams that depend on the results of another stream e.g: `/employees/{id}/details`. In this case, we can use caching to write the data of the parent stream to a file to use this data when the child stream synchronizes, rather than performing a full HTTP request again. + +The caching mechanism works as follows: If the request is made for the first time, the returned value will be written to disk (all requests made by the `read_records` method will be written to the cache file). When the same request is made again, instead of making another HTTP request, the result will instead be read from disk. It is checked whether the required request is in the cache file, and if so, the data from this file is returned. However, if the check for the request's existence in the cache file fails, a new request will be made, and its result will be added to the cache file. + +Caching can be enabled by overriding the `use_cache` property of the `HttpStream` class to return `True`. + +The caching mechanism is related to parent streams. For child streams, there is an `HttpSubStream` class inheriting from `HttpStream` and overriding the `stream_slices` method that returns a generator of all parent entries. + +To use caching in the parent/child relationship, perform the following steps: +1. Turn on parent stream caching by overriding the `use_cache` property. +2. Inherit child stream class from `HttpSubStream` class. + +#### Example +```python +class Employees(HttpStream): + ... + + @property + def use_cache(self) -> bool: + return True + +class EmployeeDetails(HttpSubStream): + ... +```