Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cecilia/itensor stage2 #64

Open
wants to merge 5 commits into
base: in_place_dynamic
Choose a base branch
from

Conversation

ceciliapeng2011
Copy link

Details:

  • item1
  • ...

Tickets:

  • ticket-id

@ceciliapeng2011 ceciliapeng2011 marked this pull request as draft June 28, 2023 02:48
Comment on lines 18 to 23
// WA: resize stage might work because there is no shape change,
// but the underlying actual memory manager changes.
bool validated = (_previous != m_pMngr);
if (validated && m_pMngr->getSize() < m_Size) {
m_pMngr->resize(m_Size);
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maxnick Here is the WA for #62
Sounds not a good solution as well, as it assumes something happening in resizing stage and potential extra resize in some situations.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 889 to 918
// deternmine a group with outputs.
size_t isOutGrp = 0;
int64_t outBoxId = -1;
for (auto& box : group) {
if (std::any_of(
edge_clusters[box.id].begin(),
edge_clusters[box.id].end(),
[box](const ov::intel_cpu::EdgePtr edge) {
return edge->getChild()->getType() == Type::Output;
})) {
isOutGrp++;
outBoxId = box.id;
}
}
if (isOutGrp) {
IE_ASSERT(isOutGrp==1); // reuse_io_tensors false
grpMemMngr =
std::make_shared<OutputMemoryMngr>(grpMemMngr);
DEBUG_LOG(grpMemMngr, " ", this);

// Store the output memory managers.
// So that, the infer requests can be able to access them.
for (auto& edge : edge_clusters[outBoxId]) {
const auto child = edge->getChild();
if (child->getType() == Type::Output) {
for (auto &output : outputNodesMap) {
if (output.second == child) outputNodesMemMngrMap[output.first] = grpMemMngr;
}
}
}
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maxnick indeed not a good place to add OutputMemoryMngr. I just realize that there might be situations when graph input is dynamic shape, but output is static shape (e.g. some interpolation ops).
Need further discussion to follow up #62

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also there could be corner cases that more than one outputs share the same memory. e.g. each output is the output from Split.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also there could be corner cases that more than one outputs share the same memory. e.g. each output is the output from Split.

That is why I recommended check Edge::Status::NeedAllocation status. If the edge status is different, it means it shares the memory with other edges and in that case we cann't set proxy memory manager.

I just realize that there might be situations when graph input is dynamic shape, but output is static shape (e.g. some interpolation ops).

To avoid affecting static tensors, this step may be performed after initializing static tensors, L831.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

namespace ov {
namespace intel_cpu {

class OutputMemoryMngr : public IMemoryMngrObserver {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename it to ProxyMemoryMngr not to bind its purpose only for output.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


using namespace ov::intel_cpu;

void OutputMemoryMngr::setTensor(std::shared_ptr<Tensor> tensor) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not need this method at all. The class is a simple proxy, which allows to replace the underlying memory manager without changing the proxy memory manager reference, which is shared between partitional memory managers and memory objects.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

data = make_blob_with_precision(desc);
data->allocate();

auto mem_ptr = create_memory(InferenceEngine::details::convertPrecision(outputNode->second->get_input_element_type(0)), dims);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to allow creation memory from the actual shape. The Memory object may have dynamic shape.

Copy link
Author

@ceciliapeng2011 ceciliapeng2011 Jun 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to defer it until api 2.0 probably, as we need wrap this tensor to TensorMemoryBlob, whose constructor will call tensor's get_shape and get_strides. These calls would throw exception if it is dynamic shape.
We simply use dims=0 here to WA.

data = make_blob_with_precision(desc);
data->allocate();

auto mem_ptr = create_memory(InferenceEngine::details::convertPrecision(outputNode->second->get_input_element_type(0)), dims);
const auto &tensor_ptr = std::make_shared<Tensor>(mem_ptr);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed before, the data blob should be instanced from the tensor. We should use Blob wrapper over the Tensor object which is already available in the core part.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wrapper is not usable for now. We have discussed in the email thread. It is only temprary. We need a fix, or api 2.0.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I remember it can be easily fixed to unblock these changes. Than it will be removed after moving to API 2.0 entities.

Copy link
Author

@ceciliapeng2011 ceciliapeng2011 Jun 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. though not that easy as thought.

CPUTensor is a wrap of memory object, but it’s also wrapped by TensorMemoryBlob object. Thus there’s problem that - when the memory object is resized, it won’t reflect to TensorMemoryBlob. So in graph::pulloutputdata stage, CPUTensor is set_shape again as blob tensordesc still has old (empty) dims.

The second problem is, since all buffer(), rwmap() method of Blob returns LockedMemory instance, instead of void*, and we also have to sync the blob and its within wrapped tensor, a NutshellAllocator is invented to blindly lock memory without check.

Comment on lines 272 to 273
auto inferrequest = std::dynamic_pointer_cast<InferRequest>(this->shared_from_this());
OPENVINO_ASSERT(inferrequest, "should be a InferRequest instance");
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now how we handle polymorphism.
If something is should be called only on the InferRequest level, than it shall be implemented on that level!!!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. a legacy of debugging.


if (canBeInPlace) {
auto tt = std::get<0>(outputsTensor2BlobMap[it.first]);
outputMemMngr->setTensor(tt);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed, we must not set Tensor. We have to use the memory manager of the tt underlining memory object!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

outputMemMngr->setTensor(tt);
DEBUG_LOG("setTensor ", tt, " graph ", graph, " inferrequest ", this);
} else {
outputMemMngr->setTensor(nullptr);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we simply set a new instance of memory Mngr.

Copy link
Author

@ceciliapeng2011 ceciliapeng2011 Jun 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a memory allocation in graph init stage for output edges. In situations that zero-copy is not feasible, we could reuse that memory manager, instead of create a new instance frequently. How do you think?

const MemoryMngrPtr m_pOrigMngr; in the proxy manager is for this purpose.

@@ -246,6 +248,8 @@ class Graph {
std::map<std::string, NodePtr> inputNodesMap;
std::map<std::string, NodePtr> outputNodesMap;

std::map<std::string, MemoryMngrPtr> outputNodesMemMngrMap;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use specific memory manager proxy type to avoid dyn downcast.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 889 to 918
// deternmine a group with outputs.
size_t isOutGrp = 0;
int64_t outBoxId = -1;
for (auto& box : group) {
if (std::any_of(
edge_clusters[box.id].begin(),
edge_clusters[box.id].end(),
[box](const ov::intel_cpu::EdgePtr edge) {
return edge->getChild()->getType() == Type::Output;
})) {
isOutGrp++;
outBoxId = box.id;
}
}
if (isOutGrp) {
IE_ASSERT(isOutGrp==1); // reuse_io_tensors false
grpMemMngr =
std::make_shared<OutputMemoryMngr>(grpMemMngr);
DEBUG_LOG(grpMemMngr, " ", this);

// Store the output memory managers.
// So that, the infer requests can be able to access them.
for (auto& edge : edge_clusters[outBoxId]) {
const auto child = edge->getChild();
if (child->getType() == Type::Output) {
for (auto &output : outputNodesMap) {
if (output.second == child) outputNodesMemMngrMap[output.first] = grpMemMngr;
}
}
}
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also there could be corner cases that more than one outputs share the same memory. e.g. each output is the output from Split.

That is why I recommended check Edge::Status::NeedAllocation status. If the edge status is different, it means it shares the memory with other edges and in that case we cann't set proxy memory manager.

I just realize that there might be situations when graph input is dynamic shape, but output is static shape (e.g. some interpolation ops).

To avoid affecting static tensors, this step may be performed after initializing static tensors, L831.

@@ -67,6 +67,8 @@ class IMemoryMngr {
* @return status whether the object has control over underlying memory buffer
*/
virtual bool hasExtBuffer() const noexcept = 0;

virtual size_t getSize() const noexcept = 0;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call is not necessary. The proxy may keep the last requested size internally and call resize every time the new memory manager is set, to insure that there is enough memory allocated.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@ceciliapeng2011 ceciliapeng2011 force-pushed the cecilia/itensor_stage2 branch from 5cabcc0 to abce82d Compare June 29, 2023 02:13
Comment on lines 19 to 25
{
std::lock_guard<std::mutex> guard(m_lock);
if (m_validated) {
m_pMngr->resize(m_Size);
m_validated = false;
}
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can avoid this complexity by unconditional resize call inside setManager method.
So it may cost possible extra malloc (though it is rather a speculative conclusion) I think we can sacrifice this possibility in spite of the less code complexity. If we really face any perf issues caused by this call, we can refine this later. Also technically getRawPtr is a more frequent call than setManager and having synchronization here may have more significant perf impact.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


private:
// We keep the original MemMngr as may fallback to copy output.
const MemoryMngrPtr m_pOrigMngr;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to extract the responsibility of setting a default (fallback) memory manager to the proxy to the user. Explicit is better than implicit in this case. And anyway, the decision is still on the call user, sine the latter must set nullptr to enforce the proxy to switch to the fallback mem manager.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leave this to your decision.

@@ -732,6 +732,9 @@ class Node {
std::unordered_map<std::string, MemoryPtr> privateWeightCache;

CPU_DEBUG_CAP_ENABLE(friend class Verbose);

public:
bool forceUpdateShape = false;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you brought back this flag. This looks like an unnecessary extra responsibility, that may be resolved on the proxy mem manager level via storing the size from the previous run.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

outputMemMngr->setManager(memptr->getMemoryMngr());
DEBUG_LOG("setManager ", memptr->getMemoryMngr(), " graph ", graph, " inferrequest ", this);
} else {
outputMemMngr->setManager(nullptr);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we can just change the call signature with:

   outputMemMngr->setManager(std::make_shared<DnnlMemoryMngr>(make_unique<MemoryManagerWithReuse>());

}
}
}
IE_ASSERT(outputNodesMemMngrMap.size() <= outputNodesMap.size());
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add meaningful error message for easier debugging in future.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary. removed.

if (!isDynamic && !externalPtr.count(name) &&
data->getTensorDesc() == MemoryDescUtils::convertToTensorDesc(output->second->getParentEdgesAtPort(0)[0]->getMemory().getDesc())) {
if (!externalPtr.count(name) &&
(isDynamic ||(data->getTensorDesc() == MemoryDescUtils::convertToTensorDesc(output->second->getParentEdgesAtPort(0)[0]->getMemory().getDesc()) && !isDynamic))) { // TODO: handle desc incompatible if isDynamic.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(isDynamic ||(data->getTensorDesc() == MemoryDescUtils::convertToTensorDesc(output->second->getParentEdgesAtPort(0)[0]->getMemory().getDesc()) && !isDynamic))) { // TODO: handle desc incompatible if isDynamic.
(isDynamic || (data->getTensorDesc() == MemoryDescUtils::convertToTensorDesc(output->second->getParentEdgesAtPort(0)[0]->getMemory().getDesc())) { // TODO: handle desc incompatible if isDynamic.

We do not need additional !isDynamic check since || operator is used.

Copy link
Author

@ceciliapeng2011 ceciliapeng2011 Jun 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, right. done

But there is a question: do we need tensor desc compatibility check for dynamic case?

@ceciliapeng2011 ceciliapeng2011 force-pushed the cecilia/itensor_stage2 branch from af162b0 to 931020e Compare June 30, 2023 03:55
@ceciliapeng2011 ceciliapeng2011 requested a review from maxnick June 30, 2023 03:57
@ceciliapeng2011 ceciliapeng2011 marked this pull request as ready for review June 30, 2023 03:58
@maxnick maxnick force-pushed the in_place_dynamic branch 2 times, most recently from 97b8873 to 8c87601 Compare July 10, 2023 16:03
fix

Update src/plugins/intel_cpu/src/cpu_tensor.cpp

Co-authored-by: Maksim Kutakov <[email protected]>

Update src/plugins/intel_cpu/src/cpu_tensor.cpp

Co-authored-by: Maksim Kutakov <[email protected]>

remove unit test of check empty tensor data().

add mock IMemory for CPUTensor unit test.

fix function template inherit issue

unit test with mock memory

update_strides fix

fix

fix

CR fix and add unit test.

trivial fix

CR fix. code cleanup.

CR fix

fix afer rebase
…put memory to graph via ITensor implementation.

fix

throw exception

CR fix

remove irrelated headers.

CR fix: instance ProxyMemManager for output edge which NeedAllocation

stage tensor_to_blob

tensor_to_blob wrap to use api 1.0

fix
@ceciliapeng2011 ceciliapeng2011 force-pushed the cecilia/itensor_stage2 branch from 931020e to 8fb04bc Compare July 11, 2023 01:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants