-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cecilia/itensor stage2 #64
base: in_place_dynamic
Are you sure you want to change the base?
Cecilia/itensor stage2 #64
Conversation
// WA: resize stage might work because there is no shape change, | ||
// but the underlying actual memory manager changes. | ||
bool validated = (_previous != m_pMngr); | ||
if (validated && m_pMngr->getSize() < m_Size) { | ||
m_pMngr->resize(m_Size); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
src/plugins/intel_cpu/src/graph.cpp
Outdated
// deternmine a group with outputs. | ||
size_t isOutGrp = 0; | ||
int64_t outBoxId = -1; | ||
for (auto& box : group) { | ||
if (std::any_of( | ||
edge_clusters[box.id].begin(), | ||
edge_clusters[box.id].end(), | ||
[box](const ov::intel_cpu::EdgePtr edge) { | ||
return edge->getChild()->getType() == Type::Output; | ||
})) { | ||
isOutGrp++; | ||
outBoxId = box.id; | ||
} | ||
} | ||
if (isOutGrp) { | ||
IE_ASSERT(isOutGrp==1); // reuse_io_tensors false | ||
grpMemMngr = | ||
std::make_shared<OutputMemoryMngr>(grpMemMngr); | ||
DEBUG_LOG(grpMemMngr, " ", this); | ||
|
||
// Store the output memory managers. | ||
// So that, the infer requests can be able to access them. | ||
for (auto& edge : edge_clusters[outBoxId]) { | ||
const auto child = edge->getChild(); | ||
if (child->getType() == Type::Output) { | ||
for (auto &output : outputNodesMap) { | ||
if (output.second == child) outputNodesMemMngrMap[output.first] = grpMemMngr; | ||
} | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also there could be corner cases that more than one outputs share the same memory. e.g. each output is the output from Split.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also there could be corner cases that more than one outputs share the same memory. e.g. each output is the output from Split.
That is why I recommended check Edge::Status::NeedAllocation
status. If the edge status is different, it means it shares the memory with other edges and in that case we cann't set proxy memory manager.
I just realize that there might be situations when graph input is dynamic shape, but output is static shape (e.g. some interpolation ops).
To avoid affecting static tensors, this step may be performed after initializing static tensors, L831.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
namespace ov { | ||
namespace intel_cpu { | ||
|
||
class OutputMemoryMngr : public IMemoryMngrObserver { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's rename it to ProxyMemoryMngr
not to bind its purpose only for output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
using namespace ov::intel_cpu; | ||
|
||
void OutputMemoryMngr::setTensor(std::shared_ptr<Tensor> tensor) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not need this method at all. The class is a simple proxy, which allows to replace the underlying memory manager without changing the proxy memory manager reference, which is shared between partitional memory managers and memory objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
data = make_blob_with_precision(desc); | ||
data->allocate(); | ||
|
||
auto mem_ptr = create_memory(InferenceEngine::details::convertPrecision(outputNode->second->get_input_element_type(0)), dims); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to allow creation memory from the actual shape
. The Memory object may have dynamic shape.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to defer it until api 2.0 probably, as we need wrap this tensor to TensorMemoryBlob, whose constructor will call tensor's get_shape and get_strides. These calls would throw exception if it is dynamic shape.
We simply use dims=0 here to WA.
data = make_blob_with_precision(desc); | ||
data->allocate(); | ||
|
||
auto mem_ptr = create_memory(InferenceEngine::details::convertPrecision(outputNode->second->get_input_element_type(0)), dims); | ||
const auto &tensor_ptr = std::make_shared<Tensor>(mem_ptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed before, the data
blob should be instanced from the tensor. We should use Blob wrapper over the Tensor object which is already available in the core part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The wrapper is not usable for now. We have discussed in the email thread. It is only temprary. We need a fix, or api 2.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I remember it can be easily fixed to unblock these changes. Than it will be removed after moving to API 2.0 entities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done. though not that easy as thought.
CPUTensor is a wrap of memory object, but it’s also wrapped by TensorMemoryBlob object. Thus there’s problem that - when the memory object is resized, it won’t reflect to TensorMemoryBlob. So in graph::pulloutputdata stage, CPUTensor is set_shape again as blob tensordesc still has old (empty) dims.
The second problem is, since all buffer(), rwmap() method of Blob returns LockedMemory instance, instead of void*, and we also have to sync the blob and its within wrapped tensor, a NutshellAllocator is invented to blindly lock memory without check.
auto inferrequest = std::dynamic_pointer_cast<InferRequest>(this->shared_from_this()); | ||
OPENVINO_ASSERT(inferrequest, "should be a InferRequest instance"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now how we handle polymorphism.
If something is should be called only on the InferRequest
level, than it shall be implemented on that level!!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done. a legacy of debugging.
|
||
if (canBeInPlace) { | ||
auto tt = std::get<0>(outputsTensor2BlobMap[it.first]); | ||
outputMemMngr->setTensor(tt); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed, we must not set Tensor. We have to use the memory manager of the tt
underlining memory object!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
outputMemMngr->setTensor(tt); | ||
DEBUG_LOG("setTensor ", tt, " graph ", graph, " inferrequest ", this); | ||
} else { | ||
outputMemMngr->setTensor(nullptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we simply set a new instance of memory Mngr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a memory allocation in graph init stage for output edges. In situations that zero-copy is not feasible, we could reuse that memory manager, instead of create a new instance frequently. How do you think?
const MemoryMngrPtr m_pOrigMngr; in the proxy manager is for this purpose.
src/plugins/intel_cpu/src/graph.h
Outdated
@@ -246,6 +248,8 @@ class Graph { | |||
std::map<std::string, NodePtr> inputNodesMap; | |||
std::map<std::string, NodePtr> outputNodesMap; | |||
|
|||
std::map<std::string, MemoryMngrPtr> outputNodesMemMngrMap; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use specific memory manager proxy type to avoid dyn downcast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
src/plugins/intel_cpu/src/graph.cpp
Outdated
// deternmine a group with outputs. | ||
size_t isOutGrp = 0; | ||
int64_t outBoxId = -1; | ||
for (auto& box : group) { | ||
if (std::any_of( | ||
edge_clusters[box.id].begin(), | ||
edge_clusters[box.id].end(), | ||
[box](const ov::intel_cpu::EdgePtr edge) { | ||
return edge->getChild()->getType() == Type::Output; | ||
})) { | ||
isOutGrp++; | ||
outBoxId = box.id; | ||
} | ||
} | ||
if (isOutGrp) { | ||
IE_ASSERT(isOutGrp==1); // reuse_io_tensors false | ||
grpMemMngr = | ||
std::make_shared<OutputMemoryMngr>(grpMemMngr); | ||
DEBUG_LOG(grpMemMngr, " ", this); | ||
|
||
// Store the output memory managers. | ||
// So that, the infer requests can be able to access them. | ||
for (auto& edge : edge_clusters[outBoxId]) { | ||
const auto child = edge->getChild(); | ||
if (child->getType() == Type::Output) { | ||
for (auto &output : outputNodesMap) { | ||
if (output.second == child) outputNodesMemMngrMap[output.first] = grpMemMngr; | ||
} | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also there could be corner cases that more than one outputs share the same memory. e.g. each output is the output from Split.
That is why I recommended check Edge::Status::NeedAllocation
status. If the edge status is different, it means it shares the memory with other edges and in that case we cann't set proxy memory manager.
I just realize that there might be situations when graph input is dynamic shape, but output is static shape (e.g. some interpolation ops).
To avoid affecting static tensors, this step may be performed after initializing static tensors, L831.
@@ -67,6 +67,8 @@ class IMemoryMngr { | |||
* @return status whether the object has control over underlying memory buffer | |||
*/ | |||
virtual bool hasExtBuffer() const noexcept = 0; | |||
|
|||
virtual size_t getSize() const noexcept = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This call is not necessary. The proxy may keep the last requested size internally and call resize every time the new memory manager is set, to insure that there is enough memory allocated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
5cabcc0
to
abce82d
Compare
{ | ||
std::lock_guard<std::mutex> guard(m_lock); | ||
if (m_validated) { | ||
m_pMngr->resize(m_Size); | ||
m_validated = false; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can avoid this complexity by unconditional resize call inside setManager
method.
So it may cost possible extra malloc (though it is rather a speculative conclusion) I think we can sacrifice this possibility in spite of the less code complexity. If we really face any perf issues caused by this call, we can refine this later. Also technically getRawPtr
is a more frequent call than setManager
and having synchronization here may have more significant perf impact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
private: | ||
// We keep the original MemMngr as may fallback to copy output. | ||
const MemoryMngrPtr m_pOrigMngr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is better to extract the responsibility of setting a default (fallback) memory manager to the proxy to the user. Explicit is better than implicit in this case. And anyway, the decision is still on the call user, sine the latter must set nullptr to enforce the proxy to switch to the fallback mem manager.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leave this to your decision.
src/plugins/intel_cpu/src/node.h
Outdated
@@ -732,6 +732,9 @@ class Node { | |||
std::unordered_map<std::string, MemoryPtr> privateWeightCache; | |||
|
|||
CPU_DEBUG_CAP_ENABLE(friend class Verbose); | |||
|
|||
public: | |||
bool forceUpdateShape = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you brought back this flag. This looks like an unnecessary extra responsibility, that may be resolved on the proxy mem manager level via storing the size from the previous run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
outputMemMngr->setManager(memptr->getMemoryMngr()); | ||
DEBUG_LOG("setManager ", memptr->getMemoryMngr(), " graph ", graph, " inferrequest ", this); | ||
} else { | ||
outputMemMngr->setManager(nullptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we can just change the call signature with:
outputMemMngr->setManager(std::make_shared<DnnlMemoryMngr>(make_unique<MemoryManagerWithReuse>());
src/plugins/intel_cpu/src/graph.cpp
Outdated
} | ||
} | ||
} | ||
IE_ASSERT(outputNodesMemMngrMap.size() <= outputNodesMap.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add meaningful error message for easier debugging in future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unnecessary. removed.
if (!isDynamic && !externalPtr.count(name) && | ||
data->getTensorDesc() == MemoryDescUtils::convertToTensorDesc(output->second->getParentEdgesAtPort(0)[0]->getMemory().getDesc())) { | ||
if (!externalPtr.count(name) && | ||
(isDynamic ||(data->getTensorDesc() == MemoryDescUtils::convertToTensorDesc(output->second->getParentEdgesAtPort(0)[0]->getMemory().getDesc()) && !isDynamic))) { // TODO: handle desc incompatible if isDynamic. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(isDynamic ||(data->getTensorDesc() == MemoryDescUtils::convertToTensorDesc(output->second->getParentEdgesAtPort(0)[0]->getMemory().getDesc()) && !isDynamic))) { // TODO: handle desc incompatible if isDynamic. | |
(isDynamic || (data->getTensorDesc() == MemoryDescUtils::convertToTensorDesc(output->second->getParentEdgesAtPort(0)[0]->getMemory().getDesc())) { // TODO: handle desc incompatible if isDynamic. |
We do not need additional !isDynamic
check since ||
operator is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, right. done
But there is a question: do we need tensor desc compatibility check for dynamic case?
af162b0
to
931020e
Compare
97b8873
to
8c87601
Compare
fix Update src/plugins/intel_cpu/src/cpu_tensor.cpp Co-authored-by: Maksim Kutakov <[email protected]> Update src/plugins/intel_cpu/src/cpu_tensor.cpp Co-authored-by: Maksim Kutakov <[email protected]> remove unit test of check empty tensor data(). add mock IMemory for CPUTensor unit test. fix function template inherit issue unit test with mock memory update_strides fix fix fix CR fix and add unit test. trivial fix CR fix. code cleanup. CR fix fix afer rebase
…put memory to graph via ITensor implementation. fix throw exception CR fix remove irrelated headers. CR fix: instance ProxyMemManager for output edge which NeedAllocation stage tensor_to_blob tensor_to_blob wrap to use api 1.0 fix
931020e
to
8fb04bc
Compare
Details:
Tickets: