Mpi logger #300

alkino · 2020-05-07T20:58:03Z

No description provided.

This reverts commit e22621e.

…on neuron size

Waiting better days with neuron / corneuron

bbpbuildbot · 2020-05-07T20:58:07Z

Can one of the admins verify this patch?

pramodk · 2020-05-08T06:49:01Z

coreneuron/mpi/spdlog_mpi_logger.h

+    {
+        MPI_Comm_rank(comm, &mpi_rank);
+        if (mpi_rank == server_rank) {
+            server_thr = std::thread([&](){


few quick comments:

I don't think we want to spawn the thread (performance & pinning aspects).

another aspect is mixing two threading models - OpenMP in CoreNEURON and std::thread

I think we can afford 1 thread in 1 rank before doing hours of computation.

Is it a problem as in this thread there is no OpenMP and that in OpenMP part there is no thread?

(now I see where the slack discussion came from) I guess it's clear now that OpenMP is threading. And I sort of agree with @pramodk that here the thread is not needed. It seems premature optimization and adding unnecessary complications to want to run the logging in its separate thread. I would say for now let's just skip threading.

pramodk · 2020-05-08T06:56:38Z

coreneuron/mpi/spdlog_mpi_logger.h

+                    MPI_Probe(MPI_ANY_SOURCE, tag, comm, &status);
+                    int msg_size = 0;
+                    MPI_Get_count(&status, MPI_BYTE, &msg_size);
+                    auto buf = std::make_unique<char[]>(msg_size);
+                    MPI_Recv(buf.get(), msg_size, MPI_PACKED, MPI_ANY_SOURCE, tag, comm, MPI_STATUS_IGNORE);
+
+                    int position = 0;
+                    int level = 0;
+                    MPI_Unpack(buf.get(), msg_size, &position, &level, 1, MPI_INT, comm);
+                    unsigned long payload_size = 0;
+                    MPI_Unpack(buf.get(), msg_size, &position, &payload_size, 1, MPI_UNSIGNED_LONG, comm);
+                    std::string payload(payload_size, ' ');
+                    MPI_Unpack(buf.get(), msg_size, &position, &payload[0], payload_size, MPI_BYTE, comm);
+                    std::cout << "[" << level << "] " << payload << std::endl;


MPI_Probe, MPI_Recv are blocking calls

It's true that these are running under separate thread and hence argument could be they won't block the main thread

then, note that this requires enabling thread safety in MPI library (which comes with the cost)

I wouldn't claim every MPI library is fully thread safe (e.g. for MPI_THREAD_MULTIPLE and related performance penalty)

(I understand that collecting logs at server_rank is one policy but this won't be scalable.)

Yes, I made them blocking because this thread has nothing else to do.
hm, I will look for MPI thread safety, but it seems me unrelated there.
why, is it not scallable? do you have an idea on an other policy?

ohm314 · 2020-05-12T07:11:49Z

coreneuron/mpi/spdlog_mpi_logger.h

+{
+public:
+    // Empty logger
+    explicit mpi_logger(std::string logger_name, int server_rank_ = 0, MPI_Comm comm_ = MPI_COMM_WORLD)


what do you mean by "server rank"?

ohm314 · 2020-05-12T07:29:59Z

coreneuron/mpi/spdlog_mpi_logger.h

+            server_thr = std::thread([&](){
+                int stopped = 1;
+                MPI_Finalized(&stopped);
+                while(!stopped) {


OK, I see how you thought about it. May I suggest an alternate way do this?
The more MPI way of doing parallelism is that all ranks execute essentially the same program. This would mean that in a given piece of code where you'd want to log, all ranks would call mpi_logger.log(...) (or something similar). the log function could then do two different things:

You could go the serialization route: You write a loop over the ranks and serialize with barriers the output, printing at each iteration the log of one rank and protecting it with a barrier. This way you ensure that the ranks don't write over each other, you can then also add fancy things like masks for ranks to print or mute, etc.

you could go the MPI_Gather route: You first gather (potential) logs from all ranks to rank 0 and print from there. This could be possibly the more efficient way of doing it, but I'm not sure right now.

I'm sure @pramodk has many improvements over my ideas, but I think generally this will be the better approach then the way method here. By the way, the method you use also makes perfectly sense and I've seen it (maybe implemented it in a previous life) in client-server architectures.

Final question for discussion: Would the more MPI way of doing it work with spdlog? I don't see immediately why not, but I haven't thought very hard about this.

alkino · 2020-08-31T07:48:07Z

Should define it more and rewrite it.

Nicolas Cornu and others added 30 commits March 24, 2020 09:08

Rewrite phase1

0ea0ad7

Merge remote-tracking branch 'origin/master'

db5c07c

Move in files

e22621e

Revert "Move in files"

5e260dc

This reverts commit e22621e.

Remove global static user variables

43576c1

Phase2 compile but still FIXME

e2f5c2a

Less fixme

081e719

Put Phase1 outside nrn_setup

1a9263e

Move phase2

61eb7f8

Add more variable for using omp mutex

6c5ee3e

Use read_vector in phase1

987cb30

add a state inside Phase2

8f07a72

Split Phase2::populate

1578d23

Use MUT outside Phase1

5bf190f

Fix use of type inside of i

1a9b1b2

Several fixes

aff1818

Fix Phase2::read_file

77afdec

Fix use of dparam_size and param_size

cd0e2ca

Fix use on nt.ncell before setting

1586cca

Load iArray / dArray only if there is elements

167ef21

Fix last things?

9accfd7

Take care of tqueue items

826b542

Fix push_back of vecplay

6f2582f

Don't use nt before populate

c5165f8

Even if data is copied and not allocated, nodeindices stay allocated …

2657146

…on neuron size

More fixes

ac1fd68

Fix more errors

31c5662

Fix stupid error

3491cc3

Last fix

703a229

Fix very last stupid error

7e06dc7

Nicolas Cornu added 14 commits April 24, 2020 01:26

From ne to n_data_padded

539e2ef

Make test executable from any place

6811798

First attempt to fix the problem of Mutex in a nice way

556dfb2

To avoid copies, we use the old way to handle _data.

7f51a27

Waiting better days with neuron / corneuron

Read ml->data directly into _data to avoid copy

646b7c5

Merge remote-tracking branch 'origin/master' into phase1

1a6b286

Fix error during computation of offset for _data in phase2

2f66e50

Fix computing of maxgid

e62c6b8

Need to append the offset after reading

42c8d96

Disable GPU test because PGI is broken for now

5b65d14

Don't copy yvec and tvec

41a52c4

Avoid copy of v_parent_index

400d3c3

Give ownership of y_, t_ to VecPlayContinuous

5bd8c5c

Add a first implementation of mpi_logger

b19c6ed

alkino changed the base branch from master to phase1 May 7, 2020 20:58

pramodk reviewed May 8, 2020

View reviewed changes

Duplicate MPI_Communicator to not mix with other data

699f2dd

ohm314 reviewed May 12, 2020

View reviewed changes

pramodk force-pushed the phase1 branch from 1d84231 to 392519f Compare June 3, 2020 00:17

pramodk mentioned this pull request Jul 16, 2020

Add --verbose flag for CLI to control output messages #338

Closed

pramodk force-pushed the phase1 branch 4 times, most recently from a68d38e to 1e89fc5 Compare August 9, 2020 11:13

Base automatically changed from phase1 to master August 9, 2020 21:48

alkino closed this Aug 31, 2020

alkino deleted the mpi_logger branch August 31, 2020 07:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mpi logger #300

Mpi logger #300

alkino commented May 7, 2020

bbpbuildbot commented May 7, 2020

pramodk May 8, 2020

alkino May 8, 2020

ohm314 May 12, 2020

pramodk May 8, 2020

alkino May 8, 2020

ohm314 May 12, 2020

ohm314 May 12, 2020

alkino commented Aug 31, 2020

Mpi logger #300

Mpi logger #300

Conversation

alkino commented May 7, 2020

bbpbuildbot commented May 7, 2020

pramodk May 8, 2020

Choose a reason for hiding this comment

alkino May 8, 2020

Choose a reason for hiding this comment

ohm314 May 12, 2020

Choose a reason for hiding this comment

pramodk May 8, 2020

Choose a reason for hiding this comment

alkino May 8, 2020

Choose a reason for hiding this comment

ohm314 May 12, 2020

Choose a reason for hiding this comment

ohm314 May 12, 2020

Choose a reason for hiding this comment

alkino commented Aug 31, 2020