map by partition: dask and pyspark inconsistency #470

pavlis · 2023-11-18T11:12:18Z

pavlis
Nov 18, 2023
Maintainer

This particular topic comes up from a frustrating experience I've had working on a revision to the write_distributed_data function. The problem was frustrating because of a fundamental problem debugging parallel code involved in any lazy/delayed computation. This particular problem was hidden for a long time by misleading errors from pyspark.

The problem I uncovered is a disconnect in pyspark that is a legacy of the same problem that we long ago fond requiree a different usage in map operator calls in dask and spark. That is, consider the example of how we run Database.save_data in a map operator in dask and spark. I dask you would use:

bag = bag.map(db.save_data,collection='wf_TimeSeries')

while in pyspark you need to use a lambda function

rdd = rdd.map(lambda d : db.save_data(d,collection='wf_TimeSeries'))

The reason the lambda is necessary is that pyspark is the python translation of spark where the native language is scala. I don't know scala, but scala is a "functional programming" language which makes the fact that the map method of an rdd, which is what is used above, takes a function object as the only required argument. We use the lambda as a convenient way to translate a regular python function where arg0 of the function expects to always handle a single datum in the container.

That was a review for all MsPASS developers, but is needed to understand a related operator found in both dask and pyspark. In dask it has the name map_partitions and in pyspark it is called mapPartitions. The documentation for both dask and spark state that one should use the partitioned form when the function has an overhead that is expensive. I, correctly I am certain, realized this was the right approach to use in write_distributed_data. What it allows is to replace MongoDB insert_one transactions with a fewer calls to insert_many where the "many" is defined by the number of data in each partition. You can look at the current implementation in the (presently almost complete) branch with the (bad) name "cleanup_database". A key point is that the approach works fine in dask. That is, I have a working version of this approach for dask. The dask implementation was more straightforward because the api for map_partitions in pythonic and just allows regular and variable (kwargs) arguments as part of the api. The only oddity I had to get around is that arg0 for the function received by dask's map_partition is not a single datum, which it is the case for a map operator but an "iterator" that is best thought of as a pointer into the bag container. The only real complexity is that the iterator can only be traversed once in the function passed to map_partition.

Now the problem I need help with that we need to preserve for the record. The documentation for spark's mapPartitions method of RDD can be found here. The key point is that like map, mapPartitions requires only one argument, which is the function name to be run "by partition". That is consistent with pyspark being derived from a scala api where that kind of construct would be the norm. In python it is an oddity and clashes with the normal use of the language. The problem this causes is that, as far as I can tell, a lambda function cannot be used to resolve this problem. i.e. a construct like

myrdd = myrdd.mapPartitions(lambda d : myfunction(d,arg1, arg2, varg1='sample'))

does not work and there is no way to do something like this:

myrdd = myrdd.mapPartitions(myfunction(d,arg1,arg2,varg1='sample'))

The reason is that "myfunction" needs an "iterable" for arg0 and what you get from a normal lambda like the above is not an iterable. The second form is maybe possible if there is a way to make d be an iterable derived from the partitioning and the container, but I haven't seen anything on the web on how to do that.

The main solution I've seen described on the web is to use "currying". That was a new term to me. Here is one of many pages you can find on this topic for python programming That is probably the solution to this problem, but I haven't found it yet. I think I can work it out for this specific example. I just have to get the right incantation to implement currying for this particular example. If I get this done I'll post it to this page, but it defines a more general problem; I'm not sure there is a generic way to equivalence a dask map_partitions to a spark mapPartitions call sequence.

Do any of you have an experience or insight on this problem?

wangyinz · 2023-11-19T05:29:16Z

wangyinz
Nov 19, 2023
Maintainer

hmmm... this is tricky and I am not aware of the currying concept before. I wonder if it is possible to create another function wrapper like this:

def process_partitions(iterator):
    for item in iterator:
        yield myfunction(item, arg1_value, arg2_value, varg1='sample')

result_pyspark = rdd.mapPartitions(process_partitions)

3 replies

pavlis Nov 19, 2023
Maintainer Author

Hadn't reported it yet, but tried that basic idea. Have not yet found a combination that works completely. I think I'm close as the approach I'll describe in a moment works when there is only one partition but fails if the RDD has multiple partitions. I think my residual problem is related to an additional subtle and largely undocumented detail about what mapPartitions needs to return from each partition it processes.

A generic solution I think can work that I haven't found in any online source is an object-oriented implementation. The idea is to load the parameters that would appear in an argument list in a constructor and then implemented the partition algorithm as a method. Consider a hypothetical examples. If we had a function newalgorithm that we want to run by partition that has the signature:

def newalgorithm(d,arg1,option1='foo',option2='bar'):

Define a class with a constructor that sets the parameters as "self" attributes:

class newalg_class:
  def __init__(self,d,arg1,option1,option2):
    self.arg1=arg1
    self.option1=option1
    self.option2=option2
 def partiton_method(iterator):
   out=[]
   for datum in iterator:
     ... do things to datum using self args, create out - 
   yield out

Note that the nature of what I called out in that pseudocode is what I'm currently hung up on. I can't tell for sure because pyspark just aborts with a message that seems to have nothing to do with what is actually wrong. The one thing that might make this approach wrong is I'm not totally sure how yield behaves in this context. As you probably know "yield" acts like return but retains the state of the calling function. All examples I found seem to suggest yield is preferred in this context. I think that, in fact, may be the key to why my current implementation is failing. A working hypothesis for now

pavlis Nov 19, 2023
Maintainer Author

WARNING: this is a very long reply because of code segments it contains. Read the text before getting bogged down in the code segments.

I decided to first create a simple test script that tested the concept of using a class to set input parameters for a mapPartition with pyspark. It was illuminating and important to show for the record, although I still have some residual and confusing problems I'll get to in a minute:

from pyspark import SparkContext
sc = SparkContext("local", "testing_spark_mappartition")

class testclass:
    def __init__(self,x,y,z):
        self.x=x
        self.y=y
        self.z=z
    def partioned_method(self,iterator):
        out=[]
        for d in iterator:
            newval=d+self.x+self.y+self.z
            out.append(newval)
        print("returned list=",out)
        # This creates the proper rdd with a list of integers for this example
        #return iter(out)
        # This creates a list of lists
        #yield out
        # This also returns the expected result
        #return out
        # This also returns the expeced result
        return iter(out)


testalg = testclass(4,5,10)
rdd = sc.parallelize(range(20),numSlices=5)
rdd=rdd.mapPartitions(testalg.partioned_method)
out=rdd.collect()
print(out)
type(out[0])
sc.stop()

Run this and should get this output:

23/11/19 07:11:39 WARN Utils: Your hostname, pavlis-linux resolves to a loopback address: 127.0.1.1; using 172.17.0.1 instead (on interface docker0)
23/11/19 07:11:39 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/11/19 07:11:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
returned list= [19, 20, 21, 22]
returned list= [23, 24, 25, 26]
returned list= [27, 28, 29, 30]=>                                   (2 + 1) / 5]
returned list= [31, 32, 33, 34]
returned list= [35, 36, 37, 38]
[19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38]

where I think those warning seem harmless. The key things to note are:

This algorithm does what is should do which is return a list of 20 integers counting from 19 (sum of x+y+z + count)
As the comments say there are multiple variants on how it can return the partitioned output and it works correctly. The only exception is yield out that returns a list of lists - one for each partition (this example uses 2 partitions set with the parallelize line).

Now the mystery is why this doesn't behave in the context of the revised version of write_distributed_data. You can see the current version (or close) in the code of the current "cleanup_database" branch (current is not this but it gives the context). I'm running the same kind of thing in this section of the new code:

if data_are_atomic:
            pyspark_interface = pyspark_mappartition_interface(db,collection)
            # With atomic data dead in this implementation we handle 
            # any dead datum with the map operators that saves the 
            # wf documents.   Dead data return a None instead of an id 
            # by default and leave a body in the cemetery collection
            # unless cremate is set true
            data = data.map(lambda d : _atomic_extract_wf_document(d, 
                                db, 
                                save_schema, 
                                exclude_keys, 
                                mode,
                                post_elog=post_elog,
                                save_history=save_history,
                                post_history=post_history,
                                data_tag=data_tag,
                                undertaker=stedronsky,
                                cremate=cremate,
                                ))
            data = data.mapPartitions(pyspark_interface.partitioned_save_wfdoc)
            data = data.collect()

where pyspark_mappartition_interface is defined as follows (with a lot of print scaffolding):

class pyspark_mappartition_interface():
    def __init__(self,
                 db,
                 collection,
                 dbname=None,
                 ):
        if db is None:
            if dbname is None:
                message="pyspark_mappartion_interface constructor:  invalid parameter combination\n"
                message+="Both db (arg0) and dbname (arg2) values are None.  One or the other must be defined"
                raise ValueError(message)
            dbclient = DBClient()
            self.db = dbclient.get_database(dbname)
        else:
            self.db = db
        self.collection=collection
        
    def partitioned_save_wfdoc(self,iterator):
        print("Entered partioned_save_wfdoc method")
        print("Type of iterator=",type(iterator))
        dbcol = self.db[self.collection]
        # test for the existence of any dead data.  Handle that case specially
        has_bodies=False
        docarray=[]
        for doc in iterator:
            # clear the wfid if it exists or mongo may overwrite
            if '_id' in doc:
                doc.pop('_id')
            print(doc)
            docarray.append(doc)
            if not doc["live"]:
                has_bodies=True
        if has_bodies:
            print("Entering block for dead data")
            lifelist=[]
            cleaned_doclist=[]
            for doc in docarray:
                if doc["live"]:
                    cleaned_doclist.append(doc)
                    lifelist.append(True)
                else:
                    lifelist.append(False)
            print("size docarray=",len(docarray))
            print("size of cleaned=",len(cleaned_doclist))
            if len(cleaned_doclist)>0:
                wfids_inserted = dbcol.insert_many(cleaned_doclist).inserted_ids
                wfids=[]
                ii=0
                for i in range(len(lifelist)):
                    if lifelist[i]:
                        wfids.append(wfids_inserted[ii])
                        ii += 1
                    else:
                        wfids.append(None)
            else:
                wfids=[]
                for i in range(len(docarray)):
                    #wfids.append(None)
                    wfids.append('bad')
                    
        else:
            print("Entering block for live data")
            # this case is much simpler
            # note plural ids.  insert_one uses "inserted_id".
            # proper english usage but potentially confusing - beware
            wfids = dbcol.insert_many(docarray).inserted_ids
        print("Type of return value wfid=",type(wfids))
        print("Length of wfids=",len(wfids))
        print("wfidslist returned by method=",wfids)
        # With this we only get one ObjectId in the rdd out always
        #return iter(wfids)
        # Same result as return with iter
        #return wfids
        # this returns one object Id for each partition
        #yield wfids
        # This returns an iterator with one entry per partition
        yield iter(wfids)

Note I test the exact same set of variations for what is returned BUT I get completely different results. The result I am expecting is a complete list of ObjectIds of all saved documents. (In the tests I've run to create the comments there are no dead data so the logic always bypasses the complexity of the section for handling dead data. I am positive because the print scaffolding tells me it always runs through the "Entering block for live data" section. Further, the print statements always tell me the function handles 2 partitions in this tests - one with 3 and one with 2 TimeSeries objects). None of them work wrt to what is returned. All have shortcomings as noted. On the other hand, the test actually works in the sense that it does save the data. I can query one of these runs and I get one document for each of the 5 TimeSeries objects passed through this mapPartition run. The problem is now reduced to why I'm getting a different behavior in the various permutations of what is returned. This is baffling and online sources only add to the confusion for me. I think the issue is deep in the weeds of "generators" and "iterators" in python.

pavlis Nov 19, 2023
Maintainer Author

MYSTERY SOLVED: The inconsistent result for the two similar (conceptually) examples above was a mistake creating from failing to remove something from earlier hacking not shown in the block for the class I used in the write_distibuted_data version. Not shown was this line:

data=data[0]

added when I was getting a list of lists returned in one previous test. When I remove that the inconsistency goes away.

So, I think the class approach like my simple example is a viable generic solution. I still have something else causing the spark tests to fail when dask does not, but the issues now are different.

For this discussion I think a conclusion is using a class as illustrated for mapPartitions with pyspark is a viable approach provided you use the right return/yield syntax. Things I've seen with "currying" may also be feasible, but in my opinion nested function calls required to do that are dangerous.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

map by partition: dask and pyspark inconsistency #470

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

map by partition: dask and pyspark inconsistency #470

pavlis Nov 18, 2023 Maintainer

Replies: 1 comment · 3 replies

wangyinz Nov 19, 2023 Maintainer

pavlis Nov 19, 2023 Maintainer Author

pavlis Nov 19, 2023 Maintainer Author

pavlis Nov 19, 2023 Maintainer Author

pavlis
Nov 18, 2023
Maintainer

Replies: 1 comment 3 replies

wangyinz
Nov 19, 2023
Maintainer

pavlis Nov 19, 2023
Maintainer Author

pavlis Nov 19, 2023
Maintainer Author

pavlis Nov 19, 2023
Maintainer Author