we've been considering some kind of MPI support on and off in the past (say around 2008) until it became clear that (a) no-one in the team had real expertise with MPI, but maily (b) we would have to make big sacrifices in flexibility (most importantly scriptability) of the code, and decided to go with OpenMP for what is needed. Woo targets primarily smaller simulation, where performance is not highest (though we care about it big time) -- researcher-friendliness is primary.
I can summarize the results from the discussion back then for you here, hope it will make sense.
- Woo (back then still Yade I was working on) represents the simulation as a quite complicated data structure where objects hold references to each other via smart_ptr's (like particles hold pointers to their contacts and vice versa). This is very hard to do via MPI since memory is not shared.
- Python is used for scripting Woo and accessing its internals (its refcounting works very nicely with shared_ptr, but that is perhaps only a convenience). As far as I know, integrating Python with MPI codes is rather difficult. ESys-Particle has done that, but their level of acess to internals is nowhere near to what is possible with Woo.
- More or less everything needs a major rewrite to support MPI so that nonlocal objects are handled correctly and in a way which is efficient. Ohloh.net estimates Woo for 18 man-years, rewriting lots of that is something not doable till August.
- DEM by itself usually does the computation in tiny timesteps, and needs many of them (in contracts to e.g. FEM codes which sometimes need low number of steps but each step takes a long time); this is a big disadvantage for any message passing, since the overhead becomes important. All nodes have to be synced twice in every step and communication quickly becomes a bottleneck -- especially considering that serializing objects with complex pointer structures will take some time in itself.
Your effort in this direction is appreciated, and maybe you have some ideads how to proceed. By all means try it, I will support you in that with all the knowledge I have, but I don't see the way as something straightforward.
What I think would be much more beneficial (and even doable) is to make Woo run on Xeon Phi -- compile libs (incl. python) and Woo itself for k1om arch and then run in on a card with 56 cores. The compilation with icc works (tried) but did not get far with the rest for lack of time.
In general, every simulation can have different bottlenecks. You can use the woo.timing module to do a quick check of what is happening, and then maybe use more detailed timers for loops where you suspect possible savings. Typical simulations have everything feasible paralellized (integrator, collision detection, contact resolution). OpenMP applies only to c++, there is no computation running in Python at all (or should not be; I've seen someone from your institute computing buoyancy in Python in every step -- that is a performance killer, making the simulation run maybe 1000x slower, as a rough guess...), and Python knows nothing about parallelization (and should not).
I added a section (just a draft) on how to measure performance, check it out at https://woodem.org/prog/performance.html .
Hope this helps.