Welcome to ask.woodem.org. You may post when you login through your GitHub account.

Feasibility of adding MPI support?

0 votes
46 views
asked Jul 5, 2016 by Nabil (210 points)
Greetings.

The team I am working for is very interested in running their Woo programs on a cluster so the simulation times may be reduced, thus I was tasked to lay the groundwork for an MPICH implementation. After spending a number of days browsing your Python and C++ code (which I am very comfortable with), I was wondering if this is a feasible long-term project (my internship ends in mid-August, so someone is said to be taking over after me) and would you have any suggestions as to where I could start (what algorithms have you parallelized already, time-critical for-loops, etc.)?

If I understand your code correctly, are you invoking OpenMP using both Python and C++, or just through the Boost.Python wrappers (not too familiar with this)?

Please let me know if you need clarification on anything since "MPI" is a pretty large topic :P

Thanks!!

1 Answer

+1 vote
answered Jul 5, 2016 by eudoxos (44,890 points)
selected Jul 29, 2016 by eudoxos
 
Best answer

Hi Nabil,

we've been considering some kind of MPI support on and off in the past (say around 2008) until it became clear that (a) no-one in the team had real expertise with MPI, but maily (b) we would have to make big sacrifices in flexibility (most importantly scriptability) of the code, and decided to go with OpenMP for what is needed. Woo targets primarily smaller simulation, where performance is not highest (though we care about it big time) -- researcher-friendliness is primary.

I can summarize the results from the discussion back then for you here, hope it will make sense.

  1. Woo (back then still Yade I was working on) represents the simulation as a quite complicated data structure where objects hold references to each other via smart_ptr's (like particles hold pointers to their contacts and vice versa). This is very hard to do via MPI since memory is not shared.
  2. Python is used for scripting Woo and accessing its internals (its refcounting works very nicely with shared_ptr, but that is perhaps only a convenience). As far as I know, integrating Python with MPI codes is rather difficult. ESys-Particle has done that, but their level of acess to internals is nowhere near to what is possible with Woo.
  3. More or less everything needs a major rewrite to support MPI so that nonlocal objects are handled correctly and in a way which is efficient. Ohloh.net estimates Woo for 18 man-years, rewriting lots of that is something not doable till August.
  4. DEM by itself usually does the computation in tiny timesteps, and needs many of them (in contracts to e.g. FEM codes which sometimes need low number of steps but each step takes a long time); this is a big disadvantage for any message passing, since the overhead becomes important. All nodes have to be synced twice in every step and communication quickly becomes a bottleneck -- especially considering that serializing objects with complex pointer structures will take some time in itself.

Your effort in this direction is appreciated, and maybe you have some ideads how to proceed. By all means try it, I will support you in that with all the knowledge I have, but I don't see the way as something straightforward.

What I think would be much more beneficial (and even doable) is to make Woo run on Xeon Phi -- compile libs (incl. python) and Woo itself for k1om arch and then run in on a card with 56 cores. The compilation with icc works (tried) but did not get far with the rest for lack of time.

In general, every simulation can have different bottlenecks. You can use the woo.timing module to do a quick check of what is happening, and then maybe use more detailed timers for loops where you suspect possible savings. Typical simulations have everything feasible paralellized (integrator, collision detection, contact resolution). OpenMP applies only to c++, there is no computation running in Python at all (or should not be; I've seen someone from your institute computing buoyancy in Python in every step -- that is a performance killer, making the simulation run maybe 1000x slower, as a rough guess...), and Python knows nothing about parallelization (and should not).

I added a section (just a draft) on how to measure performance, check it out at https://woodem.org/prog/performance.html .

Hope this helps.

Cheers, Vaclav

commented Jul 7, 2016 by Nabil (210 points)

Thanks for the detailed response!
I used that timing module and I was very surprised to notice that the VTK export takes from 35%-60% of total running time (probably has to do with slower hard drive writes and/or data parsing?). Anyways, it's definitely something I can use to diagnose any possible hitches in their simulations!

I really like the idea of the Xeon Phi actually. The idea of simply running woo on something with a TON of cores was an idea I considered earlier, but totally forgot once I got bogged down with the whole MPI thing. Now, getting all the required libraries to compile on icc is another matter entirely (something to keep me busy) :D

(I'll probably be back with icc questions sometime in the near future)
Until then!

Cheers,
-Nabil

commented Jul 19, 2016 by eudoxos (44,890 points)

BTW before you go to ICC, have a look here for necessary compilation options and one outstanding issue (might be gone, I am 70% certain it is/was not a bug in our code based on debugging I've tried back then): https://github.com/woodem/woo/issues/2

...