Jon Stewart on 15 Nov 2003 01:01:59 -0000 |
[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]
[ALACPP] boost::shared_ptr performance |
Don't know whether any of you saw this, but thought you'd find it mildly interesting. Jon ------- Forwarded Message To: boost-users@xxxxxxxxxxxxxxx From: "Matthew Wilson" <stlsoft@xxxxxxxxxxx> Date: Sat, 15 Nov 2003 11:22:46 +1100 Subject: [Boost-users] boost::shared_ptr performance [I posted this in boost.devel, but got no response. Presumably that was the wrong group. :)] Hi Boosters I'm just doing some research on reference-counting, and contrasting self-counting types with externally counted types, the latter effected with the rather excellent shared pointer technique. [As a side question, is it true that Nicolai Josuttis was the inventor of the technique?] Naturally, one of the shared pointer's I've used is boost::shared_ptr. The measurements have demonstrated some surprising performance characteristics for shared_ptr in particular, and I want to solicit the opinion of the expert's here before I make any bold claims, as some results are just too surprising for me to trust them. There are five types tested. 1. A self-counting type, using an ancient Synesis (my company) template ref-counting bolt-in class 2. A self-counting type, using a custom-written (for this test) template ref-counting bolt-in class 3. Externally counting, using boost::shared_ptr 4. Externally counting, using a custom-written (for this test) shared pointer type 5. Externally counting, using a custom-written (for this test) shared pointer type, which has a custom allocator for the shared count members There were three scenarios investigated, and the same algorithms were applied to the above five types A: Saving refs vector<smart-ptr> ptrs; loop (100,000 times) { smart-ptr sp1(new Thing); sp1 = sp1; smart-ptr p2; p2 = p1; smart-ptr p3(p1); smart-ptr p4(p3); smart-ptr p5(p4); smart-ptr p6(p5); process_Thing(p6.get()); // A function to do something (actually it just returns the result of a call on Thing) to prevent elision by optimisation ptrs.push_back(p5); } The whole is executed within a single thread, and the program is tested in both single and multithreaded forms. B: Discarding refs This is identical to A ("Saving refs") except that saving them to the vector of ptrs is omitted. Again, this is done in one thread, and single and multi-threaded forms tested C: Multi-threaded sharing 20 smart-pointers are created, containing 20 Thing instances, in the main thread. Then 20 threads, using the following thread-proc, are created, and executed. Two times are recorded for each scenario: the total time, and the sum of the individual thread-times (obtained by the WinSTL threadtimes_counter, based on the Win32 API function GetThreadTimes()) std::vector<ptr_type> ptrs(ITERATIONS / 10); for(size_t i = 0; i < 10; ++i) { for(size_t j = 0; j < ITERATIONS / 10; ++j) { ptrs.push_back(things[i % things.size()]); } } ITERATIONS is 100,000. things is the shared shared_ptr array The purpose of this one is to really exercise the sharing between threads. Results (all results are expressed in terms of the most costly pointer) ============================================= [Note, all were compiled with Intel 7.0 with very aggressive speed optimisations (-O2 -Ox -QaxMKW /Qvc7 -EHsc), and executed in a single session on an otherwise quiescent 512MB 2GHz XP box.] A (single-threaded) - -------------------- 1. Synesis self-counter: 65% 2. Custom self-counter: 58% 3. boost::shared_ptr: 100% 4. custom SharedPtr: 53% 5. custom SharedPtr + rc pool): 48% I was not surprised that the crusty old Synesis type was slower than the custom self-counter. As expected, the custom shared-ptr with rc pool was the fastest. I was a little surprised that the Boost shared_ptr was up to twice as slow. However, they're all within 100% of each other, so it's no big result I guess. A (multi-threaded) - -------------------- 1. Synesis self-counter: 90% 2. Custom self-counter: 46% 3. boost::shared_ptr: 100% 4. custom SharedPtr: 28% 5. custom SharedPtr + rc pool): 27% This was surprising. I expected the use of thread-safe operations (atomic increments, etc.) to flatten out the curve, but the reverse has happened. Interestingly, the two real-world types (Synesis::RefCounter and boost::shared_ptr) fair less well. Again, not what I expected. B (single-threaded) - -------------------- 1. Synesis self-counter: 65% 2. Custom self-counter: 55% 3. boost::shared_ptr: 100% 4. custom SharedPtr: 59% 5. custom SharedPtr + rc pool): 53% This scenario has a lot more reference-increments, and a lot less deallocation of the reference-counted instance. Nonetheless, the results are virtually identical, in relative terms (the actual times taken are about 50% across the board). B (multi-threaded) - -------------------- 1. Synesis self-counter: 71% 2. Custom self-counter: 41% 3. boost::shared_ptr: 100% 4. custom SharedPtr: 32% 5. custom SharedPtr + rc pool): 31% Pretty much the same as with B (single-threaded). The results are the same, in relative terms Up to this point, the results are mildly interesting, but nothing earth shattering. C - - (Times are total time, and thread times, respectively) 1. Synesis self-counter: 2.9% 39% 2. Custom self-counter: 2.7% 35% 3. boost::shared_ptr: 100% 100% 4. custom SharedPtr: 1.0% 13% 5. custom SharedPtr + rc pool): 1.2% 15% This scenario tells quite a different tail indeed. One thing to note is that because of the degree of reference-sharing between the ptr-storing vectors in the 20 threads, the ref-count-pool becomes a pessimisation as a result of the number of actual smart pointer instances being created exceeding the arbitrary 1024 cached ref-counts available. It's likely that this number would not be exceeded in real-world scenarios, so the pool is probably still worth having. Naturally, the very big question is: what's happening to boost::shared_ptr? In thread times, it performs 2-3 times worse than in single threaded scenarios (irrespective of whether they are built for single or multiple threads). However, in total time, it performs extremely poorly, up to 100 times slower than some of the others! Clearly, there is some serious contention involved. There are three possibilities I have thought of: (i) It performs very poorly in multi-threaded sharing scenarios. If this is the case, can any people more expert in Boost than me proffer an explanation as to precisely the problem, and is any such person motivated to try and address the issue? (ii) The particular sharing scenario is well away from any real-world scenario in which boost::shared_ptr is likely to be used. If this is the case, why do four different implementations, representing two very different reference-counting strategies, all maintain a consistent relative performance? What are the scenarios for which boost::shared_ptr is optimised? (iii) There are environment settings for which boost::shared_ptr will perform much better. If that's so, can someone help me out with suggestions of what defines to make? Why are these settings not the default. I'd be very interested in your thoughts. Many thanks in advance. - -- Matthew Wilson STLSoft moderator (http://www.stlsoft.org) Contributing editor, C/C++ Users Journal (www.synesis.com.au/articles.html#columns) "You can tell a Yorkshireman, but you can't tell him much!" -- Uncle Michael - ---------------------------------------------------------------------------- - --- _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost _______________________________________________ Boost-users mailing list Boost-users@xxxxxxxxxxxxxxx http://lists.boost.org/mailman/listinfo.cgi/boost-users ------- End of Forwarded Message -- Jon Stewart Advanced Los Angeles C++ stew1@xxxxxxxxxxx http://www.alacpp.org _______________________________________________ alacpp mailing list alacpp@xxxxxxxxxxx http://lists.ellipsis.cx/mailman/listinfo/alacpp