Finding performance bottlenecks

sam_stickland

I've started this thread to find (and hopefully share) techniques to find performance bottlenecks in computer setups.

For example, just now I kicked off a render in Premiere Pro using the MainConcept CUDA H264 encoder. All six of my cores are at 60%, my disks are running at 25Mbps, my physical memory utilisation is 60% and my graphics card load is 20%. So where on earth is the bottleneck? None of these components seem particularly stressed, and certainly not max'ed out.

In this instance I suspect it could be the following:

The CUDA enabled effects / encoder aren't using all the shaders on graphics card. Those that are being used are max'ed out
The CUDA enabled effects / encoder are causing a lot of memory transfers to/from the graphics card which is the bottleneck

But these are both just guesses. Is there anyway to measure this? I'm using a PC with Nvidia GT460.

EDIT: I'd also checked to make sure that individual CPU cores weren't completely loaded.

Ralph_B

This is just a shot in the dark -maybe the MainConcept CUDA H264 encoder and the Mercury playback engine are competing for Cuda cores. Try rendering the same project without Cuda acceleration for the H264 encoding, and see whether that makes a difference.

sam_stickland

Interesting. Although I my understanding is that CUDA applications allocate GPU memory themselves (i.e. it's not managed and moved out on a 'context' switch), so the switch between different CUDA tasks shouldn't be too heavy? My GPU's memory usage doesn't go above 7% either.

If memory serves doing CPU based H264 will be CPU bound - and slower. So all that proves is that CPU based H264 encoding is less efficient?

Seems to me that with all these different cores and processing units we've waiting for the software to take better advantage of it. Especially since the trend is towards more cores/shaders rather than increasing Hz.

EDIT: I've just had a look and the CUDA memory model is more complex than I'd realised - Shared or Register memory would indeed need to be switched out.

duartix

I'm just curious, how do you measure the GPU use at 20%? Which software are you using?

sam_stickland

GPU-Z or "GPU Monitor" (which is a windows widget).

pdlumina

random access + read/write speeds perhaps. relevant if your footage lives in the same disk your operating system is, and also where your Premiere/Mainconcept is installed, and also where you are writing the resulting file to.

If external drives: USB2 ports, LAN speed, etc

sam_stickland

@pdlumina Could be random access. It's definitely not read/write speeds though, as the recorded rates during the render are way below what I've benchmarked my drives at.

duartix

I very much doubt this is disk related. I'd put my money on memory bandwidth. Try under/overclocking it and see if the render times change.

sam_stickland

Well I'm actually just waiting on 16GB of 1600Mhz RAM (an upgrade from the current 8GB), so it show be pretty easy to do a before and after test.

sam_stickland

Going from 8GB RAM @ 667MHz to 16GB @ 833 Mhz dropped the encoding time on a small render from 9:16 to 8:45 (MainConcept CUDA). The times are close enough (5.5%) that it makes me wish I'd taken measurements on the 8GB setup. GPU load 12%, and the GPU memory controller load was only 2%.

Using the MainConcept H.264 CPU encoder (16GB @ 833MHz) was interesting.. Overall CPU usage was 40%, no single core higher than 80%. GPU load 2%. Encoding time was 8:47. Almost exactly the same as the GPU encode which is weird. I'm going to take this test render and strip all the effects and see if the render times are the same.

Does anybody know if it's possible to monitor the HyperTransport load on an AMD X6?

UPDATE: The H.264 encoder in PPro took 12:23 (16GB @ 833MHz) and pretty much kept all the CPU cores at 100%. So the H.264 encoder has clearly moved the bottleneck to a different part of the system.

duartix

Your GPU does have a shotload of CUDA cores (336)...

It's possibly too optimistic to expect them all to be scheduled for h264 compression as a 12% looking at it in a simplistic form is still 40 cores devoted to the task!!!

Rendering is an almost perfect task for multitasking, however h264 encoding is a very complex issue with some processes that might indeed be suited to multitasking (motion estimation and intra prediction) but others that are loaded with data dependencies (compression) and on top of that there is the scheduling issue where the encoder has to ensure that all data needed to complete a process is available at the same time to make it efficient.

Sincerely Sam, I believe that a 100% use of the GPU is unrealistic given the number of cores available.

sam_stickland

@duartix Yeah, I get that, but does that mean then there isn't a single piece of my rig I can upgrade to get more performance?! ;) Later graphics cards are mostly adding more cores, not increasing the individual core speed.

The fact that the MainConcept encoding took exactly the same time on the CPU as the GPU (and wasn't CPU or GPU bound) suggests a limiting factor somewhere else. Probably memory bandwidth I suspect, and there isn't much more to be done there, as I'm already running 833Mhz SDRAM. My source media is on SSD btw.

Ralph_B

You could try overclocking the graphics card to wring a little more performance out of the system.

duartix

You are probably right and there is little to be done on the HW front, but don't lament you decision to upgrade to 16GB. I've got a feeling it's going to prove a major boon when using AE CS 6, and so far from what I've seen it might be the bigger benefit.

I look at renders as set and forget, so I do them mostly at night when the energy and time are cheaper... ;)

duartix

Sam, I've got an Intel i7 870 (8 virtual cores with HT) @ 2.93GHz fitted with 16GB of DDR3 @1600Mhz (CL10). In some ways it might be faster than your rig, but my case is tiny and the best I could fit was a GeForce 430 GT, which is a shade of a GPU when compared to yours. Being also curious as to how important the GPU is, if you can post some of the footage you are benchmarking, I wouldn't mind having a go at it. Better yet, if you would also share the project file, we could compare with/without filters and learn something about where the bottlenecks are. :)

Howdy, Stranger!

Categories

Tags in Topic

Top Posters