A Laptop GPU thermal lesson learned the hard way: Never rely on thermal throttling
Alternative title: The importance of undervolting for lasting a laptop GPU
There is a misconception out there about GPU thermal throttling. The fact that "thermal throttling" even exists gives off the impression that the hardware itself has been adequately tuned to protect itself from thermal damage.
Nothing can be further from the truth. And this was how I found out the hard way.
I bought a Clevo P750TM1-G under Sager brand in 2017. It came with the i7-8700K and the mobile GTX 1070. A year after purchase, when the factory thermal paste has all dried out, I installed The Witcher 3 and decided to see how far I can push this card. TW3 as you may already knew is a GPU-bound game. This was when the first signs of trouble began to appear.
At this point, I trusted the GPU to automatically clock down when it hits a safe thermal limit and sustain at that clocked down frequency. It didn't go like that. I turned on supersample resolution for The Witcher 3 and saw that the game hovers around 30-40fps, which is still adequately playable since this laptop came with a G-SYNC monitor. However, MSI Afterburner was reading the GPU temperature at 95°C, and the whole laptop shut down after running at that setting for a few minutes.
2 years later, the laptop has been repasted several times. I didn't play anything as heavy as supersampled The Witcher 3 until then, when I played Horizon: Zero Dawn on it. This game puts a lot of stress on my laptop 1070. 100% GPU load to output ~40fps at native 1080p. Temperature ran hot at ~93°C. The Dynamic Resolution option in the game settings didn't help a lot. It would still push the card to the same temperature, but now it'd run at 55-60fps. But the signs of trouble this time manifested a different way: When I climb way too high up on a mountain, the game started... looking funny. Some mountain surfaces start to look like film negatives of themselves, the whole screen of the game brightens up to an unnatural level, and then the game crashed. The fact that this was thermal damage to hardware didn't even occur to me because Horizon: Zero Dawn had the reputation of being a rough PC port. I didn't realize at the time that the reason Horizon: Zero Dawn crashed back then was because the GPU lost all signals to the rest of the laptop for a few seconds. Turns out being on the mountain area had nothing to do with that area having any issue in particular, this GPU was just at the stage where running past a certain load threshold will cause it to lose all signals, the mountain area just put a lot of load on the card.
The
clue that gave me this realization was Guild Wars 2. Being an MMORPG,
it has a lot of areas where there are many things to render on screen.
It too crashed when putting a heavy load on this GPU, but its crash
message was the clue that I was having some hardware damage:
DXGI_ERROR_DEVICE_REMOVED. It was too late at that point. The threshold
of load that this GPU could run without crashing was slowly lowered and
lowered until even lightly-loading games are affected. I bought this to
repair technicians and they quoted back that their hands are tied with
this level of hardware damage. The GPU degraded to the point now where I
had to boot into Safe Mode, disable the GPU in Device Manager and
uninstall Nvidia driver in order to continue using the laptop. My laptop
GTX 1070 is now effectively dead.
So, observations:
- This laptop GPU was never overclocked, it ran at stock frequency or lower during its 4 year lifespan. Thermal damage can still happen even at stock clock.
-
It doesn't clock down sufficiently to protect itself from heat damage,
yet pulls more power than necessary. In some games, it runs at 80% load
while drawing 0.9V. In some other games, it draws 1.063V (voltage limit)
while only putting up a 30% load. This didn't become an issue until I
notice that it was the sustained heat that was degrading my GPU.
Obviously higher voltage generates more heat.
-
Softwares can legitimately damage your GPU if they make your GPU run at
a high enough sustained load. Sustained load draws sustained power,
sustained power generates sustained heat.
- Sustained heat around 90°C
is still really dangerous even though technically, the lowest melting
point for anything on the board is way higher than that, and the GPU
itself hard throttles at 100°C. If MSI Afterburner is reading 80+°C, lower load, repaste and undervolt. What MSI Afterburner reads out is never the hottest point on your GPU.
- Undervolting is wondrous. For a while, undervolting actually helped with putting a hard limit on the voltage that my laptop was sustaining at while running Horizon: Zero Dawn. Since hardware thermal throttling was so unreliable, I treated it as a manual thermal throttling method. If I had known the consequences of sustained heat 4 years ago, I would have undervolted my laptop 1070 out of the box.
- Everything I said about the unreliable thermal throttling behavior may very well be a Nvidia-specific issue. I don't have a laptop AMD GPU to test, but seeing how AMD has proven more power efficient than their competitors in both CPU and GPU markets, I'm very curious to see how AMD cards manage power draws.
-
On a broader discussion, I'm getting very serious about power
efficiency for laptop hardwares after this dead GPU experience, yet I'm
seeing very worrying signs for the gaming laptop market. We are
physically limited by our form factors, yet I see Intel and Nvidia
continue to manufacture increasingly power-hungry parts in their latest
generations of hardwares. Power efficiency is not just about our
electricity bills, it determines how well your parts can perform under a
wattage that doesn't slowly kill the parts themselves. Is this going to
be the future of gaming laptops? Everything just draw more power and
run hotter and die faster?
Comments
Post a Comment