VOGONS


NV bumpgate lead-free solder debacle

Topic actions

First post, by mockingbird

User metadata
Rank Oldbie
Rank
Oldbie
Scali wrote:
Indeed, I'm surprised that anyone does NOT know this. The 8800 series was THE DX10-hardware. Also the hardware that kicked off t […]
Show full quote

Indeed, I'm surprised that anyone does NOT know this.
The 8800 series was THE DX10-hardware. Also the hardware that kicked off the GPGPU revolution by introducing Cuda, which later led to OpenCL and DirectCompute (both of which the original 8800 series support).
They also introduced physics acceleration on GPUs with PhysX.
Note also that they were actually launched even before Vista/DX10.
8800, like the R300, is one of the big milestones in the history of GPUs.

Show me an 8800 still in operation today... The nVidia chips were rushed to the market and they had defects in their engineering. The 8800 was the "milestone" of the downfall of many a videocard company who went bankrupt trying to satisfy the near 100% RMA rates on these things. nVidia also practically destroyed the 3rd-party motherboard chipset industry with near 100% RMA rates on laptops and motherboards that used their northbridges (which I'm sure played a role in AMD/Intel refusing to license chipset production to 3rd parties).

My 2900XT OTOH is still fully operational. It's sitting in a drawer right now, but it's seen quite a bit of use and I'm sure it will outlast any 8800 out there.

mslrlv.png
(Decommissioned:)
7ivtic.png

Reply 1 of 56, by PhilsComputerLab

User metadata
Rank l33t++
Rank
l33t++

I've got quite a few 8800 type cards. 8800 GT, 9600GT, 9800 GTS, GTX+. They all work fine.

I do remember an issue with notebook chips, but I believe it was 7 series that was affected. But 100% sure.

YouTube, Facebook, Website

Reply 2 of 56, by Scali

User metadata
Rank l33t
Rank
l33t
mockingbird wrote:

Show me an 8800 still in operation today...

I have an 8800GTX that still works.

mockingbird wrote:

The nVidia chips were rushed to the market and they had defects in their engineering.

Incorrect.
The problem was that new RoHS regulations no longer allowed lead-based solder (see https://en.wikipedia.org/wiki/Soldering#Lead- … ronic_soldering). Not all lead-free solder replacements were as reliable. The problem is that the lead made the solder somewhat elastic, which means it could absorb some of the expansion or shrinking that occurs when the chips heat up and cool down. Some of the new solders would crack under these situations. The Xbox 360 RRoD is because of the same issue, and AMD cards also suffered the issue, although generally they had smaller GPUs, that didn't suffer as much from changes in temperature. The problem is, you don't really know if it's going to crack until it's been stress-tested for quite a while.
And it doesn't happen on all cards, such as my 8800GTX.
So the problem is not in the chips (and these cards can often be fixed by 'reflowing': heating up the solder so it reseats itself and the cracks are filled).

Anyway, all that is completely beside the point that the GPU itself was nothing short of groundbreaking, as was the R300 a few years earlier. Even today, GPUs from AMD and nVidia still closely resemble a lot of architectural features first introduced in the 8800.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 3 of 56, by candle_86

User metadata
Rank l33t
Rank
l33t
mockingbird wrote:
Scali wrote:
Indeed, I'm surprised that anyone does NOT know this. The 8800 series was THE DX10-hardware. Also the hardware that kicked off t […]
Show full quote

Indeed, I'm surprised that anyone does NOT know this.
The 8800 series was THE DX10-hardware. Also the hardware that kicked off the GPGPU revolution by introducing Cuda, which later led to OpenCL and DirectCompute (both of which the original 8800 series support).
They also introduced physics acceleration on GPUs with PhysX.
Note also that they were actually launched even before Vista/DX10.
8800, like the R300, is one of the big milestones in the history of GPUs.

Show me an 8800 still in operation today... The nVidia chips were rushed to the market and they had defects in their engineering. The 8800 was the "milestone" of the downfall of many a videocard company who went bankrupt trying to satisfy the near 100% RMA rates on these things. nVidia also practically destroyed the 3rd-party motherboard chipset industry with near 100% RMA rates on laptops and motherboards that used their northbridges (which I'm sure played a role in AMD/Intel refusing to license chipset production to 3rd parties).

My 2900XT OTOH is still fully operational. It's sitting in a drawer right now, but it's seen quite a bit of use and I'm sure it will outlast any 8800 out there.

I have an 8800GTS 320 still fully functional.

Reply 4 of 56, by mockingbird

User metadata
Rank Oldbie
Rank
Oldbie
Scali wrote:

Incorrect.

Incorrect.

The nVidia failures post-Geforce FX series (The GeForce FX series itself was not affected) was caused by bad engineering. There were several problems:

1) Incorrect underfill. The underfill is the epoxy-like substance that sits around the CPU die. It's designed to soften and harden depending on the temperature. nVidia used an inferior underfill that was not meant for the kind of temperatures their new dies were putting out, and when the chips got really hot, the underfill did not sufficiently soften as to allow the die to 'float'. This lead to catastrophic mechanical failures within the chip.

2) Poorly engineered heat dissipation: The chip die has something called 'bumps' which are meant to be strategically placed on the die where there are areas that produce a lot of heat. The bumps make contact with the substrate which in turn makes contact with the heatsink. On a properly designed chip, the heat then dissipates properly and it doesn't overheat. You can type 'nvidia bumpgate' in a search engine. It's been thoroughly discussed.

3) They used the wrong alloy for the bumps. nVidia used high-lead bumps which are more sensitive to thermal stressing than eutectic bumps.

And as for lead-free solder - The ATI 2900XT also used lead-free solder. Not that I'm saying that lead-free solder isn't an enormous problem, but nVidia's problems were much, much bigger.

nVidia would keep blundering all the way up until the Fermi. And for every one of your *functional* G8x/G9x cards, there are literally millions sitting in landfills. You can get away with having one in your system if you give it only very light use and keep it cool.

mslrlv.png
(Decommissioned:)
7ivtic.png

Reply 5 of 56, by obobskivich

User metadata
Rank l33t
Rank
l33t
philscomputerlab wrote:

I've got quite a few 8800 type cards. 8800 GT, 9600GT, 9800 GTS, GTX+. They all work fine.

I do remember an issue with notebook chips, but I believe it was 7 series that was affected. But 100% sure.

I think you're right on the GeForce 7 (maybe others?) in mobile computers/devices being a problem too. I remember hearing about problems with Apple laptops that had nVidia chips at least. 😊

Also, because PhysX came up (how this relates to NV3x I haven't the foggiest): GeForce 8 did not introduce PhysX-on-GPU; Ageia wasn't even part of nVidia until 2008. PhysX-on-GPU was launched on the GTX 280/260 and 9800GTX/GTX+ (w/driver 177.39), and later extended to include GeForce 8 and other cards (starting with 177.79) later in the year. In 2006-7, Ageia was still trying to sell the Physx PPU, the Ageia P1, as a PCI (and for OEMs, PCIe) add-in card. Very few games supported it (I think the final count stands at around 12). Many more games today support GPU PhysX, of course.

mockingbird wrote:

The nVidia failures post-Geforce FX series (The GeForce FX series itself was not affected) was caused by bad engineering.

Curiosity: is NV40 affected by this? From what I've read the issues started with later production nodes (like 65nm and on), so I'm assuming the 130nm NV40 also predates it, but I could be mistaken (my other reason for this hunch is that 6800U and GT report similarly high throttle temperatures (>110*C) as their FX brethren).

Reply 6 of 56, by Scali

User metadata
Rank l33t
Rank
l33t
mockingbird wrote:
Incorrect. […]
Show full quote
Scali wrote:

Incorrect.

Incorrect.

The nVidia failures post-Geforce FX series (The GeForce FX series itself was not affected) was caused by bad engineering. There were several problems:

1) Incorrect underfill. The underfill is the epoxy-like substance that sits around the CPU die. It's designed to soften and harden depending on the temperature. nVidia used an inferior underfill that was not meant for the kind of temperatures their new dies were putting out, and when the chips got really hot, the underfill did not sufficiently soften as to allow the die to 'float'. This lead to catastrophic mechanical failures within the chip.

2) Poorly engineered heat dissipation: The chip die has something called 'bumps' which are meant to be strategically placed on the die where there are areas that produce a lot of heat. The bumps make contact with the substrate which in turn makes contact with the heatsink. On a properly designed chip, the heat then dissipates properly and it doesn't overheat. You can type 'nvidia bumpgate' in a search engine. It's been thoroughly discussed.

3) They used the wrong alloy for the bumps. nVidia used high-lead bumps which are more sensitive to thermal stressing than eutectic bumps.

And as for lead-free solder - The ATI 2900XT also used lead-free solder. Not that I'm saying that lead-free solder isn't an enormous problem, but nVidia's problems were much, much bigger.

nVidia would keep blundering all the way up until the Fermi. And for every one of your *functional* G8x/G9x cards, there are literally millions sitting in landfills. You can get away with having one in your system if you give it only very light use and keep it cool.

This can pretty much be summarized into: "Bad lead-free solder' (plus some other minute details, completely overblown in your post). So I was not incorrect. It's not the chips that failed (as you incorrectly claimed), it was the way they were mounted to the PCB.
The temperature-argument itself is highly debatable, since nVidia has made large GPUs that ran very hot for many years. The 8800 wasn't exceptional in that sense.

Also, I don't care for the condescending tone and the hyperbole (by the way, you were claiming there were no working 8800s in existence. Various people have reported here that they have working 8800s. I don't see any response from your end).
If you think I don't know what bumpgate is, then guess again. The reason why I bought a Radeon 5770 is because the 8800GTS I used up until then died because of bumpgate. I documented that on my blog at the time: https://scalibq.wordpress.com/2009/11/1 ... %e2%80%a6/
But unlike you, I don't make unrealistic claims and go off on some blind rage against a brand. These things just happen from time to time. It wasn't the first time, and certainly won't be the last time.

Now, I'm getting REALLY tired of all these people constantly attacking me with these aggressive posts. Why don't you people just behave?

Last edited by Scali on 2015-07-14, 16:59. Edited 1 time in total.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 7 of 56, by sliderider

User metadata
Rank l33t++
Rank
l33t++
obobskivich wrote:
I think you're right on the GeForce 7 (maybe others?) in mobile computers/devices being a problem too. I remember hearing about […]
Show full quote
philscomputerlab wrote:

I've got quite a few 8800 type cards. 8800 GT, 9600GT, 9800 GTS, GTX+. They all work fine.

I do remember an issue with notebook chips, but I believe it was 7 series that was affected. But 100% sure.

I think you're right on the GeForce 7 (maybe others?) in mobile computers/devices being a problem too. I remember hearing about problems with Apple laptops that had nVidia chips at least. 😊

Also, because PhysX came up (how this relates to NV3x I haven't the foggiest): GeForce 8 did not introduce PhysX-on-GPU; Ageia wasn't even part of nVidia until 2008. PhysX-on-GPU was launched on the GTX 280/260 and 9800GTX/GTX+ (w/driver 177.39), and later extended to include GeForce 8 and other cards (starting with 177.79) later in the year. In 2006-7, Ageia was still trying to sell the Physx PPU, the Ageia P1, as a PCI (and for OEMs, PCIe) add-in card. Very few games supported it (I think the final count stands at around 12). Many more games today support GPU PhysX, of course.

mockingbird wrote:

The nVidia failures post-Geforce FX series (The GeForce FX series itself was not affected) was caused by bad engineering.

Curiosity: is NV40 affected by this? From what I've read the issues started with later production nodes (like 65nm and on), so I'm assuming the 130nm NV40 also predates it, but I could be mistaken (my other reason for this hunch is that 6800U and GT report similarly high throttle temperatures (>110*C) as their FX brethren).

The Apple laptops that had the die separation issues had the 8400M chip but some tech news outlets were reporting that the problem went beyond just the mobile parts and extended to the desktop cards as well.

Reply 8 of 56, by swaaye

User metadata
Rank l33t++
Rank
l33t++

I have seen personally solder problems with 7900, 8600 and 8800. It tends to manifest as driver crashes, BSODs, or occasional display corruption. But I do have several 8600GT cards in use and an 8800GT that works fine too. However, that 8800GT needed a cooler replacement to become stable. The single slot 8800GT cards tended to operate at over 100C in games and this must have exacerbated the solder/manufacturing/engineering problem.

And yes I HAVE NOTICED ALL OF THE REPORTED POSTS. I feel like a babysitter. I demand calmness and happy feelings among everyone or I'm just going to lock this nonsense.

Reply 9 of 56, by Scali

User metadata
Rank l33t
Rank
l33t
swaaye wrote:

I have seen personally solder problems with 7900, 8600 and 8800. It tends to manifest as driver crashes, BSODs, or occasional display corruption. But I do have several 8600GT cards in use and an 8800GT that works fine too. However, that 8800GT needed a cooler replacement to become stable. The single slot 8800GT cards tended to operate at over 100C in games and this must have exacerbated the solder/manufacturing/engineering problem.

Thing is, many users have reported that a ghett0 reflow in their oven fixed their cards. Which wouldn't work if it was really the chip itself that had cracked.
Also here's a story of Humus successfully ghett0-reflowing a Radeon from that era: http://www.humus.name/index.php?page=News&ID=283

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 10 of 56, by swaaye

User metadata
Rank l33t++
Rank
l33t++
Scali wrote:

Thing is, many users have reported that a ghett0 reflow in their oven fixed their cards. Which wouldn't work if it was really the chip itself that had cracked.
Also here's a story of Humus successfully ghett0-reflowing a Radeon from that era: http://www.humus.name/index.php?page=News&ID=283

That's true. I resurrected a 7900 Go GTX board with the oven baking trick once. It only lasted a few months though.

Reply 11 of 56, by candle_86

User metadata
Rank l33t
Rank
l33t

I got a go 7600 laptop back from the grave with baking

Reply 12 of 56, by Scali

User metadata
Rank l33t
Rank
l33t
swaaye wrote:

That's true. I resurrected a 7900 Go GTX board with the oven baking trick once. It only lasted a few months though.

In those days I worked for a company that also developed their own hardware. The lead-free soldering was a big issue for our embedded devices as well. Reliability wasn't as good as before, and they had a lot more problems doing repairs on these units, because you needed higher temperatures for soldering. Everyone hated it.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 13 of 56, by candle_86

User metadata
Rank l33t
Rank
l33t

whats the big deal with lead solder anyway, where kids eating video cards and getting sick? I mean really cmon

Reply 14 of 56, by Scali

User metadata
Rank l33t
Rank
l33t
candle_86 wrote:

whats the big deal with lead solder anyway, where kids eating video cards and getting sick? I mean really cmon

Yea, the irony is that the lead-replacements gave off toxic fumes, so it's not like they made things so much healthier or more environmentally friendly 😀
They just reintroduced problems with soldering that had been solved decades before by adding the lead.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 15 of 56, by mockingbird

User metadata
Rank Oldbie
Rank
Oldbie
Scali wrote:

This can pretty much be summarized into: "Bad lead-free solder' (plus some other minute details, completely overblown in your post).

I haven't blown anything out of proportion. Claiming that the massive nVidia failures were caused mostly by unleaded solder is untruthful in the face of the overwhelming evidence:

Nvidia chips show underfill problems

Underfill is basically a glue that surrounds the bumps, keeps them from getting contaminated, and moisture free. It also provides some mechanical support for the chips. There are two properties of underfill, Tg and stiffness. Tg is the Temperature of Glassification, which means the temperature at which it loses all stiffness. Instead of thinking about it melting, think of it turning to jello. Stiffness is how hard it is before it melts or turns to jello.

...

Remember when we said that Nvidia engineering wasn’t abjectly stupid? Scratch that. Remember when we said we were going to break out the electron microscope? It’s time. Remember the part about the PI layer being necessary for stiffer underfills? Guess what?

Test_chip_eu_without_PI_WM.jpg

Once again, this is not saying that they will fail, or the one you have will fail. We are simply stating that according to all the packaging experts we talked to, none of them could come up with a scenario where this was not a massive problem. Once again, time will tell.

And time did tell. They did fail. En masse. Like I said, when these cards are put to heavy use, they have a 100% failure rate. And it's not a matter of re-balling the chip. Among those that do in fact replace nVidia northbridges on Laptop motherboards, more often than not, they use a fresh old stock part.

Scali wrote:

Thing is, many users have reported that a ghett0 reflow in their oven fixed their cards. Which wouldn't work if it was really the chip itself that had cracked.

That's because a reflow accomplishes two things. Firstly, it reflows the solder between the chip and the PCB, but secondly, it also reflows the bumps inside the chip with the pads. And consider as well that reflowing doesn't solve the underlying problem of the unleaded solder between the chip and the PCB, much less the incorrect bump/pad alloy. That means to say, even if the chip were professionally re-balled (the unleaded solder balls were replaced with leaded solder balls), the chip would still fail and would need to be reflowed to bond the bumps and pads inside the chip itself.

Additionally, if the chip suffered a mechanical catastrophic failure because it cracked when the improper underfill did not allow it to 'float' under high heat, no amount of reflowing or re-balling will bring the chip back to life.

And as for the example you gave with the Xbox360 - when these were re-balled, they did not exhibit this problem, because the chips themselves didn't have any engineering problems - indeed, in their case, it was only the question of the unleaded solder.

obobskivich wrote:

Curiosity: is NV40 affected by this? From what I've read the issues started with later production nodes (like 65nm and on), so I'm assuming the 130nm NV40 also predates it, but I could be mistaken (my other reason for this hunch is that 6800U and GT report similarly high throttle temperatures (>110*C) as their FX brethren).

That's a good question. I would imagine not, as I've personally seen an old high-end Dell Inspiron with a mobility Geforce 6 that worked well after many years, and I have a fanless Geforce 6200 which still works quite well (after a re-cap, that is).

Last edited by mockingbird on 2015-07-14, 18:21. Edited 3 times in total.

mslrlv.png
(Decommissioned:)
7ivtic.png

Reply 17 of 56, by mockingbird

User metadata
Rank Oldbie
Rank
Oldbie
Scali wrote:

Right, 'evidence' from Semi-accurate... I think we're done here.

I think the saddest part of the whole nVidia debacle was that the failure rates would have been far, far lower had nVidia actually not skipped QA, because then they would have found that the amount of voltage they specified for their chips was excessive.

The final piece to the nVidia puzzle - voltage

I've been long wanting to try undervolting a nvidia GPU from the bad lot to see the difference it makes. I'd never gotten around to actually trying it though. So i did the mod on this dv9000 board. First by using a pencil, then once i've settled on a value, by soldering in the appropriate resistors. To finish it off, i removed the original aluminum foil/thermal material for the GPU, and applied Arctic MX-4. The CPU got the same treatment.

I was able to undervolt the GPU from 1.2v/1.15v (high/low) to 0.95v/0.89v. That is 250mV, which represents a 21% reduction in operating voltage. All with the GPU not only still operating correctly, but still presenting some overclocking headroom!!! Under FurMark the GPU Vcore measured 0.943v.

So, to sum it up: Nvidia underrated the current capability of the bumps inside most of their chips manufactured 2006 to 2009, they used the wrong materials for bumps and underfill, and to top it off, the parts were overvolted by 20%, thus using almost 40% more power and putting away 40% more heat than they could have. That's one hell of a booboo if you ask me.

That's not from the SemiAccurate Website, that's based on the personal observation of our great member Th3_uN1Qu3 over at the badcaps forum.

So, back to the original topic of those generation of nVidia cards being superior to their ATI counterparts, and I must say, I've always been impartial to nVidia myself because of the far superior drivers, but nevertheless, I would still go as far as to say that the claim of superiority is irelevant in light of the fact that these cards were in essence a sort of vaporware. Not quite like Bitboys that never released anything at all, but not far off from it.

Last edited by mockingbird on 2015-07-14, 18:24. Edited 1 time in total.

mslrlv.png
(Decommissioned:)
7ivtic.png

Reply 18 of 56, by Scali

User metadata
Rank l33t
Rank
l33t

You realize that the claim that nVidia doesn't know at what voltages they can run the chips that they themselves designed is rather far-fetched, right?
The fact that some chips can run at lower voltages doesn't mean much. Manufacturers never run the chips at their limits, but keep some healthy margins. So the better picks of the bunch will be able to run at considerably lower voltages. But it's the worst picks of the bunch that are the issue.
Heck, I've undervolted my Core2 Duo severely as well, for all its lifetime. It's specced at 1.35v if I recall correctly, but it runs fine at 1v. It's a 2.4 GHz model, but I ran it at 3 GHz at 1.2v. Does that mean Intel doesn't know what voltages and clockspeeds their CPUs should run at either?

I would still go as far as to say that the claim of superiority is irelevant in light of the fact that these cards were in essence a sort of vaporware. Not quite like Bitboys that never released anything at all, but not far off from it.

Again, this doesn't make sense.
You dismiss the technical merits of the architecture based on the fact that the reliability wasn't that great.
Calling them 'vaporware' is just nonsense. Yes, my card died after about 3 years, but in the meantime I used it for a lot of software development, and it had pretty much run its course anyway. Calling it 'vaporware' is ridiculous.
My 9600Pro died a lot sooner, and I wouldn't call that one 'vaporware' either. It just happens.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 19 of 56, by mockingbird

User metadata
Rank Oldbie
Rank
Oldbie
Scali wrote:

You realize that the claim that nVidia doesn't know at what voltages they can run the chips that they themselves designed is rather far-fetched, right?

When your chips are running at 100C and failing left and right, I don't think it's a question of them deciding to run them at those voltages to "keep healthy margins". There was no margin at all at those voltages. Like I said, considering the poor engineering of the chips, they were being factory over-volted, and for no apparent reason.

Hey, the truth is sometimes stranger than fiction. I didn't make this up.

Again, this doesn't make sense.
You dismiss the technical merits of the architecture based on the fact that the reliability wasn't that great.

I think my point was very salient. Bitboys cards also had a lot of technical merit. In simulations, they outperformed everything else, and I'm sure that had they had some millionaire backers and not to mention some luck, they might have put out some pretty impressive silicon.

And we're not talking about nVidia cards of that era failing after 3 years. We're talking about cards dropping like flies after several months of usage. Just look at consumer-submitted Newegg follow-up reviews of GeForce cards from that era to get a pretty good idea of just how long these cards lasted on average.

And again, this wasn't limited to one series of cards. This took place over a span of many years, perhaps even up until the very last G9x silicon. And keep in mind that G9x silicon was still being sold even after Fermi was released. so while high-end Geforce 2xx cards were Fermi-based, lower-end 2xx models were simply re-badged G9x models, and were still being sold well into 2010.

mslrlv.png
(Decommissioned:)
7ivtic.png