The largest1 of the "Enterprise" series of Sun computer systems.

Can house up to sixteen system boards that can be grouped together logically into seperate systems called "domains". System boards can be added or removed from a running system with no downtime required.

Each system board can contain up to four CPUs, four gigabytes of RAM as well as multiple ethernet, SCSI and Fiber interfaces.

Each system board can support "alternate pathing" (AP) to allow multiple paths to single devices in the event of component failure.

An E10k can therefore be used to create an enormous domain containing sixty-four CPUs and sixty-four gigabytes of RAM, or sixteen seperate domains.

An E10k relies on another machine, the System Service Processor, or SSP. The SSP contains all the configuration information for each domain as well as the boot PROM image for each, acts as a system console for each domain. It is also responsible for booting each domain.

This is not a computer you'll find in someone's study.


Update

Sun have announced the successor to the E10k, the Sunfire 15000. This new machine is reportedly capable of 15 dynamic domains, a terabyte of RAM and about 160 CPUs.


1 This writeup is now very out of date and will be updated shortly. The Sunfire 25K is now the largest system made by Sun Microsystems.

A group of engineers in San Diego left their jobs at well established corporations (especially a large number of former NCR employees) to form their own company. They set out to build massively parallel computers with Sparc processors. The market they were targetting has traditionally been a hard one, and they had a hard time surviving. They were acquired by similar companies and reshaped several times, most notably their second to last acquisition by Cray Research, Inc.. They were also joined with several engineers in Beaverton, Oregon through these mergers.

Under Cray's leadership, they produced a machine with 64 Sparc processors called the CS6400 (or more affectionately called the SuperDragon since it was an implementation of Sun's sun4d architecture, similar to what would be found in a SparcCenter 2000 computer from Sun). The CS6400 supported a feature called Dynamic System Domains, meaning that its multiple system boards could be electronically isolated into distinct sets (called domains), and that the partitioning could be changed dynamically while separate instances of the operating system were executing within each domain. Another feature called Alternate Pathing allowed SCSI and ethernet devices to be virtualized on top of pairs of SCSI and ethernet interface cards, allowing an operator to dynamically repartition system boards or even physically remove system boards from the chassis without interrupting I/O services provided to the end users.

A relationship was established between the engineers who built the CS6400 and Sun Microsystems because their large Sparc based servers ran the Solaris Operating Environment. (Well, along with a few low level tweaks in the kernel to get Solaris to work on the slightly different hardware, and to support the Alternate Pathing feature and the Dynamic Reconfiguration feature that allows the kernel to release/claim resources as they're physically detached/attached.)

When Cray was purchased by SGI, and while SGI analyzed what it had just bought, it found this quirky little division in San Diego building things with Sparc processors and working closely with Sun Microsystems. SGI didn't really want to keep the group, considering that it clashed with the sorts of technologies that SGI was already producing. So they gladly sold the group off to Sun for about $50 million. Sun liked what the engineers were doing and how their computer systems worked, so it gladly acquired the division just as it was about to complete its follow-on to the CS6400: the Ultra Enterprise Server 10000 (also known as the Starfire).

Development was completed on the new machine under Sun's leadership, and it took off in the marketplace once the fine products from these brilliant engineers were finally coupled with the vast resources of Sun's marketing department. The Enterprise 10000 servers were essentially a more modern and refined approach at what the CS6400 was. The Enterprise 10000 had easier, more reliable Dynamic Reconfiguration. (Although, I'd advise using the Solaris 7 or Solaris 8 versions of DR over the Solaris 2.5.1 and Solaris 2.6 versions if possible; but that's just my personal preference.) It also had faster hardware based on a 16x16 crossbar implementation of Sun's UPA architecture (sun4u, same as an Ultra-1, except on a larger scale). The server was praised in the computer industry for its success at scaling SMP up to the largest number of processors ever achieved, due in no small part to the amazing ASICs that drive its interconnect with a remarkably fast cache coherency snooping implementation. (All 64 processors can access any of the 64GB of memory in the system with uniform performance measurements of ~12GBytes/sec of bandwidth and ~500ns of latency while keeping their 8MB e-cache's coherent.)

Scott McNealy considers his company's acquisition of the Enterprise 10000 and its engineers as the best deal since Microsoft bought DOS. The acquired division was directly responsible for several billion dollars in revenue during its first year within Sun's ranks, not to mention the other revenue associated with selling service and accessories to go with all of that Enterprise 10000 hardware.


While Dynamic Reconfiguration sure is a great feature, it's not 100% correct to say that no disruption ever occurs to the instance of Solaris running in a domain whilst its boards are removed. Certain parts of the SunOS kernel are not pageable, and a patented design allows the Enterprise 10000 server to reprogram the physical addresses of various chunks of memory on its system boards so that it can move the kernel from one board to another to faciliate board removal. This operation can take upwards of 30 seconds, during which the entire domain will appear hung. System administrators are given advice on how to minimize the necessity for these particular sorts of operations, so that this system suspension is very rarely experienced. And besides, a 30 second delay certainly beats the shit out of rebooting your whole system just to replace something stupid like a bad SIMM. Reboots are bad. I've heard of some situations where it could take upwards of 8 hours to reboot and fully restore service, but they were extreme situations with around 100TB of storage and complex database applications running.

Log in or register to write something here or to contact authors.