I’ve mastered the art of squandering my time
Hell, I can’t even find time to write a blog on a semi-regular basis. It’s a little sad. In the time since my last post, I’ve decided to change my home network architecture. Energy prices aren’t getting any lower, and speed just isn’t keeping up any more. The C3600, Octane2, and DS20E have got to go (or at least be powered down). I’m ganking a recently decommissioned Intellistation 6224-33U from a colleague. IBM refuses to upgrade the firmware for dual-core support (ditto for Sun on their W2100z, which is built on the same platform), probably because it would have killed the incentive to upgrade to a 6217 (same platform, does have dual-core support), though the 6217 has PCIe. So, it’s a dual Opteron 254 with 4GB of RAM. Currently one U320 hard drive and a Quadro FX 1100. I don’t really give a damn about the Quadro, but hey, it’s there.
I aim to put in a PERC 4/DC (dual channel PCI-X U320 controller) and 3 more U320 drives (36GB) I have laying around. It probably won’t even cap a channel, but there’s not really enough internal expansion to add more unless I rip out the CDROM drive and put a 3 bay hotswap cage. That’s a distinct possibility, but if so, I’d be doing it with a PCI-X SATA (or SAS, whichever is cheaper, since SAS supports SATA drives) controller. There’s only two SATA ports on the motherboard, and even with eBayed SCSI prices, it’s just not worth it. Sure, I could get 3 147GB 15k drives or 3 300GB 10k drives, but I don’t see the point. They’re still going to cost as much as 500GB SATA drives (or more). The SCSI will be used for a decommissioned external array. External SATA arrays are ludicrously expensive, given that I’d rather use eSATA if possible instead of a depreciated standard (internal connectors for external devices). SATA arrays with U320/Fibre Channel interfaces are ridiculously expensive, and they pretty much all use (crap) proprietary hardware RAID. Sun makes a few JBODs that’d work, but again, ludicrously expensive. A Powervault 220S (I already have one, but more never hurt) with dual U320 controllers is $99 on eBay with sleds (no drives), so I’ll attach one of those if I need more spindles or more space.
It is, of course, entirely possible that the price of SATA arrays with FC/SCSI connections will drop by the time I need to add more storage, but I’m not counting on that for now. In the meantime, I’ll be putting 2 400GB SATA drives in. I’m just not sure on what kind of filesystem layout I want to have. I’ve got an Intel Pro/1000 MT (quad GigE) PCI-X card that’ll be going in there. The Fibre Channel array is a nogo until I decide to actually spend some money and pick up PCI-X HBAs on eBay, since it’s a PCI64/66 card. I’m not sure how many PCI-X buses the Intellistation has, but PCI-X is a parallel bus, so adding a 66Mhz card would drop the throughput of whatever bus it’s on to 533MB/s. That’s no good. If it ends up being on the same bus as the SCSI controller and GigE card, I’d be capped.
It’s not that PCI-X (64/100) is much better (800MB/s), but I’d rather avoid bringing it all down. It -should- be a split bus. prtconf -pv lists five PCI-X bridges, but until I actually try swapping cards in, it’ll be hard to tell if it’s actually split electrically, or if they’re just hanging the onboard USB/GigE/SCSI/SATA off different bridges and all the PCI-X ports are on the same bus. If it’s actually split, then no worries.
The idea is this:
- Intellistation A Pro 6224-33U. Dual Opteron 254 with 4GB of RAM running Solaris Express Developer Edition (SXDE).
- 4×36GB U320 drives, HW RAID5 on the PERC4, stuck in a ZFS pool
- 2×400GB SATA150 drives, pooled with the SCSI drives
- 10×73GB FC drives over 2×1GB HBAs (if the bus is split), pooled with the rest. If it’s not split, grab a dual 2GB PCI-X/133 FC HBA off of eBay when I have $200 to blow and attach it that way.
Additional possibilities:
- Powervault 220Sx2 with whatever drives I can scrape up. This would mean moving the current U320 drives to the onboard SCSI controller (again, dual channel U320, just that it’s only RAID0/1/0+1, not 5, and there’s no offload engine or battery backed cache). 80 pin (SCA-2) SCSI drives are much cheaper than 68 pin, since there’s a ton of servers getting decommissioned. SCSI (well, parallel SCSI) is disappearing as the trend to SAS and SATA drives continues in the datacenter, this should only get better for me.
- Some other kind of FC array.
- FC or SCSI array with SATA drives. MTTL is much lower, but /shrug. It’s cheap!
- Bump the Intellistation to 8GB of RAM, assuming DDR1 ECC prices get better (unlikely). If it comes down to it, I’d rather have more spindles than more RAM anyway.
It can, at the very least, take over the role of OpenVPN server, DNSMasq server (I like having DNS on my home network), Postgre server, Oracle server, SSH gateway, and LDAP server if I feel like being a pain in the ass and making everyone authenticate to the server (plus RADIUS) to get on my network. I don’t have encryption set up on my wireless network, and I’m not about to change that, but I could (should) set up a trunked VLAN subnet on the wireless which can only get out to the internet (and not route to the wired network) until you authenticate, at which point you get into the main subnet. I mean, what if some random person comes to my house (or parks outside) and needs the internet, like, now! Sure, there’s a coffee shop a block away, but what if they need it at 3AM when the coffee shop is closed?
Now, concerns…
ZFS keeps an ‘intent log.’ Similar to most journaled filesystems, it’s got a record of what it does and doesn’t do. Unlike most journaled filesystems (jfs, reiserfs, ext3 -j/ext4, NTFS, HFS+, VxFS, XFS), it doesn’t check the filesystem and replay the journal if the system crashes. That’s not an issue in many cases. ZFS relies on filesystem metadata and self-heals. Due to this, ZFS requires that the writes be committed (fwrite()) every time. Without replaying the journal, writes could be lost on power loss, and there’s not a way I know of to automagically fsck the filesystem when it comes up (actually, to my knowledge, fsck.zfs doesn’t exist). That being the case, you’d end up having corruption. ZFS can fix that. If it’s in the kernel or essential processes, though? You just hosed the server.
The entire point of a cache-backed drive is that it waits for sequential writes so it’s not constantly flipping around the platters, which really helps performance. A cache backed controller immediately returns success to the OS, though the write is not committed yet.When the system comes back on after a power loss or crash, it flushes the cache to disks, and you’re good to go. With flaky SATA drives, JBODs on a plain Jane controller (no cache, which a lot of Fibre Channel HBAs are), forcing a sync is good. With the cache backed controller, it’s bad. Solaris has a syscontrol setting you can change to prevent this from happening (while still leaving the ZFS Intent Log up and running, though turning that off is another way around it, which is not at all recommended). That works great if everything in your ZFS pool is cache backed (real hardware RAID arrays, drives run by a cache-backed controller, etc). In a mixed environment (as mine will be)? I take either the risk of poor SCSI performance or data corruption. I could forgo the hardware RAID, but then why use the PERC at all? The only advantage I can see is that I’d still have a writeback cache, which would be flushed far too often. There’s a way to set this in per controller in sd.conf, but that’s for Fibre Channel LUNs, not SCSI. Turning off ZFS’s cache flushing would negatively affect performance on the SATA disks. Best solution for now? Make each SCSI disk its own logical drive on the PERC, then zpool those with the SATA disks.
The max throughput of GigE is 125MB/s. Given protocol overhead, 80-90MB/s is more realistic. The cost of a GigE switch which supports 802.3ad (link aggregation) is $50. That being the case, I’m going to put another wireless router in my house in repeater mode, upstairs (where my computer is, and where this’ll probably be), put a 802.3ad switch on it, connect the GigE on the Intellistation to the router, and the quad GigE to the switch. That’ll solve the problem with certain Broadcom wireless cards not coming up until I log into Gnome, since I’ll just wire them, plus I’ll have the advantage of being able to issue WOL packets (Wake On LAN). Link aggregation effectively makes multiple NICs appear to be one, along with the bandwidth. Intel’s got a proprietary way to do it via ‘teaming,’ but that’s only supported on their cards. Yeah, I have one, but Intel’s implementation is nonexistent on Solaris. Fortunately, Solaris doesn’t need it. I can aggregate whatever I want, regardless of the vendor (take that, Linux bonding, FreeBSD IPMP, and Windows lack of any comparable feature!).
I’ll have four aggregated GigE connections on the switch with a different subnet, so lookups for filesharing succeed with an IP in the hostfile (rather than routing through the wireless for no reason). This gives me an optimal throughput of 360MB/s or so on the network, and that can always be increased via another Pro/1000 MT (PCI-X versions are cheap!). I’d have to pick up a PCIe multiple port GigE card (another Intel, probably) for my desktop if I want more than 90MB/s, but that’s not necessary just yet. It’s faster than my hard drive is, anyway. Ideally, once throughput gets high enough (more spindles), and I have more throwaway money, I’ll pick one up. PCIe has a direct link to the CPU/RAM anyway, so it doesn’t need to touch my hard drive if I’m just streaming it over the network into RAM.
How to share it, though? NFS and CIFS (Samba) are both rather unintelligent, and they issue an assload of commands for everything. Not a big deal on copying a few large files, but ever tried to move a ton of small files (say, music) over the network? Suck. NFSv4 fixes this. I don’t know of a Windows NFSv4 client. SMB/CIFSv2 fixes this. That’s only supported on Vista and Server 2008. What do I do here, then? I could install 2008 in a virtual machine just to share files. Seems like a damn waste, and I’ll never touch 99% of what it does. The machine’s going to be headless, I don’t want to use RDP. I don’t want Active Directory on my network. SMB/CIFSv2 is the only thing Server 2008 offers me. Solaris does everything else I want to do better. iSCSI has none of this overhead, but I’d need to specify a create a volume, export it to a system, then format it on the client. I can’t get direct access to that from multiple clients, and it doesn’t grow nicely. Yeah, I could create another iSCSI device, export it, mount it, and use the support for Volumes windows has to span them, but that sucks, and I still can’t access it from multiple systems. So, I could create a VM, install Server 2008, have it share the volume, and add more virtual disks as necessary (again, spanning via Windows) to share.
Again, this is not an ideal solution. My storage isn’t unified, and it’s a big hassle for me to go add more. Creating a 22TB ZFS pool at work took 15 seconds. Any idea how long that takes on Windows? Plus I have to go through filesystem checks if it crashes. I don’t really know what I’m going to do about that. Latency on closing a 1K file via NFS is about 4 seconds. It’s similar for CIFS. Assuming the process reading/writing is multithreaded, it shouldn’t bottleneck. I have no idea how many of the applications I use are actually multithreaded, though, and I don’t really feel like digging around Process Explorer to find out.
Best case scenario?
Get a Thumper. Given that I don’t have $25,000 to blow? Get a PCI-X SATA card and an external case. The idea here is to save money on electricity, and attaching 3 arrays with redundant power supplies isn’t going to help that. A single case with a 300W with 8 SATA drives and cables funneled out of the Intellistation might work, since every company out there seems to be full of jackasses. It can’t honestly be that hard to support the SATA2 spec (no, it’s not 3GB/s throughput max) and give me a cheap port multiplier. Sequential throughput on a SATA drive is about 70MB/s. Random access is closer to 40. With 2 SATA drives plus 4 SCSI drives, I ought to be able to saturate a single GigE link pretty easily for now. With more drives, that’ll go up. Ten SATA drives plus the 4 SCSI should put me over the cap for the quad GigE. It’s not like I can’t add another card and aggregate those, too, but how fast do I really need it? PCI-X might disappear also (PCIe is gradually replacing it in servers), but the price of quad GigE PCI-X cards can only get better.
Just think of my zoning times at 360MB/s! That’s not going to happen now (or for a while), but I should at least be getting twice the speed of my hard drive.
The real problem for scalability is that I only get two cores. Hopefully, by the time it doesn’t scale to the demands of database/fileserver load/whatever, it’ll be a long time from now. 1TB drives should be less than $100 in a year. Who knows what I could get by the time this is obsoleted? I’d still like to dangle SCSI/Fibre Channel arrays off it, but I don’t think that’s going to go over well.
Looking at moving, condos, marriage, etc. Given the cost of weddings, it’s unlikely that I’ll have extra money any time soon. Given the average size of a condo, I don’t think I’d get a good reaction from whirring and clicking arrays, no matter how appealing the blinkenlights may be, plus 3U equipment is loud (ok, not as loud as 2U or 1U, but 60dB isn’t quiet). Regardless, perhaps I should appeal against condos with $250+ association fees on the basis that they’re costing me at least 1.5TB (7200RPM SATA) or 1TB (10k 300GB SCSI) a month, or more RAM, or something…
