Melinda Soares-Furtado
Computational FAQ
Updated: Sep 3, 2020
This entry is intended to share some of my computational take-aways from exploring various software/hardware options during my postdoctoral appointment at UW-Madison. For context, I work in a small collaboration where data are shared among 3-5 users. Our data storing capacity needs are about 5 TB per field. There could be as many as ~10 fields to reduce.
ResearchDrive
The university also provides secure, shareable storage space on the UW-Madison campus network. ResearchDrive is a university-wide file storage solution for Faculty PIs, permanent PIs, and their research group members. It is suited for a variety of research purposes, including backup, archive, storage for data inputs/outputs of research computing, and others. It is a secure and permanent place for keeping data. 5 TB of storage are provided to each PI at no cost. For $200/TB per year additional storage space and maintenance can be obtained on ResearchDrive.
I will mention that ResearchDrive presents a bit of a bottleneck in terms of the time required to read/write to files. This is why I'm leaning toward housing the data on external HDDs.
Hard Disk Drives (HDD)
I ended up purchasing the Seagate External Hard Drive 8TB Expansion. The price point is decent at ~$130. To see the cost breakdown per TB for external HDDs in 2020, check out this link. Moreover, this HDD is large enough to meet my long-term storage needs for a single field (~5 TB of data for a four-year observation window of an open cluster). This particular device arrives in NTSF format. For Mac users, this presents a slight snag that will need to be addressed. This is because OS has read-only access for NTFS, meaning that you cannot write your data without reformatting the disk in a compatible form. I went with APFS. For more details regarding how to reformat your disk, see this helpful Quora response by Theo Lucia.
It's also worth mentioning that HDDs do not last forever. In fact, they often have a pretty short lifespan of ~3-5 years. Be sure to back up your storage. I learned this the hard way when I attempted to fire up a six-year-old HDD.
Servers versus Workstations
A workstation is a personal computer, while servers are the central machines that provide data to workstations. You can read more about the hardware differences here. A server largely has the same basic hardware as a workstation: hard drives, RAM, processors, and network adapters. For instance, you can take a desktop tower and change its software transforming it from a workstation into a server. This is what I plan to do.
After spec'ing out identical hardware for machines of each type, I found a major difference to be cost. This is largely due to server licensing costs. In short, workstations will tend to save you at least 10% on price (depending on how you spec this out).
Server & Workstation Specs
Often companies like Dell have purchase agreements arranged with universities. These agreements provide additional savings and this is why your university will want you to use particular vendors.
Below I've provided the configurations that I've mapped out for both a Dell workstation and a Dell server. Since I am not using much machine learning at this time, I've decided to skip on the GPU upgrades. I'm open to being convinced as to why this is a terrible mistake. I do see an increasingly important role of machine learning (ML) in astrophysical research and I know that GPUs are important hardware elements for ML algorithms. I focus more on clock speed than core count, as my processes are often performed in serial. A typical laptop (i5/i7/i9) might have a processor speed of 2.4 GHz or so. I'm aiming for 3.8 GHz or better.
You can compare the power of various Mac machines --- single core and multi-core --- at the link provided here. Bear in mind that nearly all machines have multiple cores now. My laptop, for instance, is a quad core.
RAM is another important component that impacts your processing speed. You can learn about how here. Servers offer error-correcting code (ECC) memory, which helps circumvent data corruption.
Some other things, you want to have about a MB of cache. I've been told that you almost never need to worry about enhancing the cache.
The first is the PowerEdge T440 Fully Configurable with a price tag of $4,632.79.
The second is the Precision 7920 Tower - Build Your Own with a price tag of $4,885.65.
Center for High Throughput Computing Resources
The Center for High Throughput Computing (CHTC) supports a variety scalable computing resources and services for UW-affiliated researchers and their collaborators, including high-throughput computing (HTC) and, tightly-couple computations (e.g. message passing interface, or "MPI"), high-memory, and GPUs. CHTC compute systems and personnel are funded by the National Science Foundation, the National Institutes of Health, the Department of Energy (DOE), the National Science Foundation (NSF), the Morgridge Institute for Research, and various grants from the university.
Standard access to CHTC resources are provided to all UW-Madison researchers, free of charge. Even external collaborators with an on-campus sponsor may be given access to resources. They also offer hardware buy-in options for priority access to computing capacity on a case-by-case basis, though standard access is more than sufficient for the vast majority of CHTC users.
To gain access to the CHTC, you will have to fill out the Large-Scale Computing Request Form. Don't worry if you do not know the answers to all the spec-related questions. Your request will be received by a Research Computing Facilitator with the CHTC, who will contact you to set up a consultation, during which time the relevant server accounts will be created. This allows you to discuss your computational research goals, review CHTC resources and policies, and get an account set up. A faculty sponsor (usually, a faculty advisor or project PI) will need to be present for only the first ~30 minutes of the meeting that CHTC has with a new research group. The meeting will last 45-60 minutes in total.
Read more about high-throughput computing here.
Also, bear in mind, that while the CHTC can process your jobs (which you submit in batch form), you still have to find a place to house these data long-term.
Internet Provider Fun
In Madison, the options for internet are very dependent upon your neighborhood. For example, in my neighborhood AT&T only offers DSL internet. Spectrum, on the other hand, offers cable internet but not fiber-optic cable. Be sure to do some homework to find all the options available to you. You can check your internet performance here. For reference, my current specs are 355 Mbps download & 17 Mbps upload.
By switching from AT&T DSL to Spectrum cable, I experienced a factor of 30 improvement in Mbps download speed. Mbps upload speed was improved by a factor of 3. Unfortunately, Midvale Heights does not yet offer fiber-optic cable, which would further improve my internet performance.