I have worked as a Site Reliability Engineer on the hardware team at my company and we did a lot of different things. We received and answered internal support tickets for server issues that our neighboring internal engineering teams would submit and then coordinate with our remote data center operations team to get those things fixed. You may write a tool or application for you and your team to be able to automate tasks like installing the operating system on a server or for internal teams to be able to put in a request to swap out a broken server with a working one. We wrote an in-house Ruby on Rails application that allowed us to automate our provisioning tasks so we wouldn't have to do them manually anymore for large batches of hardware.
You may get the chance to travel to a data center to help unbox, rack and cable up large batches of servers. You'd possibly diagnose issues on the spot with those new servers because often a certain percentage of them will have defects when they come fresh from the vendor/maker. That can involve opening up the server and swapping out a memory card or motherboard. Cabling servers up to the power supplies and networking equipment can be really fun too! Some hardware teams are on-call (kind of like a doctor but for hardware) for hardware issues and get paged to fix them at any hour for a set duration of time.
Usually all of our engineers are on-call for 24 hours a day for a week at a time and that rotates X number of weeks depending on how many people are on your team. But being on-call is different at every company and it's good to ask about how that works and what the expectations are if you are interviewing somewhere.
Simone recommends the following next steps:
- Check out Linux Academy and learn how to install an operating system on a virtual machine