Incident Manager Job at Crusoe, San Francisco, CA

SmQ5L1FQWDVnKytvM3VVNWZlT3JWZHVjNUE9PQ==
  • Crusoe
  • San Francisco, CA

Job Description

Job Description

Job Description

Crusoe is on a mission to accelerate the abundance of energy and intelligence . As the only vertically integrated AI infrastructure company built from the ground up, we own and operate each layer of the stack — from electrons to tokens — to power the world's most ambitious AI workloads. When you join Crusoe, you join a team that is building the future, faster.

We're in the midst of the greatest industrial revolution of our time. The demand for AI compute is boundless, and power is a bottleneck. We're solving that — with an energy-first approach that makes AI infrastructure better for the world and faster for the people innovating with AI.

We're looking for problem-solving, opportunity-finding teammates with a sense of urgency, who believe in the scale of our ambition and thrive on a path not fully paved — people who want to grow their careers alongside a team of experts across energy, manufacturing, data center construction, and cloud services.

If you want to do the most meaningful work of your career, help our customers and partners advance their AI strategies, and be part of a high-performing team that believes in each other, come build with us at Crusoe.

About the Role

This Incident Manager role is critical for upholding service reliability and customer trust, directly impacting company success by minimizing downtime and resolving critical issues. You will spearhead the management of high-visibility incidents and customer escalations, ensuring rapid and effective responses to complex technical challenges.

Beyond immediate resolution, we are looking to sharpen our incident management practices to ensure a superior customer experience during "storms" as well as robust preventative measures afterward. You will leverage data analytics to drive greater resiliency and reliability, ensuring that every incident translates into a stronger product and process.

What You’ll Be Working On Crisis Management & Data-Driven Resiliency
  • Handle the "Storm": Lead incident responses for high-visibility issues, ensuring minimal disruption to customer operations. You will act as the calm anchor during crises, managing communication and strategy to maintain customer trust during outages or critical failures.

  • Analytics & Reliability: Utilize data analytics to identify trends in incidents, translating these insights into actionable strategies for greater system resiliency and reliability.

  • Preventative Strategy: Develop robust incident response strategies and designs. Focus on the "preventative piece" by conducting deep post-incident reviews to ensure root causes are addressed and recurrences are eliminated.

Technical Execution & Customer Support
  • Troubleshoot and Resolve: Diagnose and resolve complex technical issues related to Infiniband, containerization, and distributed training.

  • Implement and Optimize: Guide and assist customers in implementing and optimizing their HPC infrastructure to achieve maximum performance and efficiency.

  • Educate and Empower: Develop and deliver training materials, including internal training sessions, documentation, and knowledge base articles, to empower customers to effectively utilize our solutions.

  • Collaborate Internally: Work closely with internal engineering and product teams to provide valuable customer feedback. You will act as a key technical resource, helping our Customer Support Engineers (CSEs) and Customer Success Managers (CSMs) understand and resolve complex product issues.

What You’ll Bring to the Team Technical Proficiency & Certifications
  • Core Tech Stack: Strong technical experience with Linux, Virtualization, Kubernetes, and handling customer incidents.

  • Certifications: We are looking for candidates who actively update their skill sets. NVIDIA, Linux, and Kubernetes certifications are strongly preferred to demonstrate a deep understanding of the products our CSEs and CSMs support.

  • Networking & Infrastructure: Solid understanding of the TCP/IP stack and Infrastructure-as-Code (IaC) practices.

  • Bonus Skills: Programming skills with one or more programming languages.

Essential Experience & Mindset
  • Experience: 4-5 years of customer-facing experience and 3-5+ years’ experience in a team leadership role acting as a liaison with external/internal customers.

  • Crisis Handling: A proven track record in crisis management, capable of navigating high-pressure situations with a focus on customer experience.

  • Problem Solving: A proven problem-solving mindset with the ability to diagnose and resolve complex technical issues.

  • Communication: Excellent communication skills, both written and verbal.

Benefits:

  • Competitive compensation

  • Restricted Stock Units

  • Paid time off & paid holidays

  • Comprehensive health, dental & vision insurance

  • Employer contributions to HSA account

  • Paid parental leave

  • Paid life insurance, short-term and long-term disability

  • Professional development & tuition reimbursement

  • Mental health & wellness support

  • Commuter benefits (parking & transit)

  • Cell phone stipend

  • 401(k) Retirement plan with company match up to 4% of salary

  • Volunteer time off

Compensation Range

Compensation will be paid in the range of $136,125 -$165,000K. Restricted Stock Units are included in all offers. Compensation to be determined by the applicants knowledge, education, and abilities, as well as internal equity and alignment with market data.

Crusoe is an Equal Opportunity Employer. Employment decisions are made without regard to race, color, religion, disability, genetic information, pregnancy, citizenship, marital status, sex/gender, sexual preference/ orientation, gender identity, age, veteran status, national origin, or any other status protected by law or regulation.

Job Tags

Temporary work, Immediate start

Similar Jobs

Thermo Fisher Scientific

Engineer II, Computer Systems Validation Job at Thermo Fisher Scientific

 ...supported by advanced technologies and a highly collaborative team environment. Discover Impactful Work As an Engineer II, Computer Systems Validation (CSV), you will support validation activities tied to automation systems, manufacturing equipment, and computerized... 

Next Step Systems LTD

Software Engineer Web Developer Job at Next Step Systems LTD

 ...Software Engineer Web Developer, Northbrook, IL The selected Software Engineer Web Developer will become part of the team responsible for...  ...in Computer Science or related field. - 0-2+ years of experience. - Proficiency in compiled languages (Java, C++, etc.)... 

MultiCare Health System

Physician Faculty - Addiction Medicine Fellowship Job at MultiCare Health System

 ...FTE: .75-1.0, Shift: day, Schedule: variable MultiCare Addiction Medicine Fellowship is seeking a full-time physician to join our faculty! Learn more about the program here: Our faculty trains exceptional residents and fellows as part of the East Pierce... 

International Medical Placement, Ltd.

Physician / Addiction Medicine / New York / Permanent / Rochester area Job at International Medical Placement, Ltd.

Seeing an Addiction Psychiatrist for an outpatient chemical dependency center Schedule: Mon ? Fri 8a ? 5p with potential Saturday coverage...  ...? rotated call among colleagues EMR: EPIC Join 2 Addiction Medicine MD?s (part-time); 2 Addiction Psychiatrists; 2 NP (1 PMHNP and... 

Boeing Future of Flight

On-site Aerospace Structures Assembler (Contract) Job at Boeing Future of Flight

An aerospace company in North Charleston is hiring a Structural Mechanic for a contract position lasting 11 months. The ideal candidate...  ...a pay range of $25 to $30 per hour, with optional benefits like medical and dental insurance.#J-18808-Ljbffr Boeing Future of Flight