[Remote] IBM Workload Scheduler Administration / Infrastructure Engineer
Note: The job is a remote job and is open to candidates in USA. Kastech Software Solutions Group is seeking a highly skilled IBM Workload Scheduler Administration / Infrastructure Engineer with 3–5+ years of experience. The role involves managing, maintaining, and optimizing enterprise batch scheduling infrastructure, ensuring high availability and reliable execution of critical business workloads.
Responsibilities
- IBM Workload Scheduler Administration
- Administer Production IBM Workload Scheduler (formerly Tivoli Workload Scheduler) environment:
- 28,000 unique daily jobs
- Approximately 350,000 daily job runs
- 44 servers
- Three additional change-control environments
- Install, configure, administer, patch, and upgrade IWS components:
- Master Domain Manager (MDM)
- Dynamic Agents
- Dynamic Pools
- Dynamic Workload Console (DWC)
- Change Management & Governance
- Work closely with Product Owners and communicate workstreams through Jira
- Manage job promotions using a Workload Application Template-based process
- Perform safety and stability assessments for all job promotions
- Manage change control across four separate environments
- Enforce change management standards, policies, and governance
- Platform Availability & Operations
- Maintain and continuously improve Production platform uptime target of 99.17% per month
- Follow SOPs, DevOps practices, and disciplined change-control processes
- Coordinate platform-impacting communications to a user community of approximately 500 developers and data engineers
- Support Production infrastructure consisting of:
- 44 servers
- MDM, DWC, and Agent environments
- Troubleshooting & Support
- Resolve:
- Complex job failures
- Performance bottlenecks
- Agent-related issues
- Infrastructure-related issues
- Provide guidance on complex job scheduling designs to less experienced team members
- Monitoring, Security & Compliance
- Monitor scheduler platform health and performance
- Manage database maintenance activities
- Perform backup, disaster recovery, and monthly failover testing
- Define and maintain:
- Security policies
- User authorizations
- Authentication for Dynamic Workload Console (DWC)
- Respond to:
- Cybersecurity vulnerability assessments
- PCI compliance audits
- Other regulatory audit requests
- Automation & DevOps
- Design and implement Ansible-based automation solutions
- Develop self-healing mechanisms to reduce unplanned outages
- Coordinate with offshore teams performing SOP activities during non-business hours
- Develop automation scripts using:
- Python
- IWS REST APIs
Skills
- Ability to modernize, implement, install, configure, upgrade, migrate, develop, or design IBM Workload Scheduler (IWS) / IBM Workload Automation (IWA) solutions
- Support migration activities across pre-production and production environments
- Participate in knowledge transfer and documentation to enable team self-sufficiency
- 3–5+ years of dedicated IBM Workload Scheduler administration experience
- Responsible for managing, maintaining, and optimizing enterprise batch scheduling infrastructure
- Primary environment hosted on Red Hat Enterprise Linux (RHEL)
- Strong expertise in: IBM Workload Scheduler (IWS), Linux System Administration, Scripting and Automation
- Focus on ensuring high availability and reliable execution of critical business workloads
- Administer Production IBM Workload Scheduler (formerly Tivoli Workload Scheduler) environment: 28,000 unique daily jobs, Approximately 350,000 daily job runs, 44 servers, Three additional change-control environments
- Install, configure, administer, patch, and upgrade IWS components: Master Domain Manager (MDM), Dynamic Agents, Dynamic Pools, Dynamic Workload Console (DWC)
- Work closely with Product Owners and communicate workstreams through Jira
- Manage job promotions using a Workload Application Template-based process
- Perform safety and stability assessments for all job promotions
- Manage change control across four separate environments
- Enforce change management standards, policies, and governance
- Maintain and continuously improve Production platform uptime target of 99.17% per month
- Follow SOPs, DevOps practices, and disciplined change-control processes
- Coordinate platform-impacting communications to a user community of approximately 500 developers and data engineers
- Resolve: Complex job failures, Performance bottlenecks, Agent-related issues, Infrastructure-related issues
- Provide guidance on complex job scheduling designs to less experienced team members
- Monitor scheduler platform health and performance
- Manage database maintenance activities
- Perform backup, disaster recovery, and monthly failover testing
- Define and maintain: Security policies, User authorizations, Authentication for Dynamic Workload Console (DWC)
- Respond to: Cybersecurity vulnerability assessments, PCI compliance audits, Other regulatory audit requests
- Design and implement Ansible-based automation solutions
- Develop self-healing mechanisms to reduce unplanned outages
- Coordinate with offshore teams performing SOP activities during non-business hours
- Develop automation scripts using: Python, IWS REST APIs
- Strong experience with IBM Workload Scheduler architecture, especially Dynamic Workload Broker, V10.1+, high availability of MDM's managing Fault Tolerant Agent and Dynamic Agent agent architectures
- Strong conceptual understanding of Master Domain Manager (MDM), Backup MDM (BMDM), Dynamic Workload Console (DWC), Fault Tolerant Agent (FTA), Dynamic Agent (DA)
- Strong grasp of conman CLI to monitor and control production plan, check job/job stream/resource status
- Strong grasp of composer CLI to define, modify and extract scheduling objects
- Strong grasp of planman CLI to control pre-production plan and GUI mirroring
- Strong grasp of lifecycle of daily production planning process, phases of JNextplan/FINAL
- Proficiency in navigating the DWC web-based GUI to monitor workloads, manage user access security, and define scheduling objects
- Experience installing IWS components, applying Fix Packs, and Interim Fixes
- Troubleshooting with logs under TWSDATA/stdlist, adjusting trace level for netman, batchman, writer, mailman, etc
- Strong experience with IBM WebSphere Liberty
- Strong grasp of reading messages.log, traces.log, FFDC logs
- Strong grasp of configuring JVM heap sizes
- Strong grasp of configuring tracing scope, tracing levels, tracing retention
- Strong experience with Red Hat Enterprise Linux 8+
- Deep familiarity with bash/shell commands for text processing (for example, grep, awk, sed), file manipulation, and system navigation
- Ability to manage, start, stop, and troubleshoot SystemD services using systemctl and journalctl for IWS agents and MDM
- Managing user accounts, groups, service accounts and deep knowledge of Linux file permissions (chmod, chown, ACL on local filesystems and NFS)
- Ability to monitor system performance using tools like top, htop, vmstat, iostat, and sar to troubleshoot bottlenecks and platform unresponsiveness
- Understanding of Logical Volume Manager (LVM) and filesystem usage
- Checking TCP port availability, firewall rules (firewalld/iptables), and connectivity between MDM and Dynamic Agents using netstat, ss, ping, curl, etc
- Managing SSL/TLS certificates, private keystores, public truststores, and working with Certificate Authority
- Strong experience with scripting (Bash Shell, Python, etc.) for automation
- Understanding of networking principles
- Understanding of basic Oracle database administration, enough to troubleshoot with DBA's to prove when an issue is in Oracle
- Understanding of basic SQL to query job metadata
- Understanding of checking database connectivity
- Understanding of AWS cloud infrastructure
- Experience with using secrets manager (CyberArk PPM, Hashicorp Vault, or similar)
Company Overview
Company H1B Sponsorship