Executive Summary
This interactive document outlines a strategic 1-year roadmap for DevOps infrastructure transformation. It addresses from - critical vulnerabilities, End-of-Life (EOL) software, operational inefficiencies to overcoming all these challenges. The roadmap prioritizes immediate risk mitigation, followed by a phased implementation of modern, resilient, and centrally managed systems.
Key Transformation Pillars
- EOL Remediation: Upgrade/migrate EOL XenServer 7.2 [currently it resides in air-gapped env], Elimination of old Linux distros, Node.js.
- High Availability (HA): Full implementation of HA for on-prem virtualization and Nginx instances.
- CI/CD Modernization: Migrate from Jenkins to GitLab for unified, automated pipelines (covering AWS production).
- Centralized Management & Observability: IPA server with Ansible, Centralized logging.
- Security Hardening: Regulating every access with SSL & MFA even for developer's private ENV.
- Isolated ENV: Isolation of production & development ENV from developer's access/private ENV.
Anticipated Benefits
- ✔️ Reduced security attack surface.
- ✔️ Improved system uptime and service resilience.
- ✔️ Accelerated and more reliable development/deployment.
- ✔️ Enhanced developer productivity.
- ✔️ Stronger compliance posture.
- ✔️ Reduced chaos when key team members leave.
The Imperative for Change
Modernization is a keep going process and is crucial to address significant risks and inefficiencies as they arise with the changing technical turf. This section outlines why this transformation is not just an IT upgrade, but a vital measure for business continuity and operational excellence.
Why Modernization is Crucial
EOL Software Risks: XenServer 7.2, CentOS 7 based Nginx and Node.js 16 are EOL, meaning no security patches or vendor support. This makes them prime targets for malicious actors, risking data loss, downtime, and reputational damage. The cumulative risk of multiple EOL systems is significantly greater than their individual vulnerabilities. As far as CE+ compliance is concerned, we are keeping the Xen infra in an air gapped Env, similar approach has been applied for the Nginx based router as well (cent os 7).
Operational Inefficiencies: Decentralized management leads to inconsistencies and human error. Limited automation hinders development and deployment speed and reliability.
Holistic Transformation: The roadmap encompasses interconnected initiatives (EOL upgrades, HA, CI/CD, centralized management). Success in one area amplifies benefits in others, safeguarding core business operations and ensuring continuity.
Current Infrastructure Assessment & Key Shortcomings
This section provides a comprehensive analysis of our current infrastructure, highlighting critical weaknesses and operational deficiencies. Understanding these shortcomings is paramount to appreciating the necessity and strategic direction of the proposed 1-year modernization roadmap.
Critical End-of-Life (EOL) Software
⚠️ XenServer 7.2 (EOL)
Two physical hosts are running EOL XenServer 7.2. This means no security updates, bug fixes, or support, compromising all hosted VMs (dev, test, sand, .NET) and risking prolonged downtime. They are placed in a air-gapped ENV for complience & security reasons, but still an upgrade is necessary.
⚠️ CentOS 7 (Nginx Proxy) (EOL June 2024)
One Nginx proxy runs on EOL CentOS 7, making this critical security boundary highly susceptible to exploits and posing significant compliance risks (PCI DSS, HIPAA, GDPR).
⚠️ Node.js 16 (EOL September 2023)
Node.js 16 environments are EOL due to an OpenSSL dependency. Applications are exposed to unpatched vulnerabilities, hindering new feature adoption and accumulating technical debt. As running applications ontop is public facing so its critical.
General EOL Software Risks Summarized
EOL software universally means no security patches or vendor support, potential incompatibility, and increased operational risk. It's a "ticking time bomb" leading to weakened security, instability, and an impediment to innovation.
Operational Deficiencies
- Decentralized Operations & Management: Managing Linux instances (+/- 300)individually for our team is inefficient, error-prone, and unscalable, leading to inconsistent configurations and patching. Jenkins operates as a separate CI/CD entity, missing integration opportunities.
- Limited High Availability & Resilience: At front, the public facing Nginx proxy is not completely active-active, creating a Single Point of Failure (SPOF). The on-premise virtualization layer also lacks robust HA, impacting development velocity if physical servers fail.
- Security & Compliance Gaps: Web-based VS Code access lacks Multi-Factor Authentication (MFA), though they are secured with SSL (pub-prv) certificates at the moment. Extensive EOL software creates significant compliance challenges.
- Operational Inefficiencies & Development Bottlenecks: The desire to "ditch Jenkins," absence of a central log server, and "worry about upgrading dev env nodejs" highlight slow, error-prone deployments, inefficient troubleshooting (high MTTR), and perpetuation of technical debt.
- MFA with the inhouse Email server access: We need to add MFA in the inhouse email server access which is at the moment protected only with BF & dictionary based attacks.
Interconnected Fragility
The combination of EOL components, decentralized management, and limited HA creates an environment of interconnected fragility. A failure or compromise in one area can easily cascade to others.
Example of Cascading Risk:
Decentralized management hinders patching, and lack of centralized logging delays detection of such incidents.
The "worry about upgrading dev env nodejs" is symptomatic of deeper issues: past negative experiences, inconsistent environments, and lack of automated testing. Addressing this involves implementing standardized environments (Docker, NVM) and robust automated testing within the new GitLab CI/CD pipeline, fostering better development practices.
Proposed 1-Year Modernization Roadmap
This section details the strategic, phased approach to transition our infrastructure. Each phase builds upon the previous, starting with immediate risk mitigation, then establishing foundational centralized systems, and culminating in comprehensive high availability and advanced automation. Click on a phase to see details.
Roadmap Progress Overview
Phase 1 (July25 - Oct25)
Foundational Upgrades & Risk Mitigation
Phase 2 (Nov25 - Jan26)
Centralization & CI/CD Transformation
Phase 3 (Feb26 - Apr26)
HA & Expanding Automation
Phase 4 (May26 - June26)
Monitoring & Continuous Improvement
Phase 1: Foundational Upgrades & Immediate Risk Mitigation (Months 1-3)
Objective: Urgently address critical EOL software risks, reduce attack surface, stabilize essential services.
Reasons: XenServer 7.2 is EOL, posing security risks. XCP-ng is an open-source successor with a clear migration path.
Requirements: In-depth planning, hardware validation, License arrangements, comprehensive VM backup, phased migration (host by host), Xen Orchestra implementation for management and backup.
Reason: EOL distros are a severe security liability. They add an overhead of maintaining the air-gapped ENV. Rocky Linux 9.x provides a secure, supported foundation.
Requirements: Provision new Rocky Linux 9.x VMs, Nginx installation & configuration migration (critical due 150+ .conf exist/active), rigorous testing in a parallel environment, phased DNS cutover. We need to deploy / add more hardware to support it properly.
Reason: Node.js 16 EOL exposes apps to risks. Transition to current LTS (e.g., Node.js 22.x) for security and features. Also to run Node.js 16 we have to engage old Linux distros aswell at some places.
Requirements: LTS version 22 has been selected (assessment of NVM packages for security has to be done), pilot application upgrade (its currently in progress however more dev hands are required), dependency audit & update, thorough testing, iterative rollout to remaining apps.
Reason: SSL-only protection is insufficient due to the possibility of human errors and mischivious activities. An added layer of MFA is needed (for best practices) for developer credentials.
Requirements: Solution evaluation (e.g., Coder, oauth2-proxy, Keycloak), IdP consideration, implementation and configuration, developer rollout and training. We need to deploy / add more hardware to support it properly.
Phase 2: Centralization & CI/CD Transformation (Months 4-6)
Objective: Streamline operations, enhance security by centralizing Linux identity and configuration management. Initiate pivotal migration from Jenkins to GitLab.
Deploy the IPA Server:
- Reason: Management of multiple Linux instances, physical hosts & configurations individually is inefficient/insecure. The IPA server offers centralized identity, aut & and options to add MFA based solutions to enrich an IT Infra.
- Requirements: Design topology, server provisioning (Rocky Linux 9.x), initial configuration, train on WebUI/CLI. We have to add additional & suitable manpower during the implementation as it all have to go side by side with our existing workload.
Integrate Ansible with IPA:
- Reason: Automate client enrollment and dynamic configuration management.
- Requirements: Ansible setup, client enrollment playbooks, configuration management playbooks.We have to add additional & suitable manpower during the implementation as it all have to go side by side with our existing workload.
Web Panel for Ansible Management (AWX):
| Feature | AWX | Rundeck |
|---|---|---|
| FreeIPA Integration | Strong | Possible via plugins |
| Playbook Management | Core feature | Can execute, less native |
| User Interface | Ansible-centric | Job/operational task-centric |
AWX has been selected due to its native Ansible focus.
Plan Jenkins to GitLab Migration:
- Reason: GitLab offers integrated SCM, CI/CD, security. User wants to "ditch Jenkins".
- Requirements: Assess Jenkins pipelines, gradual iterative migration, manual pipeline rewriting (Groovy to YAML), team training.
Develop Initial GitLab CI/CD Pipelines (.NET & Node.js):
- Reason: Build internal expertise and demonstrate value quickly.
- Requirements: Define `.gitlab-ci.yml` for .NET (restore, build, test, publish) and Node.js (install, lint, test, package). Configure runners, artifacts, caching.
Integrate Automated Software Testing (Functional & Security):
- Reason: Core requirement, "shift-left" security.
- Requirements: Functional test integration. Security: SAST, DAST, Dependency Scanning, Container Scanning, Secret Detection in GitLab pipelines.
Phase 3: Implementing High Availability & Expanding Automation (Months 7-9)
Objective: Significantly enhance service resilience by implementing robust HA for critical frontend (Nginx) and backend (on-premise virtualization) components. Extend CI/CD automation to cover all environments, including AWS production.
Reason: Current setup is inefficient as its relient on 1 Nginx which is SPOF. Active-active provides redundancy and load distribution.
Technology: Keepalived with VRRP (core HA).
Requirements: Build on Rocky Linux Nginx VMs. Configure Keepalived (VIPs, health checks). Rigorously test failover and load distribution. Synchronize Nginx configs (Ansible). Phasing out old Dell PE 210 servers with suitable generation micro servers (2 units) as these nginx based proxies act as a public facing Internet gateways.
Reason: Guarantee uptime for on-prem dev/test/sand environments and management plane.
Selected Approach (Dual-Platform):
- Hyper-V Failover Clustering: For Windows Server 2019 hosts (.NET workloads).
- XCP-ng with Xen Orchestra HA: For migrated XCP-ng hosts (Linux workloads).
Requirements: Shared storage implementation/verification for both, HA configuration (Failover Cluster Manager for Hyper-V, Xen Orchestra for XCP-ng), comprehensive HA testing (failover, live migration). Suitable OS licenses are required.
This approach leverages existing skills (Hyper-V) and embraces modern open-source (XCP-ng) without the overhead of a full IaaS like OpenStack initially.
Reason: Extend GitLab CI/CD to automate deployments across all environments including AWS production, improving velocity, consistency, and reliability.
Requirements:
- Pipeline standardization & templating (GitLab CI `include`).
- Secure AWS deployment configuration in GitLab CI/CD (with protected variables, IAM roles).
- Develop deployment scripts/templates (AWS CLI, SDKs, Terraform/CloudFormation via GitLab CI).
- Define clear environment promotion strategy (GitLab Environments, branch-based workflows, protected branches/tags for production).
- Incremental rollout to production, starting with less critical apps, monitoring, and refining.
Phase 4: Comprehensive Monitoring & Continuous Improvement (Months 10-12)
Objective: Establish robust observability, standardize development practices for long-term sustainability, complete MongoDB upgrades, and conduct a comprehensive review to inform future planning.
Reason: Crucial for troubleshooting, security monitoring, operational awareness across on-prem and AWS. User desires "graphical based controls. At the moment we dont have a centralised approach, log collection is segmented for different workloads which consumes time in evaluation "
Planned implementation: Graylog (as its all-in-one and user-friendly.
Requirements: Infrastructure provisioning, agent deployment (Grafana Alloy/Fluentd) on Linux/Windows/apps/AWS, dashboarding & alerting in Graylog UI.
Reason: Sustain "latest software" goal, prevent drift towards outdated software and upgrade anxieties.
Requirements:
- Policy definition & governance (OS versioning, Node.js LTS adoption process).
- Enforce Node.js Version Management (NVM locally, Docker in CI/CD & deployed envs).
- Standardized base images (VM templates for XCP-ng/Hyper-V, Docker images for .NET/Node.js) with regular updates.
- Automated patching & updates using the IPA & Ansible (scheduled GitLab CI or AWX jobs).
Reason: Complete ongoing critical database upgrade from MongoDB v6 to v7 for dev/test/sand environments.
Requirments: Complete upgrade process for all targeted replica sets, perform comprehensive functional/performance validation, set Feature Compatibility Version (FCV) to "7.0" cautiously, update documentation.
Reason: Conclusion of this 1-year roadmap with a review of achievements, lessons learned, and team feedback to inform future strategic planning.
Requirements:
- Assess progress against roadmap goals (quantify improvements).
- Capture lessons learned (workshops, feedback sessions).
- Gather team feedback on new tools/processes.
- Strategic planning for Year 2 and beyond (e.g., further cloud adoption, Kubernetes, advanced security, cost optimization, deeper automation).
Key Decisions Justification
- Migrating from XenServer to XCP-ng: Chosen for being open-source, actively maintained, with a clear upgrade path and familiar architecture.
- Adopting Rocky Linux 9: Stable, RHEL-compatible, community-supported replacement for CentOS with long-term support.
- Using the IPA server & Ansible: Provides centralized identity and configuration management with automation and auditing capabilities.
- Switching to GitLab CI/CD: Replaces Jenkins with integrated source control, pipelines, and security scanning — aligning with modern DevOps practices. There's a possibility of code scanning from security prospective (more testing/poc is required) as somehow we find the testing team review overlooking the webapp security which places this overhead on devops team which consumes time. An automated tool integrated within the delivery pipeline will helpus.
- Grafana Loki vs Graylog vs ELK: Trade-offs were evaluated; Grafana Loki or Graylog selected for usability and lower complexity vs ELK stack.
Impact & Future Planning
- Improved Security Posture: EOL risk eliminated, MFA enforced, automated patching in place.
- Increased Development Velocity: CI/CD automation reduces manual errors, accelerates deployments.
- Stronger Compliance Readiness: System centralization and logging support audit trails and access control.
- Lower Operational Overhead: Ansible & FreeIPA reduce manual administrative tasks.
- Path Forward: Year 2 can explore Kubernetes, enhanced cost monitoring, and full SRE practices.