MUHAMMED ALİ DOĞAN

Strategic Senior Site Reliability Engineer with 8 years of experience spanning the full infrastructure lifecycle—from Machine Learning and NLP development to co-founding DevOps startups and leading Enterprise SRE initiatives. My career is defined by a passion for deep-system internals and the belief that true reliability requires visibility from the application logic down to the packet level. I specialize in resolving the "unsolvable" performance bottlenecks that exist in the gaps between the application, the cloud network, and the kernel.

Currently, I am focused on the evolution of SRE in the AI era, developing custom R&D tools to analyze high-cardinality observability data for automated Root Cause Analysis (RCA) in AI/ML environments. I bridge the gap between traditional reliability forensics (L4-L7 Packet Analysis, Wireshark, RTCC) and modern cloud-native standards (Azure, LGTM Stack, OpenTelemetry). I don't just maintain systems; I build the tools and frameworks that make self-healing, intelligent infrastructure a reality.

Ankara, Turkiye

Professional Experience

Senior Site Reliability Engineer

Destel Bilişim

Remote & AnkaraJan 2022 - Present
  • Agentic Observability & AI-SRE R&D: Spearheading the development of stateful, multi-agent diagnostic workflows using LangGraph to automate Root Cause Analysis (RCA). Architecting observability pipelines using LangSmith, focusing on multi-step trace analysis, latency attribution, and semantic telemetry analysis across high-cardinality data and Vector Databases.
  • High-Scale Cloud & Retail Transformation: Directed the design and implementation of end-to-end observability solutions for enterprise clients using Azure Monitor, Log Analytics, and NewRelic. It was a migration task from Azure to New Relic for centralized observability across hundreds of stores. Managed complex architectures across cloud and retail in-store systems, ensuring secure cloud-on-prem integration and high availability.
  • Deep-System Forensics (L4–L7): Acting as the final escalation point for mission-critical system regressions. Correlate high-volume telemetry (logs, metrics, traces) with deep packet-level analysis using Wireshark, Aternity, and Riverbed AppResponse to resolve "unsolvable" performance bottlenecks.
  • SRE Governance & Leadership: Leading the Digital Performance team in guiding clients through the adoption of SRE principles. Defined and monitored SLIs/SLOs, managed error budgets, and established operational ownership models to consistently improve MTTD/MTTR.
  • Software-Defined Automation: Leveraging "developer-first" skills to build custom analysis agents for data aggregation and automated insight generation. Optimized incident response by automating deployment, monitoring, and alerting pipelines using Python, Terraform, and GitHub Actions, reducing manual toil and detection times.
  • Enterprise Reliability Architect: Engineered high-availability observability and performance standards for both high-scale Azure Cloud ecosystems and secure, air-gapped on-premise environments (Government/Banking), ensuring 99.9%+ uptime for mission-critical retail and disconnected infrastructures.

Site Reliability Engineer

EHSIM

AnkaraMar 2021 - Oct 2021
  • Air-Gapped Infrastructure Governance: Held full production responsibility for the availability and reliability of a confidential, mission-critical defense system. Managed a strictly air-gapped, high-compliance environment, ensuring 99.9% uptime under rigorous security protocols.
  • Distributed Systems: Orchestrated and maintained a complex infrastructure stack built on OpenStack, Rancher (Kubernetes), and Ceph. Managed high-availability components including HAProxy and PostgreSQL, ensuring seamless data persistence and load balancing.
  • Deep-System Forensics & RCA: Led cross-layer Root Cause Analysis (RCA) spanning application, network, and storage (Ceph) layers. Resolved critical performance regressions involving Kubernetes clusters, etcd, and networking in a disconnected environment where external support was unavailable.
  • Observability Architecture: Designed and implemented comprehensive monitoring and alerting strategies using Zabbix, Grafana, and the ELK stack. Optimized system visibility to significantly reduce incident detection and response times (MTTD/MTTR).
  • Software-Defined Automation: Leveraged strong development muscles to build Python and Bash-based automation scripts. Eliminated operational toil by streamlining repeatable infrastructure maintenance and lifecycle tasks within isolated networks.

DevOps & Backend Engineer (Co-Founder)

Logarity

İstanbulApr 2020 - Sep 2020
  • Founding Product Engineering: Co-founded and developed Logarity, an ELK-based mini-SIEM solution designed for high-efficiency log retention and compliance.
  • High-Throughput Architecture: Designed and implemented distributed log ingestion pipelines using Kafka to ensure operational scalability under heavy data loads.
  • Custom Tooling & Agents: Developed high-performance, gRPC-based agents in Python to optimize reliable data transfer between endpoints and the central platform.
  • Reliability-First Backend: Extended the ELK stack with custom modules for agent management and long-term archiving, containerizing the entire ecosystem with Docker for reproducible on-prem deployments.
  • Operational Ownership: Owned the end-to-end DevOps lifecycle, including container orchestration and the reliability of distributed on-prem installations.

Software Engineer / DevOps (Co-Founder)

AllConfig

İstanbulJul 2019 - Mar 2020
  • Technical Leadership: Led a small development team in building a microservice-based network configuration management system from the ground up.
  • Network Automation: Designed Python-based backend services and APIs to automate and audit network device configurations using Netmiko.
  • Infrastructure-as-Code: Orchestrated service isolation and deployment workflows using Docker Swarm, focusing on operational simplicity for on-premise environments.
  • CI/CD Orchestration: Owned deployment pipelines, bridging the gap between application development and network operations using Azure DevOps.
  • Collaborative Design: Directed system design and operational decision-making under the mentorship of senior industry experts.

Machine Learning & NLP Engineer

VeriUs

İstanbulJun 2018 - May 2019
  • NLP Pipeline Engineering: Built robust data collection and preprocessing pipelines, including web scraping, tokenization, and noise filtering for NLP systems.
  • Model Deployment: Developed and containerized Python-based REST APIs (Flask/Docker) to serve intent detection and text summarization models for client-facing production use.
  • R&D & Performance Validation: Conducted model training experiments and performance validation for noisy text classification and intent detection.
  • Academic Contributions: Collaborated with Dr. Murat Can Ganiz on peer-reviewed academic research concerning Word Sense Disambiguation, providing a foundational understanding of LLM logic.

Core Skills

AI / MLOps Infrastructure

  • Agentic Reliability R&D: Architecting observability for high-cardinality RAG-based applications and agentic workflows using LangGraph and LangSmith to ensure production stability.
  • Automated RCA Tooling: Developing custom diagnostic agents to automate Root Cause Analysis (RCA) by querying semantic telemetry across model inference, data pipelines, and vector databases.
  • Performance Benchmarking: Monitoring latency attribution, throughput, and error patterns of AI workloads, maintaining a strict separation between research experimentation and production reliability ownership.

Reliability & SRE

  • SRE Governance: Defining and operating SLIs/SLOs and managing Error Budgets to guide operational and architectural decisions and optimize MTTD/MTTR.
  • Deep-System Forensics: Leading end-to-end incident management and resolving "unsolvable" performance bottlenecks across application, cloud network (L4–L7), and kernel layers.
  • Incident Lifecycle Optimization: Improving detection and recovery times through advanced observability, high-quality alerting, and disciplined data-driven operational practices.

Observability & Telemetry

  • Packet-Level RCA: Rare expertise in L4–L7 Deep Packet Analysis using Wireshark, Riverbed, and Aternity to identify anomalies and regressions invisible to traditional APM.
  • Observability Platform Design: Designing and operating high-volume observability platforms using metrics, logs, and traces (LGTM Stack, NewRelic, ELK).
  • Intelligent Insights: Developing custom analysis agents for log, metric, and trace correlation to reduce noise and generate automated operational insights.

Programming & Automation

  • Developer-First SRE: Leveraging strong development muscles in Python (FastAPI, Flask) and full-stack frameworks (Next.js, Flutter) to build custom internal reliability tools and dashboards.
  • Infrastructure-as-Code (IaC): Engineering infrastructure automation and CI/CD pipelines using Terraform, Azure DevOps, and GitHub Actions.
  • Strategic Automation: Building backend services and operational scripts (Bash, PowerShell) to eliminate manual toil and support repeatable infrastructure maintenance.

Cloud & Platform

  • Strategic Direction: Leading and mentoring observability teams, bridging the gap between executive business goals and deep-system execution.
  • Environment Versatility: Managing reliability across diverse environments, from strictly air-gapped, on-premise systems (Government/Banking) to high-scale Azure Cloud deployments.
  • Distributed Systems Monitoring: Expertise in monitoring and troubleshooting Kubernetes (Rancher), Docker, OpenStack, and bare-metal platforms to ensure the stability of production systems.

Certifications & Badges

Professional certifications and learning achievements

Grafana Labs

Trailblazer Technical Practitioner

Grafana Labs
Issued: Nov 2025
GrafanaObservability
DevOps Institute

Site Reliability Engineering (SRE) Foundation℠v1

DevOps Institute
Issued: Dec 2022
ID: 23833559
TroubleshootingObservabilitySite Reliability Engineering
Riverbed Technology

End-to-End Visibility

Riverbed Technology
TroubleshootingObservabilityNetwork PerformanceApplication Performance
Riverbed Technology

RCPE Associate: AppResponse

Riverbed Technology
TroubleshootingObservabilityApplication PerformanceAPM
Riverbed Technology

RCPE Associate: Introduction to NPM

Riverbed Technology
TroubleshootingObservabilityNetwork PerformanceNPM
Riverbed Technology

RCPE Certified Professional

Riverbed Technology
TroubleshootingObservabilityNetwork PerformanceApplication Performance
Riverbed Technology

RCPE Foundation: Performance Foundations

Riverbed Technology
TroubleshootingObservabilityPerformance AnalysisSystem Monitoring
Riverbed Technology

RCPE Professional: AppResponse 11

Riverbed Technology
TroubleshootingObservabilityApplication PerformanceAPM+1 more
Riverbed Technology

RCPE Professional: Packet Analyzer Plus (PA+)

Riverbed Technology
TroubleshootingObservabilityPacket AnalysisNetwork Analysis+1 more