MUHAMMED ALİ DOĞAN

Strategic Senior Site Reliability Engineer with 8 years of experience spanning the full infrastructure lifecycle—from Machine Learning and NLP development to co-founding DevOps startups and leading Enterprise SRE initiatives. My career is defined by a passion for deep-system internals and the belief that true reliability requires visibility from the application logic down to the packet level. I specialize in resolving the "unsolvable" performance bottlenecks that exist in the gaps between the application, the cloud network, and the kernel.

Currently, I am focused on the evolution of SRE in the AI era, developing custom R&D tools to analyze high-cardinality observability data for automated Root Cause Analysis (RCA) in AI/ML environments. I bridge the gap between traditional reliability forensics (L4-L7 Packet Analysis, Wireshark, RTCC) and modern cloud-native standards (Azure, LGTM Stack, OpenTelemetry). I don't just maintain systems; I build the tools and frameworks that make self-healing, intelligent infrastructure a reality.

Ankara, Turkiye

GitHub

Professional Experience

Senior Site Reliability Engineer

Destel Bilişim

Remote & AnkaraJan 2022 - Present

Remote & Ankara

Jan 2022 - Present

Agentic Observability & AI-SRE R&D: Spearheading the development of stateful, multi-agent diagnostic workflows using LangGraph to automate Root Cause Analysis (RCA). Architecting observability pipelines using LangSmith, focusing on multi-step trace analysis, latency attribution, and semantic telemetry analysis across high-cardinality data and Vector Databases.
High-Scale Cloud & Retail Transformation: Directed the design and implementation of end-to-end observability solutions for enterprise clients using Azure Monitor, Log Analytics, and NewRelic. It was a migration task from Azure to New Relic for centralized observability across hundreds of stores. Managed complex architectures across cloud and retail in-store systems, ensuring secure cloud-on-prem integration and high availability.
Deep-System Forensics (L4–L7): Acting as the final escalation point for mission-critical system regressions. Correlate high-volume telemetry (logs, metrics, traces) with deep packet-level analysis using Wireshark, Aternity, and Riverbed AppResponse to resolve "unsolvable" performance bottlenecks.
SRE Governance & Leadership: Leading the Digital Performance team in guiding clients through the adoption of SRE principles. Defined and monitored SLIs/SLOs, managed error budgets, and established operational ownership models to consistently improve MTTD/MTTR.
Software-Defined Automation: Leveraging "developer-first" skills to build custom analysis agents for data aggregation and automated insight generation. Optimized incident response by automating deployment, monitoring, and alerting pipelines using Python, Terraform, and GitHub Actions, reducing manual toil and detection times.
Enterprise Reliability Architect: Engineered high-availability observability and performance standards for both high-scale Azure Cloud ecosystems and secure, air-gapped on-premise environments (Government/Banking), ensuring 99.9%+ uptime for mission-critical retail and disconnected infrastructures.

Site Reliability Engineer

EHSIM

AnkaraMar 2021 - Oct 2021

Ankara

Mar 2021 - Oct 2021

Air-Gapped Infrastructure Governance: Held full production responsibility for the availability and reliability of a confidential, mission-critical defense system. Managed a strictly air-gapped, high-compliance environment, ensuring 99.9% uptime under rigorous security protocols.
Distributed Systems: Orchestrated and maintained a complex infrastructure stack built on OpenStack, Rancher (Kubernetes), and Ceph. Managed high-availability components including HAProxy and PostgreSQL, ensuring seamless data persistence and load balancing.
Deep-System Forensics & RCA: Led cross-layer Root Cause Analysis (RCA) spanning application, network, and storage (Ceph) layers. Resolved critical performance regressions involving Kubernetes clusters, etcd, and networking in a disconnected environment where external support was unavailable.
Observability Architecture: Designed and implemented comprehensive monitoring and alerting strategies using Zabbix, Grafana, and the ELK stack. Optimized system visibility to significantly reduce incident detection and response times (MTTD/MTTR).
Software-Defined Automation: Leveraged strong development muscles to build Python and Bash-based automation scripts. Eliminated operational toil by streamlining repeatable infrastructure maintenance and lifecycle tasks within isolated networks.

DevOps & Backend Engineer (Co-Founder)

Logarity

İstanbulApr 2020 - Sep 2020

İstanbul

Apr 2020 - Sep 2020

Founding Product Engineering: Co-founded and developed Logarity, an ELK-based mini-SIEM solution designed for high-efficiency log retention and compliance.
High-Throughput Architecture: Designed and implemented distributed log ingestion pipelines using Kafka to ensure operational scalability under heavy data loads.
Custom Tooling & Agents: Developed high-performance, gRPC-based agents in Python to optimize reliable data transfer between endpoints and the central platform.
Reliability-First Backend: Extended the ELK stack with custom modules for agent management and long-term archiving, containerizing the entire ecosystem with Docker for reproducible on-prem deployments.
Operational Ownership: Owned the end-to-end DevOps lifecycle, including container orchestration and the reliability of distributed on-prem installations.

Software Engineer / DevOps (Co-Founder)

AllConfig

İstanbulJul 2019 - Mar 2020

İstanbul

Jul 2019 - Mar 2020

Technical Leadership: Led a small development team in building a microservice-based network configuration management system from the ground up.
Network Automation: Designed Python-based backend services and APIs to automate and audit network device configurations using Netmiko.
Infrastructure-as-Code: Orchestrated service isolation and deployment workflows using Docker Swarm, focusing on operational simplicity for on-premise environments.
CI/CD Orchestration: Owned deployment pipelines, bridging the gap between application development and network operations using Azure DevOps.
Collaborative Design: Directed system design and operational decision-making under the mentorship of senior industry experts.

Machine Learning & NLP Engineer

VeriUs

İstanbulJun 2018 - May 2019

İstanbul

Jun 2018 - May 2019

NLP Pipeline Engineering: Built robust data collection and preprocessing pipelines, including web scraping, tokenization, and noise filtering for NLP systems.
Model Deployment: Developed and containerized Python-based REST APIs (Flask/Docker) to serve intent detection and text summarization models for client-facing production use.
R&D & Performance Validation: Conducted model training experiments and performance validation for noisy text classification and intent detection.
Academic Contributions: Collaborated with Dr. Murat Can Ganiz on peer-reviewed academic research concerning Word Sense Disambiguation, providing a foundational understanding of LLM logic.

Core Skills

AI / MLOps Infrastructure

Agentic Reliability R&D: Architecting observability for high-cardinality RAG-based applications and agentic workflows using LangGraph and LangSmith to ensure production stability.
Automated RCA Tooling: Developing custom diagnostic agents to automate Root Cause Analysis (RCA) by querying semantic telemetry across model inference, data pipelines, and vector databases.
Performance Benchmarking: Monitoring latency attribution, throughput, and error patterns of AI workloads, maintaining a strict separation between research experimentation and production reliability ownership.

Reliability & SRE

SRE Governance: Defining and operating SLIs/SLOs and managing Error Budgets to guide operational and architectural decisions and optimize MTTD/MTTR.
Deep-System Forensics: Leading end-to-end incident management and resolving "unsolvable" performance bottlenecks across application, cloud network (L4–L7), and kernel layers.
Incident Lifecycle Optimization: Improving detection and recovery times through advanced observability, high-quality alerting, and disciplined data-driven operational practices.

Observability & Telemetry

Packet-Level RCA: Rare expertise in L4–L7 Deep Packet Analysis using Wireshark, Riverbed, and Aternity to identify anomalies and regressions invisible to traditional APM.
Observability Platform Design: Designing and operating high-volume observability platforms using metrics, logs, and traces (LGTM Stack, NewRelic, ELK).
Intelligent Insights: Developing custom analysis agents for log, metric, and trace correlation to reduce noise and generate automated operational insights.

Programming & Automation

Developer-First SRE: Leveraging strong development muscles in Python (FastAPI, Flask) and full-stack frameworks (Next.js, Flutter) to build custom internal reliability tools and dashboards.
Infrastructure-as-Code (IaC): Engineering infrastructure automation and CI/CD pipelines using Terraform, Azure DevOps, and GitHub Actions.
Strategic Automation: Building backend services and operational scripts (Bash, PowerShell) to eliminate manual toil and support repeatable infrastructure maintenance.

Cloud & Platform

Strategic Direction: Leading and mentoring observability teams, bridging the gap between executive business goals and deep-system execution.
Environment Versatility: Managing reliability across diverse environments, from strictly air-gapped, on-premise systems (Government/Banking) to high-scale Azure Cloud deployments.
Distributed Systems Monitoring: Expertise in monitoring and troubleshooting Kubernetes (Rancher), Docker, OpenStack, and bare-metal platforms to ensure the stability of production systems.

Latest Blog Posts

Recent thoughts and tutorials

AITeknoloji Trendleri

2026 Teknoloji Dünyası: Yazılım Geliştirmeden Kuantum Devrimine 10 Kritik Trend

2026'da teknoloji dünyasını şekillendiren 10 eğilim: Kod Temizlikçileri, AI platosu, halka arz dalgası, nükleer veri merkezleri ve kuantum uygulamaları. Teknik perspektif.

Jan 28, 2026

4 min read

AITech Trends

Tech Trends 2026: From AI Plateaus to the Rise of "Code Janitors"

Ten critical trends shaping 2026: the code janitor role, LLM plateau, IPO wave, humanoid robots, nuclear data centers, quantum practicality, and JavaScript evolution.

Jan 27, 2026

5 min read

AISecurity

Decoding ClawdBot: Is Anthropic's Web Crawler a Threat to Your Infrastructure?

Identify ClawdBot activity, distinguish it from spoofing, and implement robots.txt or WAF controls to protect bandwidth and content without hurting SEO.

Jan 26, 2026

4 min read

View All Blog Posts