MUHAMMED ALİ DOĞAN
Strategic Senior Site Reliability Engineer with 8 years of experience spanning the full infrastructure lifecycle—from Machine Learning and NLP development to co-founding DevOps startups and leading Enterprise SRE initiatives. My career is defined by a passion for deep-system internals and the belief that true reliability requires visibility from the application logic down to the packet level. I specialize in resolving the "unsolvable" performance bottlenecks that exist in the gaps between the application, the cloud network, and the kernel.
Currently, I am focused on the evolution of SRE in the AI era, developing custom R&D tools to analyze high-cardinality observability data for automated Root Cause Analysis (RCA) in AI/ML environments. I bridge the gap between traditional reliability forensics (L4-L7 Packet Analysis, Wireshark, RTCC) and modern cloud-native standards (Azure, LGTM Stack, OpenTelemetry). I don't just maintain systems; I build the tools and frameworks that make self-healing, intelligent infrastructure a reality.
Professional Experience
Senior Site Reliability Engineer
Destel Bilişim
- Agentic Observability & AI-SRE R&D: Spearheading the development of stateful, multi-agent diagnostic workflows using LangGraph to automate Root Cause Analysis (RCA). Architecting observability pipelines using LangSmith, focusing on multi-step trace analysis, latency attribution, and semantic telemetry analysis across high-cardinality data and Vector Databases.
- High-Scale Cloud & Retail Transformation: Directed the design and implementation of end-to-end observability solutions for enterprise clients using Azure Monitor, Log Analytics, and NewRelic. It was a migration task from Azure to New Relic for centralized observability across hundreds of stores. Managed complex architectures across cloud and retail in-store systems, ensuring secure cloud-on-prem integration and high availability.
- Deep-System Forensics (L4–L7): Acting as the final escalation point for mission-critical system regressions. Correlate high-volume telemetry (logs, metrics, traces) with deep packet-level analysis using Wireshark, Aternity, and Riverbed AppResponse to resolve "unsolvable" performance bottlenecks.
- SRE Governance & Leadership: Leading the Digital Performance team in guiding clients through the adoption of SRE principles. Defined and monitored SLIs/SLOs, managed error budgets, and established operational ownership models to consistently improve MTTD/MTTR.
- Software-Defined Automation: Leveraging "developer-first" skills to build custom analysis agents for data aggregation and automated insight generation. Optimized incident response by automating deployment, monitoring, and alerting pipelines using Python, Terraform, and GitHub Actions, reducing manual toil and detection times.
- Enterprise Reliability Architect: Engineered high-availability observability and performance standards for both high-scale Azure Cloud ecosystems and secure, air-gapped on-premise environments (Government/Banking), ensuring 99.9%+ uptime for mission-critical retail and disconnected infrastructures.
Site Reliability Engineer
EHSIM
- Air-Gapped Infrastructure Governance: Held full production responsibility for the availability and reliability of a confidential, mission-critical defense system. Managed a strictly air-gapped, high-compliance environment, ensuring 99.9% uptime under rigorous security protocols.
- Distributed Systems: Orchestrated and maintained a complex infrastructure stack built on OpenStack, Rancher (Kubernetes), and Ceph. Managed high-availability components including HAProxy and PostgreSQL, ensuring seamless data persistence and load balancing.
- Deep-System Forensics & RCA: Led cross-layer Root Cause Analysis (RCA) spanning application, network, and storage (Ceph) layers. Resolved critical performance regressions involving Kubernetes clusters, etcd, and networking in a disconnected environment where external support was unavailable.
- Observability Architecture: Designed and implemented comprehensive monitoring and alerting strategies using Zabbix, Grafana, and the ELK stack. Optimized system visibility to significantly reduce incident detection and response times (MTTD/MTTR).
- Software-Defined Automation: Leveraged strong development muscles to build Python and Bash-based automation scripts. Eliminated operational toil by streamlining repeatable infrastructure maintenance and lifecycle tasks within isolated networks.
DevOps & Backend Engineer (Co-Founder)
Logarity
- Founding Product Engineering: Co-founded and developed Logarity, an ELK-based mini-SIEM solution designed for high-efficiency log retention and compliance.
- High-Throughput Architecture: Designed and implemented distributed log ingestion pipelines using Kafka to ensure operational scalability under heavy data loads.
- Custom Tooling & Agents: Developed high-performance, gRPC-based agents in Python to optimize reliable data transfer between endpoints and the central platform.
- Reliability-First Backend: Extended the ELK stack with custom modules for agent management and long-term archiving, containerizing the entire ecosystem with Docker for reproducible on-prem deployments.
- Operational Ownership: Owned the end-to-end DevOps lifecycle, including container orchestration and the reliability of distributed on-prem installations.
Software Engineer / DevOps (Co-Founder)
AllConfig
- Technical Leadership: Led a small development team in building a microservice-based network configuration management system from the ground up.
- Network Automation: Designed Python-based backend services and APIs to automate and audit network device configurations using Netmiko.
- Infrastructure-as-Code: Orchestrated service isolation and deployment workflows using Docker Swarm, focusing on operational simplicity for on-premise environments.
- CI/CD Orchestration: Owned deployment pipelines, bridging the gap between application development and network operations using Azure DevOps.
- Collaborative Design: Directed system design and operational decision-making under the mentorship of senior industry experts.
Machine Learning & NLP Engineer
VeriUs
- NLP Pipeline Engineering: Built robust data collection and preprocessing pipelines, including web scraping, tokenization, and noise filtering for NLP systems.
- Model Deployment: Developed and containerized Python-based REST APIs (Flask/Docker) to serve intent detection and text summarization models for client-facing production use.
- R&D & Performance Validation: Conducted model training experiments and performance validation for noisy text classification and intent detection.
- Academic Contributions: Collaborated with Dr. Murat Can Ganiz on peer-reviewed academic research concerning Word Sense Disambiguation, providing a foundational understanding of LLM logic.
Core Skills
AI / MLOps Infrastructure
- Agentic Reliability R&D: Architecting observability for high-cardinality RAG-based applications and agentic workflows using LangGraph and LangSmith to ensure production stability.
- Automated RCA Tooling: Developing custom diagnostic agents to automate Root Cause Analysis (RCA) by querying semantic telemetry across model inference, data pipelines, and vector databases.
- Performance Benchmarking: Monitoring latency attribution, throughput, and error patterns of AI workloads, maintaining a strict separation between research experimentation and production reliability ownership.
Reliability & SRE
- SRE Governance: Defining and operating SLIs/SLOs and managing Error Budgets to guide operational and architectural decisions and optimize MTTD/MTTR.
- Deep-System Forensics: Leading end-to-end incident management and resolving "unsolvable" performance bottlenecks across application, cloud network (L4–L7), and kernel layers.
- Incident Lifecycle Optimization: Improving detection and recovery times through advanced observability, high-quality alerting, and disciplined data-driven operational practices.
Observability & Telemetry
- Packet-Level RCA: Rare expertise in L4–L7 Deep Packet Analysis using Wireshark, Riverbed, and Aternity to identify anomalies and regressions invisible to traditional APM.
- Observability Platform Design: Designing and operating high-volume observability platforms using metrics, logs, and traces (LGTM Stack, NewRelic, ELK).
- Intelligent Insights: Developing custom analysis agents for log, metric, and trace correlation to reduce noise and generate automated operational insights.
Programming & Automation
- Developer-First SRE: Leveraging strong development muscles in Python (FastAPI, Flask) and full-stack frameworks (Next.js, Flutter) to build custom internal reliability tools and dashboards.
- Infrastructure-as-Code (IaC): Engineering infrastructure automation and CI/CD pipelines using Terraform, Azure DevOps, and GitHub Actions.
- Strategic Automation: Building backend services and operational scripts (Bash, PowerShell) to eliminate manual toil and support repeatable infrastructure maintenance.
Cloud & Platform
- Strategic Direction: Leading and mentoring observability teams, bridging the gap between executive business goals and deep-system execution.
- Environment Versatility: Managing reliability across diverse environments, from strictly air-gapped, on-premise systems (Government/Banking) to high-scale Azure Cloud deployments.
- Distributed Systems Monitoring: Expertise in monitoring and troubleshooting Kubernetes (Rancher), Docker, OpenStack, and bare-metal platforms to ensure the stability of production systems.
Latest Blog Posts
Recent thoughts and tutorials
2026 Teknoloji Dünyası: Yazılım Geliştirmeden Kuantum Devrimine 10 Kritik Trend
2026'da teknoloji dünyasını şekillendiren 10 eğilim: Kod Temizlikçileri, AI platosu, halka arz dalgası, nükleer veri merkezleri ve kuantum uygulamaları. Teknik perspektif.
Tech Trends 2026: From AI Plateaus to the Rise of "Code Janitors"
Ten critical trends shaping 2026: the code janitor role, LLM plateau, IPO wave, humanoid robots, nuclear data centers, quantum practicality, and JavaScript evolution.
Decoding ClawdBot: Is Anthropic's Web Crawler a Threat to Your Infrastructure?
Identify ClawdBot activity, distinguish it from spoofing, and implement robots.txt or WAF controls to protect bandwidth and content without hurting SEO.
Certifications & Badges
Professional certifications and learning achievements

