Логирование в микросервисах на Java

Вызовы логирования в микросервисах

Основные проблемы

Распределенные транзакции:

  • Один request проходит через множество сервисов
  • Сложно отследить полный путь выполнения
  • Логи разбросаны по разным системам и файлам

Корреляция логов:

  • Нужно связать логи одного request'а из разных сервисов
  • Отсутствие единого контекста затрудняет debugging
  • Временные метки могут отличаться между серверами

Объем данных:

  • Микросервисы генерируют огромное количество логов
  • Нужна эффективная агрегация и поиск
  • Хранение и индексация больших объемов

Структурированность:

  • Неструктурированные логи сложно анализировать
  • Разные форматы в разных сервисах
  • Нужна стандартизация и парсинг

Пояснение: В монолите все логи в одном месте, в микросервисах — это distributed debugging nightmare без правильных инструментов.


Фундаментальные принципы

1. Correlation ID

Ключевая концепция — каждый request получает уникальный идентификатор, который передается через всю цепочку сервисов.

@Component
public class CorrelationInterceptor implements HandlerInterceptor {
    
    private static final String CORRELATION_ID_HEADER = "X-Correlation-ID";
    
    @Override
    public boolean preHandle(HttpServletRequest request, 
                           HttpServletResponse response, 
                           Object handler) {
        String correlationId = request.getHeader(CORRELATION_ID_HEADER);
        if (correlationId == null) {
            correlationId = UUID.randomUUID().toString();
        }
        
        MDC.put("correlationId", correlationId);
        response.setHeader(CORRELATION_ID_HEADER, correlationId);
        return true;
    }
    
    @Override
    public void afterCompletion(HttpServletRequest request, 
                               HttpServletResponse response, 
                               Object handler, Exception ex) {
        MDC.clear();
    }
}

Пояснение: MDC (Mapped Diagnostic Context) позволяет добавлять contextual information ко всем log statements в текущем потоке.

2. Структурированное логирование

// Плохо: неструктурированные логи
log.info("User John ordered 3 items for $25.50");

// Хорошо: структурированные логи
log.info("Order created", 
    kv("userId", "john123"),
    kv("itemCount", 3),
    kv("totalAmount", 25.50),
    kv("action", "order_created"));

3. Логирование на разных уровнях

Application logs: Бизнес-события и ошибки Infrastructure logs: Системные события, performance metrics Audit logs: Compliance и security events Access logs: HTTP requests и responses


Настройка базового логирования

Logback конфигурация

<!-- logback-spring.xml -->
<configuration>
    <springProfile name="!prod">
        <appender name="CONSOLE" class="ch.qos.logback.core.ConsoleAppender">
            <encoder class="net.logstash.logback.encoder.LoggingEventCompositeJsonEncoder">
                <providers>
                    <timestamp/>
                    <logLevel/>
                    <loggerName/>
                    <mdc/>
                    <message/>
                    <stackTrace/>
                </providers>
            </encoder>
        </appender>
    </springProfile>
    
    <springProfile name="prod">
        <appender name="FILE" class="ch.qos.logback.core.rolling.RollingFileAppender">
            <file>/var/log/app/application.log</file>
            <rollingPolicy class="ch.qos.logback.core.rolling.SizeAndTimeBasedRollingPolicy">
                <fileNamePattern>/var/log/app/application.%d{yyyy-MM-dd}.%i.log</fileNamePattern>
                <maxFileSize>100MB</maxFileSize>
                <maxHistory>30</maxHistory>
                <totalSizeCap>3GB</totalSizeCap>
            </rollingPolicy>
            <encoder class="net.logstash.logback.encoder.LogstashEncoder">
                <includeContext>true</includeContext>
                <includeMdc>true</includeMdc>
                <customFields>{"service":"user-service"}</customFields>
            </encoder>
        </appender>
    </springProfile>
    
    <root level="INFO">
        <appender-ref ref="CONSOLE"/>
        <appender-ref ref="FILE"/>
    </root>
</configuration>

Пояснение:

  • JSON формат облегчает парсинг в ELK stack
  • Rolling policy предотвращает переполнение диска
  • Разные настройки для dev/prod окружений

Structured Logging с SLF4J

@Service
public class OrderService {
    
    private static final Logger log = LoggerFactory.getLogger(OrderService.class);
    
    public Order createOrder(CreateOrderRequest request) {
        MDC.put("userId", request.getUserId());
        MDC.put("operation", "create_order");
        
        try {
            log.info("Creating order", 
                kv("productId", request.getProductId()),
                kv("quantity", request.getQuantity()));
            
            Order order = processOrder(request);
            
            log.info("Order created successfully", 
                kv("orderId", order.getId()),
                kv("status", order.getStatus()),
                kv("amount", order.getTotalAmount()));
            
            return order;
            
        } catch (Exception e) {
            log.error("Order creation failed", 
                kv("productId", request.getProductId()),
                kv("error", e.getMessage()), e);
            throw e;
        } finally {
            MDC.clear();
        }
    }
}

ELK Stack (Elasticsearch, Logstash, Kibana)

Архитектура ELK

Microservices → Filebeat → Logstash → Elasticsearch → Kibana
     ↓              ↓           ↓            ↓          ↓
   [Logs]      [Shipping]  [Processing]  [Storage]  [Visualization]

Пояснение:

  • Filebeat — lightweight log shipper
  • Logstash — log processing pipeline
  • Elasticsearch — search и analytics engine
  • Kibana — visualization и dashboard platform

Filebeat конфигурация

# filebeat.yml
filebeat.inputs:

- type: log
  enabled: true
  paths:

    - /var/log/app/*.log
  fields:
    service: user-service
    environment: production
  fields_under_root: true
  multiline.pattern: '^\d{4}-\d{2}-\d{2}'
  multiline.negate: true
  multiline.match: after

processors:

- add_host_metadata:
    when.not.contains.tags: forwarded

output.logstash:
  hosts: ["logstash:5044"]

logging.level: info
logging.to_files: true
logging.files:
  path: /var/log/filebeat
  name: filebeat
  keepfiles: 7
  permissions: 0644

Logstash конфигурация

# logstash.conf
input {
  beats {
    port => 5044
  }
}

filter {
  if [service] == "user-service" {
    json {
      source => "message"
    }
    
    date {
      match => [ "timestamp", "ISO8601" ]
    }
    
    if [level] == "ERROR" {
      mutate {
        add_tag => ["error"]
      }
    }
    
    # Извлечение correlation ID
    if [mdc][correlationId] {
      mutate {
        add_field => { "correlation_id" => "%{[mdc][correlationId]}" }
      }
    }
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "microservices-logs-%{+YYYY.MM.dd}"
  }
}

Kibana Dashboard

Index Pattern: microservices-logs-*

Полезные поля для визуализации:

  • @timestamp — время события
  • service — имя микросервиса
  • level — уровень логирования
  • correlation_id — идентификатор запроса
  • message — текст сообщения
  • mdc.userId — пользователь

Пояснение: ELK отлично подходит для centralized logging и ad-hoc поиска по логам, но требует значительных ресурсов для больших объемов.


Fluentd для log aggregation

Архитектура с Fluentd

Microservices → Fluentd Agent → Fluentd Aggregator → Storage
     ↓              ↓               ↓                  ↓
   [Logs]      [Collection]    [Processing]      [ES/S3/etc]

Fluentd конфигурация

<!-- fluent.conf -->
<source>
  @type tail
  path /var/log/app/*.log
  pos_file /var/log/fluentd/app.log.pos
  tag microservices.app
  format json
  time_key timestamp
  time_format %Y-%m-%dT%H:%M:%S.%L%z
</source>

<filter microservices.**>
  @type record_transformer
  <record>
    hostname ${hostname}
    environment "#{ENV['ENVIRONMENT']}"
    cluster "#{ENV['CLUSTER_NAME']}"
  </record>
</filter>

<match microservices.**>
  @type elasticsearch
  host elasticsearch
  port 9200
  index_name microservices-logs
  type_name _doc
  logstash_format true
  logstash_prefix microservices-logs
  logstash_dateformat %Y.%m.%d
  
  <buffer>
    @type file
    path /var/log/fluentd/buffer/elasticsearch
    flush_mode interval
    flush_interval 10s
    chunk_limit_size 64m
    queue_limit_length 128
  </buffer>
</match>

Пояснение: Fluentd более гибкий чем Logstash для routing и transformation, лучше подходит для complex log processing scenarios.


Distributed Tracing

OpenTelemetry

OpenTelemetry — unified observability framework для traces, metrics и logs.

<dependency>
    <groupId>io.opentelemetry</groupId>
    <artifactId>opentelemetry-api</artifactId>
</dependency>
<dependency>
    <groupId>io.opentelemetry.instrumentation</groupId>
    <artifactId>opentelemetry-spring-boot-starter</artifactId>
</dependency>
@RestController
public class UserController {
    
    private final Tracer tracer = GlobalOpenTelemetry.getTracer("user-service");
    
    @GetMapping("/users/{id}")
    public User getUser(@PathVariable String id) {
        Span span = tracer.spanBuilder("get-user")
            .setAttribute("user.id", id)
            .setAttribute("service.name", "user-service")
            .startSpan();
            
        try (Scope scope = span.makeCurrent()) {
            log.info("Fetching user", kv("userId", id));
            return userService.findUser(id);
        } catch (Exception e) {
            span.setStatus(StatusCode.ERROR, e.getMessage());
            throw e;
        } finally {
            span.end();
        }
    }
}

Zipkin интеграция

# application.yml
management:
  tracing:
    sampling:
      probability: 1.0  # 100% для dev, 0.1 для prod
  zipkin:
    tracing:
      endpoint: http://zipkin:9411/api/v2/spans

logging:
  pattern:
    level: "%5p [%X{traceId:-},%X{spanId:-}]"
@Component
public class TracingFilter implements Filter {
    
    @Override
    public void doFilter(ServletRequest request, ServletResponse response, 
                        FilterChain chain) throws IOException, ServletException {
        
        TraceContext traceContext = Tracing.current().tracer().nextSpan()
            .name("http-request")
            .tag("http.method", ((HttpServletRequest) request).getMethod())
            .tag("http.url", ((HttpServletRequest) request).getRequestURL().toString())
            .start()
            .context();
            
        try (CurrentTraceContext.Scope scope = 
             Tracing.current().currentTraceContext().newScope(traceContext)) {
            chain.doFilter(request, response);
        }
    }
}

Pояснение: Zipkin визуализирует request flow через микросервисы, показывая timing и dependencies между вызовами.

Jaeger vs Zipkin

Zipkin:

  • Проще в setup и использовании
  • Twitter origin, battle-tested
  • Хорошая производительность для средних объемов
  • JSON/Thrift протоколы

Jaeger:

  • Uber origin, designed for scale
  • Better performance для больших объемов
  • Adaptive sampling strategies
  • gRPC протокол более эффективен
  • Better storage options (Cassandra, Elasticsearch)
# Jaeger configuration
opentracing:
  jaeger:
    service-name: user-service
    sampler:
      type: probabilistic
      param: 0.1  # 10% sampling rate
    sender:
      agent-host: jaeger-agent
      agent-port: 6831

Пояснение: Выбор между Zipkin и Jaeger зависит от scale и requirements. Zipkin проще для начала, Jaeger лучше для enterprise scale.


Observability Patterns

Three Pillars of Observability

1. Logs — что произошло 2. Metrics — numerical measurements 3. Traces — как requests проходят через систему

Correlation между Logs и Traces

@Service
public class PaymentService {
    
    public PaymentResult processPayment(PaymentRequest request) {
        Span currentSpan = Span.current();
        String traceId = currentSpan.getSpanContext().getTraceId();
        String spanId = currentSpan.getSpanContext().getSpanId();
        
        MDC.put("traceId", traceId);
        MDC.put("spanId", spanId);
        MDC.put("operation", "process_payment");
        
        try {
            log.info("Processing payment", 
                kv("amount", request.getAmount()),
                kv("currency", request.getCurrency()));
                
            // Создание child span для external call
            Span childSpan = GlobalOpenTelemetry.getTracer("payment-service")
                .spanBuilder("external-payment-api")
                .setAttribute("payment.provider", "stripe")
                .startSpan();
                
            try (Scope scope = childSpan.makeCurrent()) {
                return externalPaymentService.charge(request);
            } finally {
                childSpan.end();
            }
            
        } finally {
            MDC.clear();
        }
    }
}

Custom Metrics с Micrometer

@Component
public class BusinessMetrics {
    
    private final Counter orderCounter;
    private final Timer orderProcessingTime;
    private final Gauge activeUsers;
    
    public BusinessMetrics(MeterRegistry meterRegistry) {
        this.orderCounter = Counter.builder("orders.created")
            .description("Number of orders created")
            .register(meterRegistry);
            
        this.orderProcessingTime = Timer.builder("order.processing.time")
            .description("Order processing time")
            .register(meterRegistry);
            
        this.activeUsers = Gauge.builder("users.active")
            .description("Number of active users")
            .register(meterRegistry, this, BusinessMetrics::getActiveUserCount);
    }
    
    public void recordOrder(String userId, double amount) {
        orderCounter.increment(
            Tags.of("user.type", getUserType(userId))
        );
        
        log.info("Order recorded in metrics", 
            kv("userId", userId),
            kv("amount", amount));
    }
}

Security и Compliance

Sensitive Data Handling

@Component
public class SecureLogging {
    
    // Маскирование sensitive data
    public void logUserAction(String userId, String creditCardNumber, String action) {
        log.info("User action performed", 
            kv("userId", userId),
            kv("creditCard", maskCreditCard(creditCardNumber)),
            kv("action", action));
    }
    
    private String maskCreditCard(String ccNumber) {
        if (ccNumber == null || ccNumber.length() < 4) {
            return "****";
        }
        return "**** **** **** " + ccNumber.substring(ccNumber.length() - 4);
    }
    
    // Audit trail
    public void auditSecurityEvent(String userId, String event, String details) {
        MDC.put("audit", "true");
        MDC.put("security", "true");
        
        log.info("Security event", 
            kv("userId", userId),
            kv("event", event),
            kv("details", details),
            kv("timestamp", Instant.now()),
            kv("source", "security-audit"));
            
        MDC.remove("audit");
        MDC.remove("security");
    }
}

GDPR Compliance

@Service
public class GdprCompliantLogging {
    
    public void logWithDataRetention(String userId, String action) {
        // Добавление retention metadata
        MDC.put("retention.policy", "90days");
        MDC.put("data.classification", "personal");
        
        log.info("User action", 
            kv("userId", hashUserId(userId)), // Хеширование PII
            kv("action", action));
    }
    
    private String hashUserId(String userId) {
        return DigestUtils.sha256Hex(userId + SECRET_SALT);
    }
}

Пояснение: В production никогда не логируйте PII (personally identifiable information) в plain text. Используйте hashing или masking.


Performance Optimization

Asynchronous Logging

<!-- Async appender для performance -->
<appender name="ASYNC_FILE" class="ch.qos.logback.classic.AsyncAppender">
    <appender-ref ref="FILE"/>
    <queueSize>1024</queueSize>
    <discardingThreshold>0</discardingThreshold>
    <includeCallerData>false</includeCallerData>
    <neverBlock>true</neverBlock>
</appender>
// Conditional logging для дорогих операций
if (log.isDebugEnabled()) {
    log.debug("Expensive debug info: {}", expensiveDebugInfoGeneration());
}

// Lazy evaluation с lambda
log.debug("User details: {}", () -> buildExpensiveUserReport(user));

Log Sampling

@Component
public class SampledLogger {
    
    private final AtomicLong counter = new AtomicLong(0);
    private static final int SAMPLE_RATE = 100; // Логируем каждое 100-е событие
    
    public void logSampled(String message, Object... args) {
        if (counter.incrementAndGet() % SAMPLE_RATE == 0) {
            log.info(message, args);
        }
    }
}

Resource Usage Monitoring

@Component
@Scheduled(fixedRate = 60000) // Каждую минуту
public class LoggingResourceMonitor {
    
    public void monitorLogDiskUsage() {
        File logDir = new File("/var/log/app");
        long totalSpace = logDir.getTotalSpace();
        long freeSpace = logDir.getFreeSpace();
        double usagePercent = ((double)(totalSpace - freeSpace) / totalSpace) * 100;
        
        if (usagePercent > 85) {
            log.warn("High disk usage for logs", 
                kv("usage", usagePercent),
                kv("freeSpace", freeSpace),
                kv("totalSpace", totalSpace));
        }
    }
}

Monitoring и Alerting

Log-based Alerts

# Примеры alert правил для Elasticsearch
- alert: HighErrorRate
  expr: rate(log_entries{level="ERROR"}[5m]) > 10
  for: 2m
  annotations:
    summary: "High error rate detected"
    description: "Error rate is {{ $value }} errors per second"

- alert: ServiceDown
  expr: absent(log_entries{service="user-service"}[5m])
  for: 1m
  annotations:
    summary: "Service appears to be down"
    description: "No logs from user-service in last 5 minutes"

Health Check Logging

@Component
public class HealthCheckLogger {
    
    @EventListener
    public void onHealthChange(HealthChangedEvent event) {
        log.info("Health status changed", 
            kv("component", event.getComponentName()),
            kv("status", event.getStatus()),
            kv("details", event.getDetails()));
    }
    
    @Scheduled(fixedRate = 30000)
    public void logSystemHealth() {
        log.info("System health check", 
            kv("memoryUsage", getMemoryUsage()),
            kv("cpuUsage", getCpuUsage()),
            kv("activeConnections", getActiveConnections()));
    }
}

Best Practices

1. Logging Levels Strategy

// ERROR: для ошибок, требующих немедленного внимания
log.error("Payment processing failed", kv("orderId", orderId), exception);

// WARN: для потенциальных проблем
log.warn("High response time detected", kv("responseTime", responseTime));

// INFO: для важных business events
log.info("Order created", kv("orderId", orderId), kv("userId", userId));

// DEBUG: для detailed troubleshooting (только в dev)
log.debug("Database query executed", kv("sql", sql), kv("params", params));

// TRACE: для очень детального debugging (обычно отключен)
log.trace("Method entry", kv("method", "calculateTotal"), kv("args", args));

2. Standardized Log Format

public class LoggingUtils {
    
    public static void logBusinessEvent(String event, String entityType, 
                                      String entityId, Map<String, Object> details) {
        MDC.put("event.type", "business");
        MDC.put("entity.type", entityType);
        MDC.put("entity.id", entityId);
        
        log.info("Business event: {}", event, 
            kv("details", details),
            kv("timestamp", Instant.now()));
    }
    
    public static void logTechnicalEvent(String component, String operation, 
                                       String status, long duration) {
        MDC.put("event.type", "technical");
        MDC.put("component", component);
        
        log.info("Technical event", 
            kv("operation", operation),
            kv("status", status),
            kv("duration", duration));
    }
}

3. Error Context Preservation

@Service
public class ErrorHandlingService {
    
    public void handleBusinessError(BusinessException e, String operation) {
        // Сохраняем полный контекст ошибки
        log.error("Business operation failed", 
            kv("operation", operation),
            kv("errorCode", e.getErrorCode()),
            kv("errorMessage", e.getMessage()),
            kv("userId", getCurrentUserId()),
            kv("correlationId", MDC.get("correlationId")),
            kv("stackTrace", getStackTraceAsString(e)));
    }
    
    private String getStackTraceAsString(Exception e) {
        StringWriter sw = new StringWriter();
        e.printStackTrace(new PrintWriter(sw));
        return sw.toString();
    }
}

4. Production Deployment

# production-logging.yml
logging:
  level:
    root: INFO
    com.company: INFO
    org.springframework: WARN
    org.hibernate: WARN
  pattern:
    file: "%d{yyyy-MM-dd HH:mm:ss} [%thread] %-5level [%X{traceId:-},%X{spanId:-}] %logger{36} - %msg%n"
  file:
    name: /var/log/app/application.log
    max-size: 100MB
    max-history: 30

5. Log Rotation и Cleanup

#!/bin/bash
# log-cleanup.sh - автоматическая очистка старых логов

LOG_DIR="/var/log/app"
RETENTION_DAYS=30

# Удаление логов старше 30 дней
find $LOG_DIR -name "*.log*" -mtime +$RETENTION_DAYS -delete

# Компрессия логов старше 7 дней
find $LOG_DIR -name "*.log*" -mtime +7 ! -name "*.gz" -exec gzip {} \;

# Уведомление при превышении дискового пространства
USAGE=$(df $LOG_DIR | tail -1 | awk '{print $5}' | sed 's/%//')
if [ $USAGE -gt 85 ]; then
    echo "Warning: Log directory usage is ${USAGE}%" | mail -s "High disk usage" admin@company.com
fi

Troubleshooting Scenarios

1. Performance Issue Investigation

// Логирование для performance analysis
@Around("@annotation(Monitored)")
public Object monitorPerformance(ProceedingJoinPoint joinPoint) throws Throwable {
    String methodName = joinPoint.getSignature().getName();
    long startTime = System.currentTimeMillis();
    
    log.info("Method execution started", 
        kv("method", methodName),
        kv("args", Arrays.toString(joinPoint.getArgs())));
    
    try {
        Object result = joinPoint.proceed();
        long duration = System.currentTimeMillis() - startTime;
        
        log.info("Method execution completed", 
            kv("method", methodName),
            kv("duration", duration),
            kv("status", "success"));
            
        return result;
    } catch (Exception e) {
        long duration = System.currentTimeMillis() - startTime;
        
        log.error("Method execution failed", 
            kv("method", methodName),
            kv("duration", duration),
            kv("error", e.getMessage()), e);
        throw e;
    }
}

2. Distributed Transaction Tracking

// Отслеживание распределенных транзакций
@Service
public class OrderOrchestrator {
    
    public void processOrder(OrderRequest request) {
        String correlationId = MDC.get("correlationId");
        String sagaId = UUID.randomUUID().toString();
        
        MDC.put("sagaId", sagaId);
        
        log.info("Saga started", 
            kv("sagaId", sagaId),
            kv("orderId", request.getOrderId()),
            kv("steps", List.of("inventory", "payment", "shipping")));
        
        try {
            // Step 1: Reserve inventory
            logSagaStep("inventory", "started");
            inventoryService.reserve(request.getProductId(), request.getQuantity());
            logSagaStep("inventory", "completed");
            
            // Step 2: Process payment
            logSagaStep("payment", "started");
            paymentService.charge(request.getPaymentDetails());
            logSagaStep("payment", "completed");
            
            // Step 3: Arrange shipping
            logSagaStep("shipping", "started");
            shippingService.schedule(request.getShippingAddress());
            logSagaStep("shipping", "completed");
            
            log.info("Saga completed successfully", kv("sagaId", sagaId));
            
        } catch (Exception e) {
            log.error("Saga failed, starting compensation", 
                kv("sagaId", sagaId),
                kv("error", e.getMessage()));
            // Compensation logic...
        }
    }
    
    private void logSagaStep(String step, String status) {
        log.info("Saga step " + status, 
            kv("step", step),
            kv("status", status),
            kv("sagaId", MDC.get("sagaId")));
    }
}

Заключение

Ключевые принципы

  1. Correlation ID везде — основа для debugging distributed systems
  2. Структурированные логи — JSON формат для легкого парсинга
  3. Appropriate log levels — не засоряйте production DEBUG'ом
  4. Centralized aggregation — один источник истины для всех логов
  5. Security first — никогда не логируйте sensitive data
  6. Performance awareness — async logging и sampling для high load
  7. Monitoring и alerting — proactive problem detection

Выбор инструментов

ELK Stack: Для полнофункционального log management и analytics Fluentd: Когда нужна гибкость в log processing и routing OpenTelemetry: Современный стандарт для observability Zipkin: Простое distributed tracing для начала Jaeger: Enterprise-scale distributed tracing

Эволюция подхода

  1. Start simple: Basic file logging + correlation ID
  2. Add structure: JSON формат и MDC
  3. Centralize: ELK или аналогичный stack
  4. Add tracing: Zipkin или Jaeger для request flow
  5. Optimize: Performance tuning и cost

Мониторинг с Prometheus и Grafana на Java

Что такое мониторинг приложений

Application Performance Monitoring (APM) — это практика отслеживания и анализа производительности, доступности и поведения приложений в real-time. В микросервисной архитектуре мониторинг критически важен для:

  • Раннего обнаружения проблем до их влияния на пользователей
  • Capacity planning и оптимизации ресурсов
  • Troubleshooting и root cause analysis
  • SLA compliance и performance optimization

Типы метрик

Business Metrics: Заказы в секунду, конверсия, revenue Application Metrics: Response time, error rate, throughput Infrastructure Metrics: CPU, память, disk I/O, network Custom Metrics: Domain-specific показатели

The Four Golden Signals (Google SRE)

  1. Latency — время отклика запросов
  2. Traffic — количество запросов к системе
  3. Errors — процент failed запросов
  4. Saturation — использование ресурсов системы

Пояснение: Эти четыре метрики дают полную картину здоровья системы. Если они в норме — система работает хорошо.


Prometheus: Time Series Database

Что такое Prometheus

Prometheus — это open-source система мониторинга и alerting, специально созданная для cloud-native приложений. Ключевые особенности:

  • Pull-based модель — Prometheus сам запрашивает метрики у приложений
  • Time series database — эффективное хранение временных рядов
  • Multi-dimensional data model — метрики с labels для детализации
  • PromQL — мощный язык запросов для анализа данных
  • Service discovery — автоматическое обнаружение targets

Архитектура Prometheus

Applications → Metrics Endpoint → Prometheus Server → AlertManager
     ↓              ↓                    ↓              ↓
[/actuator/   [HTTP scraping]      [Storage +      [Notifications]
 prometheus]                        PromQL]

Пояснение: Приложения exposing метрики через HTTP endpoint, Prometheus периодически их scraping и сохраняет в time series database.

Data Model

Metric sample состоит из:

  • Metric name — имя метрики (например, http_requests_total)
  • Labels — key-value пары для детализации ({method="GET", status="200"})
  • Timestamp — время измерения
  • Value — численное значение

Пример:

http_requests_total{method="GET", endpoint="/api/users", status="200"} 1027 @1609459200

Типы метрик в Prometheus

Counter — монотонно растущий счетчик (например, количество запросов) Gauge — значение, которое может увеличиваться и уменьшаться (например, использование памяти) Histogram — распределение значений по buckets (например, latency distribution) Summary — похож на Histogram, но вычисляет quantiles на client-side


Настройка мониторинга в Spring Boot

Maven зависимости

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-registry-prometheus</artifactId>
</dependency>

Конфигурация Actuator

# application.yml
management:
  endpoints:
    web:
      exposure:
        include: health, info, metrics, prometheus
      base-path: /actuator
  endpoint:
    health:
      show-details: always
      show-components: always
    metrics:
      enabled: true
    prometheus:
      enabled: true
  metrics:
    distribution:
      percentiles-histogram:
        http.server.requests: true
      percentiles:
        http.server.requests: 0.5, 0.95, 0.99
    tags:
      application: ${spring.application.name}
      environment: ${spring.profiles.active}

Пояснение:

  • /actuator/prometheus endpoint exposing метрики в Prometheus format
  • percentiles-histogram включает histogram buckets для latency analysis
  • Tags добавляются ко всем метрикам для идентификации приложения

Базовые метрики из коробки

Spring Boot Actuator автоматически предоставляет:

  • HTTP metrics: http_server_requests_* — latency, throughput, errors
  • JVM metrics: jvm_memory_*, jvm_gc_*, jvm_threads_*
  • System metrics: system_cpu_*, process_*
  • Database metrics: hikaricp_* для connection pool
  • Custom application metrics: через Micrometer API

Custom метрики с Micrometer

Counter — счетчики событий

@RestController
public class OrderController {
    
    private final Counter orderCounter;
    private final Counter orderErrorCounter;
    
    public OrderController(MeterRegistry meterRegistry) {
        this.orderCounter = Counter.builder("orders.created")
            .description("Number of orders created")
            .tag("type", "business_metric")
            .register(meterRegistry);
            
        this.orderErrorCounter = Counter.builder("orders.errors")
            .description("Number of order creation errors")
            .register(meterRegistry);
    }
    
    @PostMapping("/orders")
    public Order createOrder(@RequestBody CreateOrderRequest request) {
        try {
            Order order = orderService.createOrder(request);
            
            // Инкремент счетчика с tags
            orderCounter.increment(
                Tags.of("status", "success", 
                       "user_type", getUserType(request.getUserId()))
            );
            
            return order;
        } catch (Exception e) {
            orderErrorCounter.increment(
                Tags.of("error_type", e.getClass().getSimpleName())
            );
            throw e;
        }
    }
}

Gauge — текущие значения

@Component
public class SystemMetrics {
    
    private final AtomicInteger activeUsers = new AtomicInteger(0);
    private final Queue<String> pendingJobs = new ConcurrentLinkedQueue<>();
    
    public SystemMetrics(MeterRegistry meterRegistry) {
        // Gauge для активных пользователей
        Gauge.builder("users.active")
            .description("Number of currently active users")
            .register(meterRegistry, activeUsers, AtomicInteger::get);
            
        // Gauge для размера очереди
        Gauge.builder("jobs.pending")
            .description("Number of pending background jobs")
            .register(meterRegistry, pendingJobs, Queue::size);
            
        // Gauge для custom вычислений
        Gauge.builder("memory.usage.percentage")
            .description("Memory usage percentage")
            .register(meterRegistry, this, SystemMetrics::getMemoryUsagePercentage);
    }
    
    private double getMemoryUsagePercentage() {
        Runtime runtime = Runtime.getRuntime();
        long totalMemory = runtime.totalMemory();
        long freeMemory = runtime.freeMemory();
        return ((double) (totalMemory - freeMemory) / totalMemory) * 100;
    }
}

Timer — измерение времени выполнения

@Service
public class PaymentService {
    
    private final Timer paymentProcessingTimer;
    private final Timer.Sample sample;
    
    public PaymentService(MeterRegistry meterRegistry) {
        this.paymentProcessingTimer = Timer.builder("payment.processing.duration")
            .description("Payment processing time")
            .publishPercentiles(0.5, 0.95, 0.99) // медиана, 95й и 99й процентили
            .register(meterRegistry);
    }
    
    public PaymentResult processPayment(PaymentRequest request) {
        return Timer.Sample.start(meterRegistry)
            .stop(paymentProcessingTimer.tag("provider", request.getProvider()));
    }
    
    // Альтернативный способ с try-with-resources
    public PaymentResult processPaymentAlternative(PaymentRequest request) {
        Timer.Sample sample = Timer.Sample.start(meterRegistry);
        try {
            PaymentResult result = externalPaymentService.process(request);
            sample.stop(Timer.builder("payment.external.duration")
                .tag("provider", request.getProvider())
                .tag("status", result.getStatus())
                .register(meterRegistry));
            return result;
        } catch (Exception e) {
            sample.stop(Timer.builder("payment.external.duration")
                .tag("provider", request.getProvider())
                .tag("status", "error")
                .register(meterRegistry));
            throw e;
        }
    }
}

DistributionSummary — распределение значений

@Component
public class BusinessMetrics {
    
    private final DistributionSummary orderAmountSummary;
    
    public BusinessMetrics(MeterRegistry meterRegistry) {
        this.orderAmountSummary = DistributionSummary.builder("order.amount")
            .description("Distribution of order amounts")
            .baseUnit("USD")
            .publishPercentiles(0.5, 0.75, 0.95, 0.99)
            .register(meterRegistry);
    }
    
    public void recordOrderAmount(double amount, String category) {
        orderAmountSummary.record(amount, Tags.of("category", category));
    }
}

Пояснение: DistributionSummary подходит для анализа распределения business metrics — размеры заказов, количество items, etc.


Prometheus конфигурация

prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:

  - "rules/*.yml"

alerting:
  alertmanagers:

    - static_configs:

        - targets:

          - alertmanager:9093

scrape_configs:

  - job_name: 'prometheus'
    static_configs:

      - targets: ['localhost:9090']

  - job_name: 'spring-boot-apps'
    metrics_path: '/actuator/prometheus'
    static_configs:

      - targets: 

        - 'user-service:8080'
        - 'order-service:8081'
        - 'payment-service:8082'
    scrape_interval: 10s
    scrape_timeout: 5s

  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:

      - role: pod
    relabel_configs:

      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

Пояснение:

  • scrape_interval — как часто Prometheus собирает метрики
  • kubernetes_sd_configs — автоматическое обнаружение pods в Kubernetes
  • relabel_configs — правила для filtering и modification targets

Service Discovery в Kubernetes

# kubernetes deployment с annotations
apiVersion: apps/v1
kind: Deployment
metadata:
  name: user-service
spec:
  template:
    metadata:
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/actuator/prometheus"
    spec:
      containers:

      - name: user-service
        image: user-service:latest
        ports:

        - containerPort: 8080

PromQL (Prometheus Query Language)

Основные операторы

Селекторы метрик:

# Все значения метрики
http_requests_total

# Фильтрация по labels
http_requests_total{method="GET"}
http_requests_total{method!="GET"}  # НЕ GET
http_requests_total{status=~"2.."}  # regex: статусы 2xx

Range queries (временные диапазоны):

# Значения за последние 5 минут
http_requests_total[5m]

# Rate of change за 5 минут
rate(http_requests_total[5m])

# Increase за час
increase(http_requests_total[1h])

Полезные функции

rate() — скорость изменения per second:

# Request rate per second
rate(http_requests_total[5m])

# Error rate
rate(http_requests_total{status=~"5.."}[5m])

histogram_quantile() — вычисление процентилей:

# 95й процентиль latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# 99й процентиль за разные временные окна
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[1m]))

Aggregation operations:

# Сумма по всем instances
sum(rate(http_requests_total[5m]))

# Сумма по service
sum by (service) (rate(http_requests_total[5m]))

# Среднее время отклика
avg(http_request_duration_seconds)

# Максимальное использование памяти
max(jvm_memory_used_bytes) by (application)

Практические примеры запросов

Error Rate (процент ошибок):

sum(rate(http_requests_total{status=~"5.."}[5m])) / 
sum(rate(http_requests_total[5m])) * 100

Apdex Score (Application Performance Index):

(
  sum(rate(http_request_duration_seconds_bucket{le="0.1"}[5m])) +
  sum(rate(http_request_duration_seconds_bucket{le="0.4"}[5m])) / 2
) / sum(rate(http_request_duration_seconds_count[5m]))

Memory Usage (процент использования heap):

jvm_memory_used_bytes{area="heap"} / 
jvm_memory_max_bytes{area="heap"} * 100

Grafana: Визуализация и Dashboards

Что такое Grafana

Grafana — это open-source платформа для visualization и analytics. Позволяет создавать interactive dashboards для мониторинга метрик из различных data sources.

Ключевые возможности:

  • Multi-datasource support — Prometheus, InfluxDB, Elasticsearch, etc.
  • Rich visualization options — graphs, tables, heatmaps, alerts
  • Dashboard templating — переменные для dynamic dashboards
  • Alerting — уведомления при превышении thresholds
  • User management — роли и permissions

Подключение Prometheus как Data Source

{
  "name": "Prometheus",
  "type": "prometheus", 
  "url": "http://prometheus:9090",
  "access": "proxy",
  "isDefault": true,
  "jsonData": {
    "httpMethod": "POST",
    "timeInterval": "15s"
  }
}

Основные типы панелей

Time Series Panel — для отображения метрик во времени:

  • CPU usage, memory consumption
  • Request rate, error rate
  • Response time trends

Stat Panel — для single value metrics:

  • Current active users
  • Total orders today
  • System uptime

Table Panel — для tabular data:

  • Top error endpoints
  • Service health status
  • Resource usage by service

Heatmap Panel — для distribution analysis:

  • Response time distribution
  • Load patterns over time

Создание Dashboard для Java приложения

Application Overview Dashboard

{
  "dashboard": {
    "title": "Java Application Overview",
    "panels": [
      {
        "title": "Request Rate",
        "type": "timeseries",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total[5m])) by (application)",
            "legendFormat": "{{application}}"
          }
        ]
      },
      {
        "title": "Error Rate %",
        "type": "timeseries", 
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) by (application) / sum(rate(http_requests_total[5m])) by (application) * 100",
            "legendFormat": "{{application}}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "color": {"mode": "palette-classic"}
          }
        }
      },
      {
        "title": "Response Time 95th Percentile",
        "type": "timeseries",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, application))",
            "legendFormat": "{{application}}"
          }
        ]
      }
    ]
  }
}

JVM Metrics Dashboard

Memory Usage Panel:

# Heap memory usage
jvm_memory_used_bytes{area="heap"} / jvm_memory_max_bytes{area="heap"} * 100

# Non-heap memory usage  
jvm_memory_used_bytes{area="nonheap"} / jvm_memory_max_bytes{area="nonheap"} * 100

# Garbage collection rate
rate(jvm_gc_collection_seconds_count[5m])

Thread Metrics Panel:

# Active threads
jvm_threads_live_threads

# Daemon threads
jvm_threads_daemon_threads

# Peak threads
jvm_threads_peak_threads

Business Metrics Dashboard

{
  "panels": [
    {
      "title": "Orders per Minute",
      "targets": [
        {
          "expr": "sum(rate(orders_created[1m])) * 60",
          "legendFormat": "Orders/min"
        }
      ]
    },
    {
      "title": "Revenue per Hour", 
      "targets": [
        {
          "expr": "sum(rate(order_amount_sum[1h])) * 3600",
          "legendFormat": "Revenue/hour"
        }
      ]
    },
    {
      "title": "Active Users",
      "type": "stat",
      "targets": [
        {
          "expr": "users_active",
          "legendFormat": "Active Users"
        }
      ]
    }
  ]
}

Dashboard Variables (Templating)

{
  "templating": {
    "list": [
      {
        "name": "application",
        "type": "query",
        "query": "label_values(http_requests_total, application)",
        "multi": true,
        "includeAll": true
      },
      {
        "name": "environment", 
        "type": "query",
        "query": "label_values(http_requests_total, environment)",
        "multi": false
      },
      {
        "name": "time_range",
        "type": "interval",
        "options": ["1m", "5m", "15m", "30m", "1h"]
      }
    ]
  }
}

Использование переменных в запросах:

sum(rate(http_requests_total{application=~"$application", environment="$environment"}[$time_range]))

Пояснение: Variables делают dashboards переиспользуемыми для разных приложений и окружений.


Alerting

Prometheus Alert Rules

# alerts.yml
groups:

  - name: application_alerts
    rules:

      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (application) /
          sum(rate(http_requests_total[5m])) by (application) * 100 > 5
        for: 2m
        labels:
          severity: warning
          team: backend
        annotations:
          summary: "High error rate detected"
          description: "Application {{ $labels.application }} has error rate of {{ $value }}%"

      - alert: HighLatency
        expr: |
          histogram_quantile(0.95, 
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, application)
          ) > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency detected"
          description: "95th percentile latency is {{ $value }}s for {{ $labels.application }}"

      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service is down"
          description: "{{ $labels.instance }} has been down for more than 1 minute"

AlertManager конфигурация

# alertmanager.yml
global:
  smtp_smarthost: 'localhost:587'
  smtp_from: 'alerts@company.com'

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'default'
  routes:

    - match:
        severity: critical
      receiver: 'critical-alerts'
    - match:
        team: backend
      receiver: 'backend-team'

receivers:

  - name: 'default'
    email_configs:

      - to: 'team@company.com'
        subject: 'Alert: {{ .GroupLabels.alertname }}'
        body: |
          {{ range .Alerts }}
          Alert: {{ .Annotations.summary }}
          Description: {{ .Annotations.description }}
          {{ end }}

  - name: 'critical-alerts'
    slack_configs:

      - api_url: 'https://hooks.slack.com/services/...'
        channel: '#alerts-critical'
        title: 'Critical Alert'
        text: '{{ .CommonAnnotations.summary }}'
    
  - name: 'backend-team'
    email_configs:

      - to: 'backend-team@company.com'
    pagerduty_configs:

      - service_key: 'your-pagerduty-key'

Grafana Alerts

{
  "alert": {
    "name": "High Memory Usage",
    "conditions": [
      {
        "query": {
          "queryType": "",
          "refId": "A",
          "expr": "jvm_memory_used_bytes{area=\"heap\"} / jvm_memory_max_bytes{area=\"heap\"} * 100"
        },
        "reducer": {
          "type": "last",
          "params": []
        },
        "evaluator": {
          "type": "gt",
          "params": [85]
        }
      }
    ],
    "frequency": "10s",
    "handler": 1,
    "noDataState": "no_data",
    "executionErrorState": "alerting"
  }
}

Пояснение: Prometheus alerts лучше для infrastructure metrics, Grafana alerts — для complex business logic и visualization-based alerts.


Advanced Monitoring Patterns

SLI/SLO мониторинг

Service Level Indicators (SLI) — метрики качества сервиса:

# Availability SLI
sum(rate(http_requests_total{status!~"5.."}[5m])) / 
sum(rate(http_requests_total[5m]))

# Latency SLI  
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# Error Budget calculation
1 - (sli_availability / slo_target)

RED Method (Rate, Errors, Duration)

# Rate - requests per second
sum(rate(http_requests_total[5m]))

# Errors - error percentage  
sum(rate(http_requests_total{status=~"5.."}[5m])) / 
sum(rate(http_requests_total[5m])) * 100

# Duration - response time percentiles
histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

USE Method (Utilization, Saturation, Errors)

# CPU Utilization
100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory Utilization
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Disk Utilization
100 - (node_filesystem_avail_bytes / node_filesystem_size_bytes * 100)

# Network Saturation
rate(node_network_transmit_drop_total[5m])

Performance Optimization

Efficient Metrics Collection

@Component
public class OptimizedMetrics {
    
    // Переиспользование метрик вместо создания новых
    private static final Counter REQUEST_COUNTER = 
        Metrics.counter("http.requests", "endpoint", "unknown");
    
    // Ограничение cardinality для избежания memory leaks
    private final Map<String, Counter> endpointCounters = new ConcurrentHashMap<>();
    
    public void recordRequest(String endpoint) {
        // Ограничиваем количество unique endpoints
        if (endpointCounters.size() > 100) {
            endpoint = "other";
        }
        
        endpointCounters.computeIfAbsent(endpoint, 
            e -> Metrics.counter("http.requests", "endpoint", e))
            .increment();
    }
}

Prometheus Configuration Tuning

# prometheus.yml - optimization
global:
  scrape_interval: 15s
  scrape_timeout: 10s
  evaluation_interval: 15s
  
  external_labels:
    cluster: 'production'
    region: 'us-east-1'

# Retention и storage
storage:
  tsdb:
    retention.time: 30d
    retention.size: 100GB
    min-block-duration: 2h
    max-block-duration: 25h

# Memory optimization
runtime:
  gomaxprocs: 4
  
scrape_configs:

  - job_name: 'high-frequency'
    scrape_interval: 5s
    static_configs:

      - targets: ['critical-service:8080']
      
  - job_name: 'low-frequency'  
    scrape_interval: 60s
    static_configs:

      - targets: ['batch-service:8080']

Grafana Performance

{
  "refresh": "30s",
  "time": {
    "from": "now-1h",
    "to": "now"
  },
  "panels": [
    {
      "targets": [
        {
          "expr": "avg_over_time(metric[5m])",
          "interval": "30s",
          "maxDataPoints": 100
        }
      ]
    }
  ]
}

Пояснение:

  • Ограничивайте maxDataPoints для лучшей производительности
  • Используйте appropriate interval для aggregation
  • Избегайте слишком больших временных диапазонов

Monitoring в Production

High Availability Setup

Prometheus HA:

# prometheus-1.yml
global:
  external_labels:
    replica: 'prometheus-1'

# prometheus-2.yml  
global:
  external_labels:
    replica: 'prometheus-2'

Grafana HA:

# grafana.ini
[database]
type = mysql
host = mysql-cluster:3306
name = grafana
user = grafana
password = ${GRAFANA_DB_PASSWORD}

[session]
provider = mysql
provider_config = grafana:${GRAFANA_DB_PASSWORD}@tcp(mysql-cluster:3306)/grafana

[server]
root_url = https://grafana.company.com

Security Best Practices

# prometheus.yml - security
global:
  external_labels:
    cluster: 'production'

scrape_configs:

  - job_name: 'secure-app'
    scheme: https
    tls_config:
      ca_file: /etc/prometheus/ca.crt
      cert_file: /etc/prometheus/client.crt
      key_file: /etc/prometheus/client.key
    basic_auth:
      username: prometheus
      password_file: /etc/prometheus/password

Resource Planning

Prometheus Storage Requirements:

Samples/sec = Number of series × Scrape frequency
Storage/day = Samples/sec × 86400 × 16 bytes (compressed)

Example:
10,000 series × 1/15s × 86400 × 16 bytes = ~900MB/day

Memory Requirements:

RAM = Number of series × 6KB (rule of thumb)
Example: 100,000 series = ~600MB RAM minimum

Troubleshooting

Common Issues

High Cardinality Problems:

// BAD: Unbounded cardinality
Metrics.counter("user.actions", "user_id", userId); // Millions of users!

// GOOD: Bounded cardinality  
Metrics.counter("user.actions", "user_type", getUserType(userId)); // Few types

Missing Metrics:

# Check if endpoint is accessible
curl http://app:8080/actuator/prometheus

# Verify Prometheus targets
curl http://prometheus:9090/api/v1/targets

# Check for scrape errors
curl http://prometheus:9090/api/v1/query?query=up

# Debug PromQL queries
curl -G http://prometheus:9090/api/v1/query \
  --data-urlencode 'query=rate(http_requests_total[5m])'

Performance Issues:

# Check Prometheus performance
prometheus_tsdb_compactions_total
prometheus_config_last_reload_success_timestamp_seconds
prometheus_rule_evaluation_duration_seconds

# Check scrape duration
scrape_duration_seconds > 0.1

# Identify slow queries
topk(10, increase(prometheus_engine_query_duration_seconds_count[1h]))

Debug Dashboard

{
  "dashboard": {
    "title": "Prometheus Debug",
    "panels": [
      {
        "title": "Scrape Targets Status",
        "type": "table",
        "targets": [
          {
            "expr": "up",
            "format": "table",
            "instant": true
          }
        ]
      },
      {
        "title": "Scrape Duration",
        "type": "timeseries",
        "targets": [
          {
            "expr": "scrape_duration_seconds",
            "legendFormat": "{{instance}}"
          }
        ]
      },
      {
        "title": "Series Count",
        "type": "timeseries",
        "targets": [
          {
            "expr": "prometheus_tsdb_symbol_table_size_bytes",
            "legendFormat": "Series count"
          }
        ]
      }
    ]
  }
}

Best Practices

1. Metric Naming Conventions

// Следуйте Prometheus naming conventions
public class MetricNamingBestPractices {
    
    // ✅ Good: Clear, descriptive names
    private final Counter HTTP_REQUESTS_TOTAL = 
        Metrics.counter("http_requests_total");
    
    private final Gauge MEMORY_USAGE_BYTES = 
        Metrics.gauge("memory_usage_bytes");
    
    private final Timer REQUEST_DURATION_SECONDS = 
        Metrics.timer("request_duration_seconds");
    
    // ❌ Bad: Unclear names
    private final Counter REQ_COUNT = Metrics.counter("req_count");
    private final Gauge MEM = Metrics.gauge("mem");
    
    // ✅ Good: Consistent units
    private final Counter BYTES_SENT_TOTAL = 
        Metrics.counter("bytes_sent_total");
    
    private final Timer DATABASE_QUERY_DURATION_SECONDS = 
        Metrics.timer("database_query_duration_seconds");
    
    // ❌ Bad: Mixed units
    private final Timer RESPONSE_TIME_MS = 
        Metrics.timer("response_time_ms"); // Should be seconds
}

2. Label Strategy

@Component
public class LabelingBestPractices {
    
    // ✅ Good: Low cardinality labels
    public void recordRequest(String method, String endpoint, int status) {
        Metrics.counter("http_requests_total",
            "method", method,              // GET, POST, PUT (low cardinality)
            "endpoint", normalizeEndpoint(endpoint), // /api/users/{id} (normalized)
            "status_class", getStatusClass(status)   // 2xx, 4xx, 5xx (low cardinality)
        ).increment();
    }
    
    // ❌ Bad: High cardinality labels
    public void recordRequestBad(String userId, String sessionId, String requestId) {
        Metrics.counter("requests_total",
            "user_id", userId,        // Millions of users!
            "session_id", sessionId,  // Millions of sessions!
            "request_id", requestId   // Every request unique!
        ).increment();
    }
    
    private String normalizeEndpoint(String endpoint) {
        // /api/users/123 -> /api/users/{id}
        return endpoint.replaceAll("/\\d+", "/{id}")
                      .replaceAll("/[a-f0-9-]{36}", "/{uuid}");
    }
    
    private String getStatusClass(int status) {
        return status / 100 + "xx";
    }
}

3. Error Monitoring Strategy

@Component
public class ErrorMonitoring {
    
    private final Counter errorCounter;
    private final Timer errorResolutionTime;
    
    public ErrorMonitoring(MeterRegistry registry) {
        this.errorCounter = Counter.builder("application_errors_total")
            .description("Total number of application errors")
            .register(registry);
            
        this.errorResolutionTime = Timer.builder("error_resolution_duration_seconds")
            .description("Time to resolve errors")
            .register(registry);
    }
    
    public void recordError(Exception e, String operation, String severity) {
        errorCounter.increment(
            Tags.of(
                "error_type", e.getClass().getSimpleName(),
                "operation", operation,
                "severity", severity,
                "recoverable", String.valueOf(isRecoverable(e))
            )
        );
        
        // Structured logging для correlation с metrics
        log.error("Application error occurred",
            kv("operation", operation),
            kv("error_type", e.getClass().getSimpleName()),
            kv("severity", severity),
            kv("message", e.getMessage()),
            e);
    }
    
    private boolean isRecoverable(Exception e) {
        return !(e instanceof OutOfMemoryError || 
                e instanceof StackOverflowError);
    }
}

4. Business Metrics Integration

@Service
public class BusinessMetricsService {
    
    private final Counter orderMetrics;
    private final DistributionSummary revenueMetrics;
    private final Gauge inventoryMetrics;
    
    public BusinessMetricsService(MeterRegistry registry, InventoryService inventoryService) {
        this.orderMetrics = Counter.builder("orders_total")
            .description("Total number of orders")
            .register(registry);
            
        this.revenueMetrics = DistributionSummary.builder("revenue_usd")
            .description("Revenue in USD")
            .baseUnit("USD")
            .register(registry);
            
        this.inventoryMetrics = Gauge.builder("inventory_items")
            .description("Current inventory level")
            .register(registry, inventoryService, InventoryService::getTotalItems);
    }
    
    @EventListener
    public void handleOrderCreated(OrderCreatedEvent event) {
        orderMetrics.increment(
            Tags.of(
                "product_category", event.getProductCategory(),
                "customer_segment", event.getCustomerSegment(),
                "payment_method", event.getPaymentMethod()
            )
        );
        
        revenueMetrics.record(event.getOrderAmount().doubleValue(),
            Tags.of("currency", event.getCurrency()));
    }
    
    @EventListener
    public void handleOrderCancelled(OrderCancelledEvent event) {
        // Отдельная метрика для cancellations
        Metrics.counter("orders_cancelled_total",
            "reason", event.getCancellationReason()
        ).increment();
    }
}

Deployment и DevOps

Docker Compose Setup

# docker-compose.yml
version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    ports:

      - "9090:9090"
    volumes:

      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - ./alerts.yml:/etc/prometheus/alerts.yml
      - prometheus_data:/prometheus
    command:

      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--storage.tsdb.retention.time=30d'
      - '--web.enable-lifecycle'
      - '--web.enable-admin-api'

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:

      - "3000:3000"
    environment:

      - GF_SECURITY_ADMIN_PASSWORD=admin123
      - GF_INSTALL_PLUGINS=grafana-piechart-panel
    volumes:

      - grafana_data:/var/lib/grafana
      - ./grafana/dashboards:/etc/grafana/provisioning/dashboards
      - ./grafana/datasources:/etc/grafana/provisioning/datasources

  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    ports:

      - "9093:9093"
    volumes:

      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml

  app:
    image: spring-boot-app:latest
    ports:

      - "8080:8080"
    environment:

      - SPRING_PROFILES_ACTIVE=docker
    depends_on:

      - prometheus

volumes:
  prometheus_data:
  grafana_data:

Kubernetes Deployment

# prometheus-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
  template:
    metadata:
      labels:
        app: prometheus
    spec:
      containers:

      - name: prometheus
        image: prom/prometheus:latest
        ports:

        - containerPort: 9090
        volumeMounts:

        - name: config
          mountPath: /etc/prometheus
        - name: storage
          mountPath: /prometheus
        args:

          - '--config.file=/etc/prometheus/prometheus.yml'
          - '--storage.tsdb.path=/prometheus'
          - '--storage.tsdb.retention.time=30d'
      volumes:

      - name: config
        configMap:
          name: prometheus-config
      - name: storage
        persistentVolumeClaim:
          claimName: prometheus-storage

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: monitoring
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
    scrape_configs:

    - job_name: 'kubernetes-pods'
      kubernetes_sd_configs:

      - role: pod
      relabel_configs:

      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

CI/CD Integration

# .github/workflows/monitoring.yml
name: Deploy Monitoring

on:
  push:
    branches: [main]
    paths: ['monitoring/**']

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:

    - uses: actions/checkout@v2
    
    - name: Validate Prometheus Config
      run: |
        docker run --rm -v $(pwd)/monitoring:/workspace \
          prom/prometheus:latest promtool check config /workspace/prometheus.yml
    
    - name: Validate Alert Rules
      run: |
        docker run --rm -v $(pwd)/monitoring:/workspace \
          prom/prometheus:latest promtool check rules /workspace/alerts.yml
    
    - name: Deploy to Kubernetes
      run: |
        kubectl apply -f monitoring/k8s/
        kubectl rollout status deployment/prometheus -n monitoring

Advanced Features

Recording Rules

# recording-rules.yml
groups:

  - name: application_rules
    interval: 30s
    rules:
      # Pre-calculate expensive queries
      - record: job:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job)
      
      - record: job:http_request_duration:p99
        expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le))
      
      # Business metrics aggregations
      - record: business:orders_per_minute
        expr: sum(rate(orders_created_total[1m])) * 60
      
      - record: business:revenue_per_hour
        expr: sum(rate(revenue_usd_sum[1h])) * 3600

Federation

# Global Prometheus config
scrape_configs:

  - job_name: 'federate'
    scrape_interval: 15s
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':

        - '{job=~"prometheus"}'
        - '{__name__=~"job:.*"}'
    static_configs:

      - targets:

        - 'prometheus-us-east:9090'
        - 'prometheus-eu-west:9090'
        - 'prometheus-ap-south:9090'

Custom Exporters

@Component
public class CustomDatabaseExporter {
    
    private final DataSource dataSource;
    private final CollectorRegistry registry;
    
    public CustomDatabaseExporter(DataSource dataSource, CollectorRegistry registry) {
        this.dataSource = dataSource;
        this.registry = registry;
        
        // Register custom collector
        new DatabaseMetricsCollector(dataSource).register(registry);
    }
    
    private static class DatabaseMetricsCollector extends Collector {
        private final DataSource dataSource;
        
        public DatabaseMetricsCollector(DataSource dataSource) {
            this.dataSource = dataSource;
        }
        
        @Override
        public List<MetricFamilySamples> collect() {
            List<MetricFamilySamples> samples = new ArrayList<>();
            
            try (Connection conn = dataSource.getConnection()) {
                // Query active connections
                ResultSet rs = conn.createStatement().executeQuery(
                    "SELECT count(*) as active_connections FROM pg_stat_activity WHERE state = 'active'"
                );
                
                if (rs.next()) {
                    samples.add(new MetricFamilySamples(
                        "database_active_connections",
                        Type.GAUGE,
                        "Number of active database connections",
                        Arrays.asList(new MetricFamilySamples.Sample(
                            "database_active_connections", 
                            Arrays.asList(), 
                            Arrays.asList(), 
                            rs.getDouble("active_connections")
                        ))
                    ));
                }
            } catch (SQLException e) {
                // Handle error
            }
            
            return samples;
        }
    }
}

Cost Optimization

Storage Management

# Prometheus storage optimization
global:
  scrape_interval: 30s  # Увеличить для non-critical metrics
  
scrape_configs:
  # Critical services - high frequency
  - job_name: 'critical-apps'
    scrape_interval: 15s
    static_configs:

      - targets: ['payment-service:8080', 'user-service:8080']
  
  # Non-critical services - low frequency  
  - job_name: 'batch-jobs'
    scrape_interval: 60s
    static_configs:

      - targets: ['reporting-service:8080']

# Retention policies
storage:
  tsdb:
    retention.time: 15d    # Reduce from default 15d for cost
    retention.size: 50GB   # Limit storage size

Metric Filtering

# Drop unnecessary metrics
metric_relabel_configs:
  # Drop detailed JVM metrics in production
  - source_labels: [__name__]
    regex: 'jvm_gc_collection_seconds_.*'
    action: drop
    
  # Drop high-cardinality HTTP metrics
  - source_labels: [__name__, uri]
    regex: 'http_request_duration_seconds_bucket;/api/users/[0-9]+'
    action: drop
    
  # Keep only important percentiles
  - source_labels: [__name__, quantile]
    regex: 'http_request_duration_seconds;0\.(5|95|99)'
    action: keep

Efficient Dashboards

{
  "dashboard": {
    "title": "Optimized Dashboard",
    "refresh": "1m",
    "panels": [
      {
        "title": "Request Rate",
        "targets": [
          {
            "expr": "job:http_requests:rate5m",
            "interval": "30s",
            "maxDataPoints": 200
          }
        ]
      }
    ]
  }
}

Пояснение: Используйте recording rules для pre-calculation дорогих запросов, ограничивайте maxDataPoints и увеличивайте interval для better performance.


Заключение

Ключевые принципы эффективного мониторинга

  1. Start with the basics — HTTP метрики, JVM metrics, error rates
  2. Follow naming conventions — consistent metric names и labels
  3. Control cardinality — избегайте high-cardinality labels
  4. Monitor what matters — focus на business impact
  5. Automate alerting — proactive vs reactive monitoring
  6. Document everything — runbooks для alerts и dashboards

Эволюция monitoring stack

Phase 1: Basic monitoring

  • Spring Boot Actuator + Prometheus
  • Basic dashboards в Grafana
  • Simple alerts на infrastructure metrics

Phase 2: Advanced observability

  • Custom business metrics
  • SLI/SLO tracking
  • Distributed tracing integration
  • Advanced alerting rules

Phase 3: Enterprise scale

  • Multi-region federation
  • Cost optimization
  • Custom exporters
  • Integration с incident management

Выбор между инструментами

Prometheus vs. Alternatives:

  • Prometheus: Лучший выбор для cloud-native applications
  • InfluxDB: Если нужны более advanced time series features
  • DataDog/New Relic: Managed solutions с меньшими operational overhead

Grafana vs. Alternatives:

  • Grafana: Де-факто стандарт для visualization
  • Prometheus UI: Достаточно для basic queries и debugging
  • Kibana: Если уже используете Elastic Stack

Главное правило

Monitor for actionability — каждая метрика и alert должны приводить к конкретным действиям. Если метрика не помогает в troubleshooting или decision making, она только добавляет noise.

Start simple, evolve gradually — начинайте с basic metrics и добавляйте complexity по мере роста понимания вашей системы.

OpenTelemetry

Основные концепции

OpenTelemetry (OTel) — единый стандарт для сбора телеметрии (трейсов, метрик, логов) из приложений. Состоит из спецификации, SDK и инструментов.

Observability — способность понимать внутреннее состояние системы по её внешним выходам. Включает три столпа:

  • Traces — путь запроса через систему
  • Metrics — числовые измерения во времени
  • Logs — структурированные записи событий

Ключевые термины

Trace — полная картина одного запроса через распределённую систему Span — единица работы в трейсе (операция, вызов метода, HTTP-запрос) Context — метаданные, передаваемые между сервисами Instrumentation — код для сбора телеметрии Exporter — компонент для отправки данных в бэкенды Collector — прокси для приёма, обработки и маршрутизации телеметрии

Архитектура трейсинга

Client → Service A → Service B → Database
   |        |         |          |
   +--------+-- Trace (единый ID) --+
   |        |         |          |
  Span1   Span2     Span3      Span4

Каждый span содержит:

  • Trace ID — уникальный идентификатор всего трейса
  • Span ID — идентификатор конкретного span
  • Parent Span ID — ссылка на родительский span
  • Timestamps — время начала и окончания
  • Attributes — метаданные (теги)
  • Events — события внутри span
  • Status — успех/ошибка

Настройка в Java

Зависимости Maven

<dependency>
    <groupId>io.opentelemetry</groupId>
    <artifactId>opentelemetry-bom</artifactId>
    <version>1.32.0</version>
    <type>pom</type>
    <scope>import</scope>
</dependency>
<dependency>
    <groupId>io.opentelemetry</groupId>
    <artifactId>opentelemetry-api</artifactId>
</dependency>
<dependency>
    <groupId>io.opentelemetry</groupId>
    <artifactId>opentelemetry-sdk</artifactId>
</dependency>

Инициализация SDK

// Создание OpenTelemetry SDK
OpenTelemetry openTelemetry = OpenTelemetrySDK.builder()
    .setTracerProvider(
        SdkTracerProvider.builder()
            .addSpanProcessor(BatchSpanProcessor.builder(
                OtlpGrpcSpanExporter.builder()
                    .setEndpoint("http://jaeger:14250")
                    .build())
                .build())
            .setResource(Resource.getDefault()
                .merge(Resource.builder()
                    .put(ResourceAttributes.SERVICE_NAME, "my-service")
                    .put(ResourceAttributes.SERVICE_VERSION, "1.0.0")
                    .build()))
            .build())
    .build();

// Получение Tracer
Tracer tracer = openTelemetry.getTracer("my-service");

Создание Spans

Ручное создание

// Создание span с автоматическим закрытием
Span span = tracer.spanBuilder("process-order")
    .setSpanKind(SpanKind.INTERNAL)
    .startSpan();

try (Scope scope = span.makeCurrent()) {
    // Добавление атрибутов
    span.setAttributes(Attributes.of(
        AttributeKey.stringKey("order.id"), orderId,
        AttributeKey.longKey("order.amount"), amount
    ));
    
    // Выполнение бизнес-логики
    processOrder(orderId);
    
    // Добавление события
    span.addEvent("order-validated", 
        Attributes.of(AttributeKey.stringKey("result"), "success"));
    
} catch (Exception e) {
    // Отметка об ошибке
    span.recordException(e);
    span.setStatus(StatusCode.ERROR, "Order processing failed");
    throw e;
} finally {
    span.end();
}

Аннотации (с Spring Boot)

@WithSpan("user-service")
public User findUser(@SpanAttribute("user.id") String userId) {
    return userRepository.findById(userId);
}

Контекст и распространение

Context Propagation — механизм передачи трейсинг-информации между сервисами и потоками.

// Получение текущего контекста
Context current = Context.current();

// Выполнение в другом потоке с сохранением контекста
CompletableFuture.supplyAsync(() -> {
    // Этот код выполнится в контексте родительского span
    return processData();
}, Context.current().wrap(executor));

// HTTP-заголовки для передачи между сервисами
W3CTraceContextPropagator propagator = W3CTraceContextPropagator.getInstance();

Автоматическая инструментация

Java Agent

# Запуск с автоматической инструментацией
java -javaagent:opentelemetry-javaagent.jar \
     -Dotel.service.name=my-service \
     -Dotel.exporter.otlp.endpoint=http://jaeger:4317 \
     -jar myapp.jar

Автоматически инструментирует:

  • HTTP-клиенты (OkHttp, Apache HttpClient)
  • Веб-фреймворки (Spring Boot, Servlet API)
  • Базы данных (JDBC, MongoDB, Redis)
  • Messaging (Kafka, RabbitMQ)

Программная инструментация

// Инструментация HTTP-клиента
OkHttpClient client = new OkHttpClient.Builder()
    .addInterceptor(OtelOkHttpInterceptor.builder(openTelemetry)
        .build())
    .build();

Метрики

// Создание метрик
Meter meter = openTelemetry.getMeter("my-service");

// Counter - монотонно возрастающее значение
LongCounter requestCounter = meter.counterBuilder("http_requests_total")
    .setDescription("Total HTTP requests")
    .build();

// Histogram - распределение значений
DoubleHistogram responseTime = meter.histogramBuilder("http_request_duration")
    .setDescription("HTTP request duration")
    .setUnit("ms")
    .build();

// Gauge - текущее значение
ObservableGauge memoryUsage = meter.gaugeBuilder("memory_usage")
    .setDescription("Current memory usage")
    .buildWithCallback(measurement -> {
        measurement.record(Runtime.getRuntime().totalMemory());
    });

// Использование
requestCounter.add(1, Attributes.of(
    AttributeKey.stringKey("method"), "GET",
    AttributeKey.stringKey("endpoint"), "/api/users"
));

responseTime.record(150.0, Attributes.of(
    AttributeKey.stringKey("status"), "200"
));

Интеграция с Spring Boot

Конфигурация

# application.yml
management:
  tracing:
    enabled: true
    sampling:
      probability: 1.0
  otlp:
    tracing:
      endpoint: http://jaeger:4318/v1/traces

Кастомные метрики

@Component
public class OrderMetrics {
    
    private final Counter orderCounter;
    private final Timer orderProcessingTime;
    
    public OrderMetrics(MeterRegistry meterRegistry) {
        this.orderCounter = Counter.builder("orders_total")
            .description("Total orders processed")
            .register(meterRegistry);
            
        this.orderProcessingTime = Timer.builder("order_processing_duration")
            .description("Order processing time")
            .register(meterRegistry);
    }
    
    public void recordOrder(String status) {
        orderCounter.increment(Tags.of("status", status));
    }
    
    public void recordProcessingTime(Duration duration) {
        orderProcessingTime.record(duration);
    }
}

Популярные бэкенды

Jaeger

  • Распределённый трейсинг
  • UI для анализа трейсов
  • Поддержка сэмплирования

Zipkin

  • Простой в настройке
  • Веб-интерфейс для трейсов
  • Легковесный

Prometheus + Grafana

  • Prometheus — сбор метрик
  • Grafana — визуализация
  • Alertmanager — уведомления

Коммерческие решения

  • Datadog — полнофункциональный APM
  • New Relic — мониторинг производительности
  • Dynatrace — AI-powered observability

Collector Configuration

# otel-collector.yml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024
  memory_limiter:
    check_interval: 1s
    limit_mib: 512

exporters:
  jaeger:
    endpoint: jaeger:14250
    tls:
      insecure: true
  prometheus:
    endpoint: "0.0.0.0:8889"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [jaeger]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheus]

Лучшие практики

Производительность

  • Используйте сэмплирование для снижения overhead
  • Настройте batch processing для экспорта
  • Ограничьте количество attributes на span

Безопасность

  • Не записывайте чувствительные данные в атрибуты
  • Используйте фильтрацию на уровне Collector
  • Настройте TLS для передачи данных

Мониторинг

  • Мониторьте сам OpenTelemetry (метрики SDK)
  • Отслеживайте производительность инструментации
  • Настройте алерты на критические метрики

Troubleshooting

Общие проблемы

  • Spans не появляются: проверьте exporter и endpoint
  • Высокий overhead: уменьшите сэмплирование
  • Потеря контекста: проверьте propagation в async-коде
  • Большие трейсы: ограничьте глубину инструментации

Отладка

// Включение debug-логов
System.setProperty("io.opentelemetry.javaagent.debug", "true");

// Самодиагностика
OpenTelemetry.noop(); // Отключение для тестов

Вопросы для собеседования

  1. Чем отличается trace от span?
  2. Как работает context propagation в микросервисах?
  3. Какие виды сэмплирования знаете?
  4. Как минимизировать overhead от трейсинга?
  5. Различия между push и pull моделями метрик?
  6. Как обеспечить безопасность телеметрии?
  7. Стратегии для high-load систем?