Логирование в микросервисах на Java
Вызовы логирования в микросервисах
Основные проблемы
Распределенные транзакции:
- Один request проходит через множество сервисов
- Сложно отследить полный путь выполнения
- Логи разбросаны по разным системам и файлам
Корреляция логов:
- Нужно связать логи одного request'а из разных сервисов
- Отсутствие единого контекста затрудняет debugging
- Временные метки могут отличаться между серверами
Объем данных:
- Микросервисы генерируют огромное количество логов
- Нужна эффективная агрегация и поиск
- Хранение и индексация больших объемов
Структурированность:
- Неструктурированные логи сложно анализировать
- Разные форматы в разных сервисах
- Нужна стандартизация и парсинг
Пояснение: В монолите все логи в одном месте, в микросервисах — это distributed debugging nightmare без правильных инструментов.
Фундаментальные принципы
1. Correlation ID
Ключевая концепция — каждый request получает уникальный идентификатор, который передается через всю цепочку сервисов.
@Component
public class CorrelationInterceptor implements HandlerInterceptor {
private static final String CORRELATION_ID_HEADER = "X-Correlation-ID";
@Override
public boolean preHandle(HttpServletRequest request,
HttpServletResponse response,
Object handler) {
String correlationId = request.getHeader(CORRELATION_ID_HEADER);
if (correlationId == null) {
correlationId = UUID.randomUUID().toString();
}
MDC.put("correlationId", correlationId);
response.setHeader(CORRELATION_ID_HEADER, correlationId);
return true;
}
@Override
public void afterCompletion(HttpServletRequest request,
HttpServletResponse response,
Object handler, Exception ex) {
MDC.clear();
}
}
Пояснение: MDC (Mapped Diagnostic Context) позволяет добавлять contextual information ко всем log statements в текущем потоке.
2. Структурированное логирование
// Плохо: неструктурированные логи
log.info("User John ordered 3 items for $25.50");
// Хорошо: структурированные логи
log.info("Order created",
kv("userId", "john123"),
kv("itemCount", 3),
kv("totalAmount", 25.50),
kv("action", "order_created"));
3. Логирование на разных уровнях
Application logs: Бизнес-события и ошибки Infrastructure logs: Системные события, performance metrics Audit logs: Compliance и security events Access logs: HTTP requests и responses
Настройка базового логирования
Logback конфигурация
<!-- logback-spring.xml -->
<configuration>
<springProfile name="!prod">
<appender name="CONSOLE" class="ch.qos.logback.core.ConsoleAppender">
<encoder class="net.logstash.logback.encoder.LoggingEventCompositeJsonEncoder">
<providers>
<timestamp/>
<logLevel/>
<loggerName/>
<mdc/>
<message/>
<stackTrace/>
</providers>
</encoder>
</appender>
</springProfile>
<springProfile name="prod">
<appender name="FILE" class="ch.qos.logback.core.rolling.RollingFileAppender">
<file>/var/log/app/application.log</file>
<rollingPolicy class="ch.qos.logback.core.rolling.SizeAndTimeBasedRollingPolicy">
<fileNamePattern>/var/log/app/application.%d{yyyy-MM-dd}.%i.log</fileNamePattern>
<maxFileSize>100MB</maxFileSize>
<maxHistory>30</maxHistory>
<totalSizeCap>3GB</totalSizeCap>
</rollingPolicy>
<encoder class="net.logstash.logback.encoder.LogstashEncoder">
<includeContext>true</includeContext>
<includeMdc>true</includeMdc>
<customFields>{"service":"user-service"}</customFields>
</encoder>
</appender>
</springProfile>
<root level="INFO">
<appender-ref ref="CONSOLE"/>
<appender-ref ref="FILE"/>
</root>
</configuration>
Пояснение:
- JSON формат облегчает парсинг в ELK stack
- Rolling policy предотвращает переполнение диска
- Разные настройки для dev/prod окружений
Structured Logging с SLF4J
@Service
public class OrderService {
private static final Logger log = LoggerFactory.getLogger(OrderService.class);
public Order createOrder(CreateOrderRequest request) {
MDC.put("userId", request.getUserId());
MDC.put("operation", "create_order");
try {
log.info("Creating order",
kv("productId", request.getProductId()),
kv("quantity", request.getQuantity()));
Order order = processOrder(request);
log.info("Order created successfully",
kv("orderId", order.getId()),
kv("status", order.getStatus()),
kv("amount", order.getTotalAmount()));
return order;
} catch (Exception e) {
log.error("Order creation failed",
kv("productId", request.getProductId()),
kv("error", e.getMessage()), e);
throw e;
} finally {
MDC.clear();
}
}
}
ELK Stack (Elasticsearch, Logstash, Kibana)
Архитектура ELK
Microservices → Filebeat → Logstash → Elasticsearch → Kibana
↓ ↓ ↓ ↓ ↓
[Logs] [Shipping] [Processing] [Storage] [Visualization]
Пояснение:
- Filebeat — lightweight log shipper
- Logstash — log processing pipeline
- Elasticsearch — search и analytics engine
- Kibana — visualization и dashboard platform
Filebeat конфигурация
# filebeat.yml
filebeat.inputs:
- type: log
enabled: true
paths:
- /var/log/app/*.log
fields:
service: user-service
environment: production
fields_under_root: true
multiline.pattern: '^\d{4}-\d{2}-\d{2}'
multiline.negate: true
multiline.match: after
processors:
- add_host_metadata:
when.not.contains.tags: forwarded
output.logstash:
hosts: ["logstash:5044"]
logging.level: info
logging.to_files: true
logging.files:
path: /var/log/filebeat
name: filebeat
keepfiles: 7
permissions: 0644
Logstash конфигурация
# logstash.conf
input {
beats {
port => 5044
}
}
filter {
if [service] == "user-service" {
json {
source => "message"
}
date {
match => [ "timestamp", "ISO8601" ]
}
if [level] == "ERROR" {
mutate {
add_tag => ["error"]
}
}
# Извлечение correlation ID
if [mdc][correlationId] {
mutate {
add_field => { "correlation_id" => "%{[mdc][correlationId]}" }
}
}
}
}
output {
elasticsearch {
hosts => ["elasticsearch:9200"]
index => "microservices-logs-%{+YYYY.MM.dd}"
}
}
Kibana Dashboard
Index Pattern: microservices-logs-*
Полезные поля для визуализации:
@timestamp
— время событияservice
— имя микросервисаlevel
— уровень логированияcorrelation_id
— идентификатор запросаmessage
— текст сообщенияmdc.userId
— пользователь
Пояснение: ELK отлично подходит для centralized logging и ad-hoc поиска по логам, но требует значительных ресурсов для больших объемов.
Fluentd для log aggregation
Архитектура с Fluentd
Microservices → Fluentd Agent → Fluentd Aggregator → Storage
↓ ↓ ↓ ↓
[Logs] [Collection] [Processing] [ES/S3/etc]
Fluentd конфигурация
<!-- fluent.conf -->
<source>
@type tail
path /var/log/app/*.log
pos_file /var/log/fluentd/app.log.pos
tag microservices.app
format json
time_key timestamp
time_format %Y-%m-%dT%H:%M:%S.%L%z
</source>
<filter microservices.**>
@type record_transformer
<record>
hostname ${hostname}
environment "#{ENV['ENVIRONMENT']}"
cluster "#{ENV['CLUSTER_NAME']}"
</record>
</filter>
<match microservices.**>
@type elasticsearch
host elasticsearch
port 9200
index_name microservices-logs
type_name _doc
logstash_format true
logstash_prefix microservices-logs
logstash_dateformat %Y.%m.%d
<buffer>
@type file
path /var/log/fluentd/buffer/elasticsearch
flush_mode interval
flush_interval 10s
chunk_limit_size 64m
queue_limit_length 128
</buffer>
</match>
Пояснение: Fluentd более гибкий чем Logstash для routing и transformation, лучше подходит для complex log processing scenarios.
Distributed Tracing
OpenTelemetry
OpenTelemetry — unified observability framework для traces, metrics и logs.
<dependency>
<groupId>io.opentelemetry</groupId>
<artifactId>opentelemetry-api</artifactId>
</dependency>
<dependency>
<groupId>io.opentelemetry.instrumentation</groupId>
<artifactId>opentelemetry-spring-boot-starter</artifactId>
</dependency>
@RestController
public class UserController {
private final Tracer tracer = GlobalOpenTelemetry.getTracer("user-service");
@GetMapping("/users/{id}")
public User getUser(@PathVariable String id) {
Span span = tracer.spanBuilder("get-user")
.setAttribute("user.id", id)
.setAttribute("service.name", "user-service")
.startSpan();
try (Scope scope = span.makeCurrent()) {
log.info("Fetching user", kv("userId", id));
return userService.findUser(id);
} catch (Exception e) {
span.setStatus(StatusCode.ERROR, e.getMessage());
throw e;
} finally {
span.end();
}
}
}
Zipkin интеграция
# application.yml
management:
tracing:
sampling:
probability: 1.0 # 100% для dev, 0.1 для prod
zipkin:
tracing:
endpoint: http://zipkin:9411/api/v2/spans
logging:
pattern:
level: "%5p [%X{traceId:-},%X{spanId:-}]"
@Component
public class TracingFilter implements Filter {
@Override
public void doFilter(ServletRequest request, ServletResponse response,
FilterChain chain) throws IOException, ServletException {
TraceContext traceContext = Tracing.current().tracer().nextSpan()
.name("http-request")
.tag("http.method", ((HttpServletRequest) request).getMethod())
.tag("http.url", ((HttpServletRequest) request).getRequestURL().toString())
.start()
.context();
try (CurrentTraceContext.Scope scope =
Tracing.current().currentTraceContext().newScope(traceContext)) {
chain.doFilter(request, response);
}
}
}
Pояснение: Zipkin визуализирует request flow через микросервисы, показывая timing и dependencies между вызовами.
Jaeger vs Zipkin
Zipkin:
- Проще в setup и использовании
- Twitter origin, battle-tested
- Хорошая производительность для средних объемов
- JSON/Thrift протоколы
Jaeger:
- Uber origin, designed for scale
- Better performance для больших объемов
- Adaptive sampling strategies
- gRPC протокол более эффективен
- Better storage options (Cassandra, Elasticsearch)
# Jaeger configuration
opentracing:
jaeger:
service-name: user-service
sampler:
type: probabilistic
param: 0.1 # 10% sampling rate
sender:
agent-host: jaeger-agent
agent-port: 6831
Пояснение: Выбор между Zipkin и Jaeger зависит от scale и requirements. Zipkin проще для начала, Jaeger лучше для enterprise scale.
Observability Patterns
Three Pillars of Observability
1. Logs — что произошло 2. Metrics — numerical measurements 3. Traces — как requests проходят через систему
Correlation между Logs и Traces
@Service
public class PaymentService {
public PaymentResult processPayment(PaymentRequest request) {
Span currentSpan = Span.current();
String traceId = currentSpan.getSpanContext().getTraceId();
String spanId = currentSpan.getSpanContext().getSpanId();
MDC.put("traceId", traceId);
MDC.put("spanId", spanId);
MDC.put("operation", "process_payment");
try {
log.info("Processing payment",
kv("amount", request.getAmount()),
kv("currency", request.getCurrency()));
// Создание child span для external call
Span childSpan = GlobalOpenTelemetry.getTracer("payment-service")
.spanBuilder("external-payment-api")
.setAttribute("payment.provider", "stripe")
.startSpan();
try (Scope scope = childSpan.makeCurrent()) {
return externalPaymentService.charge(request);
} finally {
childSpan.end();
}
} finally {
MDC.clear();
}
}
}
Custom Metrics с Micrometer
@Component
public class BusinessMetrics {
private final Counter orderCounter;
private final Timer orderProcessingTime;
private final Gauge activeUsers;
public BusinessMetrics(MeterRegistry meterRegistry) {
this.orderCounter = Counter.builder("orders.created")
.description("Number of orders created")
.register(meterRegistry);
this.orderProcessingTime = Timer.builder("order.processing.time")
.description("Order processing time")
.register(meterRegistry);
this.activeUsers = Gauge.builder("users.active")
.description("Number of active users")
.register(meterRegistry, this, BusinessMetrics::getActiveUserCount);
}
public void recordOrder(String userId, double amount) {
orderCounter.increment(
Tags.of("user.type", getUserType(userId))
);
log.info("Order recorded in metrics",
kv("userId", userId),
kv("amount", amount));
}
}
Security и Compliance
Sensitive Data Handling
@Component
public class SecureLogging {
// Маскирование sensitive data
public void logUserAction(String userId, String creditCardNumber, String action) {
log.info("User action performed",
kv("userId", userId),
kv("creditCard", maskCreditCard(creditCardNumber)),
kv("action", action));
}
private String maskCreditCard(String ccNumber) {
if (ccNumber == null || ccNumber.length() < 4) {
return "****";
}
return "**** **** **** " + ccNumber.substring(ccNumber.length() - 4);
}
// Audit trail
public void auditSecurityEvent(String userId, String event, String details) {
MDC.put("audit", "true");
MDC.put("security", "true");
log.info("Security event",
kv("userId", userId),
kv("event", event),
kv("details", details),
kv("timestamp", Instant.now()),
kv("source", "security-audit"));
MDC.remove("audit");
MDC.remove("security");
}
}
GDPR Compliance
@Service
public class GdprCompliantLogging {
public void logWithDataRetention(String userId, String action) {
// Добавление retention metadata
MDC.put("retention.policy", "90days");
MDC.put("data.classification", "personal");
log.info("User action",
kv("userId", hashUserId(userId)), // Хеширование PII
kv("action", action));
}
private String hashUserId(String userId) {
return DigestUtils.sha256Hex(userId + SECRET_SALT);
}
}
Пояснение: В production никогда не логируйте PII (personally identifiable information) в plain text. Используйте hashing или masking.
Performance Optimization
Asynchronous Logging
<!-- Async appender для performance -->
<appender name="ASYNC_FILE" class="ch.qos.logback.classic.AsyncAppender">
<appender-ref ref="FILE"/>
<queueSize>1024</queueSize>
<discardingThreshold>0</discardingThreshold>
<includeCallerData>false</includeCallerData>
<neverBlock>true</neverBlock>
</appender>
// Conditional logging для дорогих операций
if (log.isDebugEnabled()) {
log.debug("Expensive debug info: {}", expensiveDebugInfoGeneration());
}
// Lazy evaluation с lambda
log.debug("User details: {}", () -> buildExpensiveUserReport(user));
Log Sampling
@Component
public class SampledLogger {
private final AtomicLong counter = new AtomicLong(0);
private static final int SAMPLE_RATE = 100; // Логируем каждое 100-е событие
public void logSampled(String message, Object... args) {
if (counter.incrementAndGet() % SAMPLE_RATE == 0) {
log.info(message, args);
}
}
}
Resource Usage Monitoring
@Component
@Scheduled(fixedRate = 60000) // Каждую минуту
public class LoggingResourceMonitor {
public void monitorLogDiskUsage() {
File logDir = new File("/var/log/app");
long totalSpace = logDir.getTotalSpace();
long freeSpace = logDir.getFreeSpace();
double usagePercent = ((double)(totalSpace - freeSpace) / totalSpace) * 100;
if (usagePercent > 85) {
log.warn("High disk usage for logs",
kv("usage", usagePercent),
kv("freeSpace", freeSpace),
kv("totalSpace", totalSpace));
}
}
}
Monitoring и Alerting
Log-based Alerts
# Примеры alert правил для Elasticsearch
- alert: HighErrorRate
expr: rate(log_entries{level="ERROR"}[5m]) > 10
for: 2m
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }} errors per second"
- alert: ServiceDown
expr: absent(log_entries{service="user-service"}[5m])
for: 1m
annotations:
summary: "Service appears to be down"
description: "No logs from user-service in last 5 minutes"
Health Check Logging
@Component
public class HealthCheckLogger {
@EventListener
public void onHealthChange(HealthChangedEvent event) {
log.info("Health status changed",
kv("component", event.getComponentName()),
kv("status", event.getStatus()),
kv("details", event.getDetails()));
}
@Scheduled(fixedRate = 30000)
public void logSystemHealth() {
log.info("System health check",
kv("memoryUsage", getMemoryUsage()),
kv("cpuUsage", getCpuUsage()),
kv("activeConnections", getActiveConnections()));
}
}
Best Practices
1. Logging Levels Strategy
// ERROR: для ошибок, требующих немедленного внимания
log.error("Payment processing failed", kv("orderId", orderId), exception);
// WARN: для потенциальных проблем
log.warn("High response time detected", kv("responseTime", responseTime));
// INFO: для важных business events
log.info("Order created", kv("orderId", orderId), kv("userId", userId));
// DEBUG: для detailed troubleshooting (только в dev)
log.debug("Database query executed", kv("sql", sql), kv("params", params));
// TRACE: для очень детального debugging (обычно отключен)
log.trace("Method entry", kv("method", "calculateTotal"), kv("args", args));
2. Standardized Log Format
public class LoggingUtils {
public static void logBusinessEvent(String event, String entityType,
String entityId, Map<String, Object> details) {
MDC.put("event.type", "business");
MDC.put("entity.type", entityType);
MDC.put("entity.id", entityId);
log.info("Business event: {}", event,
kv("details", details),
kv("timestamp", Instant.now()));
}
public static void logTechnicalEvent(String component, String operation,
String status, long duration) {
MDC.put("event.type", "technical");
MDC.put("component", component);
log.info("Technical event",
kv("operation", operation),
kv("status", status),
kv("duration", duration));
}
}
3. Error Context Preservation
@Service
public class ErrorHandlingService {
public void handleBusinessError(BusinessException e, String operation) {
// Сохраняем полный контекст ошибки
log.error("Business operation failed",
kv("operation", operation),
kv("errorCode", e.getErrorCode()),
kv("errorMessage", e.getMessage()),
kv("userId", getCurrentUserId()),
kv("correlationId", MDC.get("correlationId")),
kv("stackTrace", getStackTraceAsString(e)));
}
private String getStackTraceAsString(Exception e) {
StringWriter sw = new StringWriter();
e.printStackTrace(new PrintWriter(sw));
return sw.toString();
}
}
4. Production Deployment
# production-logging.yml
logging:
level:
root: INFO
com.company: INFO
org.springframework: WARN
org.hibernate: WARN
pattern:
file: "%d{yyyy-MM-dd HH:mm:ss} [%thread] %-5level [%X{traceId:-},%X{spanId:-}] %logger{36} - %msg%n"
file:
name: /var/log/app/application.log
max-size: 100MB
max-history: 30
5. Log Rotation и Cleanup
#!/bin/bash
# log-cleanup.sh - автоматическая очистка старых логов
LOG_DIR="/var/log/app"
RETENTION_DAYS=30
# Удаление логов старше 30 дней
find $LOG_DIR -name "*.log*" -mtime +$RETENTION_DAYS -delete
# Компрессия логов старше 7 дней
find $LOG_DIR -name "*.log*" -mtime +7 ! -name "*.gz" -exec gzip {} \;
# Уведомление при превышении дискового пространства
USAGE=$(df $LOG_DIR | tail -1 | awk '{print $5}' | sed 's/%//')
if [ $USAGE -gt 85 ]; then
echo "Warning: Log directory usage is ${USAGE}%" | mail -s "High disk usage" admin@company.com
fi
Troubleshooting Scenarios
1. Performance Issue Investigation
// Логирование для performance analysis
@Around("@annotation(Monitored)")
public Object monitorPerformance(ProceedingJoinPoint joinPoint) throws Throwable {
String methodName = joinPoint.getSignature().getName();
long startTime = System.currentTimeMillis();
log.info("Method execution started",
kv("method", methodName),
kv("args", Arrays.toString(joinPoint.getArgs())));
try {
Object result = joinPoint.proceed();
long duration = System.currentTimeMillis() - startTime;
log.info("Method execution completed",
kv("method", methodName),
kv("duration", duration),
kv("status", "success"));
return result;
} catch (Exception e) {
long duration = System.currentTimeMillis() - startTime;
log.error("Method execution failed",
kv("method", methodName),
kv("duration", duration),
kv("error", e.getMessage()), e);
throw e;
}
}
2. Distributed Transaction Tracking
// Отслеживание распределенных транзакций
@Service
public class OrderOrchestrator {
public void processOrder(OrderRequest request) {
String correlationId = MDC.get("correlationId");
String sagaId = UUID.randomUUID().toString();
MDC.put("sagaId", sagaId);
log.info("Saga started",
kv("sagaId", sagaId),
kv("orderId", request.getOrderId()),
kv("steps", List.of("inventory", "payment", "shipping")));
try {
// Step 1: Reserve inventory
logSagaStep("inventory", "started");
inventoryService.reserve(request.getProductId(), request.getQuantity());
logSagaStep("inventory", "completed");
// Step 2: Process payment
logSagaStep("payment", "started");
paymentService.charge(request.getPaymentDetails());
logSagaStep("payment", "completed");
// Step 3: Arrange shipping
logSagaStep("shipping", "started");
shippingService.schedule(request.getShippingAddress());
logSagaStep("shipping", "completed");
log.info("Saga completed successfully", kv("sagaId", sagaId));
} catch (Exception e) {
log.error("Saga failed, starting compensation",
kv("sagaId", sagaId),
kv("error", e.getMessage()));
// Compensation logic...
}
}
private void logSagaStep(String step, String status) {
log.info("Saga step " + status,
kv("step", step),
kv("status", status),
kv("sagaId", MDC.get("sagaId")));
}
}
Заключение
Ключевые принципы
- Correlation ID везде — основа для debugging distributed systems
- Структурированные логи — JSON формат для легкого парсинга
- Appropriate log levels — не засоряйте production DEBUG'ом
- Centralized aggregation — один источник истины для всех логов
- Security first — никогда не логируйте sensitive data
- Performance awareness — async logging и sampling для high load
- Monitoring и alerting — proactive problem detection
Выбор инструментов
ELK Stack: Для полнофункционального log management и analytics Fluentd: Когда нужна гибкость в log processing и routing OpenTelemetry: Современный стандарт для observability Zipkin: Простое distributed tracing для начала Jaeger: Enterprise-scale distributed tracing
Эволюция подхода
- Start simple: Basic file logging + correlation ID
- Add structure: JSON формат и MDC
- Centralize: ELK или аналогичный stack
- Add tracing: Zipkin или Jaeger для request flow
- Optimize: Performance tuning и cost
Мониторинг с Prometheus и Grafana на Java
Что такое мониторинг приложений
Application Performance Monitoring (APM) — это практика отслеживания и анализа производительности, доступности и поведения приложений в real-time. В микросервисной архитектуре мониторинг критически важен для:
- Раннего обнаружения проблем до их влияния на пользователей
- Capacity planning и оптимизации ресурсов
- Troubleshooting и root cause analysis
- SLA compliance и performance optimization
Типы метрик
Business Metrics: Заказы в секунду, конверсия, revenue Application Metrics: Response time, error rate, throughput Infrastructure Metrics: CPU, память, disk I/O, network Custom Metrics: Domain-specific показатели
The Four Golden Signals (Google SRE)
- Latency — время отклика запросов
- Traffic — количество запросов к системе
- Errors — процент failed запросов
- Saturation — использование ресурсов системы
Пояснение: Эти четыре метрики дают полную картину здоровья системы. Если они в норме — система работает хорошо.
Prometheus: Time Series Database
Что такое Prometheus
Prometheus — это open-source система мониторинга и alerting, специально созданная для cloud-native приложений. Ключевые особенности:
- Pull-based модель — Prometheus сам запрашивает метрики у приложений
- Time series database — эффективное хранение временных рядов
- Multi-dimensional data model — метрики с labels для детализации
- PromQL — мощный язык запросов для анализа данных
- Service discovery — автоматическое обнаружение targets
Архитектура Prometheus
Applications → Metrics Endpoint → Prometheus Server → AlertManager
↓ ↓ ↓ ↓
[/actuator/ [HTTP scraping] [Storage + [Notifications]
prometheus] PromQL]
Пояснение: Приложения exposing метрики через HTTP endpoint, Prometheus периодически их scraping и сохраняет в time series database.
Data Model
Metric sample состоит из:
- Metric name — имя метрики (например,
http_requests_total
) - Labels — key-value пары для детализации (
{method="GET", status="200"}
) - Timestamp — время измерения
- Value — численное значение
Пример:
http_requests_total{method="GET", endpoint="/api/users", status="200"} 1027 @1609459200
Типы метрик в Prometheus
Counter — монотонно растущий счетчик (например, количество запросов) Gauge — значение, которое может увеличиваться и уменьшаться (например, использование памяти) Histogram — распределение значений по buckets (например, latency distribution) Summary — похож на Histogram, но вычисляет quantiles на client-side
Настройка мониторинга в Spring Boot
Maven зависимости
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
Конфигурация Actuator
# application.yml
management:
endpoints:
web:
exposure:
include: health, info, metrics, prometheus
base-path: /actuator
endpoint:
health:
show-details: always
show-components: always
metrics:
enabled: true
prometheus:
enabled: true
metrics:
distribution:
percentiles-histogram:
http.server.requests: true
percentiles:
http.server.requests: 0.5, 0.95, 0.99
tags:
application: ${spring.application.name}
environment: ${spring.profiles.active}
Пояснение:
/actuator/prometheus
endpoint exposing метрики в Prometheus formatpercentiles-histogram
включает histogram buckets для latency analysis- Tags добавляются ко всем метрикам для идентификации приложения
Базовые метрики из коробки
Spring Boot Actuator автоматически предоставляет:
- HTTP metrics:
http_server_requests_*
— latency, throughput, errors - JVM metrics:
jvm_memory_*
,jvm_gc_*
,jvm_threads_*
- System metrics:
system_cpu_*
,process_*
- Database metrics:
hikaricp_*
для connection pool - Custom application metrics: через Micrometer API
Custom метрики с Micrometer
Counter — счетчики событий
@RestController
public class OrderController {
private final Counter orderCounter;
private final Counter orderErrorCounter;
public OrderController(MeterRegistry meterRegistry) {
this.orderCounter = Counter.builder("orders.created")
.description("Number of orders created")
.tag("type", "business_metric")
.register(meterRegistry);
this.orderErrorCounter = Counter.builder("orders.errors")
.description("Number of order creation errors")
.register(meterRegistry);
}
@PostMapping("/orders")
public Order createOrder(@RequestBody CreateOrderRequest request) {
try {
Order order = orderService.createOrder(request);
// Инкремент счетчика с tags
orderCounter.increment(
Tags.of("status", "success",
"user_type", getUserType(request.getUserId()))
);
return order;
} catch (Exception e) {
orderErrorCounter.increment(
Tags.of("error_type", e.getClass().getSimpleName())
);
throw e;
}
}
}
Gauge — текущие значения
@Component
public class SystemMetrics {
private final AtomicInteger activeUsers = new AtomicInteger(0);
private final Queue<String> pendingJobs = new ConcurrentLinkedQueue<>();
public SystemMetrics(MeterRegistry meterRegistry) {
// Gauge для активных пользователей
Gauge.builder("users.active")
.description("Number of currently active users")
.register(meterRegistry, activeUsers, AtomicInteger::get);
// Gauge для размера очереди
Gauge.builder("jobs.pending")
.description("Number of pending background jobs")
.register(meterRegistry, pendingJobs, Queue::size);
// Gauge для custom вычислений
Gauge.builder("memory.usage.percentage")
.description("Memory usage percentage")
.register(meterRegistry, this, SystemMetrics::getMemoryUsagePercentage);
}
private double getMemoryUsagePercentage() {
Runtime runtime = Runtime.getRuntime();
long totalMemory = runtime.totalMemory();
long freeMemory = runtime.freeMemory();
return ((double) (totalMemory - freeMemory) / totalMemory) * 100;
}
}
Timer — измерение времени выполнения
@Service
public class PaymentService {
private final Timer paymentProcessingTimer;
private final Timer.Sample sample;
public PaymentService(MeterRegistry meterRegistry) {
this.paymentProcessingTimer = Timer.builder("payment.processing.duration")
.description("Payment processing time")
.publishPercentiles(0.5, 0.95, 0.99) // медиана, 95й и 99й процентили
.register(meterRegistry);
}
public PaymentResult processPayment(PaymentRequest request) {
return Timer.Sample.start(meterRegistry)
.stop(paymentProcessingTimer.tag("provider", request.getProvider()));
}
// Альтернативный способ с try-with-resources
public PaymentResult processPaymentAlternative(PaymentRequest request) {
Timer.Sample sample = Timer.Sample.start(meterRegistry);
try {
PaymentResult result = externalPaymentService.process(request);
sample.stop(Timer.builder("payment.external.duration")
.tag("provider", request.getProvider())
.tag("status", result.getStatus())
.register(meterRegistry));
return result;
} catch (Exception e) {
sample.stop(Timer.builder("payment.external.duration")
.tag("provider", request.getProvider())
.tag("status", "error")
.register(meterRegistry));
throw e;
}
}
}
DistributionSummary — распределение значений
@Component
public class BusinessMetrics {
private final DistributionSummary orderAmountSummary;
public BusinessMetrics(MeterRegistry meterRegistry) {
this.orderAmountSummary = DistributionSummary.builder("order.amount")
.description("Distribution of order amounts")
.baseUnit("USD")
.publishPercentiles(0.5, 0.75, 0.95, 0.99)
.register(meterRegistry);
}
public void recordOrderAmount(double amount, String category) {
orderAmountSummary.record(amount, Tags.of("category", category));
}
}
Пояснение: DistributionSummary подходит для анализа распределения business metrics — размеры заказов, количество items, etc.
Prometheus конфигурация
prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "rules/*.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'spring-boot-apps'
metrics_path: '/actuator/prometheus'
static_configs:
- targets:
- 'user-service:8080'
- 'order-service:8081'
- 'payment-service:8082'
scrape_interval: 10s
scrape_timeout: 5s
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
Пояснение:
scrape_interval
— как часто Prometheus собирает метрикиkubernetes_sd_configs
— автоматическое обнаружение pods в Kubernetesrelabel_configs
— правила для filtering и modification targets
Service Discovery в Kubernetes
# kubernetes deployment с annotations
apiVersion: apps/v1
kind: Deployment
metadata:
name: user-service
spec:
template:
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/actuator/prometheus"
spec:
containers:
- name: user-service
image: user-service:latest
ports:
- containerPort: 8080
PromQL (Prometheus Query Language)
Основные операторы
Селекторы метрик:
# Все значения метрики
http_requests_total
# Фильтрация по labels
http_requests_total{method="GET"}
http_requests_total{method!="GET"} # НЕ GET
http_requests_total{status=~"2.."} # regex: статусы 2xx
Range queries (временные диапазоны):
# Значения за последние 5 минут
http_requests_total[5m]
# Rate of change за 5 минут
rate(http_requests_total[5m])
# Increase за час
increase(http_requests_total[1h])
Полезные функции
rate() — скорость изменения per second:
# Request rate per second
rate(http_requests_total[5m])
# Error rate
rate(http_requests_total{status=~"5.."}[5m])
histogram_quantile() — вычисление процентилей:
# 95й процентиль latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# 99й процентиль за разные временные окна
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[1m]))
Aggregation operations:
# Сумма по всем instances
sum(rate(http_requests_total[5m]))
# Сумма по service
sum by (service) (rate(http_requests_total[5m]))
# Среднее время отклика
avg(http_request_duration_seconds)
# Максимальное использование памяти
max(jvm_memory_used_bytes) by (application)
Практические примеры запросов
Error Rate (процент ошибок):
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) * 100
Apdex Score (Application Performance Index):
(
sum(rate(http_request_duration_seconds_bucket{le="0.1"}[5m])) +
sum(rate(http_request_duration_seconds_bucket{le="0.4"}[5m])) / 2
) / sum(rate(http_request_duration_seconds_count[5m]))
Memory Usage (процент использования heap):
jvm_memory_used_bytes{area="heap"} /
jvm_memory_max_bytes{area="heap"} * 100
Grafana: Визуализация и Dashboards
Что такое Grafana
Grafana — это open-source платформа для visualization и analytics. Позволяет создавать interactive dashboards для мониторинга метрик из различных data sources.
Ключевые возможности:
- Multi-datasource support — Prometheus, InfluxDB, Elasticsearch, etc.
- Rich visualization options — graphs, tables, heatmaps, alerts
- Dashboard templating — переменные для dynamic dashboards
- Alerting — уведомления при превышении thresholds
- User management — роли и permissions
Подключение Prometheus как Data Source
{
"name": "Prometheus",
"type": "prometheus",
"url": "http://prometheus:9090",
"access": "proxy",
"isDefault": true,
"jsonData": {
"httpMethod": "POST",
"timeInterval": "15s"
}
}
Основные типы панелей
Time Series Panel — для отображения метрик во времени:
- CPU usage, memory consumption
- Request rate, error rate
- Response time trends
Stat Panel — для single value metrics:
- Current active users
- Total orders today
- System uptime
Table Panel — для tabular data:
- Top error endpoints
- Service health status
- Resource usage by service
Heatmap Panel — для distribution analysis:
- Response time distribution
- Load patterns over time
Создание Dashboard для Java приложения
Application Overview Dashboard
{
"dashboard": {
"title": "Java Application Overview",
"panels": [
{
"title": "Request Rate",
"type": "timeseries",
"targets": [
{
"expr": "sum(rate(http_requests_total[5m])) by (application)",
"legendFormat": "{{application}}"
}
]
},
{
"title": "Error Rate %",
"type": "timeseries",
"targets": [
{
"expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) by (application) / sum(rate(http_requests_total[5m])) by (application) * 100",
"legendFormat": "{{application}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"color": {"mode": "palette-classic"}
}
}
},
{
"title": "Response Time 95th Percentile",
"type": "timeseries",
"targets": [
{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, application))",
"legendFormat": "{{application}}"
}
]
}
]
}
}
JVM Metrics Dashboard
Memory Usage Panel:
# Heap memory usage
jvm_memory_used_bytes{area="heap"} / jvm_memory_max_bytes{area="heap"} * 100
# Non-heap memory usage
jvm_memory_used_bytes{area="nonheap"} / jvm_memory_max_bytes{area="nonheap"} * 100
# Garbage collection rate
rate(jvm_gc_collection_seconds_count[5m])
Thread Metrics Panel:
# Active threads
jvm_threads_live_threads
# Daemon threads
jvm_threads_daemon_threads
# Peak threads
jvm_threads_peak_threads
Business Metrics Dashboard
{
"panels": [
{
"title": "Orders per Minute",
"targets": [
{
"expr": "sum(rate(orders_created[1m])) * 60",
"legendFormat": "Orders/min"
}
]
},
{
"title": "Revenue per Hour",
"targets": [
{
"expr": "sum(rate(order_amount_sum[1h])) * 3600",
"legendFormat": "Revenue/hour"
}
]
},
{
"title": "Active Users",
"type": "stat",
"targets": [
{
"expr": "users_active",
"legendFormat": "Active Users"
}
]
}
]
}
Dashboard Variables (Templating)
{
"templating": {
"list": [
{
"name": "application",
"type": "query",
"query": "label_values(http_requests_total, application)",
"multi": true,
"includeAll": true
},
{
"name": "environment",
"type": "query",
"query": "label_values(http_requests_total, environment)",
"multi": false
},
{
"name": "time_range",
"type": "interval",
"options": ["1m", "5m", "15m", "30m", "1h"]
}
]
}
}
Использование переменных в запросах:
sum(rate(http_requests_total{application=~"$application", environment="$environment"}[$time_range]))
Пояснение: Variables делают dashboards переиспользуемыми для разных приложений и окружений.
Alerting
Prometheus Alert Rules
# alerts.yml
groups:
- name: application_alerts
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (application) /
sum(rate(http_requests_total[5m])) by (application) * 100 > 5
for: 2m
labels:
severity: warning
team: backend
annotations:
summary: "High error rate detected"
description: "Application {{ $labels.application }} has error rate of {{ $value }}%"
- alert: HighLatency
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, application)
) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "High latency detected"
description: "95th percentile latency is {{ $value }}s for {{ $labels.application }}"
- alert: ServiceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service is down"
description: "{{ $labels.instance }} has been down for more than 1 minute"
AlertManager конфигурация
# alertmanager.yml
global:
smtp_smarthost: 'localhost:587'
smtp_from: 'alerts@company.com'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'critical-alerts'
- match:
team: backend
receiver: 'backend-team'
receivers:
- name: 'default'
email_configs:
- to: 'team@company.com'
subject: 'Alert: {{ .GroupLabels.alertname }}'
body: |
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
{{ end }}
- name: 'critical-alerts'
slack_configs:
- api_url: 'https://hooks.slack.com/services/...'
channel: '#alerts-critical'
title: 'Critical Alert'
text: '{{ .CommonAnnotations.summary }}'
- name: 'backend-team'
email_configs:
- to: 'backend-team@company.com'
pagerduty_configs:
- service_key: 'your-pagerduty-key'
Grafana Alerts
{
"alert": {
"name": "High Memory Usage",
"conditions": [
{
"query": {
"queryType": "",
"refId": "A",
"expr": "jvm_memory_used_bytes{area=\"heap\"} / jvm_memory_max_bytes{area=\"heap\"} * 100"
},
"reducer": {
"type": "last",
"params": []
},
"evaluator": {
"type": "gt",
"params": [85]
}
}
],
"frequency": "10s",
"handler": 1,
"noDataState": "no_data",
"executionErrorState": "alerting"
}
}
Пояснение: Prometheus alerts лучше для infrastructure metrics, Grafana alerts — для complex business logic и visualization-based alerts.
Advanced Monitoring Patterns
SLI/SLO мониторинг
Service Level Indicators (SLI) — метрики качества сервиса:
# Availability SLI
sum(rate(http_requests_total{status!~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))
# Latency SLI
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
# Error Budget calculation
1 - (sli_availability / slo_target)
RED Method (Rate, Errors, Duration)
# Rate - requests per second
sum(rate(http_requests_total[5m]))
# Errors - error percentage
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) * 100
# Duration - response time percentiles
histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
USE Method (Utilization, Saturation, Errors)
# CPU Utilization
100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory Utilization
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# Disk Utilization
100 - (node_filesystem_avail_bytes / node_filesystem_size_bytes * 100)
# Network Saturation
rate(node_network_transmit_drop_total[5m])
Performance Optimization
Efficient Metrics Collection
@Component
public class OptimizedMetrics {
// Переиспользование метрик вместо создания новых
private static final Counter REQUEST_COUNTER =
Metrics.counter("http.requests", "endpoint", "unknown");
// Ограничение cardinality для избежания memory leaks
private final Map<String, Counter> endpointCounters = new ConcurrentHashMap<>();
public void recordRequest(String endpoint) {
// Ограничиваем количество unique endpoints
if (endpointCounters.size() > 100) {
endpoint = "other";
}
endpointCounters.computeIfAbsent(endpoint,
e -> Metrics.counter("http.requests", "endpoint", e))
.increment();
}
}
Prometheus Configuration Tuning
# prometheus.yml - optimization
global:
scrape_interval: 15s
scrape_timeout: 10s
evaluation_interval: 15s
external_labels:
cluster: 'production'
region: 'us-east-1'
# Retention и storage
storage:
tsdb:
retention.time: 30d
retention.size: 100GB
min-block-duration: 2h
max-block-duration: 25h
# Memory optimization
runtime:
gomaxprocs: 4
scrape_configs:
- job_name: 'high-frequency'
scrape_interval: 5s
static_configs:
- targets: ['critical-service:8080']
- job_name: 'low-frequency'
scrape_interval: 60s
static_configs:
- targets: ['batch-service:8080']
Grafana Performance
{
"refresh": "30s",
"time": {
"from": "now-1h",
"to": "now"
},
"panels": [
{
"targets": [
{
"expr": "avg_over_time(metric[5m])",
"interval": "30s",
"maxDataPoints": 100
}
]
}
]
}
Пояснение:
- Ограничивайте
maxDataPoints
для лучшей производительности - Используйте appropriate
interval
для aggregation - Избегайте слишком больших временных диапазонов
Monitoring в Production
High Availability Setup
Prometheus HA:
# prometheus-1.yml
global:
external_labels:
replica: 'prometheus-1'
# prometheus-2.yml
global:
external_labels:
replica: 'prometheus-2'
Grafana HA:
# grafana.ini
[database]
type = mysql
host = mysql-cluster:3306
name = grafana
user = grafana
password = ${GRAFANA_DB_PASSWORD}
[session]
provider = mysql
provider_config = grafana:${GRAFANA_DB_PASSWORD}@tcp(mysql-cluster:3306)/grafana
[server]
root_url = https://grafana.company.com
Security Best Practices
# prometheus.yml - security
global:
external_labels:
cluster: 'production'
scrape_configs:
- job_name: 'secure-app'
scheme: https
tls_config:
ca_file: /etc/prometheus/ca.crt
cert_file: /etc/prometheus/client.crt
key_file: /etc/prometheus/client.key
basic_auth:
username: prometheus
password_file: /etc/prometheus/password
Resource Planning
Prometheus Storage Requirements:
Samples/sec = Number of series × Scrape frequency
Storage/day = Samples/sec × 86400 × 16 bytes (compressed)
Example:
10,000 series × 1/15s × 86400 × 16 bytes = ~900MB/day
Memory Requirements:
RAM = Number of series × 6KB (rule of thumb)
Example: 100,000 series = ~600MB RAM minimum
Troubleshooting
Common Issues
High Cardinality Problems:
// BAD: Unbounded cardinality
Metrics.counter("user.actions", "user_id", userId); // Millions of users!
// GOOD: Bounded cardinality
Metrics.counter("user.actions", "user_type", getUserType(userId)); // Few types
Missing Metrics:
# Check if endpoint is accessible
curl http://app:8080/actuator/prometheus
# Verify Prometheus targets
curl http://prometheus:9090/api/v1/targets
# Check for scrape errors
curl http://prometheus:9090/api/v1/query?query=up
# Debug PromQL queries
curl -G http://prometheus:9090/api/v1/query \
--data-urlencode 'query=rate(http_requests_total[5m])'
Performance Issues:
# Check Prometheus performance
prometheus_tsdb_compactions_total
prometheus_config_last_reload_success_timestamp_seconds
prometheus_rule_evaluation_duration_seconds
# Check scrape duration
scrape_duration_seconds > 0.1
# Identify slow queries
topk(10, increase(prometheus_engine_query_duration_seconds_count[1h]))
Debug Dashboard
{
"dashboard": {
"title": "Prometheus Debug",
"panels": [
{
"title": "Scrape Targets Status",
"type": "table",
"targets": [
{
"expr": "up",
"format": "table",
"instant": true
}
]
},
{
"title": "Scrape Duration",
"type": "timeseries",
"targets": [
{
"expr": "scrape_duration_seconds",
"legendFormat": "{{instance}}"
}
]
},
{
"title": "Series Count",
"type": "timeseries",
"targets": [
{
"expr": "prometheus_tsdb_symbol_table_size_bytes",
"legendFormat": "Series count"
}
]
}
]
}
}
Best Practices
1. Metric Naming Conventions
// Следуйте Prometheus naming conventions
public class MetricNamingBestPractices {
// ✅ Good: Clear, descriptive names
private final Counter HTTP_REQUESTS_TOTAL =
Metrics.counter("http_requests_total");
private final Gauge MEMORY_USAGE_BYTES =
Metrics.gauge("memory_usage_bytes");
private final Timer REQUEST_DURATION_SECONDS =
Metrics.timer("request_duration_seconds");
// ❌ Bad: Unclear names
private final Counter REQ_COUNT = Metrics.counter("req_count");
private final Gauge MEM = Metrics.gauge("mem");
// ✅ Good: Consistent units
private final Counter BYTES_SENT_TOTAL =
Metrics.counter("bytes_sent_total");
private final Timer DATABASE_QUERY_DURATION_SECONDS =
Metrics.timer("database_query_duration_seconds");
// ❌ Bad: Mixed units
private final Timer RESPONSE_TIME_MS =
Metrics.timer("response_time_ms"); // Should be seconds
}
2. Label Strategy
@Component
public class LabelingBestPractices {
// ✅ Good: Low cardinality labels
public void recordRequest(String method, String endpoint, int status) {
Metrics.counter("http_requests_total",
"method", method, // GET, POST, PUT (low cardinality)
"endpoint", normalizeEndpoint(endpoint), // /api/users/{id} (normalized)
"status_class", getStatusClass(status) // 2xx, 4xx, 5xx (low cardinality)
).increment();
}
// ❌ Bad: High cardinality labels
public void recordRequestBad(String userId, String sessionId, String requestId) {
Metrics.counter("requests_total",
"user_id", userId, // Millions of users!
"session_id", sessionId, // Millions of sessions!
"request_id", requestId // Every request unique!
).increment();
}
private String normalizeEndpoint(String endpoint) {
// /api/users/123 -> /api/users/{id}
return endpoint.replaceAll("/\\d+", "/{id}")
.replaceAll("/[a-f0-9-]{36}", "/{uuid}");
}
private String getStatusClass(int status) {
return status / 100 + "xx";
}
}
3. Error Monitoring Strategy
@Component
public class ErrorMonitoring {
private final Counter errorCounter;
private final Timer errorResolutionTime;
public ErrorMonitoring(MeterRegistry registry) {
this.errorCounter = Counter.builder("application_errors_total")
.description("Total number of application errors")
.register(registry);
this.errorResolutionTime = Timer.builder("error_resolution_duration_seconds")
.description("Time to resolve errors")
.register(registry);
}
public void recordError(Exception e, String operation, String severity) {
errorCounter.increment(
Tags.of(
"error_type", e.getClass().getSimpleName(),
"operation", operation,
"severity", severity,
"recoverable", String.valueOf(isRecoverable(e))
)
);
// Structured logging для correlation с metrics
log.error("Application error occurred",
kv("operation", operation),
kv("error_type", e.getClass().getSimpleName()),
kv("severity", severity),
kv("message", e.getMessage()),
e);
}
private boolean isRecoverable(Exception e) {
return !(e instanceof OutOfMemoryError ||
e instanceof StackOverflowError);
}
}
4. Business Metrics Integration
@Service
public class BusinessMetricsService {
private final Counter orderMetrics;
private final DistributionSummary revenueMetrics;
private final Gauge inventoryMetrics;
public BusinessMetricsService(MeterRegistry registry, InventoryService inventoryService) {
this.orderMetrics = Counter.builder("orders_total")
.description("Total number of orders")
.register(registry);
this.revenueMetrics = DistributionSummary.builder("revenue_usd")
.description("Revenue in USD")
.baseUnit("USD")
.register(registry);
this.inventoryMetrics = Gauge.builder("inventory_items")
.description("Current inventory level")
.register(registry, inventoryService, InventoryService::getTotalItems);
}
@EventListener
public void handleOrderCreated(OrderCreatedEvent event) {
orderMetrics.increment(
Tags.of(
"product_category", event.getProductCategory(),
"customer_segment", event.getCustomerSegment(),
"payment_method", event.getPaymentMethod()
)
);
revenueMetrics.record(event.getOrderAmount().doubleValue(),
Tags.of("currency", event.getCurrency()));
}
@EventListener
public void handleOrderCancelled(OrderCancelledEvent event) {
// Отдельная метрика для cancellations
Metrics.counter("orders_cancelled_total",
"reason", event.getCancellationReason()
).increment();
}
}
Deployment и DevOps
Docker Compose Setup
# docker-compose.yml
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./alerts.yml:/etc/prometheus/alerts.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--storage.tsdb.retention.time=30d'
- '--web.enable-lifecycle'
- '--web.enable-admin-api'
grafana:
image: grafana/grafana:latest
container_name: grafana
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin123
- GF_INSTALL_PLUGINS=grafana-piechart-panel
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/dashboards:/etc/grafana/provisioning/dashboards
- ./grafana/datasources:/etc/grafana/provisioning/datasources
alertmanager:
image: prom/alertmanager:latest
container_name: alertmanager
ports:
- "9093:9093"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
app:
image: spring-boot-app:latest
ports:
- "8080:8080"
environment:
- SPRING_PROFILES_ACTIVE=docker
depends_on:
- prometheus
volumes:
prometheus_data:
grafana_data:
Kubernetes Deployment
# prometheus-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
containers:
- name: prometheus
image: prom/prometheus:latest
ports:
- containerPort: 9090
volumeMounts:
- name: config
mountPath: /etc/prometheus
- name: storage
mountPath: /prometheus
args:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
volumes:
- name: config
configMap:
name: prometheus-config
- name: storage
persistentVolumeClaim:
claimName: prometheus-storage
---
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: monitoring
data:
prometheus.yml: |
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
CI/CD Integration
# .github/workflows/monitoring.yml
name: Deploy Monitoring
on:
push:
branches: [main]
paths: ['monitoring/**']
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Validate Prometheus Config
run: |
docker run --rm -v $(pwd)/monitoring:/workspace \
prom/prometheus:latest promtool check config /workspace/prometheus.yml
- name: Validate Alert Rules
run: |
docker run --rm -v $(pwd)/monitoring:/workspace \
prom/prometheus:latest promtool check rules /workspace/alerts.yml
- name: Deploy to Kubernetes
run: |
kubectl apply -f monitoring/k8s/
kubectl rollout status deployment/prometheus -n monitoring
Advanced Features
Recording Rules
# recording-rules.yml
groups:
- name: application_rules
interval: 30s
rules:
# Pre-calculate expensive queries
- record: job:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (job)
- record: job:http_request_duration:p99
expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le))
# Business metrics aggregations
- record: business:orders_per_minute
expr: sum(rate(orders_created_total[1m])) * 60
- record: business:revenue_per_hour
expr: sum(rate(revenue_usd_sum[1h])) * 3600
Federation
# Global Prometheus config
scrape_configs:
- job_name: 'federate'
scrape_interval: 15s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job=~"prometheus"}'
- '{__name__=~"job:.*"}'
static_configs:
- targets:
- 'prometheus-us-east:9090'
- 'prometheus-eu-west:9090'
- 'prometheus-ap-south:9090'
Custom Exporters
@Component
public class CustomDatabaseExporter {
private final DataSource dataSource;
private final CollectorRegistry registry;
public CustomDatabaseExporter(DataSource dataSource, CollectorRegistry registry) {
this.dataSource = dataSource;
this.registry = registry;
// Register custom collector
new DatabaseMetricsCollector(dataSource).register(registry);
}
private static class DatabaseMetricsCollector extends Collector {
private final DataSource dataSource;
public DatabaseMetricsCollector(DataSource dataSource) {
this.dataSource = dataSource;
}
@Override
public List<MetricFamilySamples> collect() {
List<MetricFamilySamples> samples = new ArrayList<>();
try (Connection conn = dataSource.getConnection()) {
// Query active connections
ResultSet rs = conn.createStatement().executeQuery(
"SELECT count(*) as active_connections FROM pg_stat_activity WHERE state = 'active'"
);
if (rs.next()) {
samples.add(new MetricFamilySamples(
"database_active_connections",
Type.GAUGE,
"Number of active database connections",
Arrays.asList(new MetricFamilySamples.Sample(
"database_active_connections",
Arrays.asList(),
Arrays.asList(),
rs.getDouble("active_connections")
))
));
}
} catch (SQLException e) {
// Handle error
}
return samples;
}
}
}
Cost Optimization
Storage Management
# Prometheus storage optimization
global:
scrape_interval: 30s # Увеличить для non-critical metrics
scrape_configs:
# Critical services - high frequency
- job_name: 'critical-apps'
scrape_interval: 15s
static_configs:
- targets: ['payment-service:8080', 'user-service:8080']
# Non-critical services - low frequency
- job_name: 'batch-jobs'
scrape_interval: 60s
static_configs:
- targets: ['reporting-service:8080']
# Retention policies
storage:
tsdb:
retention.time: 15d # Reduce from default 15d for cost
retention.size: 50GB # Limit storage size
Metric Filtering
# Drop unnecessary metrics
metric_relabel_configs:
# Drop detailed JVM metrics in production
- source_labels: [__name__]
regex: 'jvm_gc_collection_seconds_.*'
action: drop
# Drop high-cardinality HTTP metrics
- source_labels: [__name__, uri]
regex: 'http_request_duration_seconds_bucket;/api/users/[0-9]+'
action: drop
# Keep only important percentiles
- source_labels: [__name__, quantile]
regex: 'http_request_duration_seconds;0\.(5|95|99)'
action: keep
Efficient Dashboards
{
"dashboard": {
"title": "Optimized Dashboard",
"refresh": "1m",
"panels": [
{
"title": "Request Rate",
"targets": [
{
"expr": "job:http_requests:rate5m",
"interval": "30s",
"maxDataPoints": 200
}
]
}
]
}
}
Пояснение: Используйте recording rules для pre-calculation дорогих запросов, ограничивайте maxDataPoints
и увеличивайте interval
для better performance.
Заключение
Ключевые принципы эффективного мониторинга
- Start with the basics — HTTP метрики, JVM metrics, error rates
- Follow naming conventions — consistent metric names и labels
- Control cardinality — избегайте high-cardinality labels
- Monitor what matters — focus на business impact
- Automate alerting — proactive vs reactive monitoring
- Document everything — runbooks для alerts и dashboards
Эволюция monitoring stack
Phase 1: Basic monitoring
- Spring Boot Actuator + Prometheus
- Basic dashboards в Grafana
- Simple alerts на infrastructure metrics
Phase 2: Advanced observability
- Custom business metrics
- SLI/SLO tracking
- Distributed tracing integration
- Advanced alerting rules
Phase 3: Enterprise scale
- Multi-region federation
- Cost optimization
- Custom exporters
- Integration с incident management
Выбор между инструментами
Prometheus vs. Alternatives:
- Prometheus: Лучший выбор для cloud-native applications
- InfluxDB: Если нужны более advanced time series features
- DataDog/New Relic: Managed solutions с меньшими operational overhead
Grafana vs. Alternatives:
- Grafana: Де-факто стандарт для visualization
- Prometheus UI: Достаточно для basic queries и debugging
- Kibana: Если уже используете Elastic Stack
Главное правило
Monitor for actionability — каждая метрика и alert должны приводить к конкретным действиям. Если метрика не помогает в troubleshooting или decision making, она только добавляет noise.
Start simple, evolve gradually — начинайте с basic metrics и добавляйте complexity по мере роста понимания вашей системы.
OpenTelemetry
Основные концепции
OpenTelemetry (OTel) — единый стандарт для сбора телеметрии (трейсов, метрик, логов) из приложений. Состоит из спецификации, SDK и инструментов.
Observability — способность понимать внутреннее состояние системы по её внешним выходам. Включает три столпа:
- Traces — путь запроса через систему
- Metrics — числовые измерения во времени
- Logs — структурированные записи событий
Ключевые термины
Trace — полная картина одного запроса через распределённую систему Span — единица работы в трейсе (операция, вызов метода, HTTP-запрос) Context — метаданные, передаваемые между сервисами Instrumentation — код для сбора телеметрии Exporter — компонент для отправки данных в бэкенды Collector — прокси для приёма, обработки и маршрутизации телеметрии
Архитектура трейсинга
Client → Service A → Service B → Database
| | | |
+--------+-- Trace (единый ID) --+
| | | |
Span1 Span2 Span3 Span4
Каждый span содержит:
- Trace ID — уникальный идентификатор всего трейса
- Span ID — идентификатор конкретного span
- Parent Span ID — ссылка на родительский span
- Timestamps — время начала и окончания
- Attributes — метаданные (теги)
- Events — события внутри span
- Status — успех/ошибка
Настройка в Java
Зависимости Maven
<dependency>
<groupId>io.opentelemetry</groupId>
<artifactId>opentelemetry-bom</artifactId>
<version>1.32.0</version>
<type>pom</type>
<scope>import</scope>
</dependency>
<dependency>
<groupId>io.opentelemetry</groupId>
<artifactId>opentelemetry-api</artifactId>
</dependency>
<dependency>
<groupId>io.opentelemetry</groupId>
<artifactId>opentelemetry-sdk</artifactId>
</dependency>
Инициализация SDK
// Создание OpenTelemetry SDK
OpenTelemetry openTelemetry = OpenTelemetrySDK.builder()
.setTracerProvider(
SdkTracerProvider.builder()
.addSpanProcessor(BatchSpanProcessor.builder(
OtlpGrpcSpanExporter.builder()
.setEndpoint("http://jaeger:14250")
.build())
.build())
.setResource(Resource.getDefault()
.merge(Resource.builder()
.put(ResourceAttributes.SERVICE_NAME, "my-service")
.put(ResourceAttributes.SERVICE_VERSION, "1.0.0")
.build()))
.build())
.build();
// Получение Tracer
Tracer tracer = openTelemetry.getTracer("my-service");
Создание Spans
Ручное создание
// Создание span с автоматическим закрытием
Span span = tracer.spanBuilder("process-order")
.setSpanKind(SpanKind.INTERNAL)
.startSpan();
try (Scope scope = span.makeCurrent()) {
// Добавление атрибутов
span.setAttributes(Attributes.of(
AttributeKey.stringKey("order.id"), orderId,
AttributeKey.longKey("order.amount"), amount
));
// Выполнение бизнес-логики
processOrder(orderId);
// Добавление события
span.addEvent("order-validated",
Attributes.of(AttributeKey.stringKey("result"), "success"));
} catch (Exception e) {
// Отметка об ошибке
span.recordException(e);
span.setStatus(StatusCode.ERROR, "Order processing failed");
throw e;
} finally {
span.end();
}
Аннотации (с Spring Boot)
@WithSpan("user-service")
public User findUser(@SpanAttribute("user.id") String userId) {
return userRepository.findById(userId);
}
Контекст и распространение
Context Propagation — механизм передачи трейсинг-информации между сервисами и потоками.
// Получение текущего контекста
Context current = Context.current();
// Выполнение в другом потоке с сохранением контекста
CompletableFuture.supplyAsync(() -> {
// Этот код выполнится в контексте родительского span
return processData();
}, Context.current().wrap(executor));
// HTTP-заголовки для передачи между сервисами
W3CTraceContextPropagator propagator = W3CTraceContextPropagator.getInstance();
Автоматическая инструментация
Java Agent
# Запуск с автоматической инструментацией
java -javaagent:opentelemetry-javaagent.jar \
-Dotel.service.name=my-service \
-Dotel.exporter.otlp.endpoint=http://jaeger:4317 \
-jar myapp.jar
Автоматически инструментирует:
- HTTP-клиенты (OkHttp, Apache HttpClient)
- Веб-фреймворки (Spring Boot, Servlet API)
- Базы данных (JDBC, MongoDB, Redis)
- Messaging (Kafka, RabbitMQ)
Программная инструментация
// Инструментация HTTP-клиента
OkHttpClient client = new OkHttpClient.Builder()
.addInterceptor(OtelOkHttpInterceptor.builder(openTelemetry)
.build())
.build();
Метрики
// Создание метрик
Meter meter = openTelemetry.getMeter("my-service");
// Counter - монотонно возрастающее значение
LongCounter requestCounter = meter.counterBuilder("http_requests_total")
.setDescription("Total HTTP requests")
.build();
// Histogram - распределение значений
DoubleHistogram responseTime = meter.histogramBuilder("http_request_duration")
.setDescription("HTTP request duration")
.setUnit("ms")
.build();
// Gauge - текущее значение
ObservableGauge memoryUsage = meter.gaugeBuilder("memory_usage")
.setDescription("Current memory usage")
.buildWithCallback(measurement -> {
measurement.record(Runtime.getRuntime().totalMemory());
});
// Использование
requestCounter.add(1, Attributes.of(
AttributeKey.stringKey("method"), "GET",
AttributeKey.stringKey("endpoint"), "/api/users"
));
responseTime.record(150.0, Attributes.of(
AttributeKey.stringKey("status"), "200"
));
Интеграция с Spring Boot
Конфигурация
# application.yml
management:
tracing:
enabled: true
sampling:
probability: 1.0
otlp:
tracing:
endpoint: http://jaeger:4318/v1/traces
Кастомные метрики
@Component
public class OrderMetrics {
private final Counter orderCounter;
private final Timer orderProcessingTime;
public OrderMetrics(MeterRegistry meterRegistry) {
this.orderCounter = Counter.builder("orders_total")
.description("Total orders processed")
.register(meterRegistry);
this.orderProcessingTime = Timer.builder("order_processing_duration")
.description("Order processing time")
.register(meterRegistry);
}
public void recordOrder(String status) {
orderCounter.increment(Tags.of("status", status));
}
public void recordProcessingTime(Duration duration) {
orderProcessingTime.record(duration);
}
}
Популярные бэкенды
Jaeger
- Распределённый трейсинг
- UI для анализа трейсов
- Поддержка сэмплирования
Zipkin
- Простой в настройке
- Веб-интерфейс для трейсов
- Легковесный
Prometheus + Grafana
- Prometheus — сбор метрик
- Grafana — визуализация
- Alertmanager — уведомления
Коммерческие решения
- Datadog — полнофункциональный APM
- New Relic — мониторинг производительности
- Dynatrace — AI-powered observability
Collector Configuration
# otel-collector.yml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 1s
send_batch_size: 1024
memory_limiter:
check_interval: 1s
limit_mib: 512
exporters:
jaeger:
endpoint: jaeger:14250
tls:
insecure: true
prometheus:
endpoint: "0.0.0.0:8889"
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [jaeger]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [prometheus]
Лучшие практики
Производительность
- Используйте сэмплирование для снижения overhead
- Настройте batch processing для экспорта
- Ограничьте количество attributes на span
Безопасность
- Не записывайте чувствительные данные в атрибуты
- Используйте фильтрацию на уровне Collector
- Настройте TLS для передачи данных
Мониторинг
- Мониторьте сам OpenTelemetry (метрики SDK)
- Отслеживайте производительность инструментации
- Настройте алерты на критические метрики
Troubleshooting
Общие проблемы
- Spans не появляются: проверьте exporter и endpoint
- Высокий overhead: уменьшите сэмплирование
- Потеря контекста: проверьте propagation в async-коде
- Большие трейсы: ограничьте глубину инструментации
Отладка
// Включение debug-логов
System.setProperty("io.opentelemetry.javaagent.debug", "true");
// Самодиагностика
OpenTelemetry.noop(); // Отключение для тестов
Вопросы для собеседования
- Чем отличается trace от span?
- Как работает context propagation в микросервисах?
- Какие виды сэмплирования знаете?
- Как минимизировать overhead от трейсинга?
- Различия между push и pull моделями метрик?
- Как обеспечить безопасность телеметрии?
- Стратегии для high-load систем?