Organizations need a more proactive approach to log management and log analytics to mitigate proliferating cloud-based data. This article outlines four log management best practices.
The growing challenge in modern IT environments is the exponential increase in log telemetry data, driven by the expansion of cloud-native, geographically distributed, container- and microservice-based architectures. Organizations need a more proactive approach to log management to tame this proliferation of cloud data. By following key log analytics and log management best practices, teams can get more business value from their data.
Challenges driving the need for log analytics and log management best practices
As organizations undergo digital transformation and adopt more cloud computing techniques, data volume is proliferating. Gartner® predicts that by 2026, 40% of log telemetry will be processed through a telemetry pipeline product, up from less than 10% in 2022.* The resulting vast increase in data volume highlights the need for more efficient data handling solutions.
As Forrester notes, “One of the biggest challenges that cloud-native environments have brought with them is exponential data growth.”** This data explosion poses a significant challenge, with the sheer volume of data generated in cloud-native environments creating cost overruns, poor application services, and the risk of runaway data volumes. Thus, organizations face the critical problem of designing and implementing effective solutions to manage this growing data deluge and its associated implications.
Without robust log management and log analytics solutions, organizations will struggle to manage log ingest and retention costs and maintain log analytics performance while the data volume explodes. Fortunately, organizations are finding best practices to handle this challenging IT environment.
The following best practices aren’t just about enhancing the overall performance of a log management system. They’re a gateway to unlocking significant cost savings for a variety of log management and log analytics use cases. Whether you’re striving for peak performance or tightening your organization’s budget, these best practices will help ensure that log analytics provide the answers for your business as efficiently and cost-effectively as possible.
1. Consolidate log management tools into a unified observability platform
As businesses increasingly shift toward software-centric models, the number of specialized IT monitoring tools to manage cloud environments has proliferated. While this specialization has merits, it has also given rise to a significant challenge: IT tool sprawl, where organizations adopt many tools to suit different purposes. Separate systems can also silo teams and hamper mean time to incident (MTTI) discovery. As a result, application teams often must wait for other teams to provide critical error and performance data logs. Once teams have all the data, they often have to manually correlate timestamps and session IDs to examine the data in context.
The first best practice is to consolidate log management with application monitoring in a single platform. When teams combine these functions using the same rich source of data, the lines between various aspects of IT management begin to blur. Application performance monitoring (APM), infrastructure monitoring, log management, and artificial intelligence for IT operations (AIOps) can all converge into a single, integrated approach.
With a unified log management and analytics platform like Dynatrace, application teams gain a holistic view of distributed traces, user sessions, code-level visibility, and logs—all within the context of the error or security vulnerability they aim to resolve. This integrated approach represents significant time savings, drastically reducing MTTI and speeding mean time to resolution (MTTR). Moreover, by applying causal AI and topological mapping, a unified observability platform includes all the necessary data in context, making troubleshooting significantly more efficient and effective.
In a unified strategy, logs are not limited to applications but encompass infrastructure, business events, and custom metrics. This comprehensive approach to log management extends the benefits of unified observability to a broader spectrum of IT concerns, ensuring that organizations can operate with agility, respond to incidents swiftly, and maintain a competitive edge.
2. Adopt a centralized observability data lakehouse for better log analytics and log management
Another problem organizations face is runaway costs and lagging query performance because of data coming from disparate sources and inefficient access control policies.
A best practice to avoid these problems is to store data in a single data lakehouse with massively parallel processing, such as Dynatrace Grail. A data lakehouse combines the structure and cost-efficiency of a data lake with the contextual and high-speed querying capabilities of a data warehouse.
With Grail, teams can group resources and capabilities in custom buckets, which provide user control over access to specific data. As a result, only authorized users can view or modify data within a designated bucket. These permissions enable organizations to implement security measures that align with their data protection and compliance requirements.
While pursuing log management best practices, teams should also be able to define fine-grained access control to individual records. This level of detail provides additional security of access within specific buckets or across buckets. Security policies dictate access to specific buckets or records within buckets, and you can link them to cost centers, job roles, organizations, or environments.
Log management best practices for implementing bucket-level and record-level access permissions
The following are some best practices to consider when implementing bucket-level and record-level access permissions.
Establish buckets for each cost center
Establish buckets for each cost center or specific IT environment (production and staging vs. development and testing) within your organization. Buckets serve as a hard data access boundary, limiting access for individuals or organizations. This boundary is crucial for controlling query-related costs. Well-defined bucket boundaries enforce data access limits, which enables organizations to implement cost controls and cost centers.
Associate different retention periods with different buckets
Setting retention periods for different buckets gives teams with different needs more flexibility. For example, debugging data for production applications doesn’t require a lengthy retention period. Users typically need only recent data for troubleshooting, not logs from a month ago. On the other hand, data involved in an organizational audit may require retention for years. Associating varying retention periods with different buckets enhances query performance by reducing the amount of data a query must scan. Different retention periods also help control retention costs.
Implement access a the record level
Record-level access provides access within a single bucket or across buckets based on the team’s role. For instance, a team involved in an audit may need access throughout multiple bucket environments, while a single application team may require access only to its application and infrastructure logs within a specific Kubernetes namespace. Record-level access enables finer-grained security policies than buckets, ensuring that users see only the data they have permission to access in query results. To take better control of your query costs, use buckets to limit the data scanned or query timeframe.
The following example sets up permissions in Grail that grant an application team access to application and infrastructure logs in a particular Kubernetes namespace and host group:
- ALLOW storage:buckets:read WHERE ... // Ensure the user has access to all relevant buckets - ALLOW storage:logs:read WHERE storage:k8s.namespace.name="namespace1"; - ALLOW storage:logs:read WHERE storage:dt.host_group.id STARTSWITH "shared_host_";
Set up custom logging fields
To establish custom policies within your organization’s structure, you can set up custom log fields. These fields support any value or field name an administrator defines, making an efficient way to associate policies with buckets and record-level access to any entity in your organization. Examples of how you can use custom fields in policies include associating policies to usernames, team names, cloud regions, Amazon Web Services accounts, Azure subscriptions, or Google Cloud Platform projects.
3. Optimize log parsing without schemas to reduce MTTR
Choose a solution, such as Dynatrace, that enables you to parse any log data on demand or set parsing rules at ingestion. This stands in stark contrast to log management vendors that require you to index your data into rigid schemas. With traditional schema-based solutions, you’re limited in the questions you can ask and the answers you can obtain from data based on the schemas you set up. If you require additional data while troubleshooting issues you didn’t pre-plan for, you have to invest time and money in reindexing the data. This realization inevitably occurs at the worst possible moment, precisely when you urgently need answers.
In contrast, with Dynatrace, you’re always in a position to ask any questions of your data at any time. Following are some best practices to further optimize log parsing:
- Use parsing on read. Use the parsing-on-read functionality when you need to analyze historical log data on demand. A prime example is when a business analyst requests information about how many units of a product were sold within the last month or wants to perform a year-to-year comparison. In Grail, this data resides in business events historical logs. Even when you didn’t anticipate this question, you can provide the data on demand without having to reindex it, ultimately saving money compared with other vendors.
- Set up processing rules. If you want to transform user requests into real-time dashboards or alerts, you can set up processing rules at ingestion to create fields and metrics when the log is ingested. This best practice enables you to create long-term reports or trends on fields your company frequently reviews in dashboards or monitors for alerts based on custom trend thresholds.
4. Implement a log analytics platform that makes low-cost, high-performance queries available to everyone
Log management best practices should include making log analytics readily accessible to teams across the organization.
To extract meaning from large data sets, most organizations must use regular expressions (regex) to search for patterns. However, regexes are known for their complexity, especially for in-depth analysis, leading to potential errors in results. The computations required to construct regexes against large amounts of data can also cause performance issues. Regex also varies based on the languages and libraries it’s querying, which leads to more complexity, especially in large enterprise environments. This complexity and high margin of error means that data querying is reserved for a few technical experts in an organization.
To address these problems, Dynatrace developed the Dynatrace Query Language (DQL), a simple, human-readable, and versatile tool that supports all languages without needing to modify the queries.
Because Dynatrace DQL is simple and straightforward, teams from across the organization can pick it up quickly to query any data at any time. This means that a developer and a business user can access the same data to ask different questions and gain unique insights.
The following are some DQL best practices to improve query performance.
Narrow the query time range
A simple yet effective method to boost query performance is to narrow the query time range. A shorter analysis window, when applicable, provides better performance based on identical data sets.
Use sampling options
In data science, sampling is a statistical analysis technique used to select, manipulate, and analyze a subset of data to identify patterns in the larger data set.
Sampling is a powerful tool for optimizing query performance, particularly for log data. Dynatrace offers various sampling ratios, allowing you to retrieve a representative subset of all available raw log records. Utilizing sampling can significantly improve query performance, especially when dealing with large data sets.
Limit the scanned data
Dealing with a large influx of log data? Even with a narrow query time range, queries can take considerable time to complete. To address this, you can stop the system from reading data after a specified amount, helping you control query performance and costs.
Use the recommended order of commands
To ensure optimal query performance, Dynatrace recommends following a specific order of commands in your DQL queries. For example, after using the
fetch command, you can reduce the data set, select or remove fields, and apply other pipeline stages. It’s best to apply functions
limit at the end of the query to avoid performance degradation.
Filter early to reduce the data set
Narrowing your data set by filtering and segmenting early in the query process before further processing can significantly improve performance. You can achieve this by filtering on dedicated buckets for all tables, ingested fields for business event queries, or relevant topological context fields for log queries.
Use string comparisons wisely
Take care to use operators that work optimally for what you’re looking for. For example, when comparing field values, use the match
== or does not match
!= command only when you know the value you’re searching for. However, when you don’t know the exact field values you’re looking for, use
matchesPhrase() instead of
contains() for more precise filtering.
For more DQL tips and example DQL strings, see the topic DQL best practices.
Start your log analytics and log management best practices journey
By following these best practices, organizations can effectively manage and analyze logs, enhance performance, and control costs for their Dynatrace logs management and log analytics platform.
* Gartner, Innovation Insight: Telemetry Pipelines Elevate the Handling of Operational Data, Gregg Siegfried, 20 July 2023 GARTNER is a registered trademark and service mark of Gartner, Inc. and/or its affiliates in the U.S. and internationally and is used herein with permission. All rights reserved.
** The Forrester Observability Reference Architecture: Putting It Into Practice, Forrester Research, Inc., October 21, 2022.