What Metadata?
I'm positive that is what operations teams often think developers are asking themselves. And what exactly is metadata? Well, the simple answer is any information that is outside the main data flows that is used to control or monitor the behavior of the application. And with an answer that vague, it's little wonder it's often not part of the initial design. I break metadata into two major categories. Configuration data that controls the behavior of the application. And telemetry data that is generated by the application.
Configuration Data
Configuration data is usually given little if any attention during application development. Just throw whatever you need in a property or perhaps an XML file. The file is picked up from somewhere in the file system and that's that. Unfortunately, this fails to take into consideration several factors that can impact the manageability of the component. Let's take a look at some of the common pitfalls.
Configuration vs. Code
This is one of the biggest issues I see when designing configuration data. People get confused that properties and XML files may contain what is essentially code. Anything that would only be changed by a developer, is code, regardless of what file format is used to express it. Making files that are essentially code available for potential modification as configuration in a production deployment is a formula for disaster. Code must be tested before released to production. You can't test code that you allow to be changed in production. Therefore, files that are essentially code should be delivered embedded in your deliverable (e.g. inside the jar file) and not with configuration files you might expect to be modified in production.
Configuration Resource
Where to put the configuration information and how to format it is a subject of continuous debate. Files are the obvious choice for most applications but this brings up several interesting questions. First is the format. Java properties? XML? Something else (probably not)? Fortunately this issue can usually be resolved by looking at the structure of the information. The more structured the more likely XML is the right choice.
Files bring up the interesting problem of where do you put the file. If you deploy it inside your WAR, then it becomes a challenge for operations to locate the file. It's buried somewhere under your application server directory. You can put it in some other distinguished location but if you do this, you need to allow operations to override that location at server start or you will constrain their ability to manage the number of instances of the application per server.
Delivering the configuration with the application has other issues though. Invariably, there will be values of configuration that must be set to reflect the production environment. Once operations has modified the file, how do you deliver your subsequent versions. You can't overwrite the version of the file on the production server, yet if you've added necessary configuration values, how do these get merged into the production file.
Another challenge with file based configuration is that it does not scale well. It works well enough for O(10^2) servers but beyond that it is unwieldy. Attempting to manage files on thousands of servers requires appropriate tools to have any chance at all. Even with the best tools, there is a tendency for files to be missed and very confusing production errors result.
So what is the alternative? Centralizing configuration into services such as LDAP or a configuration database (CDB) can alleviate the challenges associated with distributing files. This definitely scales better to a large number of servers. But there are some challenges with this approach as well. The primary challenge is providing developers with a usable environment for testing. Developers typically require a local version of their resources with the ability to change the content at any time. This is relatively straightforward with files but much more difficult with LDAP or CDB. Still, the scalability offered to production may well be worth this hassle.
Telemetry
Okay, logging, but I have actually borrowed this term from professional racing (or NASA, take your pick) for a good reason. From Wikipedia:
Telemetry is a technology that allows the remote measurement and reporting of information of interest to the system designer or operator.
Thinking of the information that a component can send to logs in terms of telemetry instead of just logs, gives a better perspective on what belongs in the stream for operators. Developers tend to only think of logs as tools for themselves to help in debugging when in fact they are a necessity to properly monitor the health of a running application.
The question of course is what kind of information should be in the telemetry? Considering a common web service, at a minimum, I'd expect to find the following information easily in the stream:
- The request URI including parameters
- Basic parametric information about the request
- External resource interactions performed by the service. These should include status and timings.
- The result status of the request.
- The processing time of the request.
If this information is made available to operations in real time, there is several types of monitors that can be created. Alarms can be set on thresholds for error status ratios or dramatic drops in request rates. Operational graphs can be made for average response time with potential alerts for response time drifting out of SLA. Dependency graphs can be constructed that will help operations correlate resource failures to client impacts. And this is from the small amount of information proposed above. Additional telemetry can provide even more operational monitoring capabilities.
The Java logging facility can meet the needs of generating telemetry although you may want to separate out telemetry into its own logger name. For small scale deployments, this telemetry can simply be sent to log files. Scripts that regularly scrape the logs can be used to extract the relevant bits of information. For larger scale operations however, a central logging scheme is more relevant. Logging using the socket handler may be sufficient although for very large scale installations, it may be desirable to move to a less reliable but more scalable transport such as multicast.
Summary
I know that I've only scratched the surface metadata issues. The point of this posting wasn't to give you an exhaustive guide of configuration and telemetry, but rather to bring up some issues and initiate a dialog. As always, comments most welcome.
Technorati Tags: architecture, asynchronous, engineering, java, messaging, performance, programming, scalability, services, soa, software, to_read, toread, web