I'm positive that is what operations teams often think developers are asking themselves. And what exactly is metadata? Well, the simple answer is any information that is outside the main data flows that is used to control or monitor the behavior of the application. And with an answer that vague, it's little wonder it's often not part of the initial design. I break metadata into two major categories. Configuration data that controls the behavior of the application. And telemetry data that is generated by the application.
Configuration Data
Configuration data is usually given little if any attention during application development. Just throw whatever you need in a property or perhaps an XML file. The file is picked up from somewhere in the file system and that's that. Unfortunately, this fails to take into consideration several factors that can impact the manageability of the component. Let's take a look at some of the common pitfalls.
Configuration vs. Code
This is one of the biggest issues I see when designing configuration data. People get confused that properties and XML files may contain what is essentially code. Anything that would only be changed by a developer, is code, regardless of what file format is used to express it. Making files that are essentially code available for potential modification as configuration in a production deployment is a formula for disaster. Code must be tested before released to production. You can't test code that you allow to be changed in production. Therefore, files that are essentially code should be delivered embedded in your deliverable (e.g. inside the jar file) and not with configuration files you might expect to be modified in production.
Configuration Resource
Where to put the configuration information and how to format it is a subject of continuous debate. Files are the obvious choice for most applications but this brings up several interesting questions. First is the format. Java properties? XML? Something else (probably not)? Fortunately this issue can usually be resolved by looking at the structure of the information. The more structured the more likely XML is the right choice.
Files bring up the interesting problem of where do you put the file. If you deploy it inside your WAR, then it becomes a challenge for operations to locate the file. It's buried somewhere under your application server directory. You can put it in some other distinguished location but if you do this, you need to allow operations to override that location at server start or you will constrain their ability to manage the number of instances of the application per server.
Delivering the configuration with the application has other issues though. Invariably, there will be values of configuration that must be set to reflect the production environment. Once operations has modified the file, how do you deliver your subsequent versions. You can't overwrite the version of the file on the production server, yet if you've added necessary configuration values, how do these get merged into the production file.
Another challenge with file based configuration is that it does not scale well. It works well enough for O(10^2) servers but beyond that it is unwieldy. Attempting to manage files on thousands of servers requires appropriate tools to have any chance at all. Even with the best tools, there is a tendency for files to be missed and very confusing production errors result.
So what is the alternative? Centralizing configuration into services such as LDAP or a configuration database (CDB) can alleviate the challenges associated with distributing files. This definitely scales better to a large number of servers. But there are some challenges with this approach as well. The primary challenge is providing developers with a usable environment for testing. Developers typically require a local version of their resources with the ability to change the content at any time. This is relatively straightforward with files but much more difficult with LDAP or CDB. Still, the scalability offered to production may well be worth this hassle.
Telemetry
Okay, logging, but I have actually borrowed this term from professional racing (or NASA, take your pick) for a good reason. From Wikipedia:
Telemetry is a technology that allows the remote measurement and reporting of information of interest to the system designer or operator.
Thinking of the information that a component can send to logs in terms of telemetry instead of just logs, gives a better perspective on what belongs in the stream for operators. Developers tend to only think of logs as tools for themselves to help in debugging when in fact they are a necessity to properly monitor the health of a running application.
The question of course is what kind of information should be in the telemetry? Considering a common web service, at a minimum, I'd expect to find the following information easily in the stream:
- The request URI including parameters
- Basic parametric information about the request
- External resource interactions performed by the service. These should include status and timings.
- The result status of the request.
- The processing time of the request.
If this information is made available to operations in real time, there is several types of monitors that can be created. Alarms can be set on thresholds for error status ratios or dramatic drops in request rates. Operational graphs can be made for average response time with potential alerts for response time drifting out of SLA. Dependency graphs can be constructed that will help operations correlate resource failures to client impacts. And this is from the small amount of information proposed above. Additional telemetry can provide even more operational monitoring capabilities.
The Java logging facility can meet the needs of generating telemetry although you may want to separate out telemetry into its own logger name. For small scale deployments, this telemetry can simply be sent to log files. Scripts that regularly scrape the logs can be used to extract the relevant bits of information. For larger scale operations however, a central logging scheme is more relevant. Logging using the socket handler may be sufficient although for very large scale installations, it may be desirable to move to a less reliable but more scalable transport such as multicast.
Summary
I know that I've only scratched the surface metadata issues. The point of this posting wasn't to give you an exhaustive guide of configuration and telemetry, but rather to bring up some issues and initiate a dialog. As always, comments most welcome.
Technorati Tags: architecture, asynchronous, engineering, java, messaging, performance, programming, scalability, services, soa, software, to_read, toread, web
I agree with your configuration comments. I worked with a vendor product that configured everything in XML files distributed in the EAR. This made it impossible to have anything that varies from one environment to another from build through production. We overrode the configuration reader to read from XML and if not found read from database. Each environment has its own database, and each developer can optionally have their own XML for localhost testing. It's A Bad Thing (!) if a personal XML file happens to get deployed, though.
Re telemetry ... I'm currently working with code that displays "Exception occurred" but does not display the exception or stack trace. Some team standards might be in order. ;-)
Posted by: Jim Standley | Wednesday, October 24, 2007 at 11:49 AM
In non-Java systems you learn quickly to house everything (logs, config, scripts) outside the codebase. In designs, I look for layouts that allow code to be upgraded without touching config files and vice versa. The linux file system is a good working example.
Java systems are different. I'm not sure whether this is technical (containers, classpaths) or cultural (file system independence, wora); I suspect it's both. Certainly by 2007 I'd have hoped we'd be beyond throwing WAR/EAR file over to ops and driving production configurations out of software build chains. But it doesn't look that way yet.
Smartfrog is a project to watch:
http://www.hpl.hp.com/research/smartfrog/
Telemetry: I'm biased, but I think this will come to be done using Atom or some such over XMPP. What you've described in the information stream looks like an extended Atom Entry and while XMPP is inefficient relative to jmx/snmp, it's more flexible and scalable, and probably requires less upfront agreement between agents.
Posted by: Bill de hOra | Wednesday, October 24, 2007 at 02:13 PM
Configuration aspects I have had to pay attention to:
1. Who needs to change configuration?
- Developer, Operations, Data Managers, Product staff . . .
2. How often does configuration update need to happen?
- Code release boundary, Time based, Demand driven . . .
3. Push vs. Pull?
- Deployer based, knows all end points (?)
- Repository subscription based, I know what I need principle
In a large system, configuration data / metadata has the tendency to get scattered over various places. I have seen certain elements of configuration directly setup inside the confines of the code (I mean c, java ...). First level of externalization is in configuration files that are shipped alongside code as properties, xml or even text files. Both of these scenarios in which the frequency of change is limited by the code release cycle and requires development staff to be engaged. Second level of externalization is where the config files are deployed into alternate paths on the server file system. Gives operations staff the control, however, like you have mentioned becomes hard to manage as the number of servers increase. Third level of externalization being a CMDB, allows for ad-hoc, on demand changes, but cannot have one for each developer unless you can create a decent replicated environment on every developers box.
Another aspect of configuration: In larger deployments, where servers are going in and out of rotation making sure each of the servers have the correct / latest configuration deployed also becomes a challenge. It is up to the maturity of the configuration audit systems to validate, verify, reconcile and fix issues. Especially an issue with push based systems.
As you have may have experienced, hierarchical configuration based on multiple levels of overrides and customizations makes this problem even harder. I am personally a fan of a CM service. A well abstracted interface that can hide the complexity of the location of configuration, overrides, specialization. One that could fetch (Pull) config based on functional, location and other application semantics, either at startup time, or based on an explicit instruction through the server's admin interface, or a periodic pull.
A well thought out consistent, well designed approach with "simplicity added" :-) along with a good portfolio of tools and finally the organizational discpline to keep it consistent could work.
Posted by: Sri Shivananda | Wednesday, October 24, 2007 at 10:49 PM
Very interesting and relevant post. Currently we run into some of the same issues you have mentioned in both areas.
Internally we developed a configuration mechanism that builds on top of java.util.preferences, adds more functionality and uses XML files as the backing store. The API hides the location/env/file-vs-db complexity from the app dev teams. We also ran into the same issues that you mentioned. Some of the basic requirements we had were
- individual server level granularity for different configs (A/B testing)
- remove to eliminate dependence on db services team
- make it fast for devs
- must support hot deployment of changes
As you said we are in the O(10^2) server complexity and this has worked fairly ok for us. We also had to build scripts and gui tools to view and make changes to the production env so that non-devs could make changes to n number of servers without bringing the server down.
Some of the requirements I mentioned above made LDAP and db a more difficult choice though at some point we went ahead and added db connectivity so - some rather static configs come from db in addition to hundreds of others that come from XML files.
On the logging - totally agree. We built a fairly sophisticated enterprise logging library. This allows us to traverse a single user and his activity across various systems that span java and C++, including status, timings, errors etc. This same data gets plugged into monitoring console to keep track of systems' health. We also built a sophisticated viewer that sits on top of this data so that non-devs can keep track of farms, server, single user down to a single page.
In summary totally agree with you and more often than not - this is the last thing that gets resources when putting together the plans/sotries.
Posted by: vmoharil | Thursday, October 25, 2007 at 06:49 AM
Regarding working with configuration metadata, we've found that the most effective way to manage environments of any significant scale or complexity is to use a model-driven system like ControlTier (http://open.controltier.org) to manage application deployment and ongoing management.
Why model-driven? Because the that metadata you were referring to is needed for purposes other than just configuring a single applcation. You need to use that same metadata to configure other applications to work in conjunction with your application. You also need to use that same metadata to provide context to the automation that effects change on your environment.
So I guess my point is that you not only have to think about where the "metadata" lives within each application... but how it gets managed at an environment wide level and how it gets delivered.
-Damon
http://dev2ops.org
Posted by: Damon Edwards | Wednesday, May 21, 2008 at 08:53 PM