Jon Udell

Jon Udell

Total Posts: 14 |
Perspectives
Blog.PostedBy: Jon Udell | Jun 20th @ 8:18 AM

The WorldWide Telescope was first shown to the public at TED 2008, in a joint presentation by project leader Curtis Wong, manager of Next Media Research for Microsoft, and Roy Gould, a science educator with the Harvard-Smithsonian Center for Astrophysics. In this interview they discuss how -- and why -- the WorldWide Telescope combines many sources of astronomical data and imagery to create a seamless view of the night sky.



Curtis Wong

Roy Gould


CW: I was interested in astronomy as a kid, but when you grow up in Los Angeles, the odds of seeing the Milky Way are pretty slim. I think the only time it happened recently was during the quake when the whole city lost power.

It wasn't until much later that I actually got to see the Milky Way, and other objects I'd seen pictures of, and it was really quite a transformative experience. I always wanted to figure out how to recreate and share that experience.

Early on, in the 80s, I made a little HyperCard stack called MacTelescope, which was my attempt to create the experience of looking at the sky, and then -- if you could manage to see the Milky Way -- to zoom into a section where there are all these interesting globular clusters and nebulae, if you know where to look.

JU: So there was already the idea of taking people on a guided tour.

CW: Exactly. Later I moved from the Voyager company to a company called Continuum, a little think-tank organization started by Bill Gates. The company was thinking about authoring tools and media. The project I wanted to do was called John Dobson's Universe.

John Dobson was a physical chemist at UC Berkeley who got drafted to work on the Manhattan Project. He was there to do the chemistry for tuballoy, which was the code word for uranium. He became a conscientious objector, left the project -- which was pretty hard to do -- and became a Vedantan monk. Then he became interested in looking at the sky but, being a monk, he didn't have any money for a telescope.

He knew that San Francisco shipyards were sources of glass discs, but they were too thin. Conventional wisdom said that you need thick glass to be able to grind a mirror. But he defied wisdom and found a way to use round porthole glass. He also came up with an ingenious way of mounting the telescope, using cardboard concrete form tubes. His design is now one of the most common designs for telescope mounts in the world.

He also created an organization called San Francisco Sidewalk Astronomers, where people who have telescopes are encouraged to take them out into the public and show people what's up in the sky, and explain what's going on.

John spent a lifetime in national parks, and in San Francisco, talking about astronomy to the general public. He was really good at taking complex ideas and conveying them to the public.

I remember once when he was showing the Andromeda galaxy, there was a picture with the galaxy in the background which looks like a kind of fog, with a lot of stars in front, and he said: "The stars you're looking at in front are like raindrops on your window, looking at a distant cloud."

So anyway, we started that project, got about halfway, then other things happened and it got cancelled.

JU: It's a great story!

CW: So I've thought about astronomy for a long time. At Microsoft, I heard a talk by Jim Gray who was applying his database expertise to astronomy.

JU: I looked up a paper that Jim published with a guy from Johns Hopkins, Alex Szalay, and it's called The World-Wide Telescope, but interestingly, the title also includes the phrase: An Archetype for Online Science.

The idea is that, not just in astronomy but in all of science, we're getting to the point where there's less direct observation, and more collection and analysis of data on a really large scale, happening in ways that are computationally assisted, and also assisted by the collaborative properties of the Internet.

CW: Exactly. We're getting this deluge of data. The challenge then becomes how you process, how you utilize. Bringing computer technologies -- SQL, visualization -- can really help.

So Jim had written that paper with Alex in 2002, and he'd given a talk on some of the work he'd been doing with the Sloan Digital Sky Survey at that time.

JU: Yeah, I've listened to that talk, and I was hoping you could help me connect the dots between the work that was done there and the federated virtual observatory which, for him, became a case study in the use of XML web services to create an Internet telescope that was a federation of radio astronomy services.

CW: Right. Alex and several other research scientists came together to create the National Virtual Observatory, which establishes common protocols for accessing astronomical data and imagery.

When Jim told me they were starting that project, I told him I wanted to help. Part of the pitch was a PowerPoint I made about SkyServer that showed how you could embed that data and imagery in a virtual environment.

What they were thinking was that astronomers would be querying the database, and pulling out objects. But I thought that to make this an interesting educational resource, we would need to build an environment in which people could create and share stories about the objects, and could connect those stories to original source information.

Later he came back to me and said that the SkyServer data was released, and he wanted to redesign the website to make the data more accessible to the public. So I helped with that. SkyServer DR2 (data release 2) was the redesigned website. And I used that to make the case for building the learning environment that became WorldWide Telescope.

JU: Under the covers at SkyServer, there was a lot of work done to correlate observations from different sources of data.

CW: Exactly. Alex Szalay did a lot of that work, in collaboration with people from other universities.

JU: And that's the foundation for what we see now in WWT?

CW: That was sort of the first case. It was the Sloan Data, which is just the northern sky. Then at TechFest in 2006, we were working with Alex to take the SkyServer images -- he has an image server that'll give you an image from coordinates, and from that you can get data about other objects in the field of view. For our TechFest demo, our developer Jonathan Fay -- who had been fiddling around with building his own hierarchical multi-resolution tile-browsing engines -- put together an engine that would pull the images from SkyServer and assemble them into a mosaic that you could browse and zoom in and out of. That was the first manifestation.

Meanwhile, we'd gotten some interest from Harvard. Alyssa Goodman found an intern for us who was passionate about education, and she came to work with us in the fall of 2006.

And then in January, as I was emailing back and forth with Jim about our plans, he disappeared after sailing out in San Francisco bay to spread his mother's ashes.

So when TechFest came around again in 2007, we decided to rename it WorldWide Telescope in honor of Jim. We started building it in March. That summer was a big effort to secure image sets and data from lots of different sources, as well as building the engine and defining the authoring environment. We showed a rough prototype at the Astronomical Society of the Pacific meeting ... Roy, was that in Chicago?

RG: Yes.

CW: Roy, do you want to pick up the story from there?

RG: Sure. Let me rewind the tape on my end all the way back. I was smiling when Curtis told his childhood story about being interested in stars he couldn't see. I have the east coast version of that, looking up in New York City, hounding my parents for a telescope, and then when I finally got one, we went up on the roof and there was nothing to see.

JU: Just a little too much ambient light!

RG: Exactly. We thought we saw a star, but it was a plane.

My career has been in science education, and I was familiar with Curtis' work long before the WorldWide Telescope. When I heard his talk in Chicago, and at some subsequent conferences, I saw that it was addressing two of my passions. One is astronomy, what's out there in the night sky, and having a unified view of it. But also, as somebody who communicates science, I've always been interested in the learning interface. It's not just about using the resource, it's about learning from it.

It was clear there were lots of things that could be done, but it took a long time to see what all of them might be. It was only after I used it for a while that I realized what a great breakthrough it is.

First, there's the seamless experience of the night sky. It's true that all these images are accessible in principle, if you know where to find them, and of course many us in the field do that when we prepare curricula or museum exhibits or planetarium shows.

But when you do that, you see the universe in a disconnected way. Once you go on WorldWide Telescope, it's a different experience. It's like you could look up with perfect vision.

JU: And with complete contextual awareness.

RG: Exactly. And I think we take the night sky for granted. We don't realize how important it used to be. When we have floods in the midwest, we call them disasters. Well, disaster literally means against the stars. Catastrophe is the Greek for against the stars. Romeo and Juliet were the star-crossed lovers. It's all through literature, it's part of common speech.

Then you fast-forward to the modern day, and few of us have even seen the stars, let alone have that relationship to them. For me that's number one about WorldWide Telescope. It's really inviting us to take a long look at the night sky again.

Of course you can do that through the WorldWide Telescope, but then you can also look at the night sky from your back yard, or from a dark location, and have taht dual relationship. So you can both see the night sky and, in WWT, you can explore it.

JU: That's one of my favorite things to do. At night, of course, laptops create their own illumination. My first astronomy program ran under MS-DOS, amazingly enough, and I'd take my laptop from that era out in the backyard and use it as a guide. This is the latest and greatest incarnation of that tradition.

RG: And from the educational point of view, this pays enormous dividends. We've done research here about what students know when they graduate from high school about the night sky, and about the universe in general. From that, we know that the majority of high school students graduate placing the stars inside the orbit of Pluto.

JU: C'mon. Really?

RG: It's true. About 52 percent, and that's based on surveys of thousands of students in 37 states.

There are many reasons for that, but certainly one of them is this lack of good images of the sky. It's hard enough to see a picture of the solar system, let alone its context within the galaxy. That's one beautiful feature of WorldWide Telescope, you get a sense of where things are.

We also know that students think galaxies are closer than stars, because they tell us that stars are just point sources, and no matter what your magnification or telescope they remain points, so they must be very far away. Whereas galaxies, whatever they are, are big, and so they must be closer.

But if you go on WorldWide Telescope, and look at the Andromeda galaxy, the nearest big galaxy to us, and the furthest thing you can see with your naked eye, you get a physical feeling. You see all the stars in our Milky Way that are the veil of stars we look through, and you really get a sense that the Andromeda galaxy is vast and distant.

Another thing that's useful is the ability to look at the universe in different wavelengths of light. We see only the visible light our eyes can see, but that's like listening to one instrument in the orchestra.

You can download a Chandra image, an X-ray image of the sky, but what's that about? You can't really figure it out. Or one of the infrared images. In WorldWide Telescope, you can seamlessly crossfade back and forth between images taken at various wavelengths. So you see what's going on in the visible wavelengths, but you also see what's going on that's emitting these other forms of light. And that's important, because most of the action in the universe happens at wavelengths of light we can't see. The WorldWide Telescope automatically aligns these different images.

JU: Can you say a bit about how that's done?

CW: Yes, it was a challenge to register all these different surveys so you can do that kind of cross-comparison. There are some emerging standards from the National Virtual Observatory and others. The AVM (Astronomy Visualization Metadata) standard provides metatags at high precision within objects, to give exact position and scaling and orientation for images in the sky.

JU: Does that work by reprocessing existing survey data?

CW: Right. Generally the data exists. But for example, when Hubble makes these beautiful press-release images that they send out in color, the metadata is usually lost, because the images have been post-processed in Photoshop and other programs. So a lot of these images that we want to put out there for the public needed to have that metadata reintegrated. And sometimes they're composites of many Hubble images to create a large mosaic. So that composite image needs to have metadata put into it.

I think a source image is about 400 megabytes. So while it's technically in the public domain, they don't release it because it's way too much data for most people to download. They've released low-resolution versions, but we have the full-resolution image of the Crab Nebula, and other things, in WorldWide Telescope, because we can enable people to use the high-resolution images without having to download all of them.

JU: So part of it was going back to sources that were notionally available, but not practically available.

CW: Right.

JU: But part of it is about emerging standardization of how these images are described.

CW: Yes. The whole NVO group is working toward common standards and common protocols for image metadata, so they can all be used by everybody. There's AVM, part of the VAMP project -- Virtual Astronomy Multimedia Project -- leading the charge for annotation of imagery and other media in the context of the sky.

JU: And there are presumably many uses for those annotations.

CW: Yes. So we were one of the first guinea pigs for VAMP, and I think Google joined a bit later.

JU: So let's talk about the authoring aspect. I'm a huge fan of multimedia and audiovisual tools for educational and training purpose, and this is a wonderful example of that.

I was looking at what files get created when you author a slideshow in WorldWide Telescope, and it looks like the output is an XML file with coordinates and descriptions. To me that says two things.

First, it says that WorldWide Telescope presentations can be created using other tools, which is interesting.

It also says that presentations created inside WWT could potentially be played elsewhere, in other environments.

CW: Exactly. The whole idea with the authoring, and these guided experinces, is...let's go back to the educational intention. I spent a lot of years building instructional learning, where you bring in experts to tell you about subjects, in context, but also the self-directed discovery aspect of learning. Then there's a third part I wanted to bring in, constructive learning. There's always a duality between instructive and constructive learning.

JU: What do you mean by constructive?

CW: Basically, learning by doing. Where kids who don't know much about astronomy can take tours from experts, and be taken to unfamiliar places, but then pause those tours and explore on their own. In WorldWide Telescope you can always pause the tour, like stopping a tour bus and getting off to look around. At that point you can right-click to get more information, you can zoom into places that the tour didn't deeply explore because it had to keep moving, you can see other objects that are in the neighborhood.

If you notice down below, in the context menu, as you get to different objects you can not only see that object in different wavelengths, but you can also see other tours that intersect with that object.

The goal is that as you start to see more and more guided tours within this space, if you think of objects as nodes or intersections, there are more and more opportunities to cross over from one tour to another. Eventually we might see a kind of hypermedia web of learning where instead of hyperlinking among words, we're linking among stories and paths and ideas.

JU: And that'll include things I make for myself. If I make a narrative about a part of the sky, then the context surrounding the area I'm interested in will be available when someone else plays back that tour.

CW: Exactly. If you take a tour about stellar evolution, and learn about how stars get formed from nebulae, and you get to the planetary nebula stage, you might say, wow, those are really pretty, what's going on there? Then you might intersect with a tour about different planetary nebulae, where you dive deep into that category. Then you might find a tour just about the Ring Nebula, or the Helix Nebula.

Or conversely you may come across it in a different way. You're browsing the sky and you come across the Ring Nebula or the Helix Nebula, and you may say, what are these things? And you can see other things that relate categorically to them, which would then intersect with explanations of how planetary nebulae fit into the grand scheme of the origin of all the elements.

JU: Roy, what tours are you working on?

RG: We're working on two. One is a tour of black holes, in conjunction with a traveling exhibit the Smithsonian is producing.

Another is a tour of alien solar systems and their exoplanets, as they're known. There are more than 300 stars known to have planets orbiting them. Using a small educational telescope -- we have a network of five of these, they're called micro-observatories -- students in middle school and high school can take their own images and study these exoplanets. They can actually characterize them in a surprising amount of detail. They can figure out how large they are, how far away they are from their stars,

What's more, we're using the tour of exoplanets to get teachers who have never used the micro-observatory telescopes in this particular curriculum. You know, you can make a brochure, and send them some images of the night sky, but there's nothing more exciting than having a tour that gives you a sense of the context, the depth of the sky, where things are.

JU: Do you think there will be a citizen science aspect to this?

RG: Absolutely. I think that's going to be a major use. Astronomy is probably unique among all the sciences, in there are more amateurs than professionals. But this is a third category. You've got the amateurs, you've got the professionals, but now WorldWide Telescope makes possible the blossoming of citizen science. Especially given the flood of images we're going to have in the next few years, more than researchers can ever look at. You'll have images that have never been seen by humans, and that opens up a huge possibility.

JU: So part of it's about getting more eyeballs on this flood of imagery. Potentially another part is citizen-driven analysis of data. I wonder if SkyQuery will become part of the suite, so people can start to ask questions, like how many fast-moving objects are in this part of the sky, which is one of the queries Jim Gray mentioned in his 2002 paper.

RG: Yes, I'd love to see that. There are several ways the public can contribute. One is to look for things, and make serendipitous discoveries. With some of the NASA solar missions, for example, where we have so many images of the sun coming back, you can see comets that were never seen before as they get close to the sun. And ordinary citizens have discovered comets that nobody knew existed.

But to me the most important thing is what you just alluded to: Asking questions that nobody had thought to ask, even the professionals.

So for example, what's the volume of a black hole? How big is it inside? If you go on the web and look at the standard references, you'll find answers all over the place, all at odds with one another.

There are questions that researchers just haven't gotten around to asking, that many of the public will ask, and we don't know what those are yet.

In a way, although there's all this wonderful technology in the WorldWide Telescope, but in a sense it's the modern incarnation of a campfire that you sit around and trade stories. Our organization has telescopes in Australia and Chile and elsewhere, and when I go to those countries, I find that the native cultures have all sat around campfires and developed incredible stories about the night sky.

JU: So how does the collaboration work? I've made a little slideshow, it's stored as a file on my computer, how do I share that?

CW: We're trying to encourage the development of communities. Sky and Telescope is forming one, Astronomy magazine is forming one, Meade -- a telescope maker -- is forming one.

JU: So the unit of sharing is the WWT file which gets created when you make a slideshow. It's a bundle of XML and images of thumbnails and maybe audio if there's voiceover. In a lot of cases, those will want to live out on the web where people can link to them. If I post that file, and somebody clicks on it, and WorldWide Telescope is installed, then it'll launch and play the slideshow when you click on the file?

CW: Yes. I was talking to a storyteller from a local Snoqualmish tribe, a lot of their stories happen to be about the sky. I wanted to try to capture some of those as examples of what you can do.

By the way, if you're playing the tour, you can pause and go into edit mode, and it's all open. You can change destinations, you can drop in your own audio narration, music, text, and images.

JU: So you've forked the thing you've downloaded, and at that point you can...

CW: ... put your own interpretation on it, exactly. It's just like View Source. Except easier, because you don't know have to program in HTML.

JU: Yeah, it's quite straightforward.

Do you think that there's a need -- I suspect that there is -- for some sort of universal player that wouldn't require the full application, and wouldn't even require Windows, but would just be a way for anybody to play these things?

CW: Absolutely. That's a really good idea.

So by the way, I want to highlight one citizen science story for you. It relates to Galaxy Zoo, a website that allows the public to help the Sloan digital sky survey catalog and tag the hundreds of millions of galaxies that were covered in the various data releases. In one case, a teacher from Amsterdam was looking at a galaxy and it looked really blue. She reported that to Galaxy Zoo, and they looked at it, and it was something they'd never seen before. So they retargeted the Very Large Array Radio Telescope to take a look at it. And based on those results, they've now secured Hubble time to study that galaxy in great detail.

That's a case where putting the data out there, letting the public look at it, dividing up the sky, and having that feedback mechanism can really advance science. Because when you think about it, as we start to get telescopes like LSST and PanSTARS and these other large telescopes that will generate many terabytes of imagery and data every night, it's going to be impossible for any one person or group to see what's up there. Image recognition's good, but nothing's as good as the human eye and human brain.

JU: Some guidance on what to look for is really useful. If someone's found something interesting, and that justifies spending the resources to take a closer look, that's beautiful. That's exactly how things ought to work.

CW: Right. And I think a lot of these telescope projects are thinking, how do we make this much data available to people? And how do we make that accessible in a simple way? So they've had conversations with us about how WorldWide Telescope might be able to help.

Also, NASA is very interested in how realtime data feeds from them would enable the public -- at the same time -- to have access to mission data.

JU: You start to think about what's possible, and you quickly realize there's an infinite number of interesting possibilities. It'll be great to see how this plays out over the next few years. Thanks!

Blog.PostedBy: Jon Udell | Jun 12th @ 5:22 AM

My guest for this week's Perspectives show is George Hripcsak, professor of biomedical informatics at Columbia and one of six researchers recently funded by Microsoft Research through its Computational Challenges of Genome Wide Association Studies (GWAS) program.



George Hripcsak


JU: For starters, what is a genome wide association study?

GH: A genome wide association study involves scanning markers across the human genome to find genetic variations associated with certain diseases.

JU: Specifically what's being looked for is single-nucleotide markers, right?

GH: Yes. Now our role in this project is the phenotype. We're trying to address the phenotypic computational challenge. Often it's simple. Someone has diabetes or doesn't. Or two people have it, but one gets complications and the other doesn't.

JU: So by phenotype you mean the expression of diabetes, in this case?

GH: Yes. Often you start with a disease, and some number of patients, often very small, but up to several thousand, plus a control group of patients without the disease.

You study the entire genotype, and you look for which sites on the genoome are associated with that disease. Then you look into that site. Now the fact that you may find a certain genetic mutation at that site -- that's not necessarily the cause of the difference between the two sets of patients. The cause may be something near that marker on the genome. So you might sequence that area, looking for other information about what genes are in the area, and so on.

JU: So the computational challenge is one of correlation.

GH: The first step is correlation. But...I'm at the zeroth step. There are other people working on the part I'm describing here. First they'll come up with associations, which is a computational challenge in its own way, because there is a vast number -- a hundred thousand, someday a million -- markers that you're looking at, to see if they're associated with this trait, diabetes or not diabetes.

Then they need to figure out what proteins are coded at the marker, or near the marker. In order to get to that point you need the phenotype too. As long as it's something simple, like patients with or without diabetes, it may seem like that's the easy part of the experiment.

But as time goes on, and genotyping gets easier and cheaper -- and as we learn to handle patient privacy, that's the other thing that limits the study, we have to be careful about how we collect and store these data -- the hard part is going to be collecting the phenotype.

JU: When you say "collecting the phenotype" -- that's clinical observation and description?

GH: Exactly. Imagine that every patient who comes into the hospital is given the option to participate in a trial where, in a secure fashion, their genotype is done, and then their information can be used to discover new things about disease. Some number of patients would agree to that, and then all you have to do is take their blood samples, check the DNA, and genotype it. It's relatively straightforward if you have the money to do it.

Then you have to find out what the phenotype of the patient is. But what questions should you ask? We don't know what diseases we might be studying, or what we might discover. We want to know the whole medical course of this patient: When they've been well, and when they've been sick.

What we have for these patients are their electronic health records. And in the future, with Microsoft HealthVault for example, we have the personal health records. And so the question is, with the patient's permission, can we use those data to come up with a reliable phenotype?

JU: OK. Now I see how this ties into your career history. You've done a lot of work in the area of mining clinical records, using a variety of techniques.

GH: Exactly. So in addition to working on the statistical models to do the genome association, we thought it'd be worth investing in the phenotype part of the problem. We've been working on it for 20 years, and it's harder than most people think.

JU: I wouldn't think it'd be easy, but tell us: What are the challenges unique to mining health records and clinical data?

GH: Two of our collaborators are Rich Smiley and Pamela Flood, both faculty members at Columbia, in anesthesiology. They're studying a specific protein, the beta-2 adrenergic receptor, and they're sampling about 2000 patients. They're just studying two snips, it's not a genome wide association, but they need to collect the phenotype on those 2000 patients. It's prohibitive to have a research nurse accurately gather all the information they need to do their study -- and what they're studying is labor, the length of time you spend in pre-term labor and how much pain you experience, and how that's associated with variations on these two sites. It's an enormous amount of work, and it might not be reliable.

But we do have an electronic health record. For each of these patients, the nurse has painstakingly documented what he or she recorded on the patient. Plus we have monitors, and lab tests, all fed into the electronic health record.

JU: But much of what's there is anecdotal, textual, and narrative, right?

GH: Well, it's a mixture of structured and narrative. So ideally we should be able to generate the phenotype rather than have a person do it. We're trying to do that by computational analysis of the health record.

Of course the health care record is intended for patient care, not for research. Anytime you take an information source intended for one purpose and try to use it for another you have to be careful. It takes a lot of processing and interpretation.

Whether it's structured or narrative data, people use different words to encode things. In a cardiovascular study, is it "chest pain"? Is it "angina"? Is it "coronary artery disease"?

The terminology varies, and often the terms are ambiguous. Someone says a person has diabetes, they probably mean diabetes mellitus, a problem with glucose. But there's a diabetes insipidus which is a completely different disease. All they have in common is that you urinate a lot.

JU: So on the one hand you can try to provide a more structured data collection environment that's aware of these distinctions. And on the other hand you can do a lot of text mining, correlation, and natural language processing.

GH: Yes. But remember, the people who are collecting the data are not interested in your research study. As a nation, we're working on improving our terminology, whether you get there by natural language processing or by having the doctor fill out a template. Either way we want to end up with computable knowledge.

But when the purpose is clinical care, not research, we're always going to wind up with these problems.

Furthermore, there's a reason why we speak in narrative style, and not in templates. It's an efficient means of communication. It may be true that it's best for health care providers -- and for all other human beings -- to speak in narrative language, and to have our systems, as they improve, turn that narrative into something structured.

JU: Using natural language processing to extract structure from narrative is something you've been doing for a long time. What can you say about the progress of the state of that art?

GH: We did a study in 1995 where we had 200 chest x-ray reports, and we had 12 physicians review them. Six were radiologists, the ones who generate the reports, and six were internists, who generally use them. They weren't looking at the images, they were just looking at the reports dictated by the radiologist who did the initial reading.

We wanted to see if we could use natural language processing to say yes or no to a set of questions, like: Is this a report indicative of bacterial pneumonia, or of cancer, or of chronic obstructive pulmonary disease? We had six conditions we were looking for, and we compared the reliability of each doctor's reading to the other eleven, and we compared the computer system's interpretation to all twelve. We found that the computer system was about as accurate as the 12 experts.

JU: And what is that system? There are general NLP frameworks, and also domain-specific ones...

GH: This one is a medical system called MedLEE, which Carol Friedman started building back in 1990 or 91. It went into production use in our hospital in 1995, and we've been using it ever since.

JU: And you've been training it as you use it?

GH: Well, improving it. It's not a data-driven system. So as it makes mistakes, we fix it, but it's not a machine learning system.

JU: It's a language understanding system.

GH: Yes. It uses a semantic grammar, it divides all words into classes, so rather than getting into the details of syntax, like noun phrase and verb, it says, this thing is a body part, this is a disease, this is a procedure, this is a medication, and then it has a grammar that has sequences of these classes. It also has some syntactic parsing to figure out negation and things like that, so it's a blended approach. It was used initially for radiology reports, but now it's used for all of medicine.

It was as accurate as humans at answering simple questions. When you get to complex interpretations, it doesn't do as well, but you're still in a situation where a human can't be expected to read a million chest x-ray reports, or discharge summaries. If we can do things that there just isn't the money for people to do, even if the accuracy is a bit lower, that's still useful.

JU: Given that context, how will this apply to the funded project?

GH: So, I've outlined some challenges. Things are narrative, terminology varies. Another is that data are sometimes wrong. Mistakes can be made in recording information on the chart, and often those are mistakes that the doctor would notice and immediately discount. Or it may be a subtle mistake that isn't important to a human interpreting the case, but could matter for a research trial where you're trying to automatically understand what's in the chart.

Often, there's also missing data. The patient may go for care elsewhere. Or a data value may not have been recorded. Or a test may not have been done. So you don't really have a complete record. If you're doing a clinical trial, you have a lot of money to pay a lot of people to spend a lot of time tracking patients, following up with them, measuring everything that needs to be measured. But if you're just using the combination of electronic health record and personal health record, you have to rely on whatever was collected for that purpose.

JU: It's going to be sparse data, for the foreseeable future.

GH: Exactly. So our challenge is to generate a reliable phenotype from that electronic health record and personal health record. Or, if it's not reliable, to know that there's not enough information in those records to make a determination.

JU: So part of the challenge is to infer what's missing. How can you do that?

GE: Let's say you're trying to study complications of diabetes, and you want to do a genome wide association study on people who've had severe diabetes from the point of view of treating it with insulin, but have had no complications, versus people who have had complications, to see if there's a genetic difference. If you can discover why some people don't have complications, can you develop a drug that mimics that in the other people?

To do that, we want to come up with a phenotype of people who have diabetes severe enough to be treated with insulin, but who don't have complications. And we want to use the electronic health record to identify them. What are the challenges?

Well, what if someone comes here for their diabetes care, because there's an expert in this medical center, but when they have complications, they go to the nearest hospital? My electronic health record doesn't have the data about their complications.

JU: Of course this is the promise, and the holy grail, of federated health records.

GE: Right. But this is just one example of many problems that can come up. When you're trying identify someone who hasn't had complications, you don't know if you're missing the data, or if they're truly without complications.

How can you figure it out? Well, you can use information theoretic methods to figure out, look, I have enough information such that if this person had complications, I'd know it. If this person has a history and a physical by an internist, or three different internists over the course of 10 years, and none of them ever mentioned a complication of diabetes, then odds are this patient doesn't have a complication.

JU: So you're interpreting the negative space?

GE: Exactly. Whereas another patient, who has diabetes, and disappears for 5 years, and then comes in and has a complete blood count but not a glucose, and then has some minor dermatological procedure, and then disappears for 5 years, and is here now -- I have no reason to think that person doesn't have diabetes complications. All I know is that he or she came in to have a mole removed. I have no information about diabetic complications, for example an opthamologic complication.

JU: Electronic health records are moving into the mainstream. You mentioned Microsoft HealthVault, Google Health was just announced. Most people have yet to encounter these things in their routine interaction with the health care system. I presume that in five years, many will have.

I think a lot of people have the notion that the information that's being collected will be of value, not only clinically but also to research. Your point is: No, not necessarily. So my question is, if you were the czar of electronic health information, what would you like to see happen in order to merge those two goals?

GE: I'd start with a caution. There's a knee-jerk reaction to say that we need to have doctors document more accurately, and more completely. But the problem is that you end up with a big structured template.

What I envision is an intelligent record that produces a summary for clinicians that they can read, correct, and then write their note which is a succinct summary of their thinking.

Now that doesn't answer your question, which was: How does that then get used for research? But I think that to the degree we make documentation efficient in serving health care, I think it'll also be more accurate for the sake of research.

One thing that can go wrong, for example, is that if you're filling out a record for the sake of billing, you'll have an incentive to use diagnosis codes that optimize billing. Does that then reflect clinical accuracy? And would that then be useful for research?

The important thing is to be grounded in the clinical truth. Put health care first, and then use new computational methods to extract accurate information.

JU: So clinical truth is what the doctor said, in the doctor's own language. Of course there's a lot of shared convention around the terminology.

GE: They learn in medical school, and throughout their professional lives, what to document. Things aren't always called the same, but the nation is working on health care standards in various ways, both for transferring information between systems and for coming up with common vocabularies.

JU: So although many of us would assume that those vocabulary terms need to be fields in a template, you're saying that's not the first and best strategy. You'd like to see that language just used naturally, as doctors speak their narratives, and then we'll harvest what we need out of that.

Do you think natural language processing will get us there?

GE: It's not perfect. We achieved expert-level performance on a simple task. We have less than expert performance -- but not bad performance -- on the more complex task.

JU: How has the system improved since its introduction in the 1990s?

GE: What we've done is expand our breadth. Back then we were doing mainly radiology reports, and now we cover most of medicine. I don't know that the accuracy got better, though.

Modern natural language processing systems often depend on machine learning, and don't have deep linguistic knowledge.

JU: Well, there are both breeds.

GE: In medicine we're seeing more emphasis on statistics than on linguistics, but we believe the right answer is a combination of the two. In our case we've tried some statistical systems too, but our semantic system seems to outperform them.

If you have a specific question, and that's the only one you need to answer, a statistical system is probably the more efficient way to go. What we do is parse the entire report, and spit out everything we can figure out from it.

In the 1995 study our goal was to answer six questions, but the system actually parsed the whole report, said everything it found, and then in those things it said we found which were indicators of pneumonia.

There are various techniques that you can use that do pretty well on a single question, but that don't do well if you give them an entire history and physical, and say, tell me everything there is to know about the patient. That's what MedLEE is good at.

Systems should make it easy for people to express what they need to express -- in this case, the clinical truth. If it turns out that a super-efficient template model works best, then that's great. It's an empirical study. People will experiment over time, and see what works.

JU: You've also mentioned the compromise approach: summarize, then present for approval or correction.

GH: Yes, but clinicians don't want to stop and correct. So we need to work on presenting the structured format that's useful enough to them to justify that effort.

JU: It's a perennial and vexing problem. In some ways, maybe, one of the grand computational challenges. At the interface between the data collector and the human being, the person is always going to regard the collector as an impediment.

So, when does your project start?

GH: We've already started. For that Rich Smiley and Pamela Flood study in pre-term labor, we're already taking data out of the electronic health record for them to do their associations.

It's nice to have a concrete problem to work on. Over the summer, what we're working on is a generic framework. So, how does the next person and the next person do this? And then we'll be working on putting together a pipeline of tools. You'll still need a person there to process the data, but it won't involve reading every chart.

JU: Well this sounds hopeful. Thanks!

GH: Thank you, Jon.

Blog.PostedBy: Jon Udell | May 29th @ 9:34 AM

My guests for this week's Perspectives show are Barbara Willett and Nigel Snoad. Barbara works for Mercy Corps in Afghanistan, as the design, monitoring, and evaluation manager for a number of agricultural development programs. Nigel Snoad is a lead capabilities researcher for Microsoft Humanitarian Systems. Together they've pioneered the use of FeedSync as a way to synchronize data collection and reporting in an environment where Internet connectivity is spotty, and where lightweight, two-way synchronization is essential.

Nigel Snoad and Barbara Willett


JU: Barbara, we want to discuss the database synchronization system that you've partnered with Nigel to develop, as part of the Mercy Corps work in Afghanistan.

BW: I'm the design, monitoring and evaluation manager here in Afghanistan. I arrived last year in March, about the same time we had a consultant in doing a general review. He also brought another consultant who'd worked with Mercy Corp on technical issues, including the development of databases.

JU: And can you explain what, in this context, is being designed, monitored, and evaluated? What are the programs you're supported, and what do those programs do?

BW: In Afghanistan, almost all our programs revolve around agricultural development. We have a number of programs funded by USDA, the British Government, the European Commission, all with the same goal of improving the livelihoods of the Afghan people.

JU: Is your microfinance program among those?

BW: Yes, we have a microfinance program, but it's one of our older ones, and it's self-sustaining now, no longer administered directly by our monitoring and evaluation system.

JU: What are some examples of programs that are?

BW: The ABAD [Agro-Business and Agriculture Development] program is where all this started. It supports business development, and technical capacity training for farmers.

There's also a lot of work in animal health, redeveloping and reestablishing veterinary field units, and training veterinaries, para-veterinaries, and female livestock workers.

JU: So your management challenge, relative to these programs, is what?

BW: It's developing tools that are applicable to multiple programs doing the same kinds of things. Everybody's involved in agricultural development, and interested in measuring improvement in sales and production. It's a challenge to collect that data and share it -- both operational data and impact data.

JU: So the field workers are in various locations around the country, with intermittent Internet access?

BW: Right. And in these circumstances, we want to standardize how we collect, synchronize, share, and report on this information.

JU: You had a pre-existing system based on Microsoft Access, as I understand it, and there were problems synchronizing those databases to your central office.

BW: Initially we didn't have Access, actually. When I arrived there wasn't any centralized system at all. Everything was Excel-based, sharing spreadsheets month-to-month from this region to that region. So we started the Access system, then later we realized it wasn't really working out because of the Internet problems, and because the process was bulky and cumbersome.

JU: Nigel, that's where you come in, right?

NS: Yes. Our humanitarian team was in Afghanistan looking to talk with people, do a bit of show and tell, and mainly get feedback about what people would like, and what they really need. And then use that to iterate what we were doing, and look for partnerships to do pilots.

With Mercy Corps we said, here's what we've got, here's what we're thinking, does that make sense to you?

Mercy Corps was great for that, because they were fairly well advanced in their thinking about how they were using their Access solution, and the architecture of what they wanted to do was quite clear.

JU: So from your perspective, Barbara, what was the outcome? Did it look just like what you had before, except that the synchronization problems were magically solved?

BW: Yes. I just wanted things to share, I wanted to know that we all had the same database, and somehow, whether it's every week or every month or every minute, the information just has to connect.

On paper it looked like we could do that with Access replication, but when we realized the problems that was causing, we realized that this technology Microsoft had been talking about -- which seemed maybe a little beyond our needs -- might actually solve the problems that we had. They wanted to try it, we wanted to try it, so it seemed to dovetail well.

And yes, it made happen in reality what I'd wanted to happen on paper. I was willing to accept weekly or even monthly if that's what it took, but now it happens every 10 minutes.

JU: Is this a situation where the updates that flow in from various locations tend not to conflict with one another?

BW: Yes, conflict resolution hasn't yet been much of an issue. Our biggest issues have been in our own database development, because the database itself is still evolving. So each time that changes, it affects the job mapping and FeedSync.

NS: The conflict resolution stuff is in there, and I think it'll become increasingly important as the size of the data grows, and as the activity from all the endpoints grows.

But we were quite deliberate about trying this in one place, and seeing if from Mercy Corps' perspective it worked out the way the Microsoft team had envisioned. It really was a tight spiral between developing new ideas and technologies, and also proving them and using them.

It was great to be able to do that in a real environment, but also a fairly controlled one, which was our first pilot in Kunduz. But then, you took it all over the country, damn you. [Laughs]

BW: [Laughs]

JU: There was a preexisting synchronization system that was found wanting. What was that, why didn't it work, and why is the FeedSync solution working?

NS: The solution Mercy Corps was originally piloting was based on Access and Access replicas. Which is a great technology, but in Afghanistan they were struggling with an unreliable Internet connection, and those replicas weren't working well. There was a lot of data to send, and there was a peer-to-peer VPN that would work OK sometimes, but flake out sometimes, mainly due to the Internet connection.

And in some cases, there was no Internet connection at all. So you have to send something by courier, be it a file on disk or on a memory stick.

JU: With FeedSync it's the same in the no-Internet case, you still have to sneaker-net the file.

NS: Absolutely. But with Access, when you export to a file, that's a one-way transfer. And there's all kinds of data you want to get back. Corrections to data, if there's a problem. (That's where conflicts can arise.) Then there are the reference lists and the lookups -- names of provinces, names of villages, names of staff people who are attending training sessions. All these things have to flow back to the edge, and be kept consistent.

JU: The point being that FeedSync isn't just lightweight, and more resilient to poor connectivity, but also that it's two-way.

NS: It's a two-way technology, and you've got different versions of the same thing, not one version that you're trying to somehow import and export and merge.

JU: Barbara, is this two-way aspect evident to you as a user of the system?

BW: Absolutely. At first I didn't understand a lot of the terminology, and the discussions and explanations. I kept hearing the word lightweight, and I didn't really understand what that meant.

But when I compared the Access replication, which basically takes the entire database and replicates it to another place -- which takes a long time, and then the Internet cuts out and you've corrupted the whole structure -- now instead of that you're sending just pieces of information. If it doesn't work right now, it'll work in a half hour, it just keeps trying, and it's completely lightweight and easy in that sense

And definitely the two-way street. We were still very much developing things, and even if it were perfectly developed there are still changes that have to happen from our side. As Nigel said: staff lists. People enter training records and they have to apply them to names of staff, but we get new staff people all the time. We have to continually update the names from our side so they have an appropriate list to choose from.

NS: One of the things we thought about when we were building the job manager, which is the piece that runs the FeedSync on people's desktops, is that it's an application that just sits there. You build a bunch of jobs, and a job takes a data source and syncs it with another data source.

In the case of Mercy Corps, that means take a table from an Access database and sync it with a feed on a website that acts as a relay. That's a job, and you have one of those for each table in the database. Of course referential integrity is something you can try and manage, and there's some support for that.

The other piece is that we can run a sync to a file. It takes the table in the database and syncs to a feed, in this case an RSS feed, on a file source. If the memory stick is plugged in, and you've got things set up right, it just works. The user doesn't have to worry whether it's being exported to the right place, or about what the file is called. And similarly for the Internet case.

JU: At this point, is the master database what's up on the server at Live Labs?

NS: Well the master source is really the database in Kabul, but yeah, the replicas are being also managed on feedsync.mslivelabs.com, where the plug is that anyone can go and set up a feed and a synchronization endpoint. All the databases sync to that, and then Kabul syncs to that and gets the data back down. And vice versa.

I should point out that FeedSync used to be called SSE [Simple Sharing Extensions], and this started back then. The first users of SSE in anger, if you like, were Mercy Corps in Afghanistan, which was exciting. But we had a lot to learn. Now they're moving to FeedSync. What that means is a slightly different version of the specification, a different service on the website, and a new version of tools I just gave to Faheem a day and a half ago -- he's the technical manager for Barbara's group. This latest version of the tools is the one that I hope we're releasing publicly in a couple of weeks.

JU: Barbara, I'm sure that going forward you'd like to see a better way to do schema evolution, so that the changes to the database structure can be part of this seamless synchronization.

BW: Yes, that would be ideal.

NS: That's a long conversation...

JU: Yes. Here's something else. I know that Mercy Corp works in parts of the world where connectivity is basically SMS more than Internet, or maybe exclusively SMS. That seems like something that FeedSync could be adapted for. Have you thought about that?

BW: Sure, some of our other offices have been using SMS as a way to share bits of information. We haven't found the need to do that yet, because we don't have that many people far out in the field who would need to enter data. In other programs where there visiting sites and schools all over the country, then yes.

NS: We've already built an SMS adapter for this, it's in testing at the moment. And it does exactly what you suggest. Rather than sending the data over the Internet, it breaks it up into SMS packets. There can be a lot of packets, of course, and it can be expensive. But we were talking to a different NGO in Afghanistan, which operates in very insecure areas where they don't have Internet at all. There, they are very interested in doing something similar to what Barbara is doing, but using SMS. First, because in some of these areas the security is so bad they can't even be seen to be carrying data. Second, if it takes six hours to drive somewhere, the cost of a bunch of 1-cent SMS messages is a lot cheaper than the cost of the petrol involved.

We've tried this, and it works quite nicely. I'm excited about that for the future. But to be honest, there are plenty of issues just keeping the Mercy Corps solution running. I wouldn't want to make you believe that this has been a dream installation where everything worked perfectly off the ground. Just the other day we ran into a problem where the feeds wouldn't sync.

BW: Sorry about that!

NS: Hey, no problem. It's now documented and it's part of the FAQ that'll go out when we release the tools.

JU: Barbara, how does Mercy Corps envision making use of the open toolkit which will be one of the outcomes of this project?

BW: There's been a cautious, wait-and-see approach from the beginning. Our IT has been a little like, oh, I don't know, this maybe could be interesting, let's see how it works. But now that people are understanding more, and seeing that this is not really a pilot any more, there's starting to be interest in ways we can be sharing information regionally, across offices, and how can other countries make use of the same technology. It's working its way into our lexicon.

JU: If there weren't the Internet problems you've had in Afghanistan, would there still be reasons to do things this way?

BW: Yes, I think there are other benefits. It's much closer to realtime, for one thing.

JU: The lightweight, near-realtime aspect is appealing even when there's enough bandwidth to do more heavyweight replication?

BW: Yes. And also, for us, going from one Access database to another identical database is one thing. But some of our initial discussions were about sharing across platforms that may not be identical, but use common variables. Across regions or countries, we all need to report certain pieces of information, but we collect it in different ways, and store it in different ways. If there's a system where we can upload it similarly, that would be a huge benefit.

JU: Great point. You could define a neutral common ground for data exchange.

BW: Exactly.

NS: For Mercy Corp, there's a whole pile of options they should consider when doing sync. FeedSync isn't the be-all and end-all, it's got some particular things it seems to be good for, but that last point about interoperability is really important.

From the start, we've been interested in how to use this to link up disparate systems. Sometimes that might be an Excel spreadsheet to a database. But also different organizations or, in Mercy Corps' case, different countries where they work run slightly different database designs.

That's where I think the real strength of the system will lie. We've got a lot of work to do thinking about how to make that better.

One of the things we're really concerned about is that, if Mercy Corps or another group wanted to roll this out, there would be the support to do that. I don't think we've got that perfect yet by any means, but we showed what we're doing to a bunch of other NGOs, and afterward a number of them were interested in taking this kind of tool -- be it FeedSync or some other -- and using it for their programs.

The best example might be where you've got a whole pile of agencies implementing a program. In Afghanistan there's a thing called the National Stability Program, and it's run all over the country. All the reporting happens in a standard format, but every organization has its own way of managing the process. The challenges are to integrate the data, and pass back success and lessons learned. The big NGOs have their own systems, almost all in Access, all with some of the same schemas because the ministry says this is how you will report, but no way to aggregate all that nicely.

JU: This makes good sense. Thanks!

NS: Yeah, thanks Barbara. And let me know how Faheem's getting along. There may be some issues, but I hope everything's OK.

BW: Yeah, it's good, thanks so much.

Blog.PostedBy: Jon Udell | May 22nd @ 8:42 AM

Caroline Arms is an information technologist who came to the Library of Congress to work on the American Memory project. The challenge of preserving digital content captured her interest, and her work since has focused on understanding and promoting formats that raise the probability that content will be usefully available to future generations. She is the co-compiler, with Carl Fleischhauer, of the Digital Formats website, and a member of the committee to standardize Office Open XML.

Caroline Arms


JU: I'm interested in your perspective on XML's role in the preservation of documents for the long term.

CA: I'd like to be able to go broader than XML. It's one aspect, but it's not the only answer. When we're talking about the challenge of preserving digital content we usually think more broadly.

JU: Great point. Of course there's a whole range of issues, from how you keep the disks spinning too...well, let's step back and talk about acid-free paper, which may be a more durable format than anything we've done electronically.

CA: Absolutely.

JU: So, OK, give us the broad view of how you have approached this problem at the Library of Congress.

CA: The Library's mission is to make its resources useful and available to Congress and to the American people, and to sustain and preserve a universal collection of knowledge and creativity for future generations.

Congress funded the National Digital Information Infrastructure and Preservation Program (NDIIPP), and I've been working as part of that since the early 2000s.

The program looks for every opportunity to raise the probability that content created today will be usable by those future generations.

I first came to the library to work on American Memory, which was digitizing out-of-copyright materials and making them available to everybody.

JU: Of course that project isn't just a resource for future generations...

CA: Right. So, there are many ways to think about raising that probability. The program is trying to build a network of organizations committed to the stewardship of digital content. Not just traditional libraries and archives, but certainly including them.

You mentioned spinning disks. We try to have conversations with storage vendors, and try to explain how we see the requirements for long-term cultural archives as being a little different from those for business continuity.

You also mentioned acid-free paper. In the book age, we can take in a book make sure it's on acid-free paper, and it will still be there a hundred years from now. The phrase "benign neglect" gets used. Paper survives benign neglect. Digital content doesn't.

JU: It's a paradox. Recently I visited my parents, and we found a box of correspondence they had written from a yearlong trip to India many years ago. I realized that my own correspondence is probably a lot less likely to be available to available to my kids or grandkids.

CA: I have exactly the same experience. My father was away at the battlefront in World War II, and he wrote as frequently as he could. My mother still has all those letters. Today's forces are using email and cellphones and other ephemeral means of keeping in touch.

It's amazing to read the letters discussing what my name would be, because I was on the way.

JU: Of course once that box of letters is lost, it's lost. There is no backup, there are no perfect copies. It's a paradox that we're in era when you can make perfect copies, and distribute them as widely as you want, so you'd think that superabundance would save the day, but that's not necessarily true.

CA: No. You have to act at the time of creation in order to up the probability. This is true for your own digital photographs, and for libraries. So we try to influence the early stages of content creation.

JU: How?

CA: Working on standardization efforts is one way. Another is to form partnerships that try to exploit synergies with content creators. We just look for opportunities in different industries.

For example, the scholarly publications industry has an interest in preserving their own content, they also want it to be accessible through libraries, so we find synergies there.

JU: Of all the businesses I know, that one is most sophisticated in its thinking, and in its efforts toward long-term preservation. Those folks really get it, and have done a lot of good work to enable a level of fidelity and persistence that is unheard of elsewhere.

CA: Another community working toward this are the professional photographers. These are mainly individuals, not corporations, but they're realizing that for their own business purposes they need to have good practices. And the practices that are good for them are pretty much aligned with the practices that we believe will be helpful.

JU: What are some of those practices, and how do you interact with that group to help foster them?

CA: In the NDIIPP program, we've had some money we've been able to give out as awards. A couple of recent awards are to associations of photographers, and they're all to do with exploring what the good practices should be, and promoting them. So, discussion of formats, and in particular for photographs, the capturing and recording of metadata. Understanding what the tools that photographers use do, or don't do, about accumulating and retaining metadata.

Wise choice of format is important, but we don't think there's a single best format. In thinking about formats, the two key factors are disclosure -- that is, are specifications available -- and adoption. The more widely used a format is, the less likely that archival institutions will have to foot the bill for migrating it, or maintaining tools to render it.

We are interested in understanding the formats that are widely used, and promoting practices that will use those formats in good ways.

JU: Does this boil down to recommendations that the Library of Congress has made to photographers?

CA: No, it's working with photographers to find the synergies between the requirements, have the photographers promote the best practices, and perhaps to suggest what will be even better for us. But we don't have that much influence over what formats creators and publishers use. We have to learn to be able to handle the most widely used formats.

JU: Given that, what practices do you find most useful, and why?

CA: With photographs, there is value to us and to photographers in retaining as much color and spatial information as possible. The Library will be accepting a variety of formats. If your camera takes only JPEG, there's no point in going for anything else. But in general libraries and photographers have liked to keep full-resolution images without lossy compression. TIFF has been a standby, but it's not very good for embedding metadata.

There are explorations at the moment on formats and tools for getting metadata into images. Many photographers are positive about Adobe's DNG format, with XMP metadata, which is XML-based. An advantage of XMP is that you can embed it in images, or handle it as XML outside the image. XMP as a vehicle is now being supported by more and more tools.

But then within it, you have to have practices about what elements you record. In the photography world, the leading community is photography for journalism, so ITPC (International Press Telecommunications Council) is the leading metadata standard as far as elements are concerned.

This is a case where the Library has its own metadata standards, and we don't want to lose all the experience and compatibility with our own systems and tools, but clearly the commercial market and the equipment is gathering around the IPTC metadata schema. So we need to adjust our practices so we can take advantage of that.

JU: You've said that you look to leading practitioners, like professional photographers and journalists, but of course anyone can produce something which -- though we won't know it at the time -- could prove to be of great cultural significance. So we have to hope the standards and practices trickle down to everybody, right?

CA: Yes. The standards and practices supported in cameras and software, or in Flickr and the other management services, those are all part of the environment that we're working in, and that we have to be conscious of.

The rapidity of change is a real challenge for us. The book in its hard cover on the shelf has been there for a long time, and will continue to be. But in the digital world things change very quickly.

JU: It's a huge challenge, and we've yet to see the emergence of a way of dealing with this that would separate various concerns. Storage, for example, is a separable concern. It should be possible for individuals and organizations to choose from a range of storage options which would offer a range of preservation guarantees.

CA: Absolutely.

JU: And that wouldn't necessarily be tied to other kinds of arrangements. You mentioned Flickr. On the one hand people are using it for archival purposes. But it's also a catalog, it's also a database, it's also an environment for sharing and use. We're bundling all those concerns together right now, and that makes it difficult to get at what really matters to you in a rational way.

CA: I agree entirely. Flickr is not making any commitments to the way it's archiving the content. It is tricky. In the last few years, these big services provided by Amazon and Google are a complete change in the business model for these things. But it's interesting that the storage service from Amazon has taken off in a way that some other attempts failed. There were several others, but they couldn't build the market and the trust. I think that somehow Amazon has the trust of people because it clearly has a big problem of its own. People trust that it will take good care of its own content, and that somehow it will solve these problems. So although as you say things aren't separate, in a way the building of trust can't necessarily be separate.

JU: Of course there are no long-term guarantees. This is where the scholarly publication folks have done the most thoughtful and intense work. They've even thought through what happens when the organization hosting the content fades away, and have seen that there needs to be a federation of cooperating businesses that transcends any individual organization.

I should be able to swap out Flickr's storage backend for a service that offered long-term guarantees, for which I'd pay a premium. That's not an option for anyone yet, but there's a whole slew of interesting business opportunities there for lots of players in lots of niches.

CA: What's unpredictable is quite how they will develop. It's a mixture of general moves in the technology and particular organizations deciding to go in a certain direction. And then the market, whether it's consumers or industry sectors, coming together to create critical mass.

What we found in NDIIPP is that it's very hard to drive this process. You can nudge, and promote awareness of problems, but what has actually emerged in the last year or two is probably quite different from what people were talking about in 2001 when the program got started.

JU: Where do you feel you have been successful in doing some nudging and promotion?

CA: I think some of the standardization efforts, for PDF/A, the archival format for PDF, and Office Open XML, are examples of where we've been able to play a role in moving in the right direction.

JU: The Library of Congress has been involved in both of those standardization efforts?

CA: Yes. In the PDF/A case, which happened first, this was an activity stimulated by the wishes of archival institutions and especially the legal and judicial community to have an archival document format that could substitute for paper.

The standard came out I think in 2004, and there are an increasing number of tools which can save in this format. It primarily outlaws features which are difficult for preservation.

JU: I was going to ask you to clarify that, because I think many people would say that PDF itself is a good archival format.

CA: The PDF/A format outlaws embedded audio and video, it requires that the text in the PDF be in reading order, it requires that the fonts used be embedded -- because in many cases PDF relies on the fonts you have on your computer -- and it requires that the fonts be legally embeddable. It also outlaws encrypting, and mandates XMP metadata.

JU: Do you think these restrictions tend to be easy to meet, or are they onerous?

CA: My guess is that in ordinary office documents, and documents that get submitted for court cases, it probably is not onerous.

JU: So in terms of Office Open XML, how did you approach that?

CA: We joined that effort after it was already underway. We learned that the British Library was actively involved, and we shared their interest. The general move to XML-based formats for text documents, and for the other office productivity documents, seemed to us like a very good move.

XML files, that you can look at with simple tools and hopefully understand the tag names, offer inherent advantages.

As I said, the two most important factors for preservability are disclosure and adoption. By disclosure we mean that the specification exists, in a public way, that will continue to be available. Clearly to have it exist as an international standard by a known standards organization raises the probability that it will continue to be available and used.

As to adoption, clearly the Microsoft products are widely adopted, and libraries will be collecting content produced by those applications. So this seemed like a good opportunity to influence the public availability of the specification.

JU: It's an interesting question as to what extent the Library will wind up interacting directly with documents produced by those applications, versus receiving content from organizations like scholarly publications, who are now for example beginning to be able to accept articles that were authored in Microsoft Word, but delivered in the NLM -- National Library of Medicine -- XML formats.

CA: Yes, you're right. Our traditional collecting has mainly been of published materials, and we expect they'll be in some form other than what your word processor creates. But we also collect the personal papers of famous individuals, so I'm sure we already have quite a lot of documents in word processing formats.

We believe it's important to be involved early in the content creation life cycle. If the tools begin to record more information about the transformations that go on, that's of value.

And beyond standard text documents, a phenomenal amount of valuable information is currently stored in PowerPoint files. Or, information that we might have collected on paper may be available as spreadsheets. We can't afford to assume that things will remain the way they are.

We're harvesting lots of documents from the web that may not have been published through traditional channels, and those are likely to be word processing documents or PDFs.

I'm confident we'll have plenty of documents in word processor formats that we will have to try to preserve.

JU: Of course the preponderance of what would be the equivalent of personal papers, at least for a certain era, will be email. And unfortunately we don't have any XML standards governing email.

CA: Right. So email is not something the Library of Congress spends a lot of time thinking about, but another government organization, NARA, the National Archives, for them email is very important. They capture the records of government agencies, and of each administration as it transfers power to the next.

So, I must mention that there are other XML formats. The Open Document Format is also a very important development for us, and we hope that it will be adopted. We have to keep an open mind and see where the marketplace moves.

We see that the general movement to XML-based formats, wherever they are appropriate, is a good thing.

JU: Yes. Whatever the XML format, there's a huge amount of untapped potential in the interweaving of content and metadata and, actually, data -- rows and columns sorts of data which are well represented in XML formats. The numbers in spreadsheets and databases are a form of content that is merging with documents, and should.

CA: Absolutely. One of the projects I've been involved with goes under the name Data-PASS. It's a consortium of social science data archives. They have a descriptive standard, it's a multi-level standard with a rich XML structure that supports the online subsetting of the data.

JU: So I think we're having this conversation in the nick of time, because you're retiring next month, right?

CA: Yes, in late May, actually. But I expect still to be engaged in the area. It's been a very exciting time, and I hope still to be involved even if I'm trying to have more time for family and travel.

JU: I hope so too. So, the challenges are daunting, but I think you're mostly optimistic about the future.

CA: Yes. I've learned to take a long-term perspective. You do see that even though the steps are small, there are lots of steps being taken in hopeful directions. Eventually these problems will be worked out. And as people become aware that this is not just a problem for libraries and archives, but also, as you've pointed out, for their own correspondence, their own photographs -- and also that businesses share the same problems -- I'm confident we're moving in the right direction. And I'm glad to have helped.

JU: Thanks!

Blog.PostedBy: Jon Udell | May 15th @ 9:20 AM

WinFS was an ambitious effort to embed an integrated storage engine into the Windows operating system, and use it to create a shared data ecosystem. Although WinFS never shipped as a part of Windows, many of the underlying technologies have shipped, or will ship, in SQL Server and in other products. In this interview Quentin Clark traces the lineage of those technologies back to WinFS, and forward to their current incarnations.

Quentin Clark led the WinFS project from 2002 to 2006. He's now a general manager in the SQL Server organization.


JU: You made a fascinating remark last time we spoke, which was that most of WinFS either already has shipped, or will ship. I think that would surprise a lot of people, and I'd like to hear more about what you meant by that.

QC: WinFS was about a lot of things. In part it was about trying to create something for the Windows platform and ecosystem around shared data between applications. Let's set that aside, because that part's not shipping.

JU: So you mean schemas that would define contacts, and other kinds of shared entities?

QC: Yeah. That's a mechanism, a technology required for that shared data platform. Now the notion of having that shared data platform as part of Windows isn't something we're delivering on this turn of the crank.

We may choose to do that sometime in the future, based on the technology we're finishing up here, in SQL, but it's not on the immediate roadmap.

JU: OK.

QC: Now let's look under the covers, and ask what was required to deliver on that goal. It's about schemas, it's about integrated storage, it's about object/relational, a bunch of things. And that's the layer you can look at and say, OK, the WinFS project, which went from ... well, it depends who you ask, but I think it went from 2002 until we shut it down in 2006 ... what was the technology that was being built for that effort, in order to meet those goals? And what happened to all that stuff?

You can catalog that stuff, and look at work that we're doing now for SQL Server 2008, or ADO.NET, or VS 2008 SP1, and trace its lineage back to WinFS.

JU: Let's do that.

QC: OK. I guess we can start at the top, with schemas. We're not doing anything with schemas. At the end of the WinFS project we had settled on a set of schemas. It was a very typical computer science problem, where the schemas started out as a super-small set of things, and then became the inclusion of all possible angles, properties, and interests of anybody interested in that topic whatsoever. We wound up with a contact schema with 200 or 300 properties.

Then by the time we shipped the WinFS beta we were back down to that super-small subset. Here's the 10 things about people that you need to know in common across applications.

But all that stuff is gone. The schemas, and a layer that we internally referred to as base, which was about the enforcement of the schemas, all that stuff we've put on the shelf. Because we didn't need it. It was for that particular application of all this other technology.

So that's the one piece that didn't go anywhere.

Next layer down is the APIs. The WinFS APIs were a precursor to a more generalized set of object/relational APIs, which is now shipping as what we call entity framework in ADO.NET.

What's getting delivered as part of VS 2008 SP1 is an expression of that, which allows you to describe your business objects in an abstract way, using a fairly generalized entity/relationship model. In fact we got best paper at SIGMOD last year on the model, it's a very good piece of work.

So you describe your business entities in that way, with a particular formal language...

JU: For people who haven't seen this, how would you characterize that language?

QC: It's pretty standard entity-relational. It's really a matter of describing to the system a set of properties and collections and relationships among entities. The important thing we tell people is to describe their entities as they think about them. Not as they think they should be expressed in a fully normalized database schema, and not as they need to program to them as objects, but in terms of how they think about them, and want to be able to report on them, or interact with them.

From there we can derive objects you can program against, we can derive schemas to build a store of them.

The traceback to WinFS is that we had a very fixed way of doing this for a particular set of entities. We built the schema around items, and items were entities that had relationships to other items. We built this whole model on a more generic substrate that we never expressed.

So we said OK, we didn't ship the WinFS APIs, but we have this asset, a more generalized expression framework for entities, let's figure out how to finish that work up, and get that delivered as part of the next ADO release.

This stuff is now very well integrated with LINQ. You can do LINQ to relational, where LINQ will look down into the database, look at the schemas that are there, and express that directly up into LINQ. Or you can do LINQ to entities, which allows you to have a layer of abstraction between what you're programming to and your underlying physical database schema.

That work is ongoing, we're getting good feedback, we'll see how far it takes us.

JU: How much continuity is there in terms of the team?

QC: A lot. When I did the reorg, I had an Excel sheet of everyone in the organization and where we were moving them to. Last I looked at it, 80-plus percent of the team was still in SQL somewhere.

One of the interesting things about WinFS was that we started hiring a different kind of person. The database team is full of traditional hardcore systems database guys. When we did WinFS we were looking for a different thing.

JU: In fact you don't consider yourself to be a hardcore database guy, right?

QC: Right. I'm a good example. I started at Microsoft in the Word group, and went from there to IIS to something called Application Center, worked on the manageability technologies for a while, and then was asked to come over and do WinFS. So my background was much more about how to use databases, how do you build apps around them, and not so much what are the internal algorithms you should use for bitmap indexing.

Of course we had a lot of folks from the core database team, but we hired a lot of folks that had experience with compilers, with user interfaces, with building apps on the database. A lot of those folks who were leading the API effort for WinFS are now leading the API effort for all of SQL.

So that's the story for the API team. As for the rest of it, well, there's obviously a big chunk around file systems. If you want to do this shared data model, you want it to be applicable to all data, not just things you can express relationally. So we had to figure out how to merge database constructs with file systems.

A lot of people thought this was impossible, and would harken back to Cairo and various other projects announced and unannounced to the public world around integrated storage, that didn't necessarily produce fruit.

We had one key advantage. We found an architectural approach that allowed us to control the semantics, and provide transactional database consistency over the files that were involved, while still allowing the file system to be in control when it came to file-handle-level operations.

We did it with a kernel driver that allowed us to control the namespace, and keep the database involved. The database lives up in user mode. As far as the operating system is concerned, there's no difference between SQL Server and Microsoft Word. They're high-level user-mode apps that occasionally drop down and make requests of the kernel.

So there was a fundamental disconnect. How do we maintain control over this low-level system concept, the file system, by a user-mode app? We built a kernel-level driver to communicate back to the user-mode SQL process. It had a cache of what things should look like, and what things are in what state, but it was there along the API path for the file system, to allow it to control the namespace operations over files that were "in" WinFS.

People would often ask me if WinFS was a file system, and I'd struggle with the answer to that, because, well, you know, from a certain standpoint the answer is yes. The stuff I saw in the shell, was it in the WinFS filesystem? Well, OK. But there are no streams inside the database. So from a user perspective, those files were "in" the filesystem. But from an API perspective it was more nuanced than that. I could still use the Win32 APIs, get some file, open it, and from that point forward the semantics were exactly like NTFS. Because it was NTFS at that point.

There was a certain place along the API chain where the database was completely out of the way. This allowed us to get the perfect compatibility that had tripped up other integrated storage efforts in the past. Other efforts tried to get this compatibility by emulating all the Win32 APIs, which is tough. And the performance bar is very high.

JU: So how does this carry forward, if it does?

QC: It does. That approach was so good that we decided to generalize it for SQL Server 2008, as a feature called filestream. It's basically a new kind of blob support for the database. You configure a column for filestream, you can take a file and insert it as a record, you get back a file handle, you can stream things into that file handle. You can do queries and get back file handles, and get streaming API-level NTFS performance on the files you put in there.

What we have not done is the namespace support. So you don't get to walk through a directory of files. You examine a row, you ask that row to give you back the right token, you start doing the Win32 operations on it.

But the rest is integrated. You back up the database, you back up the filestream. From most perspectives -- except mirroring, which we didn't get to fully integrating -- it looks like any other blob.

JU: Where do you see that being used to good effect?

QC: Right now there's a choice people have to make. There's a size limit on blobs in the database, because we put them inside database pages, and that leads to a performance problem as well. If you want to pull a 2-gigabyte stream out of the database with traditional blobs, it's not as performant as walking up to NTFS and using a file handle. We have to recreate the file by putting together a series of database pages that are themselves a level of indirection on file system pages.

So people today have to make a choice. Do I want the integration with the database, so backup works, my transactional semantics work, all this stuff works, and live with the performance and size limitations. Or do I want the best possible performance, and basically no limitations on size, by putting things in the file system, and then having my application logic figure out how to glue together the database world and these files that are now strewn about the file system. And when I do a backup, then I also have teach my operations guys that when you back up the database your not backing up all the data, you also have to worry about these files the database knows nothing about.

With filestream, people don't have to make the choice. They get the performance they want, with the database integration they expect.

Now the next place to take that, after 2008, is to add Win32 support. So we did this other feature as part of WinFS, which we're calling hierarchical ID. It's a column type, a new column type, which creates hierarchy support in the database.

We did this for WinFS because obviously if you're storing your data in a filesystem-like hierarchy, you need to be able to do things like show me all the stuff in this folder, and answer that query lickety-split. You can't be walking through record by record looking for matches.

JU: Or dealing with the SQL way of expressing hierarchy, which is doable but beyond my comprehension.

QC: Yeah, it's hard. The fundamental problem is that the query processor doesn't understand the concept of path. It understands matches on columns. It can find substrings within records, but it's kind of brute force. You can use fulltext indexing, but...

JU: ... but you don't get containment for free.

QC: That's right. So hierarchical ID is a column type that teaches the optimizer about hierarchy, about path, so you can do queries that find all the things contained within this part of the path.

So we have that feature also shipping in 2008, and there are all sorts of different uses for it. For example, people use it for compliance. They'll create a hierarchy of different confidentialities and compliance levels. This thing is confidential, which is a superset of things that are executive-eyes-only. Hierarchies like that are just out there in the world.

JU: How do you build and visualize them?

QC: You tell us about them. You express the form of your hierarchy, and you populate the records accordingly. But I don't think there's a tool yet.

JU: So there's the filestream piece, and the hierarchical ID piece, and then the Win32 namespace pieces is the shoe that hasn't yet dropped?

QC: That's right. In the next release we anticipate putting those two things together, the filesystem piece and the hierarchical ID piece, into a supported namespace. So you'll be able to type //machinename/sharename, up pops an Explorer window, drag and drop a file into it, go back to the database, type SELECT *, and suddenly a record appears.

Potential uses for that? It's all over the place. Take our own expense reports. We used to have these Excel form templates, and you'd fill it out and submit it to some system. Then we hit a phase where it was all online, so you're on the plane home and too bad for you. But imagine they could reintroduce that template again, and you could save that Excel file directly into the database.

Or more importantly, if you go to edit the thing, you don't have this process where you've taken a copy of the thing, you're editing it, you're sending it back through a mid-tier system that then has to reconcile the database records with the filesystem records. I can just say, oh, I need to add three more things. I double-click, and yes I'm still interacting with some web-based app, but the links I get are real Win32 links. I open the thing, I edit it, I stick it back, everything knows that it was changed within the right transactional semantics.

People are constantly having to bridge between the file world, and the world of data around the files. Providing Win32 support gives developers the opportunity to allow the desktop clients to directly interact with a file that's part of some application, without having to go through all the semantics of the mid-tier.

Are there always going to be some applications that will want to have mid-tier control over every aspect of every part of every workflow? Of course. But from a productivity standpoint, to be able to allow people to build applications more quickly, to be able to customize applications and not have to manage all those semantics themselves, that's huge.

Sync is another topic, but imagine we build the right things around synchronization, so people can take the files offline. It's a major productivity gain. As a developer, you know the consistency of the world you're dealing with. You're not having to create and manage and upload and deal with copying all on your own.

JU: You've alluded to the downside already, which is that it now becomes a new data management discipline that is neither familiar to the people from the filesystem world nor from the database world, it's a hybrid, and that's an obstacle.

QC: Sure, there's a learning curve, as with any other new technology.

So, that's the filesystem piece, and I'm really proud of the work we've done there. We're introducing the kernel driver in 2008, we're giving people this nice marriage between the two worlds, and then we get to take that next step in the next release and give people the complete picture.

I can live with the argument that we don't have integrated storage yet. Yes, we have filestream blobs in the database, which is a big step. We have the performance and the database consistency all in one package, and that's a huge step forward. But when we have Win32, at that point, unarguably, we have integrated storage.

JU: How do you think that plays out as the center of gravity shifts toward the cloud?

QC: There is no app in the world that doesn't need a database. Every cloud app has one under the covers somewhere. One thing we've learned in the last few years is that the fuzziness between structured data and unstructured data is just increasing. The major online apps that I interact with have both. You know, Hotmail has attachments. And they have limitations on attachments because they have trouble managing sizes and whatever else.

We have things now where people can create some space, put some files up there, but man, if you want any metadata around those files, too bad, it's just a dumb blob store.

JU: What I'm getting to here is that, well, part of the challenge for WinFS as originally conceived, with a heavy client component, was: How do you get the network effects? Five years later the center of gravity has shifted, there are shared spaces in the cloud where those effects can happen.

QC: Yes. And I think the technology we're building is underlying technology for the cloud apps. All of our major properties are built on SQL, and they want to use this stuff, we have work going on there, pre-release work to take advantage of these features, because they want them.

From a business standpoint, my first concern is how to provide value to our customers. And those are our customers. The people building the cloud apps are our customers.

Now, beyond that, one of the things we used to say about WinFS was that it was the world's best mashup playground, because you had all the data in one place. In the mashup world you're talking to one service at a time.

Do I think that the opportunity to build applications that solve real end-user problems building on technology like this continues to thrive? Sure.

When I think about the enterprise space, which is primarly where we sell SQL, they want this. They want a repository, and they want it not to be restricted on the types of data it has.

You'd be surprised, SQL's behind some of the biggest cloud services on the planet. And our customers who are building them have been struggling with this structured-versus-unstructured data problem.

Filestream alone gives them the answer. They don't so much need the Win32 aspect, because they have enough app development expertise in the mid-tier to bridge this stuff reasonably well. But they do want the transactional and backup consistencies that filestream gives them.

JU: Is that ultimate mashup playground also a good environment in which to iteratively work out what some key schemas need to be?

QC: Yeah, that leads to another interesting point. Going through the litany of technologies that have come from WinFS, one of them is the notion of what I refer to as semi-structured records. The schema is not necessarily all that well defined at the outset of the application. How does the database handle that? We had built WinFS around a feature called UDTs, which is a column type -- a CLR type system type.

We finished that up, and we built a whole spatial datatype on it in SQL Server 2008, it's all good stuff.

But when we stepped back and looked at the semi-structured data problem in a larger context, beyond the WinFS requirements, we saw the need to extend the top-level SQL type system in that way. Not just UDTs, but to have arbitrary extensibility.

So we did this feature in SQL Server 2008 that we internally refer to as sparse columns. It's a combination of various things. First, a large number of columns. Right now there's a 1024 limit on the number of columns in a single SQL table. We're way widening that out.

That comes of course with the ability to store data that's very sparsely populated across a large number of columns. In SQL Server 2005 we actually allocate space for every column in every row, whether it's filled or not.

JU: This is what the semantic web folks are interested in, right? Having attributes scattered through a sparse matrix?

QC: That's right. And that leads to another thing which we call column groups, which allow you to clump a few of them together and say, that's a thing, I'm going to put a moniker on that and treat it as an equivalence class in some dimension.

Then we have something called filter indices, where instead of creating an index that spans all the records in a table, you can specify what records it applies to.

JU: When it's really cheap to make lots of those equivalences, you get the ability to let people call things however they want to call them. There can be lots of aliases and labels floating around, and people can have their own vocabularies. You don't have to be so rigid about names. As you discover equivalences, you map them, and that's very efficient. Versus trying to get people in committees to agree how to call things, that's the hardest problem in the world. But if you can let people operate in their own semantic namespaces, and then bridge things together...

QC: And that gets back to why the entity data model is so important. It lets people have their own way of describing, programming to, and interacting with the data they want to deal with.

JU: Now what about relationships? In WinFS, a relationship among entities was a first-class object. How does that carry forward?

QC: The notion of a relationship is a first-class object in the entity data model. Now what we haven't done there is bridged an understanding of that into the database itself. Can the query processor understand a relationship, and be optimal for navigating through those semantics? We haven't bridged that part of the world yet. It's certainly possible to create database schemas that allow you to have good query efficiency through your entity model, but it's still intellectual work. We'd like it to be so that the database can look at an EDM schema and create at least the approriate indices so when you are examining things through that lens, we can make sure your experience is optimal.

QC: Finally there's synchronization. It went through a classic computer-science learning curve as well. At first we said, we need to synch with the cloud, with other WinFS instances, with server systems, how hard can this be?

Then we quickly realized how hard this was. What should be more infamous than people breaking their pick on integrated storage is people breaking their pick on multimaster replication. It's an incredibly difficult problem to get right.

Apps that have gotten this right for a particular domain have become wildly popular. Lotus Notes got it right for a particular domain, so did Exchange and Outlook, but a generalized solution has been very elusive.

Anyway, we did a partnership with Microsoft research, and at some point along the arc we solved it fairly well. It's not trivial. This is not something that ends up being a simple solution to this very complex problem. It's actually reasonably sophisticated, but it works, and we built it in as part of the last WinFS beta.

As they realized they were onto something, they started to fork out a componentized version of it that's now finding its way into a bunch of Microsoft products. The official branding is Microsoft Sync Framework. I think they're on target for shipping it in six different products, and for embedding it all over the place.

Building an app like Outlook, from scratch, is hard. You can always interact with your data, when you're connected the thing will always synchronize and reconcile, when it's offline it still provides a consistent experience. To build that from scratch, it's really hard. Taking the sync framework allows people to go and build that experience without having to solve the hard multimaster synchronization problems.

QC: Finally, we'd done a bunch of work to keep the SQL engine tamed and behaving properly on the desktop. Some of that has found its way into SQL Server 2008 and some has not, because there's a less pressing need for it. But for departments, and for SQL Server Express on the desktop, we still want to finish that.

JU: So to wrap up, I'd like you to reflect on how the original environment for WinFS was the end-user desktop, but now the environment in which many of these technologies have come to fruition is the enterprise datacenter and backoffice. How do these worlds yet come together?

QC: I was very happy to be able to take the technology forward, because I saw the broad applicability, not just in the problem space we were working on, but in terms of the general usefulness of the database.

My job is to grow the usefulness of the database. The work we did with WinFS was in line with that, and I'm happy with that, but there's a part of me which is still unfulfilled. Boy, what would it mean if every application could have some shared notions about, for example, the people in my life, that other applications could plug into and use.

Can we express that fully in a cloud way? Maybe. It harkens back to the old Hailstorm ideas. And we have things like Astoria [SQL Server Data Services] that is a projection of entities over the web. That's awfully familiar, both in terms of WinFS and in terms of Hailstorm.

Where it goes, I don't know. We've made a choice right now to incubate some underlying platform technologies for the web, and allow the operating system team to cycle on the stuff that's on their plates right now.

But I think not too long from now we'll come out of those cycles and say, OK, we have all this fundamental technology, what's the next big innovation we can do?

That's kind of where we got tripped up in the Longhorn cycle. We were building too much of the house at once. We had guys working on the roof while we were still pouring concrete for the foundation.

At one point we realized we needed to decouple things. And that really did give this team the freedom to go off and take these underlying technologies, which we believe were fundamental to the database, and get them done correctly.

But I do at some point want to see that place in my heart fulfilled around the shared data ecosystem for users, because I believe the power of that is enormous.

I think we'll get there. But for now we'll let the concrete dry, and get the framing in place, and then we'll see how the rest of the house shapes up.

Most Used Tags:

Page Navigation