Thursday, June 18, 2009

The Conversation Model in Interaction Design

As you can probably tell by the title of the blog, Savant is a program that lets users interact with it in natural language and accomplish a large variety of tasks. However, I am guilty of coming up with a title that is quite misleading. Natural Language Processing is a very difficult problem in computer science and I am, by no means, claim to have solved it. The syntax for Savant is quite flexible and allows users to specify arbitrarily complex queries. The rules for Savant's grammar are very intuitive, and occasionaly form gramatically correct English sentences. But my no means it is a complete natural language parser. For example, you could tell Savant, "move doc files to new folder Spring09". But you couldn't say, "move doc files to new folder called Spring09", or "move files with doc extension to a new folder called Spring09".

One can argue that all this is syntactic sugar and the grammar can be easily extended to parse these commands. But is the right approach to solve the problem? Here are new a couple of other ways we could go about to tackle this issue. The first method involves filtering non-essential words. Note that in two alternate sentences in the above example, the main key words were already there. Just by filtering out a few words, we can realize the original intent of the command. The second method involves richer interaction with the user as he is typing in the command. As the user types 'mov' we can lookup a list of commands and find out the list of commands that start with that prefix. Then for each one of those commands syntax suggestions can appear to guide the user to input his command in a way that the program understands. A positive feedback loop can be started by rewarding the user with auto-completion and detailed syntax break-down. On the contrary, the program can stop suggesting when the user starts going off track, creating a negative feedback. This helps in keeping the core parser simple and still allow flexibility for the user.

If you stop to think for a minute and imagine Savant as a human being, the latter method sounds like a dialogue between two people. The positive/negative feedback is something we constantly provide with various facial expressions. Think about it, how many times has your teacher re-explained a topic after seeing a blank face? How many times have you had the other person complete the word that you just couldn't think of? Can you see how these real life scenarios have direct mappings to the interaction between man and software. In fact, I believe any kind of human-computer interaction can be explained in terms of a conversation between two people of different languages.

Here are two scenarios to illustrate my point. Think of the command line from UNIX or DOS. To be able to use the terminal you need to memorize the names of the commands and the correct sequence of parameters they take as arguments. So, how does this scenario look in the Conversation Model? It's similar to an Englishman learning Japanese to talk to person from Japan, or vice versa. Now, learning Japanese can be a very hard prospect. Especially for people who are geographically far apart from Japan. But it is no doubt that the best way to communicate with a Japanese person is in Japanese. That's the language that he is most comfortable and you can express very complicated ideas, and he'd still be able to understand. This is the case with learning the command line way. The learning curve is tremendous. But those who have mastered it are at the peak of productivity and can express intricate tasks with very concise notation.

The second scenario is of the point-and-click interfaces that we are so used to these days. To make another analogy, this is similar to Sign Language. By sign language I dont mean languages like ASL, but the most basic hand gestures and facial expressions. The beauty of these gestures are they are universal. Irrespective of what language the other person speaks, you can always point at your wrist and indicate that you want to know the time, or nod your sideways to say no. The drawback with this system that it only works with basic things. Try asking somebody the directions to the closest museum using sign language. It is apparent that it is not the most of expressive of language. Some tasks were never meant to be described using sign language, unless you begin a new convention involving careful hand gestures. The problem with GUI is precisely this. Point-and-click makes sense for a handful of things. For others, it becomes a long repetitive process of pressing boxes and ticking checboxes.

This brings us back to our conversation with Savant. The motivation of the project was to not require the user to memorize Savant's vocabulary. Instead, Savant could master the human language, which would require solving the natural language processing problem. Or, the other option for Savant would be to have a conversation with the user and work with him to come up with something that it can carry out as required. Making this conversation efficient does not require a technological break-through. However it does require pro-active involvement from Savant's part to guide the user in typing his command.

Wednesday, June 17, 2009

Savant Architecture

In this article I am going to describe a very high level overview of the architecture of Savant. The current version of the code does not implement this architecture yet. However, keeping the Savant roadmap in mind this is the best design I can think of.

image

In terms of the completeness the bottom two blocks of the building have been laid. The parser has been written in PLY, the Python implementation of Lex and Yacc. This allows Savant to respond to any query given to it in Context Free Language. The runtime is written in pure Python with no platform specific library dependencies. However it is possible to extend the environment with native system calls.

As it is seen in the diagram, the main purpose of this post is to discuss the server aspect of Savant. The current version of the code lacks this feature and runs as a command-line program. The original idea was to have the graphical front-end execute the Savant back-end on demand. That is, whenever the user typed in a command, the GUI would run Savant with the query, capture its output, process the results and notify the user. This way of launching Savant has two main drawbacks.

  1. There isn't any cross-platform way to launch external programs and capture their output.
  2. Launching native programs is impossible with RIA platforms like Flex or JavaFX, which limits implementation possibilities for the front-end.

In this revised architecture, I have designed Savant to run a service on your system and listen to requests on a predefined port on the localhost. This eliminates both of the drawbacks mentioned above. Communicating with server ports is a very standard practice and functions to do so is available on almost every programming language. Secondly, this opens up the gate for applications written in Flex, Silverlight, JavaFX, or Titanium. Not only will this allow the GUI to be developed on these platforms, any RIA platform can have heightened access to the operating system in a standard-compliant manner. As a developer, this is a very exciting prospect. The next logical step after this can only be making Savant modular so that it extensions to the parser can be done on the fly with only a few lines of code.  RIA based front-end means more flexibility in terms of interacting with the user. If all of this can be implemented, Savant has the potential to be the Firefox for natural language shell movement. Stay tuned for updates.

Tuesday, June 16, 2009

Savant: Inspiration

How many times have you stared at the contents of a folder for 5 minutes trying to carefully select the files that you want to copy to your thumbdrive? Maybe you want to remove all the zip files that you downloaded off the web today. Or maybe you want to make a backup of all the folders matching the name hwX where X stands for a number.

Sure you can do these things with the file manager of your choice. The most naive user would go and manually select every one of the 20 items he is thinking of. In the process, maybe, he clicks an empty area on the screen and his whole selection goes away. Since there is no UNDO feature in file managers, he has to start over again. (Makes me wonder why hasn't anybody thought of this).

A more average user, or more likely, a user with a very clear query in his head will probably use the search feature built in to most file managers. The user fills in numerous checkboxes, fills in the details of whatever he remembers about his query. Hopefully, if the search is narrow enough the right results show up and the user is happy. However, more often that not, using the search feature can be a bit tedious. Ask yourself, how likely are you to use search when you want to copy, delete, move or zip more than 10 files.

On the other end of the spectrum, the macho command line geek will probably laugh at this problem. He will promptly bring up a terminal window and type in some obscure commands involving ls, rm with a bunch of regular expression arguments along with -l -u and what not. If the query is a bit more complicated, and the user is intellectually challenged, he will write a shell script to do the job for him. However, the truth is, on average cases, the command line geek will get his job done faster than the other user groups.

No, I am not advocating the use of command line in our daily lives. But, do you notice that, in a way, the more technologically inclined you are, the faster you can get things done. If you are not so good with computers you will probably end up following medival ways of doing things. However, getting things done fast doesn't come without a cost. As a user, you are trading off intellectual effort of command line trickery for boring and repetitive chores. Unfortunately for most users, there is no trade-off to make. Such users resort to the latter way of doing things.

So is there a sweet spot between tasks of high intellectual effort and the brainless and repetitive ones? Personally, I think so. In this article I will try to explain what motivates me to think in such a manner.

To put things into perspective I will zoom out a bit more and talk about a more fascinating trend. Back in the days of the first browsers, the address bar was a place where you just typed in the actual domain name. Search engines were not so popular back then. So it was important that you remembered the web address precisely. As browsers got more advanced they started packing more juice into the address bars. First came the drop down of addresses from the history. If you have visited a site in the recent past, all you have to remember is the first few characters of the domain name. A significant reduction in intellectual effort, a new potentially brainless chore. With the advent of Firefox 3, the address bar became even more "awesome". You could get away with typing only a part of the actual domain name, page title, or even tag name. This new feature quickly spread like wildfire and was adopted by every other browser in the market. With the smarter address bar, daily browsing reduced to typing in a few key strokes. Google Chrome took this even further by displaying Google search results as you type in.

Now, I ask the question again. Do you see a trend. With every new addition, we are actually accomplishing a much more complicated task. We started of by pointing the browser to a specific location on the web. We then progressed to doing a domain name search on our recent browsing history. That got elevated to performing a string search of titles, domain names, bookmarks, and tags from the history. Finally we ended up performing an actual web search on Google along with everything else. However, the actual intellectual effort of performing these increasingly complicated didn't go up that much. It is still pretty much limited to a few key strokes that we easily get used to.

So the moral of the story, it is possible to do complex tasks without doing much. Which brings us back to the issue of selecting, copying, and moving files with your file manager. The current state of the art approach to this problem is to use the sort and search functions of the manager to set things up to do whatever you want. Imagine if you had to do the same with web browsers. Instead of the address bar auto-completing recently visited sites, what if you were shown a list of all pages you have visited recently. Then you had to sort that alphabetically, or by recency, and then perhaps do a search to find what you are actually looking for. Fortunately browser UI designers have been clever enough to spare us the horror of going through all that.

So you see, the way we do things in a file-manager is pretty silly. We somehow fail to realize how silly it is until an analogy is shown to us. What if we could tell the file manager what we want? What if we could tell: "Delete all the zip files created today". Or maybe, "Move all files starting with hw to Documents/School/". Command line people may instantly conjure up the appropriate shell commands for each of those tasks. And that's exactly my point, we need something as powerful as the command line, but for average users. We need something that can understand commands given in natural language and perform them. I'm not talking about having a conversation with the computer. We just need to be able to give domain specific commands in natural language.

Savant, is a proof of concept of such an idea. It is a natural language shell to perform routine tasks that we do using a file manager. It is simple in the sense that it can perform only a handful of tasks, but it is also powerful in the sense that it can perform those tasks on any complicated query that we describe in natural language. What's more exciting about this idea is, Savant need not be a solution for a handful of problems. It can be as extensible as Firefox. Developers should be able to extend the program to do more interesting tasks. Alongside, they can also specify the natural language commands using plain old context free grammars. To make parsing easy, the front-end can suggest syntax as the user types in his command. This way, it is clear for the user what the system can and cannot understand.

Sounds interesting. Stay tuned for more updates...