MIME-Version: 1.0 Server: CERN/3.0 Date: Sunday, 01-Dec-96 19:02:37 GMT Content-Type: text/html Content-Length: 14145 Last-Modified: Thursday, 02-May-96 10:38:48 GMT
Alfred Hong
May 1, 1996
Web sites today that provide a query front-end to a backend database are done mostly via keyword or categorical search forms. These search parameters are preselected and usually "inhuman." Since the query parameters are preselected, the answers are fixed and static to the user. Replacing these fixed query forms with a natural language query front-end could make things more human, more natural, and more flexible. A "middleware" mechanism that converts a natural language question into the backend database language to perform the user's query request might do the job. Since relational databases are very popular, SQL will be the target database language for conversion.
The approach in general involves:
Natural language processing is a difficult subject. To tackle the focus of this project, a specific domain is chosen for ideas and code-testing. The domain of querying for flight schedules is thus chosen.
Using sample schedule information from USAir's flight schedule database, the following relational database table definition and record shows the type of information that is dealt with.
( "San Francisco, CA", Ithaca, NY", "052396", "062296", "635A",
"1159A", "E5361/63", "X7", 1, "PHL", "B", 524 )
FLTS( toCity, fromCity, beginDate, endDate, leaveTime, arriveTime,
flightNum, frequency, stops, stopCities, meals, fare )
Several possible concepts/techniques to tackle the problem were looked at:
Of these techniques, although some tests were initially done with using WH- and GAP features (not with flight schedules), the extra WH- and GAP features are really not necessary for the problem. Simple bottom-up chart-parsing with semantic features suffice.
The idea of using some combination of question answering, conversational agents, information retrieval concepts, and template matching came about from the realization that the number of synonyms and phrasings possible for asking for a flight schedule is actually quite diverse, requiring a rather large lexicon for a small domain.
Because of the nice structured format of SQL queries, flight schedule questions can be directly mapped to that of parts of SQL queries. As shown in Table 1, for San Francisco can be mapped to the WHERE destCity = "San Francisco" construct, and for May 28 can be mapped to the WHERE departDate = "0528" construct.
Sentence | SQL |
---|---|
What flights are available | SELECT flights |
FROM schedules | |
for San Francisco | WHERE destCity = "San Francisco" AND |
from Ithaca | departCity = "Ithaca" AND |
for May 28 | departDate = "0528" |
The following is the result from parsing the sentence "What are the departure times for houston tomorrow":
S176:<S(((SELECT(((TIME ?V18) ?V25 DEPART) ((TOMORROW ?V18) ?V25 IAH)))) (1 WH-PRO116) (2 VP175))> from 0 to 8
Sample questions for testing were solicited from users given that they need to purchase a ticket to San Francisco tomorrow from Ithaca. The following are example questions.
2. What are the departure times for Houston for tomorrow?
3. How many stops are there to San Francisco for flight 400?
4. When does flight E5361 arrive in Orlando today?
5. What is the cheapest flight available to San Francisco tomorrow?
6. Can you book me the cheapest flight I can take to San Francisco
tomorrow?
7. Do I need to stay a Saturday night to get the lowest fares?
8. How many flights do you have available going to San Francisco
tomorrow?
9. Can you check tomorrow's flight schedule to San Francisco for me?
10. I need to fly to San Francisco tomorrow. Is there any flights
available?
1. What are the available arrival times for Miami for tomorrow?
Question types 1-5 can be parsed, but not the others for reasons mainly of not anticipating certain words and different sentential structures. For results, see the results documentation and grammar+lexicon code.
As a note, questions 5 and 6 involve the realization that the "cheap" concept means a SELECT MIN(FARE) operation is needed. Question 7 is a yes/no type question. Question 8 involves a COUNT operation, and question 10 has 2 sentences that violates the defined assumptions.
The results of the bottom-up chart-parsing method could be improved with further refinement as non-anticipated sentential constructs are discovered. This is not ideal, however, which motivates the search for other means of tackling the problem.
Questions are often asked in bits and pieces; this is quite true for querying flight schedules. This requires the flight schedule answering system to be aware of what has been asked in a session, then results can be further refined by the user with several questions. Viewed in this way, multiple sentences would not be a major problem since it is allowed.
Even though the number of synonyms required are quite large, a lexicon need to exhaustively cover all possibilities that the user may present to the system. For example, "flight", "flight schedule", "schedule", "reservation" refer to the same thing; other inferences are possible through other combinations of words with helper words and different word arrangements. For instance, "to go to miami" in the context of flight schedules implies flying to miami. To parse all possible combinations of questions also require a large number of rules.
The sentences I need to fly to San Francisco tomorrow. Is there any flight available? and the sentence Is there any flight available to San Francisco tomorrow? are the same queries. Notice, however, that key phrases are the same: to San Francisco, tomorrow, and any flight. The idea is that if these can be identified, then maybe sentence boundaries and syntactic structure are not as important, and the lexicon need to concentrate only on key words/phrases.
Fig. 1. Interface of Flight Schedule Query Application
To test the idea out, a Tcl/TK application has been written (Fig. 1).
As shown, there are several parts to the interface. The user inputs queries, clicks on the Analyze button, and key words/phrases are extracted from the input and stored. Extraction of key concepts is done by regular expression matching with a built-in lexicon that ignores irrelevant words (akin to stop-lists in information extraction) in the context of flight schedule query. In the figure, the history text box shows that the concepts flights, atlanta, and tomorrow are extracted from the Q1 sentence which is shown directly under the Q1 line. (Actually, the extraction of to atlanta instead of just atlanta would be better for indicating atlanta as a destination city.) Q2 is a followup question that requests for flights to houston instead. The extracted information then is used to do template matching to SQL syntax (this has not been implemented yet).
The user can input successively to refine or change their queries until the Reset button is chosen to restart a new session. Using Tcl/TK also makes it easy to interface to a Web page with a backend relational database engine[2].
Evaluation of this application can not be done since it has not been completely developed yet. However, the pattern matching and key concept extraction component certainly show promise at this point.
Thus with a combination of question answering, conversational agents, information retrieval, and template matching techniques, maybe the flight schedule query problem can be dealt with feasibly.
There are other problems that were overlooked. What if the user mistype their questions? What if the user does not follow correct English grammar rules? In this case, current forms-based methods of flight-schedule query on the Web are definitely at an advantage. Then again, these Web sites are limited and return relatively large sets of data that require further interpretation on the part of the user.
In general, natural language processing is a difficult subject. Even with a specific domain chosen, problems are abound. In terms of the bottom-up chart-parser method, if
then a full-proof flight-schedule query system could be possible.
However, even if the sentences could be parsed, the parsed result can not be used readily; integrating such a system with a web front-end is not straightforward because of the resident LISP back-end used for the chart-parsing. Even if it is done, it is rather resource intensive. This led to researching other possible ways to make it more feasible. Hopefully the alternative method embodied by the Tcl/TK application as discussed in the previous section will prove to be promising.