LLM navigation agent: Difference between revisions

Revision as of 07:59, 6 October 2023

The 2010 paper Where should I turn: moving from individual to collaborative navigation strategies to inform the interaction design of future navigation systems states an interesting finding that the navigation in driving between a driver and a co-driver is largely collaborative, interactive, and conversational, which is reasonable, acceptable and conforms to daily experiences.

In contrast, most software navigators, be it a mobile map application or onboard navigator / infotainment system, are usually lacking interactions with the driver. They are good at route planning and giving instructions, but have very poor, if any, capability to respond to an arbitrary inquiry popped by the driver.

With powerful Large language models (LLMs) emerging and becoming globally accessible, as well as the continued progress in ASR and TTS technologies, it seems like the possibility of software navigator having vocal conversations with the driver is growing. This is a project aimed to create such a navigator agent (I know I won't finish it, though) or at least yield some valuable (to whom?) results.

Outline

The LLM as an agent [1] concept is used here. This agent has access to a collection of tools for completing different tasks, has memory for history data, info, and decisions made, etc..

Skipping a lot (basically all) of explanations, an LLM agent can perform actions such as search the Internet, access peripheral hardware for sensory data, file I/O, calling other programs, besides the rather basic task of making conversations with the user in natural language.

So we can build an agent with at least the following required features, to make it capable of performing like a human co-driver:

Chat
Map and route operations
- This can be done by either using a cloud service via API, or a local map in e.g. OpenStreetMap format.
Positioning
- GPS hardware
- Internet-based, less accuracy overall
Web search
- A good way of providing more than no information, if the user is saying something a navigation software might not be expecting e.g. 'What's Alfie Kilgour's middle name?'
- It's always not wrong to say 'Sorry, I don't know that' but this is what makes Google Assistant very annoying and look dumb.
Audio recording (for speech recognition)
Speech recognition
- Cloud / local, but either way it's going to be perceptibly slow
Text-to-Speech
- If we don't really worry about robot voices, this can be fast
Audio playback

Things to find out before digging in

There are more things to find out of course.

How fast should the system be?

Get some pairs of people to drive, similar to the Where should I turn paper. Real roads or simlulators.

The

@@ Line 4: / Line 4: @@
 With powerful Large language models (LLMs) emerging and becoming globally accessible, as well as the continued progress in ASR and TTS technologies, it seems like the possibility of ''software navigator having vocal conversations with the driver'' is growing. This is a project aimed to create such a navigator agent (I know I won't finish it, though) or at least yield some valuable (to whom?) results.
+=== Outline ===
+The ''LLM as an agent'' [https://lilianweng.github.io/posts/2023-06-23-agent/] concept is used here. This agent has access to a collection of ''tools'' for completing different tasks, has memory for history data, info, and decisions made, etc..
+Skipping a lot (basically all) of explanations, an LLM agent can perform actions such as search the Internet, access peripheral hardware for sensory data, file I/O, calling other programs, '''besides''' the rather basic task of making conversations with the user in natural language.
+So we can build an agent with at least the following required features, to make it capable of performing like a human co-driver:
+* Chat
+* Map and route operations
+** This can be done by either using a cloud service via API, or a local map in e.g. OpenStreetMap format.
+* Positioning
+** GPS hardware
+** Internet-based, less accuracy overall
+* Web search
+** A good way of providing more than no information, if the user is saying something a navigation software might not be expecting e.g. 'What's Alfie Kilgour's middle name?'
+** It's always not wrong to say 'Sorry, I don't know that' but this is what makes Google Assistant very annoying and look dumb.
+* Audio recording (for speech recognition)
+* Speech recognition
+** Cloud / local, but either way it's going to be perceptibly slow
+* Text-to-Speech
+** If we don't really worry about robot voices, this can be fast
+* Audio playback
+=== Things to find out before digging in ===
+There are more things to find out of course.
+==== How fast should the system be? ====
+Get some pairs of people to drive, similar to the [https://dl.acm.org/doi/abs/10.1145/1753326.1753516 Where should I turn] paper. Real roads or simlulators.
+The
+==== How fast can the system be? ====