LLM navigation agent: Difference between revisions
(Created page with "The 2010 paper [https://dl.acm.org/doi/abs/10.1145/1753326.1753516 Where should I turn: moving from individual to collaborative navigation strategies to inform the interaction design of future navigation systems] states an interesting finding that the navigation in driving between a driver and a co-driver is largely '''collaborative, interactive, and conversational''', which is reasonable, acceptable and conforms to daily experiences. In contrast, most software navigat...") |
No edit summary |
||
(2 intermediate revisions by the same user not shown) | |||
Line 2: | Line 2: | ||
In contrast, most software navigators, be it a mobile map application or onboard navigator / infotainment system, are usually lacking interactions with the driver. They are good at route planning and giving instructions, but have very poor, if any, capability to respond to an arbitrary inquiry popped by the driver. | In contrast, most software navigators, be it a mobile map application or onboard navigator / infotainment system, are usually lacking interactions with the driver. They are good at route planning and giving instructions, but have very poor, if any, capability to respond to an arbitrary inquiry popped by the driver. | ||
[[File:Conversational Navigation.png|thumb]] | |||
With powerful Large language models (LLMs) emerging and becoming globally accessible, as well as the continued progress in ASR and TTS technologies, it seems like the possibility of ''software navigator having vocal conversations with the driver'' is growing. This is a project aimed to create such a navigator agent (I know I won't finish it, though) or at least yield some valuable (to whom?) results. | |||
=== Outline === | |||
The ''LLM as an agent'' [https://lilianweng.github.io/posts/2023-06-23-agent/] concept is used here. This agent has access to a collection of ''tools'' for completing different tasks, has memory for history data, info, and decisions made, etc.. | |||
Skipping a lot (basically all) of explanations, an LLM agent can perform actions such as search the Internet, access peripheral hardware for sensory data, file I/O, calling other programs, '''besides''' the rather basic task of making conversations with the user in natural language. | |||
So we can build an agent with at least the following required features, to make it capable of performing like a human co-driver: | |||
* Chat | |||
* Map and route operations | |||
** This can be done by either using a cloud service via API, or a local map in e.g. OpenStreetMap format. | |||
* Positioning | |||
** GPS hardware | |||
** Internet-based, less accuracy overall | |||
* Web search | |||
** A good way of providing more than no information, if the user is saying something a navigation software might not be expecting e.g. 'What's Alfie Kilgour's middle name?' | |||
** It's always not wrong to say 'Sorry, I don't know that' but this is what makes Google Assistant very annoying and look dumb. | |||
* Audio recording (for speech recognition) | |||
* Speech recognition | |||
** Cloud / local, but either way it's going to be perceptibly slow | |||
* Text-to-Speech | |||
** If we don't really worry about robot voices, this can be fast | |||
* Audio playback | |||
* Translation, just in case | |||
=== Things to find out before digging in === | |||
There are more things to find out than listed here of course. | |||
==== How fast should the system be? ==== | |||
Get some pairs of people to drive, similar to the [https://dl.acm.org/doi/abs/10.1145/1753326.1753516 Where should I turn] paper. Real roads or simlulators. | |||
Focus on how much time (city road / highway) is left for the navigator to instruct the driver, after the driver asks for instructions, before the vehicle would miss the point to turn / stop / change lanes. | |||
==== How fast can the system be? ==== | |||
Benchmark typical time consumed for one exchange (one question from the driver, one response from the navigator), of different combinations of LLM / ASR / TTS implementations. | |||
==== Wake word or push to talk? ==== | |||
Obviously wake word is more natural but slower. If a button doesn't feel too unnatural for drivers, it's good enough. | |||
==== How much visual? ==== | |||
Probably use an eye tracker or something to record how much the driver is looking at the road and at the screen if there is one. | |||
I quite believe someone else has done this before so I'll look for papers. | |||
=== Implementation === | |||
''In all honesty I don't think I can or I want to do it alone, so I'm not having plans about implementations now.'' | |||
=== Random thinking === | |||
Actually this all started with another personal niche of wanting a Yuduki Yukari (AI) rally co-driver that can automatically do recces, generate pace notes, and call them out in a timed session. That requires more precision but less conversation (not none, though). | |||
Then I re-thinked about it and maybe an AI for daily navigation has a lot more potential users, so it's come here. |
Latest revision as of 12:33, 6 October 2023
The 2010 paper Where should I turn: moving from individual to collaborative navigation strategies to inform the interaction design of future navigation systems states an interesting finding that the navigation in driving between a driver and a co-driver is largely collaborative, interactive, and conversational, which is reasonable, acceptable and conforms to daily experiences.
In contrast, most software navigators, be it a mobile map application or onboard navigator / infotainment system, are usually lacking interactions with the driver. They are good at route planning and giving instructions, but have very poor, if any, capability to respond to an arbitrary inquiry popped by the driver.
With powerful Large language models (LLMs) emerging and becoming globally accessible, as well as the continued progress in ASR and TTS technologies, it seems like the possibility of software navigator having vocal conversations with the driver is growing. This is a project aimed to create such a navigator agent (I know I won't finish it, though) or at least yield some valuable (to whom?) results.
Outline
The LLM as an agent [1] concept is used here. This agent has access to a collection of tools for completing different tasks, has memory for history data, info, and decisions made, etc..
Skipping a lot (basically all) of explanations, an LLM agent can perform actions such as search the Internet, access peripheral hardware for sensory data, file I/O, calling other programs, besides the rather basic task of making conversations with the user in natural language.
So we can build an agent with at least the following required features, to make it capable of performing like a human co-driver:
- Chat
- Map and route operations
- This can be done by either using a cloud service via API, or a local map in e.g. OpenStreetMap format.
- Positioning
- GPS hardware
- Internet-based, less accuracy overall
- Web search
- A good way of providing more than no information, if the user is saying something a navigation software might not be expecting e.g. 'What's Alfie Kilgour's middle name?'
- It's always not wrong to say 'Sorry, I don't know that' but this is what makes Google Assistant very annoying and look dumb.
- Audio recording (for speech recognition)
- Speech recognition
- Cloud / local, but either way it's going to be perceptibly slow
- Text-to-Speech
- If we don't really worry about robot voices, this can be fast
- Audio playback
- Translation, just in case
Things to find out before digging in
There are more things to find out than listed here of course.
How fast should the system be?
Get some pairs of people to drive, similar to the Where should I turn paper. Real roads or simlulators.
Focus on how much time (city road / highway) is left for the navigator to instruct the driver, after the driver asks for instructions, before the vehicle would miss the point to turn / stop / change lanes.
How fast can the system be?
Benchmark typical time consumed for one exchange (one question from the driver, one response from the navigator), of different combinations of LLM / ASR / TTS implementations.
Wake word or push to talk?
Obviously wake word is more natural but slower. If a button doesn't feel too unnatural for drivers, it's good enough.
How much visual?
Probably use an eye tracker or something to record how much the driver is looking at the road and at the screen if there is one.
I quite believe someone else has done this before so I'll look for papers.
Implementation
In all honesty I don't think I can or I want to do it alone, so I'm not having plans about implementations now.
Random thinking
Actually this all started with another personal niche of wanting a Yuduki Yukari (AI) rally co-driver that can automatically do recces, generate pace notes, and call them out in a timed session. That requires more precision but less conversation (not none, though).
Then I re-thinked about it and maybe an AI for daily navigation has a lot more potential users, so it's come here.