LLM navigation agent: Difference between revisions

Revision as of 08:13, 6 October 2023

The 2010 paper Where should I turn: moving from individual to collaborative navigation strategies to inform the interaction design of future navigation systems states an interesting finding that the navigation in driving between a driver and a co-driver is largely collaborative, interactive, and conversational, which is reasonable, acceptable and conforms to daily experiences.

In contrast, most software navigators, be it a mobile map application or onboard navigator / infotainment system, are usually lacking interactions with the driver. They are good at route planning and giving instructions, but have very poor, if any, capability to respond to an arbitrary inquiry popped by the driver.

With powerful Large language models (LLMs) emerging and becoming globally accessible, as well as the continued progress in ASR and TTS technologies, it seems like the possibility of software navigator having vocal conversations with the driver is growing. This is a project aimed to create such a navigator agent (I know I won't finish it, though) or at least yield some valuable (to whom?) results.

Outline

The LLM as an agent [1] concept is used here. This agent has access to a collection of tools for completing different tasks, has memory for history data, info, and decisions made, etc..

Skipping a lot (basically all) of explanations, an LLM agent can perform actions such as search the Internet, access peripheral hardware for sensory data, file I/O, calling other programs, besides the rather basic task of making conversations with the user in natural language.

So we can build an agent with at least the following required features, to make it capable of performing like a human co-driver:

Chat
Map and route operations
- This can be done by either using a cloud service via API, or a local map in e.g. OpenStreetMap format.
Positioning
- GPS hardware
- Internet-based, less accuracy overall
Web search
- A good way of providing more than no information, if the user is saying something a navigation software might not be expecting e.g. 'What's Alfie Kilgour's middle name?'
- It's always not wrong to say 'Sorry, I don't know that' but this is what makes Google Assistant very annoying and look dumb.
Audio recording (for speech recognition)
Speech recognition
- Cloud / local, but either way it's going to be perceptibly slow
Text-to-Speech
- If we don't really worry about robot voices, this can be fast
Audio playback

Things to find out before digging in

There are more things to find out than listed here of course.

How fast should the system be?

Get some pairs of people to drive, similar to the Where should I turn paper. Real roads or simlulators.

Focus on:

How much time (city road / highway) is left for the navigator to instruct the driver, after the driver asks for instructions, before the vehicle would miss the point to turn / stop / change lanes.

How fast can the system be?

Benchmark typical time consumed for one exchange (one question from the driver, one response from the navigator), of different combinations of LLM / ASR / TTS implementations.

Wake word or push to talk?

Obviously wake word is more natural but slower. If a button doesn't feel too unnatural for drivers, it's good enough.

How much visual?

Probably use an eye tracker or something to record how much the driver is looking at the road and at the screen if there is one.

I quite believe someone else has done this before so I'll look for papers.

Implementation

In all honesty I don't think I can or I want to do it alone, so I'm not having plans about implementations now.

Random thinking

Actually this all started with another personal niche of wanting a Yuduki Yukari (AI) rally co-driver that can automatically do recces, generate pace notes, and call them out in a timed session. That requires more precision but less conversation (not none, though).

Then I re-thinked about it and maybe an AI for daily navigation has a lot more potential users, so it's come here.

@@ Line 29: / Line 29: @@
 === Things to find out before digging in ===
-There are more things to find out of course.
+There are more things to find out than listed here of course.
 ==== How fast should the system be? ====
 Get some pairs of people to drive, similar to the [https://dl.acm.org/doi/abs/10.1145/1753326.1753516 Where should I turn] paper. Real roads or simlulators.
-The
+Focus on:
+# How much time (city road / highway) is left for the navigator to instruct the driver, after the driver asks for instructions, before the vehicle would miss the point to turn / stop / change lanes.
+#
 ==== How fast can the system be? ====
+Benchmark typical time consumed for one exchange (one question from the driver, one response from the navigator), of different combinations of LLM / ASR / TTS implementations.
+==== Wake word or push to talk? ====
+Obviously wake word is more natural but slower. If a button doesn't feel too unnatural for drivers, it's good enough.
+==== How much visual? ====
+Probably use an eye tracker or something to record how much the driver is looking at the road and at the screen if there is one.
+I quite believe someone else has done this before so I'll look for papers.
+=== Implementation ===
+''In all honesty I don't think I can or I want to do it alone, so I'm not having plans about implementations now.''
+=== Random thinking ===
+Actually this all started with another personal niche of wanting a Yuduki Yukari (AI) rally co-driver that can automatically do recces, generate pace notes, and call them out in a timed session. That requires more precision but less conversation (not none, though).
+Then I re-thinked about it and maybe an AI for daily navigation has a lot more potential users, so it's come here.