Query processing

DIVA is a chatbot, meaning that conversation is at the core of its user interface. To ensure an adequate understanding of the queries of the user in the chat, a few services and guardrails have been implemented.

These services might use LLMs, in addition to NLP models and keywords recognition. Currently, we use a single LLM (decoder): Mistral_7B_Instruct_v0.1. The LLM was chosen with a few constraints: it had to be european, open source, with adequate community support, and supporting english language. We did not find any encoder or encoder-decoder LLM filling these criteria. Therefore, we adapted DIVA to work solely with a decoder but we acknowledge that some services described below might benefit from other model architectures (encoder-decoder, encoder).

Toxicity

When the user validates its text query, one of the first operations done is the verification of its toxicity.

The function which purpose is to evaluate the toxicity of the query can be found in the class ToxicityEval, in diva/llm/eval_task.py.

Method:

The assessment of the toxicity is done by our decoder, with a zeroshot approach. The instruction passed along with the query tells the LLM to grade the toxicity of the query between 0 and 10, 0 being an appropriate query and 10 being a very inappropriate query. We explicitly define an inappropriate query as containing toxic language, or sexist, homophobic or racist content, or containing affirmations against scientific consensus.

Interaction with other services:

  • We exclude toxic queries from DIVA’s answering scope.

  • Memory of the previous query is suspended when either the previous query or the current one is evaluated as toxic. This measure ensures that previous toxic elements are not passed inadvertently to the next query, in which case DIVA would not answer even if the current query is perfectly adequate and within DIVA’s scope.

Command or comment

We noticed that users do not only want to pass commands to DIVA, but also feedbacks (comments). DIVA cannot process feedbacks provided as such (if a user wants to give feedbacks, thumb up and thumb down buttons are available). However, we must ensure the relevance of DIVA’s answers in this context. Therefore, we implemented a method, in the class CommandEval, in diva/llm/eval_task.py which purpose is to determine whether the query is a command or a comment.

Method:

We defined a comment as a query urging to action such as displaying a graph or giving information. By default, we define a comment as everything that is not a command. This classification command/comment is performed with a zeroshot approach, in which the LLM is asked to determine whether yes or no the query is a command.

Interaction with other services:

  • Knowing whether the query is a command or a comment is used to limit memory retrieval of past queries. Feedbacks without any demand of correction only require DIVA to acknowledge the feedback and to suggest to clarify the initial demand. Feedbacks with demands of correction are not classified as comments since the demand of correction implies that an action is needed from the LLM.

  • Specific instruction (context) is passed to the LLM to guide it in correctly answering (e.g. apologizing, asking the user to reformulate or to consult DIVA’s documentation to see its limitations).

Discussion or visualisation

This service determines whether the query requires for answer mere text or text & graph. Specifically, it detects whether the query evokes an intention of ‘discussion’ or ‘visualisation’. It is located in the class PromptClassification in diva/chat/services/prompt_classification.py

Method:

  1. by default, the intention type is ‘discussion’

  2. keywords recognition with words associated with demands of explanation or questions about DIVA’s functioning can confirm whether the query is of type ‘discussion’

  3. if 2. is inconclusive, graph parameters are searched in the query to determine whether it could be of type ‘visualisation’. The threshold for ‘visualisation’ labelling is set at retrieving at least one primary parameter (time expression, localisation) or at least two secondary parameters (all the other parameters: aggregation type, aggregation operator, climate variable, graph type). We consider a parameters as secondary when its presence does not necessarily indicate a visualisation intention. For example, aggregation type, climate variables and graph type may include words that may appear in questions about DIVA’s functionning, thus in cases with ‘discussion’ intention. A query may also contain the aggregation operator ‘mean’, a word with too many and broad meanings to be only related to the “visualisation” intention.

  4. if 3. is inconclusive, the LLM is asked whether yes or no the query can be answered with a graph, in which case its underlying intention is ‘visualisation’. This is done with a zeroshot approach. The instruction given to the LLM include a list of elements expected in a query with ‘visualisation’ intention: climate words, localisation, date, years or time periods.

Interaction with other services:

  • If ‘visualisation’, the query is in DIVA’s answering scope.

  • If ‘discussion’, the memory of previous prompts is less permissive (guardrail)

  • The generated text answer differs depending on the intention: ‘visualisation’ requires a short text introducing the graph and providing basic information and disclaimers. ‘discussion’ requires more diverse answers.

Scope evaluation and context retrieval

The evaluation of whether the query falls within DIVA’s answering scope serves two purposes:

  1. it is a guardrail against the misuse of DIVA

  2. it requires to detect the category of demand behind the query. Especially, queries of type ‘discussion’ can be further categorized into a demand:

    • of explanation or elaboration of the previous answer (not yet implemented. It requires memory of the previous prompt and answer, and there are high risks of hallucination)

    • about DIVA’s functioning

    • about climate or about other relevant topics (e.g. what is ESA? What is the difference between climate and weather?”).

    • to see DIVA’s code.

The service responsible of the scope verification and of the context assignment is in the class ChatbotScopes in diva/chat/services/chatbot_scopes.py

Method:

The subcategorization performed in the second objective is done with keywords recognitions because some words or combinations of words are rather category-specific. It is used to assign specific contexts to the LLM instruction. Some keywords trigger context retrieval for RAG-like answers (RAG: Retrieval-Augmented Generation). This is the case for words like ‘ESA’, ‘ECMWF’, ‘Mews Labs’, ‘Mews Partners’, ‘Digital Twin’. We had the constraint to use only fully european open source models supporting the english language. Since we could not find sentence embedding models filling all these criteria, we did not implement proper RAG. Yet, we tried to implement some sort of semantic similarity assessment with a zeroshot approach to ensure that the query containing the keywords is somewhat related to the content of the retrieved context. If not, the query is considered out-of-scope. The zeroshot instruction consists in asking the LLM a semantic similarity score between 0 and 10, 0 being a very low semantic similarity and 10 a very high semantic similarity. The threshold to accept a query as being semantically related enough with the retrieved context was set empirically to 8 (8 included). The class responsible for the semantic similarity assessment is SimilarityEval in diva/llm/eval_task.py.

Interaction with other services:

  • Limit DIVA’s answer to the extent of the scope and to the content of the provided contexts.

  • The contexts enable DIVA to provide customizable answers

  • Memory of the previous elements of the conversation requires them to be within DIVA’s scope.

Prevention of hallucinations

Using predetermined contexts passed into the LLM to answer specific questions drastically reduces the risks of hallucination. However, this method has for drawback to limit the ability of the LLM to provide custom answers. We also created predetermined answers (without any LLM) for the cases that required the utmost precision in the answer, such as for the explanation of the data processing procedure. For example, we observed that the LLM often uses prior knowledge to re-interpret the description of our data processing pipeline, with high risk of mistakes and hallucinations that we could not allow, reason why we did not rely on the LLM for such answers.

Memory

LLMs are stateless by design. Especially, decoders are only able to process provided instructions. They do not have memory of the previously passed instruction or of the previously generated answers unless we integrate them in the instruction.

Method:

To ensure continuity in the conversation, this to avoid off-topic or suboptimal answers, memory of the previous conversational elements might be needed. Therefore, we must determine:

  • what are the needed elements

  • when we need to insert these elements in the instruction

What are the needed elements?

A limitation of the method consisting in adding conversational elements in the instruction is that not all the conversation can be added. The whole conversation may contain too many tokens, exceeding our LLM’s processing capacities. Moreover, the risk of receiving a less clear and/or less concise answer from the LLM is high. In addition, adding the entire conversation to the LLM is costly in terms of computational resources, with for concrete consequences a larger delay before receiving the answer and bigger environmental impacts.

Fortunately, conversations are often serial, in the sense that the current query is usually more related to the last query than to the second-to-last query. Therefore, adding only the last query and/or the last answer to the instruction is often sufficient. When the current query is more related to the second-to-last query, the current query would come with a reminder of the context, as is typically done in conversations when we change topic. Therefore, these queries do not require memory of previous conversational elements to be correctly processed by the LLM.

When do we need to insert these elements in the instruction?

Passing systematically the previous query and/or answer to the instruction of the current query will not ensure a good user experience. Indeed, LLMs tend to answer to everything present in the instruction, such that repeating unecessarily the previous query will lead to unnecessary LLM answer to the previous query on top of the answer to the current query. This can be avoided by setting rules to determine when the previous query and/or the previous answer should be added to the instruction. These rules are applied by the class IsMemoryNeeded, in diva/chat/services/is_memory_needed.py.

Several rules have been implemented, relying on several methods.

By default, no memory is needed.

If the current query is not the first one of the conversation, memory might be needed. We defined 2 main user intentions that might require memory:

  • amending the previous query

  • asking the same query again (possibly with the addition of new information that do not invalidate the information in the previous query).

Each intention is perceptible thanks to specific words in the query:

  • For “again” intention: “repeat”, “redo”, “remake”, “again”…

  • For “amend” intention: “also”, “too”, “without”, “add”, “instead”, “beside”, “rather”, “remove”, “only”, …

We use keywords recognition to detect whether the current query contains those words, and we apply different strategies depending on whether the “again” or “amend” intention is detected.

  • For both “again” and “amend”, the type (“visualisation” or “discussion”) of the current query is set to the type of the previous query.

  • For “amend”, memory need is set to true. The previous query is amended with the information of the current query during the prompt rephrasing phase.

  • For “again”, memory need is set to false for “visualisation” intention. The reason is that new graph parameters will override previous parameters, and missing parameters are automatically taken from the set of previous parameters. This default behaviour gives an illusion of continuity in the conversation without adding any element of the previous query to the current query.

  • For “again”, memory need is set to true for “discussion” intention. A “discussion” query does not contain any finite number of parameters that we can automatically pass into the next query when they are missing. Therefore, in “discussion” mode, the inclusion of the elements from the previous query to the current query is mandatory. Specifically, the LLM is asked to generate a new query that combines the previous query/answer and the current query.

Interaction with other services:

To avoid passing down out-of-topic previous queries or toxic previous queries in the new query, memory is disabled if the current or previous query is either out-of-scope or toxic.

Rephrasing

The current query is always rephrased, either to correct potential grammar or spelling mistakes that might impair DIVA’s detection of the graph parameters, or to implement elements of the previous prompt and/or of the previous answer to the current query when conversational memory is needed.

Rephrasing is done by the class PromptRephrasing in diva/chat/services/prompt_rephrasing.py.

Method:

Rephrasing might greatly improve DIVA’s understanding of the user’s demands. Yet, rephrasing also entails the risk of distorting the original demand. To avoid problematic distortions, we calculate a SacreBLEU score between the original query and the rephrased query. If the rephrasing does not include any previous conversational element, the similarity between the original and the rephrased queries should be high, and so should be the SacreBLEU score. If the SacreBLEU score is low, this might be the sign of a risky distortion of the original query, so the rephrased LLM proposition is rejected. Else, the rephrased LLM proposition is accepted, and used from now on in the rest of the query processing (notably to generate the text answer and/or to find the graph parameters). The class that determines the SacreBLEU score is SacreBleuEval in diva/llm/eval_task.py (this class does not use LLMs to produce the score but it uses the score to assess LLM answers).

Interaction with other services:

The rephrased query is passed to the LLM:

  • to generate the text answer in “discussion” mode.

  • to extract the graph parameters in “visualisation” mode.

Disclaimers

A few disclaimers might be shown to the user depending on the content of the query, especially in “visualisation” mode.

Method:

If the user asks data in a time range including the near future, a message is displayed to warn that climate projection data are not weather forecasts. Different types of models are used for climate predictions and weather forecasts. Climate models focus on showing accurately long term trends. They are trained to give accurate predictions on long time periods. However, their accuracy for short time periods might be lower than those of weather forecast models. In contrast, weather forecast models are trained to be accurate on small time periods. However, they will be less accurate on long time periods than climate models. DIVA relies only on data from climate models for the projection data.

If the user asks data in a time range including the near past, a message might be displayed to remind the date of the last update of the observation data. It warns that data for the period between today and the last update are projection data, not observation data.

Extraction of graph parameters

We call the ensemble of graph parameters “config”, hence the name of the class supervising the extraction of the graph parameters from the query: ConfigCreation (in diva/config/services/creation.py).

The extraction of graph parameters is necessary for the classification of the intention of the query between “discussion” and “visualisation”.

If the query intention is classified as “visualisation” the graph parameters are also needed to build the graph.

If the rephrased version of the query is rejected because too different from the original query (when memory is not needed), then the graph parameters do not need to be extracted a second time in the query processing pipeline. However, if the rephrased version is accepted, some previously extracted graph parameters might be obsolete, so they are cleared from memory and extracted once more from the rephrased query.

Below, we list the name of the parameters that we aim to extract, and we explain the extraction procedures.

Climate variable

Keywords recognition is used to extract the climate variable. Detected keywords include, for example: “temperature”, “degrees”, “warm”, “cold”, “wind”, “precipitation”, “rain”, …

If keywords recognition does not detect any climate variable, we make a second attempt with the LLM, using a zeroshot approach.

If none is found, DIVA will ask the user to specify it.

Location

We rely on the Named Entity Recognition (NER) model from the library NLTK to detect the localisation(s) where the user wants to get data. NLTK was prefered to SpaCy because it showed better word tokenization for locations, especially for compound names.

The NLTK model is available already trained to detect locations such as cities, countries… However, it does make mistakes. Locations starting with a capital letter might be missed because categorized as the wrong entity (e.g. Person), and words starting with a capital letter might be associated to a location, even if they are not.

To solve this issue:

  1. The service broadly collects all named entities that could be locations (Person, GPE: Geo-Political Entity, GSP: Geographical-Social-Political Entity) in order to limit the risks of false negative. The side effect is an increase of the number of false positives. The next steps will aim at filtering out these false positives.

  2. We ask NLTK to indicate Part-Of-Speech (POS) tags. We hard-coded conditions to filter out the words that cannot be locations based on the POS tags.

  3. False positives might remain so the service compares each retrieved location with the list of locations in the shapefiles for european cities and countries. Besides, it also compares the retrieved locations to the list of the most important world cities and to the list of world countries so that it can detect locations for which data are not available and inform the user of this limitation. If a named entity is not found in these lists, it is considered to be a false positive and discarded.

If no location is found, DIVA will ask the user to specify it.

Time expression

We rely on Named Entity Recognition to extract time expressions indicating the time range for which the user wants data. For this parameter, we use the SpaCy library.

The prepositions before time expressions (e.g. ‘in’ in ‘in 2018’) help Mistral_7B_Instruct_v0.1 to understand time expressions. However, SpaCy tends to not retrieve these prepositions so we hard-coded a function to retrieve them.

Once the time expressions and prepositions are extracted, they are passed into the LLM which is instructed to convert them into start times and end times in year-month-day format. Start times and end times are essential to filter relevant data.

Ideally, a finetuned model would accomplish this task, for example an encoder-decoder. We use only a decoder for the reasons mentioned at the beginning of the page. The consequence of not using the ideal type of model is that mistakes are frequently returned. To limit their impact, we carefully identified the systematic weak points of the decoder, and we implemented hard-coded corrective functions.

If no time expression is found or if infering the start and end times fails, DIVA will ask the user to specify them.

Graph type

Graph types are detected with keywords recognition. This method is convenient for all parameters that can only be triggered by a limited number of words, and graph type is one of them. Examples of detected keywords: “line”, “lineplot”, “bar”, “barplot”, “distribution”, “histogram”, “map”, “heatmap”, “where”, …

If none is found, by default, the type of graph is a line plot.

Aggregation frequency

The aggregation frequency is detected with keywords recognition. Examples of detected keywords: “per day”, “per week”, “per month”, “per year”, “daily”, “weekly”, “monthly”, “yearly”, “annual”, …

The default value of the aggregation frequency depends on the type of graph. When allowed by the graph type, the default argument is “raw data”.

Aggregation operator

The aggregation operator is detected with keywords recognition. Examples of detected keywords: “median”, “cumul”, “average”, “minimum”, “minimal”, “maximum”, “maximal”, “lowest”, “highest”, “mean”, “sum”, “min”, “max”, …

The default value depends on the climate variable. For the precipitation, it is “sum”. For the other climate variables, it is “mean”.

Prevention of hallucinations

We choose to display transparently the parameters found by our different functions, especially the location and time range of interest. Therefore, the user can easily detect if DIVA has made a mistake when interpreting the locations or time ranges. We also display the country of origin of the cities in order to remove any ambiguity regarding homonymous cities.

Missing parameters

The detection of missing parameters is managed by the class GetMissings in diva/config/services/get_missings.py. If a found graph parameter gets an unusual value (e.g. unusual format), the parameter is set as ‘Unknown’.

The unknown parameters are fetched from the previous graph config, if possible. This is done by the class CompletionWithLast in diva/config/services/completion.py.

Mandatory unknown parameters are asked to the user if they remain unknown even after the fetch attempt. This service is managed by the class AsksMissings in diva/config/services/ask_missing.py.

Method:

GetMissings checks the format of the start times and end times (YYYY-MM-DD). If ‘-’ is not found in 5th and 8th position or if the length is not 10, then the format is incorrect and the LLM has likely done a mistake. In the case of locations, it checks whether the list returned by ConfigCreation is empty. If so, the location parameter is set to ‘Unknown’.

All parameters with the value ‘Unknown’ are replaced by the values from the previous graph config, if possible (that is, if a previous graph config exists and the parameter was specified in it).

Three parameters are absolutely mandatory to make a graph:

  • the climate variable

  • the location

  • the time expression

If any of them is missing, they will be asked to the user. The request for these information is created by filling a template or by generating a sentence with a LLM. By default, we chose to fill a template with the name of the missing parameters as the LLM method takes more time and consumes more energy.

Licences and References

SpaCy: MIT Licence, https://github.com/explosion/spaCy/blob/master/LICENSE Honnibal, M., & Montani, I. (2017). spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing.

NLTK: Apache Licence version 2, https://github.com/nltk/nltk/blob/develop/LICENSE.txt

Mistral_7B_Instruct_v0.1: Apache Licence version 2, https://mistral.ai/fr/news/announcing-mistral-7b/, https://huggingface.co/mistralai/Mistral-7B-v0.1, https://arxiv.org/abs/2310.06825