helvede.net is one of the many independent Mastodon servers you can use to participate in the fediverse.
Velkommen til Helvede, fediversets hotteste instance! Vi er en queerfeministisk server, der shitposter i den 9. cirkel. Welcome to Hell, We’re a DK-based queerfeminist server. Read our server rules!

Server stats:

168
active users

#nlp

0 posts0 participants0 posts today

We’ve redesigned mobile email replies - with & without AI. Tap sentences while reading to enter local responses (or get suggestions). Then connect them on a usual draft screen (or let AI do just that). Result: Flexible workflows with varying speed and control. #CHI2025 preprint in comments.

#HCI#AI#NLP

ReadMe2KG: Github ReadMe to Knowledge Graph #Challenge has been published as part of the Natural Scientific Language Processing and Research Knowledge Graphs #NSLP2025 workshop co-located with #eswc2025. This #NER task aims to complement the NDFI4DataScience KG via information extraction from GitHub README files.

task description: nfdi4ds.github.io/nslp2025/doc
website: codabench.org/competitions/539

@eswc_conf @GenAsefa @shufan @NFDI4DS #NFDIrocks #knowledgegraphs #semanticweb #nlp #informationextraction

Continued thread

8/n

[2] Jinhyuk Lee, Anthony Chen, Zhuyun Dai, Dheeru Dua, Devendra Singh Sachan, Michael Boratko, Yi Luan, Sébastien M. R. Arnold, Vincent Perot, Siddharth Dalmia, Hexiang Hu, Xudong Lin, Panupong Pasupat, Aida Amini, Jeremy R. Cole, Sebastian Riedel, Iftekhar Naim, Ming-Wei Chang, and Kelvin Guu. 2024. Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? arxiv.org/abs/2406.13121

arXiv.orgCan Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases. Leveraging LCLMs' ability to natively ingest and process entire corpora of information offers numerous advantages. It enhances user-friendliness by eliminating the need for specialized knowledge of tools, provides robust end-to-end modeling that minimizes cascading errors in complex pipelines, and allows for the application of sophisticated prompting techniques across the entire system. To assess this paradigm shift, we introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning. Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks. However, LCLMs still face challenges in areas like compositional reasoning that are required in SQL-like tasks. Notably, prompting strategies significantly influence performance, emphasizing the need for continued research as context lengths grow. Overall, LOFT provides a rigorous testing ground for LCLMs, showcasing their potential to supplant existing paradigms and tackle novel tasks as model capabilities scale.
#NLP#NLProc#RAG
Continued thread

7/

REFERENCES

[1] Yifu Qiu, Varun Embar, Yizhe Zhang, Navdeep Jaitly, Shay B Cohen, and Benjamin Han. 2025. Eliciting in-context Retrieval and reasoning for long-context large language models. arxiv.org/abs/2501.08248

arXiv.orgEliciting In-context Retrieval and Reasoning for Long-context Large Language ModelsRecent advancements in long-context language models (LCLMs) promise to transform Retrieval-Augmented Generation (RAG) by simplifying pipelines. With their expanded context windows, LCLMs can process entire knowledge bases and perform retrieval and reasoning directly -- a capability we define as In-Context Retrieval and Reasoning (ICR^2). However, existing benchmarks like LOFT often overestimate LCLM performance by providing overly simplified contexts. To address this, we introduce ICR^2, a benchmark that evaluates LCLMs in more realistic scenarios by including confounding passages retrieved with strong retrievers. We then propose three methods to enhance LCLM performance: (1) retrieve-then-generate fine-tuning, (2) retrieval-attention-probing, which uses attention heads to filter and de-noise long contexts during decoding, and (3) joint retrieval head training alongside the generation head. Our evaluation of five well-known LCLMs on LOFT and ICR^2 demonstrates significant gains with our best approach applied to Mistral-7B: +17 and +15 points by Exact Match on LOFT, and +13 and +2 points on ICR^2, compared to vanilla RAG and supervised fine-tuning, respectively. It even outperforms GPT-4-Turbo on most tasks despite being a much smaller model.
#NLP#NLProc#RAG
Continued thread

6/

Through extensive experiments on five LCLMs using both the LOFT and ICR² benchmarks, our best approach on Mistral-7B with a 32K token limit outperformed Vanilla RAG and SFT baselines by an average of +17 and +15 points (Exact Match) on LOFT, and by +13 and +2 points on ICR², respectively (picture). It even achieved performance comparable to the state-of-the-art GPT-4, despite having only 7B parameters.

#NLP#NLProc#RAG
Continued thread

4/

With a more realistic benchmark in hand, we systematically explored three approaches to enhance model performance:

1. Retrieve-then-generate supervised fine-tuning (picture): we train LCLMs to first retrieve relevant information from the context and then generate the final responses.

2. Retrieval-attention-probing: During inference, we probe attention heads activated for in-context retrieval, and use their top predictions to filter out confounders.

#NLP#NLProc#RAG
Continued thread

3/

This limitation often leads to inflated results. To address this, we created a more realistic dataset ICR². It uses five retrievers to generate challenging negative documents (picture 1). Our results show significant performance drop with standard RAG setups. For example, with GPT-4-Turbo, accuracy on NQ dropped from 0.85 to 0.67, and on HPQA, it fell from 0.78 to 0.64 (picture 2).

#NLP#NLProc#RAG
Continued thread

2/

But are current LCLMs up to the task? If not, how can we improve their performance?

In our preprint [1], we evaluated five popular LCLMs using the LOFT benchmark [2], which involves answering questions paired with documents. However, LOFT relies on random sampling to create irrelevant (negative) documents for each query, failing to include confounding documents — those that are relevant but misleading — which are common in real-world scenarios.

#NLP#NLProc#RAG

1/

What if #LLMs had context windows so large that an entire knowledge base could fit into a single prompt? This would revolutionize Retrieval-Augmented Generation (RAG) applications by enabling retrieval, re-ranking, reasoning, and generation all in one step. With a Long-Context Language Model (LCLM), we could simplify RAG architecture by leveraging the model’s capability for In-Context Retrieval and Reasoning (ICR²).

#NLP#NLProc#RAG

If you care about theoretical computer science, you should watch this lovely talk by Jon Kleinberg on language generation being easier than language identification. He mentions an interesting tension between hallucination and mode collapse.

arxiv.org/abs/2404.06757
youtube.com/live/BrdaVSZBuyU?f

arXiv.orgLanguage Generation in the LimitAlthough current large language models are complex, the most basic specifications of the underlying language generation problem itself are simple to state: given a finite set of training samples from an unknown language, produce valid new strings from the language that don't already appear in the training data. Here we ask what we can conclude about language generation using only this specification, without further assumptions. In particular, suppose that an adversary enumerates the strings of an unknown target language L that is known only to come from one of a possibly infinite list of candidates. A computational agent is trying to learn to generate from this language; we say that the agent generates from L in the limit if after some finite point in the enumeration of L, the agent is able to produce new elements that come exclusively from L and that have not yet been presented by the adversary. Our main result is that there is an agent that is able to generate in the limit for every countable list of candidate languages. This contrasts dramatically with negative results due to Gold and Angluin in a well-studied model of language learning where the goal is to identify an unknown language from samples; the difference between these results suggests that identifying a language is a fundamentally different problem than generating from it.