large language models

Deep Research Comparator: A Platform for Fine-Grained Human Annotations of Deep Research Agents

Effectively evaluating deep research agents that autonomously search the web, analyze information, and generate reports remains a major challenge, particularly when it comes to assessing long reports and giving detailed feedback on their intermediate …

RefusalBench: Generative Evaluation of Selective Refusal in Grounded Language Models

The ability of language models in RAG systems to selectively refuse to answer based on flawed context is critical for safety, yet remains a significant failure point. Our large-scale study reveals that even frontier models struggle in this setting, …

The Amazon Nova Family of Models: Technical Report and Model Card

We present Amazon Nova, a new generation of state-of-the-art foundation models that deliver frontier intelligence and industry-leading price performance. Amazon Nova Pro is a highly-capable multimodal model with the best combination of accuracy, …

Unified Embeddings for Multimodal Retrieval via Frozen LLMs

On Conditional and Compositional Language Model Differentiable Prompting

Prompts have been shown to be an effective method to adapt a frozen Pretrained Language Model (PLM) to perform well on downstream tasks. Prompts can be represented by a human-engineered word sequence or by a learned continuous embedding. In this …