Markus Dreyer

Principal Applied Scientist

Amazon AGI

Biography

Markus Dreyer is a Principal Applied Scientist at Amazon, where he works on large language models, text generation, and multi-agent systems. He contributed to the Amazon Nova family of foundation models and led the development of Nova Deep Research. His research spans summarization, question answering, and natural language understanding, and he has published at venues including ACL, EMNLP, NAACL, and NeurIPS. He holds a Ph.D. in Computer Science from Johns Hopkins University.

Interests

Artificial Intelligence
Large Language Models
Machine Learning
Natural Language Processing

Education

PhD in Computer Science, 2011

Johns Hopkins University
M.Sc. in Computer Science, 2007

Johns Hopkins University
M.A. in Computational Linguistics, 2002

Heidelberg University

Experience

Principal Applied Scientist

Amazon

January 2020 – Present Seattle, WA

Contributed to the Nova family of large language models, improving capabilities for information agents and tool use. Led development of Nova Deep Research, a multi-agent system. Text generation, question answering, and summarization.

Senior Applied Scientist / Manager

Amazon

March 2015 – December 2019 Seattle, WA

Machine learning methods for Alexa natural language understanding. Managed a cross-functional team delivering production-grade NLU systems.

Senior Research Scientist, Manager of MT Technology

SDL (Language Weaver)

January 2013 – February 2015 Los Angeles, CA

Led a research and engineering team developing AI-driven machine translation systems.

Research Scientist

SDL (Language Weaver)

January 2011 – December 2012 Los Angeles, CA

Designed and implemented a modular, scalable machine translation engine.

Research Assistant

CLSP, Johns Hopkins

January 2004 – December 2010 Baltimore, Maryland

Semi-supervised learning of nonconcatenative morphology, based on graphical models, the Dirichlet process and log-linear parameterization of finite-state machines.

Research Engineer

IBM, Speech Group

May 1999 – February 2003 Heidelberg, Germany

Text-to-speech (TTS) technology and parsing methods.

Featured Publications

Aaron Langford, Aayush Shah, Abhanshu Gupta, Abhimanyu Bhatter, Abhinav Goyal, Abhinav Mathur, Abhinav Mohanty, Abhishek Kumar, Abhishek Sethi, Abi Komma, others

January 2025 arXiv preprint arXiv:2506.12103

The Amazon Nova Family of Models: Technical Report and Model Card

We present Amazon Nova, a new generation of state-of-the-art foundation models that deliver frontier intelligence and industry-leading price performance. Amazon Nova Pro is a highly-capable multimodal model with the best combination of accuracy, speed, and cost for a wide range of tasks. Amazon Nova Lite is a low-cost multimodal model that is lightning fast for processing images, video, documents and text. Amazon Nova Micro is a text-only model that delivers our lowest-latency responses at very low cost. Amazon Nova Canvas is an image generation model that creates professional grade images with rich customization controls. Amazon Nova Reel is a video generation model offering high-quality outputs, customization, and motion control. Our models were built responsibly and with a commitment to customer trust, security, and reliability. We report benchmarking results for core capabilities, agentic performance, long context, functional adaptation, runtime performance, and human evaluation.

PDF

Recent Publications

Quickly discover relevant content by filtering publications.

Prahaladh Chandrahasan, Jiahe Jin, Zhihan Zhang, Tevin Wang, Andy Tang, Lucy Mo, Morteza Ziyadi, Leonardo FR Ribeiro, Zimeng Qiu, Markus Dreyer, others

January 2025 arXiv preprint arXiv:2507.05495

Deep Research Comparator: A Platform for Fine-Grained Human Annotations of Deep Research Agents

Effectively evaluating deep research agents that autonomously search the web, analyze information, and generate reports remains a major challenge, particularly when it comes to assessing long reports and giving detailed feedback on their intermediate steps. To address these gaps, we introduce Deep Research Comparator, a platform that offers a holistic framework for deep research agent hosting, side-by-side comparison, fine-grained human feedback collection, and ranking calculation. Given a user query, our platform displays the final reports from two different agents along with their intermediate steps during generation. Annotators can evaluate the overall quality of final reports based on side-by-side comparison, and also provide detailed feedback separately by assessing intermediate steps or specific text spans within the final report. Furthermore, we develop Simple Deepresearch, an end-to-end agent scaffold. This scaffold serves as a baseline that facilitates the easy integration of various large language models to transform them into deep research agents for evaluation.

PDF

Max Glockner, Xiang Jiang, Leonardo FR Ribeiro, Iryna Gurevych, Markus Dreyer

January 2025 Findings of the Association for Computational Linguistics: ACL 2025

NeoQA: Evidence-Based Question Answering with Generated News Events

Evaluating Retrieval-Augmented Generation (RAG) in large language models (LLMs) is challenging because benchmarks can quickly become stale. Questions initially requiring retrieval may become answerable from pretraining knowledge as newer models incorporate more recent information during pretraining, making it difficult to distinguish evidence-based reasoning from recall. We introduce NeoQA (News Events for Out-of-training Question Answering), a benchmark designed to address this issue. To construct NeoQA, we generated timelines and knowledge bases of fictional news events and entities along with news articles and Q&A pairs to prevent LLMs from leveraging pretraining knowledge, ensuring that no prior evidence exists in their training data. We propose our dataset as a new platform for evaluating evidence-based question answering, as it requires LLMs to generate responses exclusively from retrieved evidence and only when sufficient evidence is available. NeoQA enables controlled evaluation across various evidence scenarios, including cases with missing or misleading details. Our findings indicate that LLMs struggle to distinguish subtle mismatches between questions and evidence, and suffer from short-cut reasoning when key information required to answer a question is missing from the evidence, underscoring key limitations in evidence-based reasoning.

PDF

Aashiq Muhamed, Leonardo FR Ribeiro, Markus Dreyer, Virginia Smith, Mona T Diab

January 2025 arXiv preprint arXiv:2510.10390

RefusalBench: Generative Evaluation of Selective Refusal in Grounded Language Models

The ability of language models in RAG systems to selectively refuse to answer based on flawed context is critical for safety, yet remains a significant failure point. Our large-scale study reveals that even frontier models struggle in this setting, with refusal accuracy dropping below 50% on multi-document tasks, while exhibiting either dangerous overconfidence or overcaution. Static benchmarks fail to reliably evaluate this capability, as models exploit dataset-specific artifacts and memorize test instances. We introduce RefusalBench, a generative methodology that programmatically creates diagnostic test cases through controlled linguistic perturbation. Our framework employs 176 distinct perturbation strategies across six categories of informational uncertainty and three intensity levels. Evaluation of over 30 models uncovers systematic failure patterns: refusal comprises separable detection and categorization skills, and neither scale nor extended reasoning improves performance. We find that selective refusal is a trainable, alignment-sensitive capability, offering a clear path for improvement. We release two benchmarks – RefusalBench-NQ (single document) and RefusalBench-GaRAGe (multi-document) – and our complete generation framework to enable continued, dynamic evaluation of this critical capability.

PDF

Aaron Langford, Aayush Shah, Abhanshu Gupta, Abhimanyu Bhatter, Abhinav Goyal, Abhinav Mathur, Abhinav Mohanty, Abhishek Kumar, Abhishek Sethi, Abi Komma, others

January 2025 arXiv preprint arXiv:2506.12103

The Amazon Nova Family of Models: Technical Report and Model Card

We present Amazon Nova, a new generation of state-of-the-art foundation models that deliver frontier intelligence and industry-leading price performance. Amazon Nova Pro is a highly-capable multimodal model with the best combination of accuracy, speed, and cost for a wide range of tasks. Amazon Nova Lite is a low-cost multimodal model that is lightning fast for processing images, video, documents and text. Amazon Nova Micro is a text-only model that delivers our lowest-latency responses at very low cost. Amazon Nova Canvas is an image generation model that creates professional grade images with rich customization controls. Amazon Nova Reel is a video generation model offering high-quality outputs, customization, and motion control. Our models were built responsibly and with a commitment to customer trust, security, and reliability. We report benchmarking results for core capabilities, agentic performance, long context, functional adaptation, runtime performance, and human evaluation.

PDF

Xiang Jiang, Markus Dreyer

January 2024 Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

CCSum: A Large-Scale and High-Quality Dataset for Abstractive News Summarization

Training a supervised news summarization model requires large amounts of high-quality training data consisting of news articles paired with reference summaries. However, obtaining such data is costly, and existing datasets contain considerable amount of noise. We present a new large-scale and high-quality dataset for supervised abstractive news summarization containing 1.3 million training samples, which we call CCSum. In creating this dataset, we take advantage of the journalistic inverted-pyramid style in news writing: In some articles, the first sentence can be considered a summary of the reported story. Accordingly, among 35 million CommonCrawl News articles, we identify pairs of articles about the same news story and use one article’s first sentence as the summary for the other article. To ensure high quality, we apply strict filters whose parameters we optimize using Bayesian optimization. We show that the resulting dataset is more factual and informative than established summarization datasets; less than 1% of the summaries have major factual inconsistencies with the corresponding news articles, compared to 5.5% to 15.4% in existing datasets, according to our human evaluation. Summarization models trained on our dataset are more favored compared to those trained on CNN/Daily Mail. The proposed dataset can open new opportunities for future research in abstractive summarization.

PDF

See all publications

Patents

Markus Dreyer, Can Liu, Sujith Ravi (2025). Meaning Summarization Techniques. US Patent 12,353,463.

Boya Yu, Avani Deshpande, Adrian Mark McLeod, Naga Sai Likhitha Patha, Markus Dreyer (2021). Word Embeddings for Natural Language Processing. US Patent 11,030,999.

William Clinton Dabney, Arpit Gupta, Faisal Ladhak, Markus Dreyer, Anjishnu Kumar (2020). Voice User Interface Knowledge Acquisition System. US Patent 10,755,177.

Markus Dreyer, Pavankumar Reddy Muddireddy, Anjishnu Kumar (2019). Extendable Label Recognition of Linguistic Input. US Patent 10,170,107.

Faisal Ladhak, Ankur Gandhe, Markus Dreyer, Ariya Rastrow, Björn Hoffmeister, Lambert Mathias (2019). Lattice Decoding and Result Confirmation Using Recurrent Neural Networks. US Patent 10,210,862.

Faisal Ladhak, Ankur Gandhe, Markus Dreyer, Ariya Rastrow, Björn Hoffmeister, Lambert Mathias (2019). Lattice Encoding Using Recurrent Neural Networks. US Patent 10,176,802.

Daniel Marcu, Markus Dreyer (2019). Method and System for Automatic Management of Reputation of Translators. US Patent 10,261,994.

Daniel Marcu, Markus Dreyer (2019). Method and System for Automatic Management of Reputation of Translators. US Patent 10,402,498.

Anjishnu Kumar, Markus Dreyer (2018). Neural Latent Variable Model for Spoken Language Understanding. US Patent 9,911,413.

Ariya Rastrow, Nikko Ström, Spyridon Matsoukas, Markus Dreyer, Ankur Gandhe, Denis Sergeyevich Filimonov, Julian Chan, Rohit Prasad (2018). Speech Processing with Learned Representation of User Interaction History. US Patent 10,032,463.

Markus Dreyer

Principal Applied Scientist

Biography

Interests

Education

Experience

Principal Applied Scientist

Senior Applied Scientist / Manager

Senior Research Scientist, Manager of MT Technology

Research Scientist

Research Assistant

Research Engineer

Featured Publications

Recent Publications

Patents

Popular Topics