# MSMARCO-MRC-Analysis
**Repository Path**: yukyin/MSMARCO-MRC-Analysis
## Basic Information
- **Project Name**: MSMARCO-MRC-Analysis
- **Description**: Analysis on the MS-MARCO leaderboard regarding the machine reading comprehension task.
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2020-11-07
- **Last Updated**: 2020-12-19
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# Analysis on MS MARCO leaderboard
Analysis on the MS-MARCO [leaderboard](http://www.msmarco.org/leaders.aspx), including V1 and V2, regarding the machine reading comprehension task.
Contributed by Yuqiang Xie, Luxi Xing and Wei Peng, *National Engineering Laboratory for Information Security Technologies, IIE, CAS*.
## Introduction
[MS MARCO](https://arxiv.org/pdf/1611.09268.pdf) (Microsoft Machine Reading Comprehension) is a large scale dataset focused on machine reading comprehension, question answering, and passage ranking. The current dataset has 1,010,916 unique real queries that were generated by sampling and anonymizing Bing usage logs. For more details, see the project: [MS MARCO V2](https://github.com/dfcf93/MSMARCOV2).
### Compare to other MRC datasets
| **Dataset** | **Segment** | **Query Source** | **Answer** | **Queries** | **Document** |
| :-------------- | :---------- | :-------------- | :----------------- | :---------- | :----------------------- |
| MCTest | No | Crowd-sourced | Multi-choices | 2.6K | 660 |
| RACE | No | Crowd-sourced | Multi-choices | 97K | 28K |
| ARC | No | Generated | Multi-choices | 8K | 14M sentences |
| WikiQA | No | User logs | Sentence selection | 3K | 29.26K sentences |
| CNN/Daily Mail | No | Close | Fill in entity | 1.4M | 93K CNN, 220K DM |
| ReCoRD | No | Close | Fill in entity | 12K | 12K |
| Children’s Book | No | Close | Fill in the word | 688K | 688K contexts, 108 books |
| SQuAD | No | Crowd-sourced | Span | 100K | 536 |
| SQuAD2.0 | No | Crowd-sourced | No answer/Span | 150K | 536 |
| NewsQA | No | Crowd-sourced | Span | 100K | 10K |
| SearchQA | No | Generated | Span | 140K | 6.9M passages |
| ***HotpotQA*** | No | Crowd-sourced | Yes/No/Span | 113K | 5M wiki paragraphs |
| DuReader | No | Crowd-sourced | Human generated | 200K | 1M |
| NarrativeQA | No | Crowd-sourced | Human generated | 47K | 1572 stories |
| ***CoQA*** | No | Crowd-sourced | Human generated | 127K | 8000 conversations |
| ***QuAC*** | No | Crowd-sourced | Human generated | 100K | Daffy Duck’s Wiki page |
| MS MARCO V1 | **Yes** | User logs | Human generated | 100K | 1M passages, 200K docs. |
| **MS MARCO V2** | **Yes** | User logs | Human generated | **1M** | 8.8M passages, 3.2M docs. |
Each dimension for datasets is hard to compare in a table. It is also hard to contain all MRC datasets in a table due to the fast development of NLP.
For example, the conversation in **CoQA** is about text passages from seven diverse domains. Besides, **HotpotQA** has the following key features: (1) the questions require finding and reasoning over multiple supporting documents to answer; (2) the questions are diverse and not constrained to any pre-existing knowledge bases or knowledge schemas; (3) dataset give sentence-level supporting facts required for reasoning, allowing QA systems to reason with strong supervision and explain the predictions.
Moreover, CoQA and QuAC belong to conversational MRC (a sequence of QA pairs is similar to a dialog). And unlike CoQA, students in **QuAC** do not know the answers to their questions prior to asking them, which lessens the role of string matching and simple paraphrasing in answering their questions.
In all, compared with other MRC datasets, **MS MARCO V2** owns its advantage. In summary, it is a large-scale, open-domain, human-generated multi-document MRC dataset.
**Tips:** Segment stands for the main classification for the answer.
### Advantages
**MS MARCO** is a benchmark dataset for multi-passage MRC or so-called "generative MRC". Besides, different from other MRC datasets, MS MARCO has the following advantages (reference [MS MARCO V2](https://github.com/dfcf93/MSMARCOV2)):
1. Real questions: All questions have been sample from real anonymized bing queries.
2. Real Documents: Most Url's that we have source the passages from contain the full web documents (abstract). These can be used as extra contextual information to improve systems or be used to compete in our expert task.
3. Human Generated Answers: All questions have an answer written by a human (not just from a passage span). If there was no answer in the passages the judge read they have written 'No Answer Present.'
4. Human Generated Well-Formed: Some questions contain extra human evaluation to create well formed answers that could be used by intelligent agents like Cortana, Siri, Google Assistant, and Alexa. [v2]
5. Dataset Size: At over 1 million queries the dataset is large enough to train the most complex systems and also sample the data for specific applications. [v2]
In this repository, submissions to MS MARCO will be almost listed. This work will be updated persistently.
**Tips:** The official evaluation metrics include [ROUGE-L](http://aclweb.org/anthology/W04-1013) and [BLEU-1](http://www.anthology.aclweb.org/P/P02/P02-1040.pdf).
## V1 Leaderboard - Model List
Task Definition: First predict whether a question can be answered based only on the information contained in the provided passages. If the question cannot be answered, then the system should return “No Answer Present” as response. If the question can be answered, then the system should generate the correct answer.
|Rank|Model| Org. | Rouge-L | Bleu-1 | Note |
|:---:|:----|:-------|:-----:|:-----:|:-----:|
|1|[MARS](#MARS)| YUANFUDAO research NLP | 49.72| 48.02 ||
|2|Human Performance| - |47.00| 46.00 ||
|3|[V-Net](#V-Net)| Baidu NLP | 46.15 | 44.46 |[note](https://indexfziq.github.io/2019/03/08/VNET/)|
|4|[S-Net](#S-Net)| Microsoft AI and Research |45.23| 43.78|[note](https://indexfziq.github.io/2019/01/17/SNET/)|
|5|[R-Net](#R-Net)| Microsoft AI and Research |42.89| 42.22|
|6|[ReasoNet](#ReasoNet)| Microsoft AI and Research |38.81| 39.86|
|7|[Prediction](#Prediction)| Singapore Management University |37.67| 33.93|
|8|[FastQA_Ext](#FastQA_Ext)| DFKI German Research Center for AI |33.67| 33.93|
## V2 Leaderboard - Model List
For this version, we focus on *Q&A + Natural Language Generation*, where human performance is 63.21 on Rouge-L.
Task Definition: The generated answer should be well-formed. The answer should make sense even without the context of the question and retrieved passages.
|Rank| Model | Org. | Rouge-L | Bleu-1 | Note |
| :---: | :--- | :--- | :-----: | :---: | :--: |
|1| Human Performance | - | 63.21 | 53.03 | |
|2| [Masque NLGEN Style](#Masque) | NTT Media Intelligence Laboratories | 49.61 | 50.13 | [note](https://indexfziq.github.io/2019/04/06/masque/) |
|3| [V-Net](#V-Net) | Baidu NLP | 48.37 | 46.75 |[note](https://indexfziq.github.io/2019/03/08/VNET/)|
|4| Selector+Combine-Content-Generator | Shengjie Qian of Caiyun xiaoyi AI and BUPT | 47.39 | 45.26 |
|5| BERT+ Multi-Pointer-Generator | Tongjun Li of the ColorfulClouds Tech and BUPT| 47.37 |45.09 |
|6| SNET + CES2S | SYSU | 45.04 | 40.62 |
|7| Reader-Writer | Microsoft Business Applications Group AI Research | 43.89 | 42.59 |
|8| [ConZNet](#ConZNet) | Samsung Research | 42.14 | 38.62 | [note](https://blog.csdn.net/ganxiwu9686/article/details/88673819) |
**Tips:** What is the meaning of "well-formed"? For some question-answer pairs, the data also contains one or more answers that are generated by a post-hoc review-and-rewrite process, which is the process of "well-formed". This process involves a separate editor reviewing the provided answer and rewriting it if:
1. It does not have proper grammar;
2. There is a high overlap in the answer and one of the provided passages (indicating that the original editor may have copied the passage directly);
3. The answer can not be understood without the question and the passage context.
## Description of Models
### MARS
**Multi-Attention ReaderS Network.** *Jingming Liu.* [ [video](https://v.qq.com/x/page/k06284mr0hk.html) ]
* **Motivation**
* Transfer learning tasks like CoVe and ELMo store more generalized information in the lower layer (encoder).
* The answers from each site are quite different and have their own features.
* Multi-task learning may bring an improvement for the MRC task.
* The score for multi-passage needs more factors, not just the simple counting.
* Through training, the train set is large and the batch is small, which leads to the distribution between real samples and each batch inconsistent. The output of model is related to the order of the batch.
* **Contribution**
* Enhance Glove with CoVe, ELMo, POS, NER and word-match features, word dropout is 5%.
*[ nearly 2.0 improvement ]*
* Embed each site to a representation and combine it with the passage representation in the prediction layer.
*[ nearly 0.3-0.5 improvement ]*
* Apply multi-task learning into MRC.
* Main task:
* Golden span (the highest rouge(>0.8) span in every passage).
* Auxiliary tasks: *[ weight: 0.1~0.2 ]*
* If a word is in the answer;
* If a passage contains a golden span;
* If a sentence contains a golden span.
* The score of answer contains span score and vote score.
* Introduce EMA (Exponential Moving Average) into the training for weakening the effect of batch order.
*[ nearly 0.3 improvement ]*
* **Overview**