The UMass Amherst Linguistics Sentiment Corpora consist of n-gram counts extracted from over 700,000 online product reviews in Chinese, English, German, and Japanese. The files are UTF-8 encoded text. They are formatted to be read in as R data frames, but they can easily be manipulated with other tools. We are releasing them under a Creative Commons Share Alike license. Read on to find out more about their size and composition.
All the corpus files have the form of the following snippet:
Token Rating TokenCount RatingWideCount 1000 break 5 18 114305 1001 breakfast 1 4 14794 1002 breakfast 2 14 20908 1003 breakfast 3 26 26266 1004 breakfast 4 66 80597 1005 breakfast 5 42 114305 1006 bright 1 0 14794 1007 bright 2 0 20908 1008 bright 3 0 26266 1009 bright 4 10 80597 1010 bright 5 1 114305
Thus, with this format, it is easy to get, for example, the frequency of a word W appearing in a particular rating category R: divide TokenCount of W for R by the RatingWideCount for R.
In order to keep the corpus files to a manageable size, and to avoid making conclusions based on too few data points, we removed n-grams that had too few tokens, according to the following thresholds:
| Threshold token counts | ||
|---|---|---|
| review | summary | |
| English Amazon | 100 | 10 |
| English Tripadvisor | 100 | 10 |
| German Amazon | 50 | 10 |
| Japanese Amazon | 50 | 10 |
| Chinese MyPrice | 20 | – |
| Chinese Amazon | 50 | 10 |
The numbers for Chinese are character counts. The others are word counts.
The Japanese tokenization was done using MeCab. The Chinese tokenization procedure just inserted spaces between all characters. English and German tokenization was done as follows:
Reviews of a wide variety of products. This corpus is large enough that we have included files for unigrams, bigrams, and trigrams. 203,554 authors. Median number of reviews per author: 1.
| Review | ||||||
|---|---|---|---|---|---|---|
| 1 star | 2 star | 3 star | 4 star | 5 star | total | |
| reviews | 29,642 | 32,602 | 100,272 | 160,817 | 204,461 | 527,794 |
| characters | 4,625,180 | 4,242,812 | 11,649,647 | 19,125,499 | 29,729,569 | 69,372,707 |
| Summary | ||||||
| 1 star | 2 star | 3 star | 4 star | 5 star | total | |
| reviews | 29,669 | 32,642 | 100,437 | 161,050 | 204,670 | 528,468 |
| characters | 611,769 | 606,104 | 1,775,852 | 2,613,873 | 3,626,706 | 9,234,304 |
Electronics reviews. The number of authors is not known. The ratings are an average of a number of a few different five-star rating categories ('value for the price', 'service and support', 'quality and reliability', 'features'), not all of which appear consistently with all reviews, probably due to a site update at some point. To calculate the overall rating of a review, we average all the scores which were entered for that review, rounding to the nearest integer (reviews without any ratings were ignored).
| Chinese MyPrice.com.cn | ||||||
|---|---|---|---|---|---|---|
| 1 star | 2 star | 3 star | 4 star | 5 star | total | |
| reviews | 2,115 | 3,042 | 8,007 | 2,055 | 2,294 | 17,513 |
| characters | 73,798 | 111,659 | 236,184 | 65,264 | 56,847 | 543,752 |
Book reviews. 40,625 authors. Median number of reviews per author: 1.
| Summary | ||||||
|---|---|---|---|---|---|---|
| 1 star | 2 star | 3 star | 4 star | 5 star | total | |
| reviews | 3,322 | 2,684 | 3,993 | 8,598 | 34,946 | 53,543 |
| words | 16,830 | 13,518 | 20,779 | 43,607 | 182,377 | 277,111 |
| vocab | 3,434 | 3,019 | 3,785 | 6,025 | 11,482 | 15,930 |
| Review | ||||||
| 1 star | 2 star | 3 star | 4 star | 5 star | total | |
| reviews | 3,323 | 2,687 | 3,994 | 8601 | 34,952 | 53,557 |
| words | 570,687 | 512,643 | 767,958 | 1,513,776 | 4,769,921 | 8,134,985 |
| vocab | 27,352 | 26,239 | 32,818 | 46,306 | 80,569 | 112,323 |
Hotel reviews for hotels in American cities. 35,713 authors. Median number of reviews per author: 1.
| Summary | ||||||
|---|---|---|---|---|---|---|
| 1 star | 2 star | 3 star | 4 star | 5 star | total | |
| reviews | 2,989 | 4,300 | 5,410 | 17,950 | 25,200 | 55,849 |
| words | 14,794 | 20,908 | 26,266 | 80,597 | 114,305 | 256,870 |
| vocab | 2,417 | 3,052 | 3,134 | 5,272 | 5,651 | 10,819 |
| Review | ||||||
| 1 star | 2 star | 3 star | 4 star | 5 star | total | |
| reviews | 2,896 | 4,130 | 4,948 | 15,801 | 22,450 | 50,225 |
| words | 605,207 | 877,854 | 974,271 | 2,726,796 | 3,577,764 | 8,761,892 |
| vocab | 19,128 | 23,534 | 24,761 | 42,776 | 49,492 | 85,425 |
Book, music, movie, and electronics reviews. 16,623 authors. Median number of reviews per author: 1.
| Summary | ||||||
|---|---|---|---|---|---|---|
| 1 star | 2 star | 3 star | 4 star | 5 star | total | |
| reviews | 2,984 | 1,880 | 2,646 | 4,427 | 15,774 | 27,711 |
| words | 12,756 | 8,135 | 12,080 | 20,841 | 73,890 | 127,702 |
| vocab | 3,024 | 2,378 | 3,038 | 4,806 | 9624 | 14,975 |
| Review | ||||||
| 1 star | 2 star | 3 star | 4 star | 5 star | total | |
| reviews | 2,987 | 1,881 | 2,647 | 4,431 | 15,784 | 27,730 |
| words | 407,888 | 319,341 | 467,556 | 788,915 | 2,205,666 | 4,189,366 |
| vocab | 35,177 | 30,835 | 37,644 | 53,539 | 95,453 | 144,418 |
Book, music, movie, and electronics reviews. 12,747 authors. Median number of reviews per author: 1. The morphological analysis was done using MeCab. MeCab decomposes the agglutinative verbal morphology, so the word counts are somewhat inflated.
| Summary | ||||||
|---|---|---|---|---|---|---|
| 1 star | 2 star | 3 star | 4 star | 5 star | total | |
| reviews | 971 | 759 | 1609 | 3,504 | 11,031 | 17,874 |
| words | 5,579 | 4,733 | 10,589 | 23,363 | 72,635 | 116,899 |
| vocab | 1,525 | 1,489 | 2,580 | 4,601 | 8,531 | 11,223 |
| Review | ||||||
| 1 star | 2 star | 3 star | 4 star | 5 star | total | |
| reviews | 971 | 759 | 1,609 | 3,504 | 11,031 | 17,874 |
| words | 127,049 | 123,312 | 277,857 | 636,067 | 1,805,764 | 2,970,049 |
| vocab | 9,574 | 9,909 | 16,247 | 24,902 | 39,948 | 49,054 |
Constant, Noah, Christopher Davis, Christopher Potts, and Florian Schwarz. 2008. The pragmatics of expressive content: Evidence from large corpora. To appear in Sprache und Datenverarbeitung.
Potts, Christopher and Florian Schwarz. 2008. Exclamatives and heightened emotion: Extracting pragmatic generalizations from large corpora. Ms., UMass Amherst.
This material is based upon work supported by the National Science Foundation under Grant No. BCS-0642752. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 License. Last update: 2009-01-03