UMass Amherst Linguistics Sentiment Corpora

Noah Constant, Christopher Davis, Christopher Potts, and Florian Schwarz

The UMass Amherst Linguistics Sentiment Corpora consist of n-gram counts extracted from over 700,000 online product reviews in Chinese, English, German, and Japanese. The files are UTF-8 encoded text. They are formatted to be read in as R data frames, but they can easily be manipulated with other tools. We are releasing them under a Creative Commons Share Alike license. Read on to find out more about their size and composition.

Format

All the corpus files have the form of the following snippet:

	         Token Rating TokenCount RatingWideCount
	1000     break      5         18          114305
	1001 breakfast      1          4           14794
	1002 breakfast      2         14           20908
	1003 breakfast      3         26           26266
	1004 breakfast      4         66           80597
	1005 breakfast      5         42          114305
	1006    bright      1          0           14794
	1007    bright      2          0           20908
	1008    bright      3          0           26266
	1009    bright      4         10           80597
	1010    bright      5          1          114305

Thus, with this format, it is easy to get, for example, the frequency of a word W appearing in a particular rating category R: divide TokenCount of W for R by the RatingWideCount for R.

In order to keep the corpus files to a manageable size, and to avoid making conclusions based on too few data points, we removed n-grams that had too few tokens, according to the following thresholds:

Threshold token counts
 reviewsummary
English Amazon10010
English Tripadvisor10010
German Amazon5010
Japanese Amazon5010
Chinese MyPrice20
Chinese Amazon5010

The numbers for Chinese are character counts. The others are word counts.

Download

Corpora details

The Japanese tokenization was done using MeCab. The Chinese tokenization procedure just inserted spaces between all characters. English and German tokenization was done as follows:

Chinese Amazon

Reviews of a wide variety of products. This corpus is large enough that we have included files for unigrams, bigrams, and trigrams. 203,554 authors. Median number of reviews per author: 1.

Review
1 star2 star3 star4 star5 startotal
reviews29,642 32,602 100,272 160,817 204,461 527,794
characters4,625,180 4,242,812 11,649,647 19,125,499 29,729,569 69,372,707
Summary
1 star2 star3 star4 star5 startotal
reviews29,669 32,642 100,437 161,050 204,670 528,468
characters611,769 606,104 1,775,852 2,613,873 3,626,706 9,234,304

Chinese MyPrice

Electronics reviews. The number of authors is not known. The ratings are an average of a number of a few different five-star rating categories ('value for the price', 'service and support', 'quality and reliability', 'features'), not all of which appear consistently with all reviews, probably due to a site update at some point. To calculate the overall rating of a review, we average all the scores which were entered for that review, rounding to the nearest integer (reviews without any ratings were ignored).

Chinese MyPrice.com.cn
1 star2 star3 star4 star5 startotal
reviews2,1153,0428,0072,0552,29417,513
characters73,798111,659236,18465,264 56,847543,752

English Amazon

Book reviews. 40,625 authors. Median number of reviews per author: 1.

Summary
1 star2 star3 star4 star5 startotal
reviews3,3222,6843,9938,59834,94653,543
words16,83013,51820,77943,607182,377277,111
vocab3,4343,0193,7856,02511,48215,930
Review
1 star2 star3 star4 star5 startotal
reviews3,3232,6873,994860134,95253,557
words570,687512,643767,9581,513,7764,769,9218,134,985
vocab27,35226,23932,81846,30680,569112,323

English Tripadvisor.com

Hotel reviews for hotels in American cities. 35,713 authors. Median number of reviews per author: 1.

Summary
1 star2 star3 star4 star5 startotal
reviews2,9894,3005,41017,95025,20055,849
words14,79420,90826,26680,597114,305256,870
vocab2,4173,0523,1345,2725,65110,819
Review
1 star2 star3 star4 star5 startotal
reviews2,8964,1304,94815,80122,45050,225
words605,207877,854974,2712,726,7963,577,7648,761,892
vocab19,12823,53424,76142,77649,49285,425

German Amazon

Book, music, movie, and electronics reviews. 16,623 authors. Median number of reviews per author: 1.

Summary
1 star2 star3 star4 star5 startotal
reviews2,9841,8802,6464,42715,77427,711
words12,7568,13512,08020,84173,890127,702
vocab3,0242,3783,0384,806962414,975
Review
1 star2 star3 star4 star5 startotal
reviews2,9871,8812,6474,43115,78427,730
words407,888319,341467,556788,9152,205,6664,189,366
vocab35,17730,83537,64453,53995,453144,418

Japanese Amazon

Book, music, movie, and electronics reviews. 12,747 authors. Median number of reviews per author: 1. The morphological analysis was done using MeCab. MeCab decomposes the agglutinative verbal morphology, so the word counts are somewhat inflated.

Summary
1 star2 star3 star4 star5 startotal
reviews97175916093,50411,03117,874
words5,5794,73310,58923,36372,635116,899
vocab1,5251,4892,5804,6018,53111,223
Review
1 star2 star3 star4 star5 startotal
reviews9717591,6093,50411,03117,874
words127,049123,312277,857636,0671,805,7642,970,049
vocab9,5749,90916,24724,90239,94849,054

Citation

Constant, Noah, Christopher Davis, Christopher Potts, and Florian Schwarz. 2008. The pragmatics of expressive content: Evidence from large corpora. To appear in Sprache und Datenverarbeitung.

Potts, Christopher and Florian Schwarz. 2008. Exclamatives and heightened emotion: Extracting pragmatic generalizations from large corpora. Ms., UMass Amherst.

Acknowledgements

This material is based upon work supported by the National Science Foundation under Grant No. BCS-0642752. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Creative Commons License

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 License. Last update: 2009-01-03