Amazon Books

Sep 13, 2021

1. Categorical data

In this section, we will analyze categorical data and answer the following questions:

  1. Which author's books receive the highest average rating (top authors).
  2. Which author has written the most bestsellers (top authors).
  3. Which genres become bestsellers more often.
  4. Which book has the most reviews (top books).

INSIGHT: By analyzing the categorical data, it is established:

  1. The following 13 authors have the highest rating: Nathan W. Pyle, Patrick Thorpe, Eric Carle, Emily Winfield Martin, Chip Gaines, Jill Twiss, Rush Limbaugh, Sherri Duskey Rinker, Alice Schertle, Pete Souza, Sarah Young, Lin-Manuel Miranda, Bill Martin Jr., Dav Pilkey. The average rating for their works was 4.9. When buying a new book, you should pay attention to these authors.
  2. Authors who have written more bestsellers: Jeff Kinney - 12 books, Rick Riordan - 10 books, J.K. Rowling - 8 books, Stephenie Meyer - 7 books, Dav Pilkey - 6 books, Bill O'Reilly - 6 books, John Grisham - 5 books, E L James - 5 books, Suzanne Collins - 5 books, Charlaine Harris - 4 books. These authors always have something to read.
  3. Books with the most reviews: Where The Crawdads Sing - 87841 Reviews, The Girl On The Train - 79446 Reviews, Becoming - 61133 Reviews, Gone Girl - 57271 Reviews, The Fault In Our Stars - 50482 Reviews. It's definitely worth reading the book Where The Crawdads Sing, it's not for nothing that it is the most talked about.
  4. Non-fiction is more likely to become a bestseller. Later we will find out how users relate to these genres.

2. Numeric data

INSIGHT: By analyzing the numeric data, it is established:

User Rating:

  1. Data is not distributed normally. Asymmetry is observed.
  2. Average and median book ratings are 4.6.
  3. There are outliers in the data. There are a small number of books in the data below the 4.1 rating.

Reviews:

  1. Data is not distributed normally. Asymmetry is observed.
  2. The data has a wide range.
  3. There are outliers in the data. There are a small number of books receiving reviews well above the 75 percentile.

Price:

  1. Data is not distributed normally. Asymmetry is observed.
  2. There are books that cost much higher than the average, as well as books with a cost of 0, which is strange.There are books that cost much higher than the average, as well as books with a cost of 0, which is strange. Either the book is given for free or this error.

INSIGHT: Based on the constructed correlation matrix as well as the constructed visualizations, it can be seen that the data does not contain any positive or negative linear relationship between the rating, reviews and the price of books.

3. TESTING HYPOTHESES

In this article, let us test the following hypothesis: "Do genres differ in terms of rating?".

We will test the hypothesis according to the following algorithm:

  1. Null and alternative hypotheses are formulated.
  2. The distribution was checked for normality using "Shapiro Wilk test of normality".
  3. Two samples were formed. Books related to the Non Fiction genre and books related to the Fiction genre.
  4. Testing was carried out in order to identify statistical differences between the study groups.

1. Let us formulate the null and alternative hypotheses.

H0 - There are no differences between genre racketeering

H1 - There are differences between genre racketeering

2. Checking the distribution for normality. Using "Shapiro Wilk test of normality".

3. Let's form two samples for testing.

4. Let's carry out testing. Since the nadi data in the samples are not normally distributed, we will use the nonparametric Mann-Whitney test.

Let's compare the median values in the groups.

Non_fiction median: 4.6, Fiction median: 4.7

Visualizing the density of distribution in samples.

INSIGHT: As a result of the testing, statistically significant differences were obtained between the compared groups. Based on the results obtained, it can be argued that users evaluate books differently depending on the genre, and based on the median values, readers give preference to works of fiction.

4. FINAL CONCLUSIONS

In the course of the analysis, it was established which authors receive the highest ratings from readers, which authors have written the most bestsellers, which books receive the most reviews from readers. In addition, it was found that non-fiction literature is becoming more often a bestseller, but users also like fiction more, which is confirmed by the statistically significant results obtained during the testing.