Abstract

No matter for what kind of many screenplays, the most common topics among those watcher or fans are probably:

  • Which xxx is the best?
  • Which xxx do you recommend?
  • Do you think what’s the best xxx of 2021?

xxx here can be substituted by any type of screenplays, i.e. movies, TV series, TV show and etc. This rule applies to one type of screenplays as well - TV anime.

In this post, I would present my analysis on what’s most welcomed TV animes in different period of time and what are the potential features make them become popular among viewers, with the help the data from anime community in Reddit and anime record data.

Background

TV series in US usually have 23-24 episode as a “full season” and many of them run across the fall and winter, in between late September to May of the next year. However, unlike the conventions in North America, anime producers in Japan had very different traditions. The TV animes in Japan are usually played by seasons and last for three months, containing 12-13 episodes:

  • Winter: January - March
  • Spring: April - June
  • Summer: July - September
  • Fall: October - December

Therefore, in the project I would divide the animes in each year into four groups by natural seasons as the release time for the most episodes of an anime would fall into one of the four season, meaning if two animes belong to the same season, their release time of each episode would be very close

Besides, the genres of the animes could very rich, covering a lot of topics and multiple themes, so we would be able to analyze if there is any potential connections between genres and the popularity.

Finding the Hotest Anime in Different Period

How to Extract the Data We Want?

First of all, we need to determine what submissions are we actually want. Unlike IMDB or Rotten Tomatoes, there is no individual page in reddit for each anime so that people would only discuss or review that specific work under such page. The topics could be relatively spare and board. This can be also proved by a word cloud.

from wordcloud import WordCloud
import matplotlib.pyplot as plt

wordcloud = WordCloud(width = 1000, height = 600, background_color="white",
                min_font_size = 16, font_step=2)
wordcloud.generate(sub_titles['text'].str.cat(sep=' '))
plt.figure(figsize=(20, 10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

png

Word cloud of title data among all submission in anime subreddit from text_submissions dataset

From the word cloud we can see that, people discussed a lot topics in this subreddit, including but not limited to plots, characters they like, anime recommendations. It would be difficult for us to determine if they are talking about a specific anime and if they are talking about the animes currently on air or not by simply applying LatentDirichletAllocation from scikit-learn. As it’s very likely that the model will fail to extract the name of the anime properly.

Fortunately, The moderators of the anime subreddit and other contributors wrote a post bot4 which monitors the latest streaming info and will create a post automatically for each episode of anime after it’s released. And this is bot is currently operates under the account AutoLovepon.

sub_one_punch

A typical discussion submission created by this bot

Then we can simple go through the full data set we have, select the posts created by AutoLovepon, and use regex to extract anime title and episode number. Besides, since the full data set only contains meta data, I also used the praw library to help obtain the title name from reddit. As a result, we can acquire the data like below:

  • anime: title of anime
  • created_utc: date this submission created, which can be used to identify which season this anime belongs to
  • com_mean, com_median, com_count: The mean, median of the scores of the comments under this submission and count the total comments
  • score: The score of this submission
sta_19[sta_19['anime'] == "One Punch Man Season 2"].head()

id created_utc anime season ep com_mean com_median com_count score
749 t3_bbatev 2019-04-09 17:34:10 One Punch Man Season 2 2 1.0 26.809337 2.0 2035.0 7756
784 t3_bdwte2 2019-04-16 17:40:48 One Punch Man Season 2 2 2.0 24.634706 2.0 1919.0 3743
830 t3_bgja8c 2019-04-23 17:39:02 One Punch Man Season 2 2 3.0 42.697674 3.0 1290.0 5468
881 t3_bj65ne 2019-04-30 17:37:40 One Punch Man Season 2 2 4.0 28.279570 3.0 930.0 3887
927 t3_bltood 2019-05-07 17:42:06 One Punch Man Season 2 2 5.0 22.986000 3.0 1000.0 2530

Rank Top 5 Animes of each season

The score of each discussion submission the most obvious data we can use to compare. Thus, we can calculate the mean of each anime’s discussion submission’s mean and rank them.

import plotly.express as px

top5_by_score = mean_scores.groupby('season').apply(lambda x: x.sort_values(by='score', ascending=False, na_position='first').head(5).reset_index()).droplevel(0)

fig = px.line(
    top5_by_score, 
    x=top5_by_score.index, 
    y='score', 
    color='season', 
    symbol='season', 
    hover_data=['anime'],
    labels={
        "index": "Rank",
        "score": "Mean of Submission Score",
        "season": "Season"
    },
)
fig.show()
rank_2019 = rank_seasons(2019)
rank_2019

anime score season
0 Kaguya-sama wa Kokurasetai: Tensai-tachi no Re... 7204.916667 1
1 Mob Psycho 100 Season 2 6884.000000 1
2 Yakusoku no Neverland 4324.083333 1
3 Tate no Yuusha no Nariagari 3800.400000 1
4 Tensei shitara Slime Datta Ken 3382.166667 1
0 Shingeki no Kyojin Season 3 10257.600000 2
1 Kimetsu no Yaiba 4872.259259 2
2 One Punch Man Season 2 3825.666667 2
3 Isekai Quartet 2550.916667 2
4 Hitori Bocchi no ○○ Seikatsu 1643.750000 2
0 Dr. Stone 4207.291667 3
1 Vinland Saga 4027.791667 3
2 Enen no Shouboutai 2533.791667 3
3 Tsuujou Kougeki ga Zentai Kougeki de Ni-kai Ko... 1760.000000 3
4 Dungeon ni Deai o Motomeru no wa Machigatte Ir... 1641.583333 3
0 Boku no Hero Academia Season 4 4608.181818 4
1 Sword Art Online: Alicization - War of Underworld 2264.833333 4
2 Fate/Grand Order: Zettai Majuu Sensen Babylonia 2143.090909 4
3 Shinchou Yuusha: Kono Yuusha ga Ore Tueee Kuse... 2067.833333 4
4 Ore o Suki na no wa Omae Dake ka yo 1840.363636 4
rank_2020 = rank_seasons(2020)
rank_2020

anime score season
0 Boku no Hero Academia Season 4 4224.571429 1
1 Fate/Grand Order: Zettai Majuu Sensen Babylonia 2351.200000 1
2 Eizouken ni wa Te wo Dasu na! 2117.833333 1
3 Itai no wa Iya nano de Bougyoryoku ni Kyokufur... 2000.166667 1
4 Haikyuu!! To the Top 1984.384615 1
0 Kaguya-sama wa Kokurasetai?: Tensai-tachi no R... 10105.000000 2
1 Kaguya-sama wa Kokurasetai?: Tensai-tachi no R... 9362.000000 2
2 Kami no Tou 8229.000000 2
3 Kami no Tou: Tower of God 8040.500000 2
4 Otome Game no Hametsu Flag shika Nai Akuyaku R... 3128.000000 2
0 Re:Zero kara Hajimeru Isekai Seikatsu Season 2 12289.615385 3
1 Yahari Ore no Seishun Love Comedy wa Machigatt... 6253.833333 3
2 The God of High School 4913.384615 3
3 Maou Gakuin no Futekigousha: Shijou Saikyou no... 3780.916667 3
4 Sword Art Online: Alicization - War of Underwo... 3347.000000 3
0 Shingeki no Kyojin: The Final Season 16821.500000 4
1 Jujutsu Kaisen 6458.266667 4
2 Haikyuu!! To the Top 2nd Season 4450.000000 4
3 Haikyuu!!: To the Top Part 2 2831.363636 4
4 Higurashi no Naku Koro ni [Reboot only thread] 2758.666667 4
rank_2021 = rank_seasons(2021)
rank_2021

anime score season
0 Shingeki no Kyojin: The Final Season 18219.307692 1
1 Re:Zero kara Hajimeru Isekai Seikatsu Season 2... 12320.250000 1
2 Jujutsu Kaisen 10981.727273 1
3 Mushoku Tensei: Isekai Ittara Honki Dasu 8043.727273 1
4 Horimiya 6959.461538 1
0 86 EIGHTY-SIX 7757.090909 2
1 Vivy: Fluorite Eye's Song 5536.461538 2
2 Fumetsu no Anata e 5449.750000 2
3 Ijiranaide, Nagatoro-san 4158.750000 2
4 Hige wo Soru. Soshite Joshikousei wo Hirou. 3315.076923 2

Since we have both score of a submission and score of a comments, I ranked the animes with

  • mean of each submission’s score
import plotly.express as px
import plotly.io as pio 
pio.renderers.default='iframe'

top5_by_score = mean_scores.groupby('season').apply(lambda x: x.sort_values(by='score', ascending=False, na_position='first').head(5).reset_index()).droplevel(0)

fig_1 = px.line(
    top5_by_score, 
    x=top5_by_score.index, 
    y='score', 
    color='season', 
    symbol='season', 
    hover_data=['anime'],
    labels={
        "index": "Rank",
        "score": "Mean of Submission Score",
        "season": "Season"
    },
)

fig_1.show()
  • mean of total comments' score under the same submission
sta_19.fillna(0, inplace=True)
mean_com_count = sta_19.groupby('anime').agg({'com_count': 'mean', 'season': 'min'})
top5_by_com_count = mean_com_count.groupby('season').apply(lambda x: x.sort_values(by='com_count', ascending=False, na_position='first').reset_index().head(5)).droplevel(0)

fig_2 = px.line(
    top5_by_com_count, 
    x=top5_by_com_count.index, 
    y='com_count', 
    color='season', 
    symbol='season', 
    hover_data=['anime'],
    labels={
        "index": "Rank",
        "com_count": "Mean of comment's count",
        "season": "Season"
    },
)
fig_2.show()

  • median of total comments' score under the same submission
med_com_score = sta_19.groupby('anime').agg({'com_median': 'mean', 'season': 'min'})
top5_by_com_score_med = med_com_score.groupby('season').apply(lambda x: x.sort_values(by='com_median', ascending=False, na_position='first').reset_index().head(5)).droplevel(0)

fig_3 = px.line(
    top5_by_com_score_med, 
    x=top5_by_com_score_med.index, 
    y='com_median', 
    color='season', 
    symbol='season', 
    hover_data=['anime'],
    labels={
        "index": "Rank",
        "com_median": "Median of comment's Score",
        "season": "Season"
    },
)
fig_3.show()
  • total number of comments under the same submission
sta_19.fillna(0, inplace=True)
mean_com_score = sta_19.groupby('anime').agg({'com_mean': 'mean', 'season': 'min'})
top5_by_com_mean = mean_com_score.groupby('season').apply(lambda x: x.sort_values(by='com_mean', ascending=False, na_position='first').reset_index().head(5)).droplevel(0)

fig_4 = px.line(
    top5_by_com_mean, 
    x=top5_by_com_mean.index, 
    y='com_mean', 
    color='season', 
    symbol='season', 
    hover_data=['anime'],
    labels={
        "index": "Rank",
        "com_mean": "Mean of comment's Score",
        "season": "Season"
    },
)

fig_4.to_html("")

fig_4.show()

So we now can see that Each of these metric would produce a relatively different result. In order to test if we should keep both of them in our ranking model as metric, I tested if they are relevant, particularly mean of submission’s score and mean of comment’s score.

sta_19['comment_total'] = sta_19['com_mean'] * sta_19['com_count']
sta_19.plot.scatter(x='com_mean', y='score', logy=True, logx=True)
sta_19.plot.scatter(x='comment_total', y='score', logy=True, logx=True)
<AxesSubplot:xlabel='comment_total', ylabel='score'>

png

png

Apparently, both the mean of and the sum of the comments' score are highly relevant to submission’s score, thus we only need to rely on the score of the submission to rank the animes. Then we can rank the anime in each season from 2019 to 2021 using the mean score of the discussion submission post.

Now we can rank the top 5 animes in each season of each year as below

  • 2019
rank_2019 = rank_seasons(2019)
rank_2019

anime score season
0 Kaguya-sama wa Kokurasetai: Tensai-tachi no Re... 7204.916667 1
1 Mob Psycho 100 Season 2 6884.000000 1
2 Yakusoku no Neverland 4324.083333 1
3 Tate no Yuusha no Nariagari 3800.400000 1
4 Tensei shitara Slime Datta Ken 3382.166667 1
0 Shingeki no Kyojin Season 3 10257.600000 2
1 Kimetsu no Yaiba 4872.259259 2
2 One Punch Man Season 2 3825.666667 2
3 Isekai Quartet 2550.916667 2
4 Hitori Bocchi no ○○ Seikatsu 1643.750000 2
0 Dr. Stone 4207.291667 3
1 Vinland Saga 4027.791667 3
2 Enen no Shouboutai 2533.791667 3
3 Tsuujou Kougeki ga Zentai Kougeki de Ni-kai Ko... 1760.000000 3
4 Dungeon ni Deai o Motomeru no wa Machigatte Ir... 1641.583333 3
0 Boku no Hero Academia Season 4 4608.181818 4
1 Sword Art Online: Alicization - War of Underworld 2264.833333 4
2 Fate/Grand Order: Zettai Majuu Sensen Babylonia 2143.090909 4
3 Shinchou Yuusha: Kono Yuusha ga Ore Tueee Kuse... 2067.833333 4
4 Ore o Suki na no wa Omae Dake ka yo 1840.363636 4
  • 2020
rank_2020 = rank_seasons(2020)
rank_2020

anime score season
0 Boku no Hero Academia Season 4 4224.571429 1
1 Fate/Grand Order: Zettai Majuu Sensen Babylonia 2351.200000 1
2 Eizouken ni wa Te wo Dasu na! 2117.833333 1
3 Itai no wa Iya nano de Bougyoryoku ni Kyokufur... 2000.166667 1
4 Haikyuu!! To the Top 1984.384615 1
0 Kaguya-sama wa Kokurasetai?: Tensai-tachi no R... 10105.000000 2
1 Kaguya-sama wa Kokurasetai?: Tensai-tachi no R... 9362.000000 2
2 Kami no Tou 8229.000000 2
3 Kami no Tou: Tower of God 8040.500000 2
4 Otome Game no Hametsu Flag shika Nai Akuyaku R... 3128.000000 2
0 Re:Zero kara Hajimeru Isekai Seikatsu Season 2 12289.615385 3
1 Yahari Ore no Seishun Love Comedy wa Machigatt... 6253.833333 3
2 The God of High School 4913.384615 3
3 Maou Gakuin no Futekigousha: Shijou Saikyou no... 3780.916667 3
4 Sword Art Online: Alicization - War of Underwo... 3347.000000 3
0 Shingeki no Kyojin: The Final Season 16821.500000 4
1 Jujutsu Kaisen 6458.266667 4
2 Haikyuu!! To the Top 2nd Season 4450.000000 4
3 Haikyuu!!: To the Top Part 2 2831.363636 4
4 Higurashi no Naku Koro ni [Reboot only thread] 2758.666667 4
  • 2021 (First two season)
rank_2021 = rank_seasons(2021)
rank_2021

anime score season
0 Shingeki no Kyojin: The Final Season 18219.307692 1
1 Re:Zero kara Hajimeru Isekai Seikatsu Season 2... 12320.250000 1
2 Jujutsu Kaisen 10981.727273 1
3 Mushoku Tensei: Isekai Ittara Honki Dasu 8043.727273 1
4 Horimiya 6959.461538 1
0 86 EIGHTY-SIX 7757.090909 2
1 Vivy: Fluorite Eye's Song 5536.461538 2
2 Fumetsu no Anata e 5449.750000 2
3 Ijiranaide, Nagatoro-san 4158.750000 2
4 Hige wo Soru. Soshite Joshikousei wo Hirou. 3315.076923 2

Furthermore, we then can generate the ranks in all three years we have in a facet plot which is categorized by year:

rank_2019['year'] = 2019
rank_2020['year'] = 2020
rank_2021['year'] = 2021

total = rank_2019.append(rank_2020).append(rank_2021)
fig = px.line(
        total, 
        x=total.index, 
        y='score', 
        color='season', 
        symbol='season',
        facet_col='year',
        hover_data=['anime'],
        labels={
            "index": "Rank",
            "score": "Mean of Submission Score",
            "season": "Season"
        },
    )

sp = total[total['anime'].str.contains('Shingeki no Kyojin')]
sp_data = px.scatter(
        sp, 
        x=sp.index, 
        y='score', 
        text="anime",
        facet_col='year'
    ).update_traces(mode="text")["data"]

for trace in sp_data:
    fig.add_trace(trace)

save_ploty(f"{Export_Path}/trend.html", [fig])

fig.show()

Based on the facet plot above, over the years, the community is actually more and more active, as we can see that the mean of the score is increasing.

And we can see that some series is extremely welcomed in fact, the season 3 and season 4 (which has 23 episodes) of Shingeki no Kyojin were released in season 2 2019, season 4 2020 and season 1 2021, The average score they received are significantly higher than the usual animes.

Therefore, Apart from knowing which Anime is the hotest in it’s own season we can also see that Shingeki no Kyojin is the most popular anime in the community.

Explore What Would Make an Anime Hot

Since in the full dataset, we were only given the meta data of each submission and comment and the text data set is only a small subset of the full dataset. We are unable to do much further analysis to detect what would make an anime popular.

Luckily, during doing this project, I also encountered an interesting dataset anime-offline-database, which contains the tag of each anime so we can pick up the tag data from it.

The columns I care about are:

  • title: The name of anime
  • tags: list of tags of the anime

we can try look into what kind of theme or subject would make a anime popular, in other words, try to connect the presence of tag in the anime with the score it can and build a model to predict the score with tag data.

Merge the Records

The only difficulty is that name of the same anime would be slightly different in two dataset, making us hard to do the exact match. The common differences are:

  • Lower cases vs. Upper cases
  • Series numbering
  • Translation Habits

Some examples are shown as below:

merged_table[merged_table['anime'] != merged_table['anime_title']][['anime', 'anime_title']].head()

anime anime_title
0 3D Kanojo: Real Girl Season 2 3D Kanojo: Real Girl 2nd Season
2 Africa no Salaryman Africa no Salaryman (TV)
4 Aikatsu Friends! Aikatsu on Parade!
8 Ani ni Tsukeru Kusuri wa Nai! Season 3 Ani ni Tsukeru Kusuri wa Nai! 3
14 BEM EMOMOMO

So I solved the problem with Python’s built-in difflib to get the closet match of anime name in two dataset, then try to merge the animes which have name presented in the search result.

def find_closet(x):
    # perform fuzzy search on the anime title from reddit dataset 
    # with title in the anime-offline-database   
    res = difflib.get_close_matches(x, anime_meta_year['title'], cutoff=0.4)
    if len(res) == 0:
        return np.nan
    return res[0]

As a result, only 50 animes were dropped due to no matchin, which is pretty good given the orginal total number of animes in the reddit is 537.

Regression With Linear Model

After merging the tables, by applying explode() and pivot(), we can obtain a tag matrix, where columns are the presented tags and index are the anime names.

tag_matrix = tags.pivot('anime', columns='tags', values='present').fillna(0)
tag_matrix = tag_matrix.sort_index(axis=1, key=lambda x: tag_matrix[x].sum(), ascending=False)
tag_matrix.head()

tags comedy action drama fantasy slice of life present based on a manga male protagonist female protagonist shounen ... shounen ai soccer shounen-ai flat chested shrine maiden flash animation skateboarding slow when it comes to love fake romance exhibitionism
anime
100-man no Inochi no Ue ni Ore wa Tatte Iru 1.0 1.0 1.0 1.0 0.0 0.0 1.0 1.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2.43: Seiin Koukou Danshi Volley-bu 0.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
22/7 1.0 0.0 1.0 0.0 1.0 1.0 0.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3D Kanojo: Real Girl Season 2 1.0 0.0 1.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
A3! Season Autumn & Winter 1.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 717 columns

In this case, we would explore with the Linear and Logistic Regressions in the linear model. We can split data into three parts - 70% as training data, 15% as test data and 15 % as validation data.

from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

def rating_test(start, end, useLinear=True, step=1):
    test_mses = []
    vali_mses = []
    if useLinear:
        Model = LinearRegression
    else:
        Model = LogisticRegression
    for i in range(start, end, step):
        reg = Model().fit(train_set.iloc[:, 0:i], train_y)
        test_mses.append(mean_squared_error(test_y, reg.predict(test_set.iloc[:, 0:i])))
        vali_mses.append(mean_squared_error(vali_y, reg.predict(vali_set.iloc[:, 0:i])))
    test_mses = pd.Series(test_mses, index=[i for i in range(start, end, step)], name='Training MSE').rename_axis('n_features')
    vali_mses = pd.Series(vali_mses, index=[i for i in range(start, end, step)], name='Validation MSE').rename_axis('n_features')
    return test_mses, vali_mses

test_mses, vali_mses = rating_test(1, 30)
test_mses.plot.line(grid=True, marker='o', legend=True)
vali_mses.plot.line(grid=True, marker='^', legend=True)
<AxesSubplot:xlabel='n_features'>

png

From the plot of MSE between predicted y and actual y is over $10^6$, so apparently, Linear Regression won’t work well with this data set.

Now, let’s test with Logistic Regression

test_mses, vali_mses = rating_test(1, 30, False)
test_mses.plot.line(grid=True, marker='o', legend=True)
vali_mses.plot.line(grid=True, marker='^', legend=True)
<AxesSubplot:xlabel='n_features'>

png

Similarly, Logistic Model doesn’t work well on this dataset.

Therefore, we cannot use the simple classifiers to predict scores of an anime.

Conclusion

With given reddit dataset, it’s possible for us to determine what are the hottest animes over different period of time and find out how to rank the top n animes for simple recommendation system. But there are still some limitations using this dataset, such as detect what would be the worst anime, since the score mostly reflects how people are happy to discuss this submission but we are unable to tell what’s the attitude of people towards to a TV series. For example, if the quality of an episode is extremely, users might go to the submission and leave comments like analyzing why it’s bad and compilations about the plots which may still receive a very good score as a result. It might be a good idea to combine with sentiment score of the texts in the thread to detect if people are complaining, praising or just don’t care about the episode.

I also attempted to use the linear model to find potential connection between the tags of animes and mean score of anime episode’s discussion submission. However, it turns out these two models didn’t work well given data, thus it may require some more powerful models to test if we can find some connections.