What's The Hottest Anime in Each Season?

Abstract

No matter for what kind of many screenplays, the most common topics among those watcher or fans are probably:

Which xxx is the best?
Which xxx do you recommend?
Do you think what’s the best xxx of 2021?

xxx here can be substituted by any type of screenplays, i.e. movies, TV series, TV show and etc. This rule applies to one type of screenplays as well - TV anime.

In this post, I would present my analysis on what’s most welcomed TV animes in different period of time and what are the potential features make them become popular among viewers, with the help the data from anime community in Reddit and anime record data.

Background

TV series in US usually have 23-24 episode as a “full season” and many of them run across the fall and winter, in between late September to May of the next year. However, unlike the conventions in North America, anime producers in Japan had very different traditions. The TV animes in Japan are usually played by seasons and last for three months, containing 12-13 episodes:

Winter: January - March
Spring: April - June
Summer: July - September
Fall: October - December

Therefore, in the project I would divide the animes in each year into four groups by natural seasons as the release time for the most episodes of an anime would fall into one of the four season, meaning if two animes belong to the same season, their release time of each episode would be very close

Besides, the genres of the animes could very rich, covering a lot of topics and multiple themes, so we would be able to analyze if there is any potential connections between genres and the popularity.

Finding the Hotest Anime in Different Period

How to Extract the Data We Want?

First of all, we need to determine what submissions are we actually want. Unlike IMDB or Rotten Tomatoes, there is no individual page in reddit for each anime so that people would only discuss or review that specific work under such page. The topics could be relatively spare and board. This can be also proved by a word cloud.

from wordcloud import WordCloud
import matplotlib.pyplot as plt

wordcloud = WordCloud(width = 1000, height = 600, background_color="white",
                min_font_size = 16, font_step=2)
wordcloud.generate(sub_titles['text'].str.cat(sep=' '))
plt.figure(figsize=(20, 10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

png

Word cloud of title data among all submission in anime subreddit from text_submissions dataset

From the word cloud we can see that, people discussed a lot topics in this subreddit, including but not limited to plots, characters they like, anime recommendations. It would be difficult for us to determine if they are talking about a specific anime and if they are talking about the animes currently on air or not by simply applying LatentDirichletAllocation from scikit-learn. As it’s very likely that the model will fail to extract the name of the anime properly.

Fortunately, The moderators of the anime subreddit and other contributors wrote a post bot4 which monitors the latest streaming info and will create a post automatically for each episode of anime after it’s released. And this is bot is currently operates under the account AutoLovepon.

sub_one_punch

A typical discussion submission created by this bot

Then we can simple go through the full data set we have, select the posts created by AutoLovepon, and use regex to extract anime title and episode number. Besides, since the full data set only contains meta data, I also used the praw library to help obtain the title name from reddit. As a result, we can acquire the data like below:

anime: title of anime
created_utc: date this submission created, which can be used to identify which season this anime belongs to
com_mean, com_median, com_count: The mean, median of the scores of the comments under this submission and count the total comments
score: The score of this submission

sta_19[sta_19['anime'] == "One Punch Man Season 2"].head()

	id	created_utc	anime	season	ep	com_mean	com_median	com_count	score
749	t3_bbatev	2019-04-09 17:34:10	One Punch Man Season 2	2	1.0	26.809337	2.0	2035.0	7756
784	t3_bdwte2	2019-04-16 17:40:48	One Punch Man Season 2	2	2.0	24.634706	2.0	1919.0	3743
830	t3_bgja8c	2019-04-23 17:39:02	One Punch Man Season 2	2	3.0	42.697674	3.0	1290.0	5468
881	t3_bj65ne	2019-04-30 17:37:40	One Punch Man Season 2	2	4.0	28.279570	3.0	930.0	3887
927	t3_bltood	2019-05-07 17:42:06	One Punch Man Season 2	2	5.0	22.986000	3.0	1000.0	2530

Rank Top 5 Animes of each season

The score of each discussion submission the most obvious data we can use to compare. Thus, we can calculate the mean of each anime’s discussion submission’s mean and rank them.

import plotly.express as px

top5_by_score = mean_scores.groupby('season').apply(lambda x: x.sort_values(by='score', ascending=False, na_position='first').head(5).reset_index()).droplevel(0)

fig = px.line(
    top5_by_score, 
    x=top5_by_score.index, 
    y='score', 
    color='season', 
    symbol='season', 
    hover_data=['anime'],
    labels={
        "index": "Rank",
        "score": "Mean of Submission Score",
        "season": "Season"
    },
)
fig.show()

rank_2019 = rank_seasons(2019)
rank_2019

	anime	score	season
0	Kaguya-sama wa Kokurasetai: Tensai-tachi no Re...	7204.916667	1
1	Mob Psycho 100 Season 2	6884.000000	1
2	Yakusoku no Neverland	4324.083333	1
3	Tate no Yuusha no Nariagari	3800.400000	1
4	Tensei shitara Slime Datta Ken	3382.166667	1
0	Shingeki no Kyojin Season 3	10257.600000	2
1	Kimetsu no Yaiba	4872.259259	2
2	One Punch Man Season 2	3825.666667	2
3	Isekai Quartet	2550.916667	2
4	Hitori Bocchi no ○○ Seikatsu	1643.750000	2
0	Dr. Stone	4207.291667	3
1	Vinland Saga	4027.791667	3
2	Enen no Shouboutai	2533.791667	3
3	Tsuujou Kougeki ga Zentai Kougeki de Ni-kai Ko...	1760.000000	3
4	Dungeon ni Deai o Motomeru no wa Machigatte Ir...	1641.583333	3
0	Boku no Hero Academia Season 4	4608.181818	4
1	Sword Art Online: Alicization - War of Underworld	2264.833333	4
2	Fate/Grand Order: Zettai Majuu Sensen Babylonia	2143.090909	4
3	Shinchou Yuusha: Kono Yuusha ga Ore Tueee Kuse...	2067.833333	4
4	Ore o Suki na no wa Omae Dake ka yo	1840.363636	4

rank_2020 = rank_seasons(2020)
rank_2020

	anime	score	season
0	Boku no Hero Academia Season 4	4224.571429	1
1	Fate/Grand Order: Zettai Majuu Sensen Babylonia	2351.200000	1
2	Eizouken ni wa Te wo Dasu na!	2117.833333	1
3	Itai no wa Iya nano de Bougyoryoku ni Kyokufur...	2000.166667	1
4	Haikyuu!! To the Top	1984.384615	1
0	Kaguya-sama wa Kokurasetai?: Tensai-tachi no R...	10105.000000	2
1	Kaguya-sama wa Kokurasetai?: Tensai-tachi no R...	9362.000000	2
2	Kami no Tou	8229.000000	2
3	Kami no Tou: Tower of God	8040.500000	2
4	Otome Game no Hametsu Flag shika Nai Akuyaku R...	3128.000000	2
0	Re:Zero kara Hajimeru Isekai Seikatsu Season 2	12289.615385	3
1	Yahari Ore no Seishun Love Comedy wa Machigatt...	6253.833333	3
2	The God of High School	4913.384615	3
3	Maou Gakuin no Futekigousha: Shijou Saikyou no...	3780.916667	3
4	Sword Art Online: Alicization - War of Underwo...	3347.000000	3
0	Shingeki no Kyojin: The Final Season	16821.500000	4
1	Jujutsu Kaisen	6458.266667	4
2	Haikyuu!! To the Top 2nd Season	4450.000000	4
3	Haikyuu!!: To the Top Part 2	2831.363636	4
4	Higurashi no Naku Koro ni [Reboot only thread]	2758.666667	4

rank_2021 = rank_seasons(2021)
rank_2021

	anime	score	season
0	Shingeki no Kyojin: The Final Season	18219.307692	1
1	Re:Zero kara Hajimeru Isekai Seikatsu Season 2...	12320.250000	1
2	Jujutsu Kaisen	10981.727273	1
3	Mushoku Tensei: Isekai Ittara Honki Dasu	8043.727273	1
4	Horimiya	6959.461538	1
0	86 EIGHTY-SIX	7757.090909	2
1	Vivy: Fluorite Eye's Song	5536.461538	2
2	Fumetsu no Anata e	5449.750000	2
3	Ijiranaide, Nagatoro-san	4158.750000	2
4	Hige wo Soru. Soshite Joshikousei wo Hirou.	3315.076923	2

Since we have both score of a submission and score of a comments, I ranked the animes with

mean of each submission’s score

import plotly.express as px
import plotly.io as pio 
pio.renderers.default='iframe'

top5_by_score = mean_scores.groupby('season').apply(lambda x: x.sort_values(by='score', ascending=False, na_position='first').head(5).reset_index()).droplevel(0)

fig_1 = px.line(
    top5_by_score, 
    x=top5_by_score.index, 
    y='score', 
    color='season', 
    symbol='season', 
    hover_data=['anime'],
    labels={
        "index": "Rank",
        "score": "Mean of Submission Score",
        "season": "Season"
    },
)

fig_1.show()

mean of total comments' score under the same submission

sta_19.fillna(0, inplace=True)
mean_com_count = sta_19.groupby('anime').agg({'com_count': 'mean', 'season': 'min'})
top5_by_com_count = mean_com_count.groupby('season').apply(lambda x: x.sort_values(by='com_count', ascending=False, na_position='first').reset_index().head(5)).droplevel(0)

fig_2 = px.line(
    top5_by_com_count, 
    x=top5_by_com_count.index, 
    y='com_count', 
    color='season', 
    symbol='season', 
    hover_data=['anime'],
    labels={
        "index": "Rank",
        "com_count": "Mean of comment's count",
        "season": "Season"
    },
)
fig_2.show()

median of total comments' score under the same submission

med_com_score = sta_19.groupby('anime').agg({'com_median': 'mean', 'season': 'min'})
top5_by_com_score_med = med_com_score.groupby('season').apply(lambda x: x.sort_values(by='com_median', ascending=False, na_position='first').reset_index().head(5)).droplevel(0)

fig_3 = px.line(
    top5_by_com_score_med, 
    x=top5_by_com_score_med.index, 
    y='com_median', 
    color='season', 
    symbol='season', 
    hover_data=['anime'],
    labels={
        "index": "Rank",
        "com_median": "Median of comment's Score",
        "season": "Season"
    },
)
fig_3.show()

total number of comments under the same submission

sta_19.fillna(0, inplace=True)
mean_com_score = sta_19.groupby('anime').agg({'com_mean': 'mean', 'season': 'min'})
top5_by_com_mean = mean_com_score.groupby('season').apply(lambda x: x.sort_values(by='com_mean', ascending=False, na_position='first').reset_index().head(5)).droplevel(0)

fig_4 = px.line(
    top5_by_com_mean, 
    x=top5_by_com_mean.index, 
    y='com_mean', 
    color='season', 
    symbol='season', 
    hover_data=['anime'],
    labels={
        "index": "Rank",
        "com_mean": "Mean of comment's Score",
        "season": "Season"
    },
)

fig_4.to_html("")

fig_4.show()

So we now can see that Each of these metric would produce a relatively different result. In order to test if we should keep both of them in our ranking model as metric, I tested if they are relevant, particularly mean of submission’s score and mean of comment’s score.

sta_19['comment_total'] = sta_19['com_mean'] * sta_19['com_count']
sta_19.plot.scatter(x='com_mean', y='score', logy=True, logx=True)
sta_19.plot.scatter(x='comment_total', y='score', logy=True, logx=True)

<AxesSubplot:xlabel='comment_total', ylabel='score'>

png

Apparently, both the mean of and the sum of the comments' score are highly relevant to submission’s score, thus we only need to rely on the score of the submission to rank the animes. Then we can rank the anime in each season from 2019 to 2021 using the mean score of the discussion submission post.

Now we can rank the top 5 animes in each season of each year as below

2019

rank_2019 = rank_seasons(2019)
rank_2019

	anime	score	season
0	Kaguya-sama wa Kokurasetai: Tensai-tachi no Re...	7204.916667	1
1	Mob Psycho 100 Season 2	6884.000000	1
2	Yakusoku no Neverland	4324.083333	1
3	Tate no Yuusha no Nariagari	3800.400000	1
4	Tensei shitara Slime Datta Ken	3382.166667	1
0	Shingeki no Kyojin Season 3	10257.600000	2
1	Kimetsu no Yaiba	4872.259259	2
2	One Punch Man Season 2	3825.666667	2
3	Isekai Quartet	2550.916667	2
4	Hitori Bocchi no ○○ Seikatsu	1643.750000	2
0	Dr. Stone	4207.291667	3
1	Vinland Saga	4027.791667	3
2	Enen no Shouboutai	2533.791667	3
3	Tsuujou Kougeki ga Zentai Kougeki de Ni-kai Ko...	1760.000000	3
4	Dungeon ni Deai o Motomeru no wa Machigatte Ir...	1641.583333	3
0	Boku no Hero Academia Season 4	4608.181818	4
1	Sword Art Online: Alicization - War of Underworld	2264.833333	4
2	Fate/Grand Order: Zettai Majuu Sensen Babylonia	2143.090909	4
3	Shinchou Yuusha: Kono Yuusha ga Ore Tueee Kuse...	2067.833333	4
4	Ore o Suki na no wa Omae Dake ka yo	1840.363636	4

2020

rank_2020 = rank_seasons(2020)
rank_2020

	anime	score	season
0	Boku no Hero Academia Season 4	4224.571429	1
1	Fate/Grand Order: Zettai Majuu Sensen Babylonia	2351.200000	1
2	Eizouken ni wa Te wo Dasu na!	2117.833333	1
3	Itai no wa Iya nano de Bougyoryoku ni Kyokufur...	2000.166667	1
4	Haikyuu!! To the Top	1984.384615	1
0	Kaguya-sama wa Kokurasetai?: Tensai-tachi no R...	10105.000000	2
1	Kaguya-sama wa Kokurasetai?: Tensai-tachi no R...	9362.000000	2
2	Kami no Tou	8229.000000	2
3	Kami no Tou: Tower of God	8040.500000	2
4	Otome Game no Hametsu Flag shika Nai Akuyaku R...	3128.000000	2
0	Re:Zero kara Hajimeru Isekai Seikatsu Season 2	12289.615385	3
1	Yahari Ore no Seishun Love Comedy wa Machigatt...	6253.833333	3
2	The God of High School	4913.384615	3
3	Maou Gakuin no Futekigousha: Shijou Saikyou no...	3780.916667	3
4	Sword Art Online: Alicization - War of Underwo...	3347.000000	3
0	Shingeki no Kyojin: The Final Season	16821.500000	4
1	Jujutsu Kaisen	6458.266667	4
2	Haikyuu!! To the Top 2nd Season	4450.000000	4
3	Haikyuu!!: To the Top Part 2	2831.363636	4
4	Higurashi no Naku Koro ni [Reboot only thread]	2758.666667	4

2021 (First two season)

rank_2021 = rank_seasons(2021)
rank_2021

	anime	score	season
0	Shingeki no Kyojin: The Final Season	18219.307692	1
1	Re:Zero kara Hajimeru Isekai Seikatsu Season 2...	12320.250000	1
2	Jujutsu Kaisen	10981.727273	1
3	Mushoku Tensei: Isekai Ittara Honki Dasu	8043.727273	1
4	Horimiya	6959.461538	1
0	86 EIGHTY-SIX	7757.090909	2
1	Vivy: Fluorite Eye's Song	5536.461538	2
2	Fumetsu no Anata e	5449.750000	2
3	Ijiranaide, Nagatoro-san	4158.750000	2
4	Hige wo Soru. Soshite Joshikousei wo Hirou.	3315.076923	2

Furthermore, we then can generate the ranks in all three years we have in a facet plot which is categorized by year:

rank_2019['year'] = 2019
rank_2020['year'] = 2020
rank_2021['year'] = 2021

total = rank_2019.append(rank_2020).append(rank_2021)
fig = px.line(
        total, 
        x=total.index, 
        y='score', 
        color='season', 
        symbol='season',
        facet_col='year',
        hover_data=['anime'],
        labels={
            "index": "Rank",
            "score": "Mean of Submission Score",
            "season": "Season"
        },
    )

sp = total[total['anime'].str.contains('Shingeki no Kyojin')]
sp_data = px.scatter(
        sp, 
        x=sp.index, 
        y='score', 
        text="anime",
        facet_col='year'
    ).update_traces(mode="text")["data"]

for trace in sp_data:
    fig.add_trace(trace)

save_ploty(f"{Export_Path}/trend.html", [fig])

fig.show()

Based on the facet plot above, over the years, the community is actually more and more active, as we can see that the mean of the score is increasing.

And we can see that some series is extremely welcomed in fact, the season 3 and season 4 (which has 23 episodes) of Shingeki no Kyojin were released in season 2 2019, season 4 2020 and season 1 2021, The average score they received are significantly higher than the usual animes.

Therefore, Apart from knowing which Anime is the hotest in it’s own season we can also see that Shingeki no Kyojin is the most popular anime in the community.

Explore What Would Make an Anime Hot

Since in the full dataset, we were only given the meta data of each submission and comment and the text data set is only a small subset of the full dataset. We are unable to do much further analysis to detect what would make an anime popular.

Luckily, during doing this project, I also encountered an interesting dataset anime-offline-database, which contains the tag of each anime so we can pick up the tag data from it.

The columns I care about are:

title: The name of anime
tags: list of tags of the anime

we can try look into what kind of theme or subject would make a anime popular, in other words, try to connect the presence of tag in the anime with the score it can and build a model to predict the score with tag data.

Merge the Records

The only difficulty is that name of the same anime would be slightly different in two dataset, making us hard to do the exact match. The common differences are:

Lower cases vs. Upper cases
Series numbering
Translation Habits

Some examples are shown as below:

merged_table[merged_table['anime'] != merged_table['anime_title']][['anime', 'anime_title']].head()

	anime	anime_title
0	3D Kanojo: Real Girl Season 2	3D Kanojo: Real Girl 2nd Season
2	Africa no Salaryman	Africa no Salaryman (TV)
4	Aikatsu Friends!	Aikatsu on Parade!
8	Ani ni Tsukeru Kusuri wa Nai! Season 3	Ani ni Tsukeru Kusuri wa Nai! 3
14	BEM	EMOMOMO

So I solved the problem with Python’s built-in difflib to get the closet match of anime name in two dataset, then try to merge the animes which have name presented in the search result.

def find_closet(x):
    # perform fuzzy search on the anime title from reddit dataset 
    # with title in the anime-offline-database   
    res = difflib.get_close_matches(x, anime_meta_year['title'], cutoff=0.4)
    if len(res) == 0:
        return np.nan
    return res[0]

As a result, only 50 animes were dropped due to no matchin, which is pretty good given the orginal total number of animes in the reddit is 537.

Regression With Linear Model

After merging the tables, by applying explode() and pivot(), we can obtain a tag matrix, where columns are the presented tags and index are the anime names.

tag_matrix = tags.pivot('anime', columns='tags', values='present').fillna(0)
tag_matrix = tag_matrix.sort_index(axis=1, key=lambda x: tag_matrix[x].sum(), ascending=False)
tag_matrix.head()

tags	comedy	action	drama	fantasy	slice of life	present	based on a manga	male protagonist	female protagonist	shounen	...	shounen ai	soccer	shounen-ai	flat chested	shrine maiden	flash animation	skateboarding	slow when it comes to love	fake romance	exhibitionism
anime
100-man no Inochi no Ue ni Ore wa Tatte Iru	1.0	1.0	1.0	1.0	0.0	0.0	1.0	1.0	0.0	1.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2.43: Seiin Koukou Danshi Volley-bu	0.0	0.0	1.0	0.0	1.0	0.0	0.0	1.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
22/7	1.0	0.0	1.0	0.0	1.0	1.0	0.0	0.0	1.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3D Kanojo: Real Girl Season 2	1.0	0.0	1.0	0.0	1.0	1.0	1.0	1.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
A3! Season Autumn & Winter	1.0	0.0	1.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

5 rows × 717 columns

In this case, we would explore with the Linear and Logistic Regressions in the linear model. We can split data into three parts - 70% as training data, 15% as test data and 15 % as validation data.

from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

def rating_test(start, end, useLinear=True, step=1):
    test_mses = []
    vali_mses = []
    if useLinear:
        Model = LinearRegression
    else:
        Model = LogisticRegression
    for i in range(start, end, step):
        reg = Model().fit(train_set.iloc[:, 0:i], train_y)
        test_mses.append(mean_squared_error(test_y, reg.predict(test_set.iloc[:, 0:i])))
        vali_mses.append(mean_squared_error(vali_y, reg.predict(vali_set.iloc[:, 0:i])))
    test_mses = pd.Series(test_mses, index=[i for i in range(start, end, step)], name='Training MSE').rename_axis('n_features')
    vali_mses = pd.Series(vali_mses, index=[i for i in range(start, end, step)], name='Validation MSE').rename_axis('n_features')
    return test_mses, vali_mses

test_mses, vali_mses = rating_test(1, 30)
test_mses.plot.line(grid=True, marker='o', legend=True)
vali_mses.plot.line(grid=True, marker='^', legend=True)

<AxesSubplot:xlabel='n_features'>

png

From the plot of MSE between predicted y and actual y is over $10^6$, so apparently, Linear Regression won’t work well with this data set.

Now, let’s test with Logistic Regression

test_mses, vali_mses = rating_test(1, 30, False)
test_mses.plot.line(grid=True, marker='o', legend=True)
vali_mses.plot.line(grid=True, marker='^', legend=True)

<AxesSubplot:xlabel='n_features'>

png

Similarly, Logistic Model doesn’t work well on this dataset.

Therefore, we cannot use the simple classifiers to predict scores of an anime.

Conclusion

With given reddit dataset, it’s possible for us to determine what are the hottest animes over different period of time and find out how to rank the top n animes for simple recommendation system. But there are still some limitations using this dataset, such as detect what would be the worst anime, since the score mostly reflects how people are happy to discuss this submission but we are unable to tell what’s the attitude of people towards to a TV series. For example, if the quality of an episode is extremely, users might go to the submission and leave comments like analyzing why it’s bad and compilations about the plots which may still receive a very good score as a result. It might be a good idea to combine with sentiment score of the texts in the thread to detect if people are complaining, praising or just don’t care about the episode.

I also attempted to use the linear model to find potential connection between the tags of animes and mean score of anime episode’s discussion submission. However, it turns out these two models didn’t work well given data, thus it may require some more powerful models to test if we can find some connections.