Simulation of Incentive Design
The next topic of Decon Simulation series is ‘reward system design’. We will look into what problems may spawn from reward system design before diving into the simulation.
Where Will the Rugby Ball Bounce to?
On a traditional blockchain, users have limited scope of behavior and rewards are automatically given without human intervention. For instance, only through mining can you receive a reward on Bitcoin. The reward is automatically given in the form of coinbase transaction to the person who solved a problem.
Such behavior simplicity and fully automatic rewarding makes it relatively easy to predict and drive user action. In other words, it is easy to design a reward system because the design can be built purely with math.
However, on blockchains where rewards are given to various activities other than mining, the reward system design becomes complex. For example, if users are motivated to upload posts (Steemit), propose new algorithms for certain problems (Numerai), or post good data set (Ocean), how should the designer distribute rewards?
Reward System Problem
Because it is impossible to complete automate the process of assessing whether a user made a meaningful contribution, the blockchain needs the help of other users for reward distribution. Many token economies solve this problem through voting. In a voting reward system, the post that get more likes by users receive more rewards.
For example, a system seeks to give appropriate rewards to users for their postings. An appropriate reward is assessed by the post’s quality and reader response. Because such assessment requires quality check, automation is very difficult. Therefore, the number of likes a post receives becomes the reward criteria. In this case, pressing the like button is equivalent to voting, and the aggregate number of likes is the number of votes.
It would seem reasonable to receive an award equal to the percentage of likes a person’s post got out of all the likes from the ‘reward pool’, which is the aggregate total of rewards. However, could such a system continuously drive high quality activities?
Methods of Giving Rewards
At a glimpse, it would be reasonable to give rewards proportionally based on the number of likes a person receives out of all likes.
However, some users could refuse to post anything for reasons unrelated to effort (popularity, exposure of skin, etc.). In an environment with such users, a uniform reward structure is necessary.
Also, if there are too many users (where the total reward is capped), a reward could shrink even though a user managed to attract relatively many likes. To resolve this problem, a reward system that increases exponentially based on the number of likes should be established to incentivise high quality posts.
There is no best reward system, only the most appropriate.
Then, which out of proportional, uniform, and exponential reward systems is the best? Unfortunately, the answer to that question is “none”. There is no cure-all system that can be applied in all cases.
Here can be the most suitable system for a certain environment. An exponential reward system would be a good fit for an environment with many users while a uniform system would be better suited for an environment with few users.
In truth, this problem would be difficult to tackle before an environment is created and functional. Yet, it is an arduous task to overhaul a design once a service is up and going. Decon experimented on a simulation to see what the optimum reward distribution system is.
Nobody can anticipate where a rugby ball will bounce off to, but if reality is effectively incorporated into simulation, we can get a rough estimate of the trajectory. Similarly, even though individual behavior is diverse, we can roughly model patterns with simulation. We aimed to come up with a better reward system.
Reward System Design
As an example for a complicated reward system, let’s take a look at review platforms. Users can leave reviews about a certain activity and receive a reward depending on the number of likes their reviews got.
The reward is distributed from the reward pool. Those who want to promote activity (such as restaurants or hotels that need reviews) can decide the size of the reward pool. Each entity has different demands. One platform might want many reviews while another platform seeks quality reviews or a mixture of both.
Simulation Environment Setting
Next is the setting and hypothetical environment of the simulation.
- Reward pool: The total reward amount invested by the service provider who wants to drive users to post reviews. The default amount is 500.
- Number of agents: The number of users in a network. The default number is 100.
- Episode: An episode is the process in which an agent decides how much effort is to be invested in writing a review, receive votes on the review, and ultimately a reward for the review. The simulation ran 500 episodes.
- Range of effort: Refers to the time and cost an agent invested to write a review. 0 means an agent did not draft a review and 9 means an agent put maximum effort in writing one. The levels are all in whole numbers.
- Asset distribution of agents: Based on Pareto principle that unequal distribution leads to the concentration of wealth, we used Pareto distribution to highlight such inequality. For the sake of convenience, we lined up the agents based on the size of their assets.
- Cost: Refers to the cost required to write a review, which is determined by the level of effort and asset of an agent. We will elaborate on this later on.
- Like: Refers to the assessment of other agents of a review. Based on the level of effort and record of reviews, an average level will be determined and randomly decided by normal distribution.
Agents compare the potential gain and cost for writing a review to decide the level of effort they are willing to put in. The gain is the incentive for writing a review and the reward is the subtraction of cost from gain.
Gain Distribution Methodology
We used the aforementioned proportional, exponential, and uniform methods for gain distribution. Detailed definitions of each are as follows:
The gain is determined by the number of likes a review gets. Let’s say 10 agents received [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] likes and the reward pool is 55. In the listed order, each agent would receive [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] gain.
Gain is distributed based on the squared ratio of likes. In other words, if an agent received relatively more likes than others, the agent can expect more gain. We went through a normalization process so that the sum would be 1 after squaring the like ratio.
If 10 agents received [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] likes and the reward pool is 55, each agent, in the listed order, would receive [0.143, 0.571, 1.286, 2.286, 3.571, 5.143, 7.000, 9.143, 11.571, 14.286] gain (rounding up to three decimal places). Notice that agents 8, 9, 10 receive more rewards compared to the proportional system.
Regardless of the number of likes, agents receive equal gains. Let’s say 5 out of 10 agents left a review and the reward pool is 55. Five agents would get 0 gain while the other five would get 11 gain. If agents 1~5 wrote reviews, the reward distribution would be [11, 11, 11, 11, 11, 0, 0, 0, 0, 0] .
Each agent decides their behavior based on the ‘cost function’ that determines the cost of writing a review and the ‘like function’ that determines the number of likes the review would receive.
The cost for writing a review is determined by the following formula.
The asset amount (ratio) that an agent already owns and the agent’s endeavor in writing a review are influenced. The assumptions are as follows:
- The cost is 0 when an agent did not endeavor to write a review.
- Because agents who already have considerable asset have small marginal utility, they are not driven to write a review for (a relatively small) incentive. Therefore, the cost was set to be bigger for agents with more asset.
- With more endeavor invested, the cost of writing a review increases.
- Regardless of endeavor, the absolute cost for writing a review was set as constant term (b0).
- When determining the cost by a multiplication term of asset and endeavor, a non-linear pattern can be used. If the b3 coefficient is 0, cost is determined by the linear combination of asset and endeavor. Increasing or decreasing the value of b3 can strengthen or reduce the non-linear traits.
- Endeavor has an integer value from 0 to 9, and increases linearly. We strengthened the nonlinear trait by applying this exponentially. A visual comparison of endeavor and exp(endeavor) is as follows:
Before determining the cost, we must note that original asset is not based on actual numbers but allocated based on Pareto distribution. Therefore, the absolute number is bound to be different based on the number of participating agents.
Naively, if there are 10 agents, the original asset of each agent would be 0.1, but if there are 100 agents it would be 0.01. Therefore, it is appropriate to multiply the total number of agents instead of using the asset ratio to calculate cost. In this simulation, coefficients b1 and b3 represent such logic.
The likes that a review would get is influenced by the level of endeavor (action) and the agent’s review history. Reviews that have significant effort invested would have higher quality, and therefore is likely to receive more likes. Additionally, agents who have consistently written reviews would have higher credibility.
To determine the number of likes, we have to take into account the level of endeavor rather than the existence of effort, which is why we used action rather than exp(endeavor). The like function formula is as follows:
The number of likes is determined by probability following normal distribution with an average (mu) and standard deviation (std). The mean of this normal distribution is determined by an agent’s action and review history, and standard deviation is set as 1.0.
Because action and review history does not have direct correlation, they can be interpreted through linear combination.
- Because the number of likes cannot be a negative number, any value below 0 is replaced by 0.
- If action is 0, it means that an agent did not write a review, so the number of likes is also 0.
- However, taking into account that the exact number of likes cannot be anticipated, the number of likes is deducted by probability in accordance with normal distribution.
- In terms of review history, we look back into past episodes based on the given window value. 1 means a review was written and 0 means the contrary. For instance, if the window is 5 and we are on episode 20, we verify if the agent wrote reviews in episodes 15~19 and record [0, 1, 1, 0, 0].
- If an agent has written reviews in the past, additional points are given. In other words, the number of reviews an agent has on record is incorporated. In the above example, 0+1+1+0+0=2 is used to determine the average.
- By adjusting coefficients c1 and c2, we can determine whether action or review history is more important. This simulation has c1 = 2.0, c2 = 1.0 as default. We assessed that current action is important than review history.
Because the score can vary tremendously based on the hyperparameter (the window), we go through a normalization process of dividing the score with the size of the window. In this simulation coefficient c2 reflects this logic.
Unlike the cost and like functions, there are implied functions — gain function and reward function. These functions either follow distribution methodology or is deduced by the return value of other functions.
The gain an agent would receive for writing a review is calculated by the number of likes. The amount is decided through the aforementioned proportional, exponential, and uniform methods applied to the given reward pool.
Unlike the cost and like functions, because the gain is determined by not just the likes an agent’s review get but also requires the number of likes other agents received on their reviews, the gain function derives from the environment. Simply put, the gain amount is determined after all agents complete their activity.
The reward is a subtraction of gain and cost. The formula is as follows:
Each agent studies through exploration so that they can get maximum reward.
In this article, we looked at the ‘reward system design’ problem which we sought to analyze through simulation. Also, we looked through the simulation setting, agent functions, and environment functions.
In our next article, we will see the simulation results of when the agents are programmed to study.