Repast Data Sets

This tutorial will explain how to create and edit data sets in Repast’s GUI interface.

Data sets can be understood as the middlemen of Repast data pipelines. They record information from each simulation run according to a user-specified set of instructions, which becomes the input for some other process. These downstream processes include:

  • Charts within the Repast GUI
  • Text sinks i.e. printing to a console or writing to a tabular file
  • The rrepast R package

A set of user-specified instructions are what define a data set. Whatever the instructions are, they will be consistently applied across a single simulation run or single series of queued runs. For example, if you are running a Repast model from the GUI, you cannot change data set instruction in the middle of its execution. If you are running a series of queued runs from the Repast batch run interface or the rrepast package, you cannot change data set instructions in the middle of that series. Data sets instructions always feature the same three components:

  1. Type   (first dialog)
    1. Aggregate
    2. Non-Aggregate
  2. Simulation Variables   (second dialog)
    1. Standard Sources
    2. Method Sources
  3. Scheduling   (final dialog)
    1. One time collections
    2. Interval collection
    3. Priority

These categories will be explained as the tutorial proceeds. For now, just understand that a data set has a schedule of when to collect information. Each time it is scheduled to collect, a number of rows are added to the data set according to it’s type. The columnar content of said row(s) are determined by what simulation variables are instructed to be recorded.

Data sets are connected to a specific Repast model scenario. As an example, the Schelling model is used as in previous tutorials.

First Dialog


To add a new data set, right click on the Data Sets node of the Scenario Tree tab of the Control Panel, pictured above. A simple dialog will appear asking for a Data Set ID and type.

The data set type determines how one or more variables are recorded across many different objects simultaneously. Being able to choose between aggregate and non-aggregate data set types is a useful feature for the kind of agent-based models Repast is intended to support. An aggregate-type data set will record a single row/feature of data for each scheduled collection interval. It records the relevant state of the simulation’s many objects, then applies an aggregate function, such as sum, mean, etc., which is chosen by the user. A non-aggregate-type data set, on the other hand, creates a new row/feature for every observed object every scheduled collection interval.


Consider the above example. These two tables represent data sets representing the same simulation that, like the Schelling model, tracks every agent’s age in the number of time-steps it’s been “alive.” However, for the sake of explanation, there are only three agents. Both data sets are set to record their agents’ age every time step. The aggregate data set creates a new row of data for each time-step, averaging the age of the model’s three agents for the “Current Age” value of each row. The non-aggregate data set instead creates three rows for each time-step representing each of the three agents.

In general, aggregate data sets are easier to work with and can save you some effort, but non-aggregate ones facilitate more detailed inter-agent analysis. Here, I create the data set “DemoData” with an aggregate type and then click next.

Second Dialog


This next dialog screen will open on the Standard Sources tab. These are more basic data that can be collected without pinging the model’s objects. “Tick Count” is the time step. There’s no reason to uncheck it, and it should probably be checked if creating a non-aggregate data set. Random Seed and Run Number should be checked if the data set will be used for a series of queued runs. I’m just going to check all three because to be honest I have no idea how many times and for what reason I’m going to reference this post in the future. Always good to be prepared!


The method data sources tab is where you select what simulation variables to record in a data set. The second through fourth columns of this dialog determine what values are recorded to the data set, and the first column determines what the name of of the column containing these values will be. For example, the data set created from the instructions shown in the last two images would look like the following:


“Agent Type” is just the name of the Java object’s class name that you want to collect data from. It doesn’t necessarily have to be an “agent”, but if you are doing agent-based modeling it usually will be. The “Method” is any method of that class that returns a value that can be added to the data set. Obviously numbers e.g. Doubles, Integers, etc. are viable candidates, but I don’t have an exhaustive list of valid return value types at this point in time. The “Aggregate Operation” will only apply to aggregate-type data sets and should be self explanatory.

I have never used the Custom Data Sources tab. Given the versatility of the discussed functionality, I would imagine it’s for much more advanced cases. I might cover it in the future.

Click next.

Third Dialog


The options on the Schedule Parameters dialog determine when information is recorded to the data set. The entries here result in data being collected on time-steps 1, 101, 201, etc., as represented by the table above. All of these settings besides “Priority” determine during what time-steps information is recorded. The Priority value itself determines when within those specified time-steps, relative to your model’s scheduled actions, said recording takes place.

That’s it! Click finish and your data set will be done.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s