University of Nairobi Programming and Simulation in Python Questions & Code
Assignment(Total of 20 points)
•
Single spaced, 12-point or larger font size; 1-inch margins
•
Use headers and/or bullets to organize and convey key elements, and page numbers
•
Only Latin alphabet characters are allowed (i.e., do not include any words or phrases that
contain non-English characters)
•
Comment all the codes provided and answer the questions in the document as needed
•
All the codes provided part should be in one .py file
•
There is no minimum and maximum number of words for each question, but your answers
need to be comprehensive and address the question.
1
2
–
3
4
Hands-On
Simulation Modeling
with Python
Develop simulation models to get accurate results and enhance
decision-making processes
Giuseppe Ciaburro
BIRMINGHAM—MUMBAI
Hands-On Simulation Modeling with Python
Copyright © 2020 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system,
or transmitted in any form or by any means, without the prior written permission of the
publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the
information presented. However, the information contained in this book is sold without
warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers
and distributors, will be held liable for any damages caused or alleged to have been caused
directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the
companies and products mentioned in this book by the appropriate use of capitals.
However, Packt Publishing cannot guarantee the accuracy of this information.
Commissioning Editor: Sunith Shetty
Acquisition Editor: Devika Battike
Senior Editor: David Sugarman
Content Development Editor: Joseph Sunil
Technical Editor: Sonam Pandey
Copy Editor: Safis Editing
Project Coordinator: Aishwarya Mohan
Proofreader: Safis Editing
Indexer: Manju Arasan
Production Designer: Joshua Misquitta
First published: July 2020
Production reference: 1160720
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
ISBN 978-1-83898-509-7
www.packt.com
Packt.com
Subscribe to our online digital library for full access to over 7,000 books and videos, as
well as industry leading tools to help you plan your personal development and advance
your career. For more information, please visit our website.
Why subscribe?
• Spend less time learning and more time coding with practical eBooks and Videos
from over 4,000 industry professionals
• Improve your learning with Skill Plans built especially for you
• Get a free eBook or video every month
• Fully searchable for easy access to vital information
• Copy and paste, print, and bookmark content
Did you know that Packt offers eBook versions of every book published, with PDF and
ePub files available? You can upgrade to the eBook version at packt.com and as a print
book customer, you are entitled to a discount on the eBook copy. Get in touch with us at
customercare@packtpub.com for more details.
At www.packt.com, you can also read a collection of free technical articles, sign up for
a range of free newsletters, and receive exclusive discounts and offers on Packt books and
eBooks.
Contributors
About the author
Giuseppe Ciaburro holds a PhD in environmental technical physics, along with two
master’s degrees. His research was focused on machine learning applications in the study
of urban sound environments. He works at the Built Environment Control Laboratory
at the Università degli Studi della Campania Luigi Vanvitelli, Italy. He has over 18 years’
professional experience in programming (Python, R, and MATLAB), first in the field
of combustion, and then in acoustics and noise control. He has several publications to
his credit.
About the reviewers
Greg Walters has been involved with computers and computer programming since
1972. He is well versed in Visual Basic, Visual Basic.NET, Python, and SQL, and is an
accomplished user of MySQL, SQLite, Microsoft SQL Server, Oracle, C++, Delphi,
Modula-2, Pascal, C, 80×86 Assembler, COBOL, and Fortran. He is a programming
trainer and has trained numerous individuals in many pieces of computer software,
including MySQL, Open Database Connectivity, Quattro Pro, Corel Draw!, Paradox,
Microsoft Word, Excel, DOS, Windows 3.11, Windows for Workgroups, Windows 95,
Windows NT, Windows 2000, Windows XP, and Linux. He is currently retired and, in his
spare time, is a musician and loves to cook. He is also open to working as a freelancer on
various projects.
Yoon Hyup Hwang is a seasoned data scientist in the marketing and finance industries
and is the author of two applied machine learning books. He has almost a decade of
experience building numerous machine learning models and data science products.
He holds an M.S.E. in Computer and Information Technology from the University of
Pennsylvania and a B.A. in Economics from the University of Chicago. He enjoys training
various martial arts, snowboarding, and roasting coffee. He currently lives in the Greater
New York Area with his artist wife, Sunyoung, and a playful dog, Dali (named after
Salvador Dali).
Packt is searching for authors like you
If you’re interested in becoming an author for Packt, please visit authors.
packtpub.com and apply today. We have worked with thousands of developers and
tech professionals, just like you, to help them share their insight with the global tech
community. You can make a general application, apply for a specific hot topic that we are
recruiting an author for, or submit your own idea.
Table of Contents
Preface
Section 1:
Getting Started with Numerical Simulation
1
Introducing Simulation Models
Introducing simulation models
4
Decision-making workflow
Comparing modeling and simulation
Pros and cons of simulation modeling
Simulation modeling terminology
5
6
6
7
Classifying simulation models
9
Comparing static and dynamic models
9
Comparing deterministic and
stochastic models
9
Comparing continuous and discrete
models10
Approaching a simulationbased problem
10
Problem analysis
Data collection
Setting up the simulation model
Simulation software selection
Verification of the software solution
Validation of the simulation model
Simulation and analysis of results
11
11
11
13
14
15
16
Dynamical systems modeling
16
Managing workshop machinery
Simple harmonic oscillator
Predator-prey model
17
18
20
Summary22
2
Understanding Randomness and Random Numbers
Technical requirements
Stochastic processes
24
24
Types of stochastic process
25
Examples of stochastic processes
The Bernoulli process
Random walk
26
26
27
The Poisson process
29
Random number simulation
30
Probability distribution
Properties of random numbers
31
32
The pseudorandom number
generator33
The pros and cons of a random
number generator
33
Random number generation algorithms 34
Linear congruential generator
34
Random numbers with uniform
distribution37
Lagged Fibonacci generator
39
Testing uniform distribution
42
The chi-squared test
Uniformity test
42
45
Exploring generic methods for
random distributions
51
The inverse transform sampling method 51
The acceptance-rejection method
52
Random number generation
using Python
53
Introducing the random module
The random.random() function
The random.seed() function
The random.uniform() function
The random.randint() function
The random.choice() function
The random.sample() function
Generating real-valued distributions
54
54
55
56
56
57
58
58
Summary
59
3
Probability and Data Generation Processes
Technical requirements
62
Explaining probability concepts 62
Exploring probability
distributions69
Understanding Bayes’ theorem 66
Probability density function
Mean and variance
Uniform distribution
Binomial distribution
Normal distribution
Compound probability
Bayes’ theorem
Summary83
Types of events
Calculating probability
Probability definition with an example
62
63
63
66
68
Section 2:
Simulation Modeling Algorithms and
Techniques
70
71
72
76
79
4
Exploring Monte Carlo Simulations
Technical requirements
88
Introducing Monte Carlo
simulation88
Monte Carlo components
First Monte Carlo application
Monte Carlo applications
Applying the Monte Carlo method for
Pi estimation
89
89
90
91
Understanding the central limit
theorem96
Law of large numbers
Central limit theorem
96
97
simulation101
Generating probability distributions
Numerical optimization
Project management
Performing numerical
integration using
Monte Carlo
Defining the problem
Numerical solution
Min-max detection
Monte Carlo method
Visual representation
101
102
103
104
104
106
108
109
111
Summary113
Applying Monte Carlo
5
Simulation-Based Markov Decision Processes
Technical requirements
116
Overview of Markov processes 116
The Bellman equation
explained138
The agent-environment interface
Exploring MDPs
Understanding the discounted
cumulative reward
Comparing exploration and
exploitation concepts
117
119
Multi-agent simulation
140
Summary142
Introducing Markov chains
124
Transition matrix
Transition diagram
Markov chain applications
122
123
125
126
127
Introducing random walks
127
Simulating a one-dimensional random
walk129
Simulating a weather forecast
132
Dynamic programming concepts
Principle of optimality
The Bellman equation
139
139
140
6
Resampling Methods
Technical requirements
Introducing resampling
methods
Sampling concepts overview
Reasoning about sampling
Pros and cons of sampling
Probability sampling
How sampling works
144
144
145
146
146
147
147
Exploring the Jackknife
technique148
Defining the Jackknife method
148
Estimating the coefficient of variation 150
Applying Jackknife resampling using
Python
151
Demystifying bootstrapping
156
Introducing bootstrapping
Bootstrap definition problem
Bootstrap resampling using Python
Comparing Jackknife and bootstrap
156
157
158
161
Explaining permutation tests 162
Approaching cross-validation
techniques163
The validation set approach
Leave-one-out cross validation
K-fold cross validation
Cross-validation using Python
163
164
165
165
Summary168
7
Using Simulation to Improve and Optimize Systems
Technical requirements
Introducing numerical
optimization techniques
170
170
Defining an optimization problem
171
Explaining local optimality
173
Defining the descent methods
174
Approaching the gradient descent
algorithm174
Understanding the learning rate
177
Explaining the trial and error method 178
Implementing gradient descent in
Python178
Facing the Newton-Raphson
method183
Using the Newton-Raphson algorithm
for root-finding
183
Approaching Newton-Raphson for
numerical optimization
Applying the Newton-Raphson
technique
184
185
Deepening our knowledge of
stochastic gradient descent
189
Discovering the multivariate
optimization methods in Python191
The Nelder–Mead method
191
Powell’s conjugate direction algorithm 195
Summarizing other optimization
methodologies197
Summary198
Section 3:
Real-World Applications
8
Using Simulation Models for Financial Engineering
Technical requirements
Understanding the geometric
Brownian motion model
202
202
Defining a standard Brownian motion 203
Addressing the Wiener process as
random walk
204
Implementing a standard Brownian
motion
205
Handling the stock price trend as time
series213
Introducing the Black-Scholes model
215
Applying Monte Carlo simulation
216
Studying risk models for
portfolio management
220
Using Monte Carlo methods for
stock price prediction
207
Using variance as a risk measure
221
Introducing the value-at-risk metric
221
Estimating the VaR for some NASDAQ
assets223
Exploring the Amazon stock price trend 208
Summary232
9
Simulating Physical Phenomena Using Neural Networks
Technical requirements
234
Introducing the basics of neural
networks
234
Understanding biological neural
networks
Exploring ANNs
Splitting the data
256
Explaining multiple linear regression 258
Understanding a multilayer perceptron
regressor model
260
235
236
Exploring deep neural networks264
Understanding feedforward
neural networks
242
Exploring neural network training
243
264
265
266
Getting familiar with convolutional
neural networks
Examining recurrent neural networks
Analyzing LSTM networks
Summary267
Simulating airfoil self-noise
using ANNs
244
Importing data using pandas
Scaling the data using sklearn
Viewing the data using matplotlib
246
249
252
10
Modeling and Simulation for Project Management
Technical requirements
270
Introducing project
management270
Addressing management problems
using MDPtoolbox
Changing the probability of fire
278
282
Understanding what-if analysis
Scheduling project time using
Monte Carlo simulation
284
271
Managing a tiny forest problem272
Summarizing the Markov decision
process272
Exploring the optimization process
273
Introducing MDPtoolbox
275
Defining the tiny forest management
example
275
Defining the scheduling grid
285
Estimating the task’s time
286
Developing an algorithm for project
scheduling287
Exploring triangular distribution
288
Summary294
11
What’s Next?
Summarizing simulation
modeling concepts
295
Generating random numbers
296
Applying Monte Carlo methods
298
Addressing the Markov decision process299
Analyzing resampling methods
300
Exploring numerical optimization
techniques302
Using artificial neural networks for
simulation303
Applying simulation model to
real life
304
Other Books You May Enjoy
Leave a review – let other
readers know what you think
317
Modeling in healthcare
Modeling in financial applications
Modeling physical phenomenon
Modeling public transportation
Modeling human behavior
304
305
306
307
308
Next steps for simulation
modeling309
Increasing the computational power
309
Machine learning-based models
311
Automated generation of simulation
models312
Summary313
Preface
Simulation modeling helps you to create digital prototypes of physical models to analyze
how they work and predict their performance in the real world. With this comprehensive
guide, you’ll learn about various computational statistical simulations using Python.
Starting with the fundamentals of simulation modeling, you’ll learn about concepts
such as randomness and explore data generating processes, resampling methods,
and bootstrapping techniques. You’ll then cover key algorithms such as Monte Carlo
simulations and the Markov Decision Process, which are used to develop numerical
simulation models, and discover how they can be used to solve real-world problems.
As you make progress, you’ll develop simulation models to help you get accurate results
and enhance decision-making processes. Using optimization techniques, you’ll learn to
modify the performance of a model to improve results and make optimal use of resources.
The book will guide you through creating a digital prototype using practical use cases
for financial engineering, prototyping project management to improve planning, and
simulating physical phenomena using neural networks.
By the end of this book, you’ll be able to construct and deploy simulation models of your
own to solve real-world challenges.
Who this book is for
Hands-On Simulation Modeling with Python is for simulation developers and engineers,
model designers, and anyone already familiar with the basic computational methods
that are used to study the behavior of systems. This book will help you explore advanced
simulation techniques such as Monte Carlo methods, statistical simulations, and
much more using Python. Working knowledge of the Python programming language
is required.
What this book covers
Chapter 1, Introduction, analyzes the basics of numerical simulation and highlights the
difference between modeling and simulation and the strengths of simulation models such
as defects. The different types of models are analyzed, and we study practical modeling
cases to understand how to elaborate a model starting from the initial considerations.
viii
Preface
Chapter 2, Understanding Randomness and Random Numbers, defines stochastic processes
and explains the importance of using them to address numerous real-world problems.
The main methods for generating random numbers with practical examples in Python
code, and the generation of uniform and generic distributions, are both explored. It also
explains how to perform a uniformity test using the chi-square method.
Chapter 3, Probability and the Data Generating Process, shows how to distinguish between
the different definitions of probabilities and how they can be integrated to obtain useful
information in the simulation of real phenomena.
Chapter 4, Monte Carlo Simulations, explores techniques based on Monte Carlo methods
for process simulation. We will first learn the basic concepts, and then we will see how to
apply them to practical cases.
Chapter 5, Simulation-Based Markov Decision Process, shows how to deal with decisionmaking processes with Markov chains. We will analyze the concepts underlying
Markovian processes and then analyze some practical applications to learn how to choose
the right actions for the transition between different states of the system.
Chapter 6, Resampling Methods, shows how to apply resampling methods to approximate
some characteristics of the distribution of a sample in order to validate a statistical model.
We will analyze the basics of the most common resampling methods and learn how to use
them by solving some practical problems.
Chapter 7, Use of Simulation to Improve and Optimize Systems, shows how to use the
main optimization techniques to improve the performance of our simulation models.
We will see how to use the gradient descent technique, the Newton-Raphson method,
and stochastic gradient descent. We will also see how to apply these techniques with
practical examples.
Chapter 8, Simulation Models for Financial Engineering, shows practical cases of using
simulation methods in a financial context. We will learn how to use Monte Carlo methods
to predict stock prices and how to assess the risk associated with a portfolio of shares.
Chapter 9, Simulating Physical Phenomena with Neural Networks, shows how to develop
models based on artificial neural networks to simulate physical phenomena. We will start
by exploring the basic concepts of neural networks, and we will examine their architecture
and its main elements. We will see how to train a network to update its weights.
Preface
ix
Chapter 10, Modeling and Simulation for Project Management, deals with practical cases
of project management using the tools we learned how to use in the previous chapters.
We will see how to evaluate in advance the results of the actions undertaken in the
management of a forest using Markov processes, and then move on to evaluating the time
required for the execution of a project using the Monte Carlo simulation.
Chapter 11, What’s Next?, provides a better understanding of the problems associated with
building and deploying simulation models and additional resources and technologies to
learn how to hone your machine learning skills.
To get the most out of this book
Working knowledge of Python programming language is required.
If you are using the digital version of this book, we advise you to type the code yourself
or access the code via the GitHub repository (link available in the next section). Doing
so will help you avoid any potential errors related to the copying and pasting of code.
Download the example code files
You can download the example code files for this book from your account at www.
packt.com. If you purchased this book elsewhere, you can visit www.packtpub.com/
support and register to have the files emailed directly to you.
You can download the code files by following these steps:
1. Log in or register at www.packt.com.
2. Select the Support tab.
3. Click on Code Downloads.
4. Enter the name of the book in the Search box and follow the onscreen instructions.
x
Preface
Once the file is downloaded, please make sure that you unzip or extract the folder using
the latest version of:
• WinRAR/7-Zip for Windows
• Zipeg/iZip/UnRarX for Mac
• 7-Zip/PeaZip for Linux
The code bundle for the book is also hosted on GitHub at https://github.com/
PacktPublishing/Hands-On-Simulation-Modeling-with-Python. In case
there’s an update to the code, it will be updated on the existing GitHub repository.
We also have other code bundles from our rich catalog of books and videos available at
https://github.com/PacktPublishing/. Check them out!
Download the color images
We also provide a PDF file that has color images of the screenshots/diagrams used in this
book. You can download it here: http://www.packtpub.com/sites/default/
files/downloads/9781838985097_ColorImages.pdf.
Conventions used
There are a number of text conventions used throughout this book.
Code in text: Indicates code words in text, database table names, folder names,
filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles.
Here is an example: “Mount the downloaded WebStorm-10*.dmg disk image file as
another disk in your system.”
A block of code is set as follows:
import random
import statistics
import matplotlib.pyplot as plt
When we wish to draw your attention to a particular part of a code block, the relevant
lines or items are set in bold:
import random
import statistics
import matplotlib.pyplot as plt
Preface
xi
Any command-line input or output is written as follows:
$ python jakknife_estimator.py
Bold: Indicates a new term, an important word, or words that you see onscreen. For
example, words in menus or dialog boxes appear in the text like this. Here is an example:
“Select System info from the Administration panel.”
Tips or important notes
Appear like this.
Get in touch
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, mention the book
title in the subject of your message and email us at customercare@packtpub.com.
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes
do happen. If you have found a mistake in this book, we would be grateful if you would
report this to us. Please visit www.packtpub.com/support/errata, selecting your
book, clicking on the Errata Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the Internet,
we would be grateful if you would provide us with the location address or website name.
Please contact us at copyright@packt.com with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in
and you are interested in either writing or contributing to a book, please visit authors.
packtpub.com.
Reviews
Please leave a review. Once you have read and used this book, why not leave a review on
the site that you purchased it from? Potential readers can then see and use your unbiased
opinion to make purchase decisions, we at Packt can understand what you think about
our products, and our authors can see your feedback on their book. Thank you!
For more information about Packt, please visit packt.com.
Section 1:
Getting Started
with Numerical
Simulation
In this section, the basic concepts of simulation modeling are addressed. This section
helps you to understand the fundamental concepts and elements of numerical simulation.
This section contains the following chapters:
Chapter 1, Introducing Simulation Models
Chapter 2, Understanding Randomness and Random Numbers
Chapter 3, Probability and Data Generating Processes
1
Introducing
Simulation Models
A simulation model is a tool capable of processing information and data and predicting
the responses of a real system to certain inputs, thus becoming an effective support for
analysis, performance evaluation, and decision-making processes. The term simulation
refers to reproducing the behavior of a system. In general, we speak of simulation both in
the case in which a concrete model is used and in the case in which an abstract model is
used that reproduces reality using a computer. An example of a concrete model is a scale
model of an airplane that is then placed in a wind tunnel to carry out simulated tests to
estimate suitable performance measures.
Although, over the years, physicists have developed theoretical laws that we can use to
obtain information on the performance of dynamic systems, often, the application of these
laws to a real case takes too long. In these cases, it is convenient to construct a numerical
simulation model that allows us to simulate the behavior of the system under certain
conditions. This elaborated model will allow us to test the functionality of the system in a
simple and immediate way, saving considerable resources in terms of time and money.
4
Introducing Simulation Models
In this chapter, we’re going to cover the following main topics:
• Introducing simulation models
• Classifying simulation models
• Approaching a simulation-based problem
• Dynamical systems modeling
Important Note
In this chapter, an introduction to simulation techniques will be discussed.
In order to deal with the topics at hand, it is necessary that you have a basic
knowledge of algebra and mathematical modeling.
Introducing simulation models
Simulation uses abstract models built to replicate the characteristics of a system. The
operation of a system is simulated using probability distributions to randomly generate
system events, and statistical observations are obtained from the simulated system. It plays
a very important role, especially in the design of a stochastic system and in the definition
of its operating procedures.
By not working directly on the real system, many scenarios can be simulated simply by
changing the input parameters, thus limiting the costs that would occur if this solution
were not used and, ultimately, reducing the time it would take. In this way, it is possible to
quickly try alternative policies and design choices and model systems of great complexity
by studying their behavior and evolution over time.
Important Note
Simulation is used when working on real systems is not convenient due to
high costs, technical impossibility, and the non-existence of a real system.
Simulation allows you to predict what happens to the real system if certain
inputs are used. Changing these input parameters simulates different scenarios
that allow us to identify the most convenient one from various points of view.
Introducing simulation models
5
Decision-making workflow
In a decision-making process, the starting point is identifying the problematic context
that requires a change and therefore a decision. The context that’s identified is then
analyzed in order to highlight what needs to be studied for the decisions that need to be
made; that is, those elements that seem the most relevant are chosen, the relationships
that connect them are highlighted, and the objectives to be achieved are defined. At this
point, a formal model is constructed, which allows us to simulate the identified system in
order to understand its behavior and to arrive at identifying the decisions to be made. The
following diagram describes the workflow that allows us to make a decision, starting from
observing the problematic context:
Figure 1.1 – Decision-making workflow
This represents a way of spreading knowledge and involves various actors. Constructing a
model is a two-way process:
• Definition of conceptual models
• Continuous interaction between the model and reality by comparison
In addition, learning also has a participatory characteristic: it proceeds through the
involvement of different actors. The models also allow you to analyze and propose
organized actions so that you can modify the current situation and produce the
desired solution.
6
Introducing Simulation Models
Comparing modeling and simulation
To start, we will clarify the differences between modeling and simulation. A model is
a representation of a physical system, while simulation is the process of seeing how a
model-based system would work under certain conditions.
Modeling is a design methodology that is based on producing a model that implements
a system and represents its functionality. In this way, it is possible to predict the behavior
of a system and the effects of the variations or modifications that are made on it. Even if
the model is a simplified representation of the system, it must still be close enough to the
functional nature of the real system, but without becoming too complex and difficult
to handle.
Important Note
Simulation is the process that puts the model into operation and allows you
to evaluate its behavior under certain conditions. Simulation is a fundamental
tool for modeling because, without necessarily resorting to physical
prototyping, the developer can verify the functionality of the modeled system
with the project specifications.
Simulation allows us to study the system through a wide spectrum of conditions so that
we can understand how representative the model is of the system that it refers to.
Pros and cons of simulation modeling
Simulation is a tool that’s widely used in a variety of fields, from operational research
to the application industry. This technique can be made successful by it overcoming the
difficulties that each complex procedure contains. The following are the pros and cons of
simulation modeling. Let’s start with the concrete advantages that can be obtained from
the use of simulation models (pros):
• It reproduces the behavior of a system in reference to situations that cannot be
directly experienced.
• It represents real systems, even complex ones, while also considering the sources
of uncertainty.
• It requires limited resources in terms of data.
• It allows experimentation in limited time frames.
• The models that are obtained are easily demonstrable.
Introducing simulation models
7
As anticipated, since it is a technique capable of reproducing complex scenarios, it has
some limitations (cons):
• The simulation provides indications of the behavior of the system, but not
exact results.
• The analysis of the output of a simulation could be complex and it could be difficult
to identify which may be the best configuration.
• The implementation of a simulation model could be laborious and, moreover, it may
take long calculation times to carry out a significant simulation.
• The results that are returned by the simulation depend on the quality of the input
data: it cannot provide accurate results in the case of inaccurate input data.
• The complexity of the simulation model depends on the complexity of the system it
intends to reproduce.
Nevertheless, simulation models represent the best solution for the analysis of
complex scenarios.
Simulation modeling terminology
In this section, we will analyze the elements that make up a model and those that
characterize a simulation process. We will give a brief description of each so that you
understand their meaning and the role they play in the numerical simulation process.
System
The context of an investigation is represented through a system; that is, the set of elements
that interact with each other. The main problem linked to this element concerns the
system boundaries, that is, which elements of reality must be inserted in the system that
represents it and which are left out and the relationships that exist between them.
State variables
A system is described in each instant of time by a set of variables. These are called
state variables. For example, in the case of a weather system, the temperature is a state
variable. In discrete systems, the variables change instantly at precise moments of time
that are finite. In continuous systems, the variables vary in terms of continuity with
respect to time.
8
Introducing Simulation Models
Events
An event is defined as any instantaneous event that causes the value of at least one of
the status variables to change. The arrival of a blizzard for a weather system is an event,
as it causes the temperature to drop suddenly. There are both external events and
internal events.
Parameters
Parameters represent essential terms when building a model. They are adjusted during
the model simulation process to ensure that the results are brought into the necessary
convergence margins. They can be modified iteratively through sensitivity analysis or in
the model calibration phase.
Calibration
Calibration represents the process by which the parameters of the model are adjusted in
order to adapt the results to the data observed in the best possible way. When calibrating
the model, we try to obtain the best possible accuracy. A good calibration requires
eliminating, or minimizing, errors in data collection and choosing a theoretical model
that is the best possible description of reality. The choice of model parameters is decisive
and must be done in such a way as to minimize the deviation of its results when applied
to historical data.
Accuracy
Accuracy is the degree of correspondence of the simulation result that can be inferred
from a series of calculated values with the actual data, that is, the difference between
the average modeled value and the true or reference value. Accuracy, when calculated,
provides a quantitative estimate of the quality expected from a forecast. Several indicators
are available to measure accuracy. The most used are mean absolute error (MAE), mean
absolute percentage error (MAPE), and mean squared error (MSE).
Sensitivity
The sensitivity of a model indicates the degree to which the model’s outputs are affected
by changes in the selected input parameters. A sensitivity analysis identifies the sensitive
parameters for the output of the model. It allows us to determine which parameters
require further investigation so that we have a more realistic evaluation of the model’s
output values. Furthermore, it allows us to identify which parameters are not significant
for the generation of a certain output and therefore can possibly be eliminated from
the model. Finally, it tells us which parameters should be considered in a possible and
subsequent analysis of the uncertainty of the output values provided by the model.
Classifying simulation models
9
Validation
This is the process that verifies the accuracy of the proposed model. The model must be
validated to be used as a tool to support decisions. It aims to verify whether the model
that’s being analyzed corresponds conceptually to our intentions. The validation of a
model is based on the various techniques of multivariate analysis, which, from time to
time, study the variability and interdependence of attributes within a class of objects.
Classifying simulation models
Simulation models can be classified according to different criteria. The first distinction is
between static and dynamic systems. So, let’s see what differentiates them.
Comparing static and dynamic models
Static models are the representation of a system in an instant of time, or representative
models of a system in which the time variable plays no role. An example of a static
simulation is a Monte Carlo model.
Dynamic models, on the other hand, describe the evolution of the system over time. In
the simplest case, the state of the system at time t is described by a function x (t). For
example, in population dynamics, x (t) represents the population present at time t. The
equation that regulates the system is dynamic: it describes the instantaneous variation of
the population or the variation in fixed time intervals.
Comparing deterministic and stochastic models
A model is deterministic when its evolution, over time, is uniquely determined by its
initial conditions and characteristics. These models do not consider random elements
and lend themselves to be solved with exact methods that are derived from mathematical
analysis. In deterministic models, the output is well determined once the input data and
the relationships that make up the model have been specified, despite the time required
for data processing being particularly long. For these systems, the transformation rules
univocally determine the change of state of the system. Examples of deterministic systems
can be observed in some production and automation systems.
Stochastic models, on the other hand, can be evolved by inserting random elements
into the evolution. These are obtained by extracting them from statistical distributions.
Among the operational characteristics of these models, there is not just one relationship
that fits all. There’s also probability density functions, which means there is no one-to-one
correspondence between the data and system history.
10
Introducing Simulation Models
A final distinction is based on how the system evolves over time: this is why we
distinguish between continuous and discrete simulation models.
Comparing continuous and discrete models
Continuous models represent systems in which the state of the variables changes
continuously as a function of time. For example, a car moving on a road represents a
continuous system since the variables that identify it, such as position and speed, can
change continuously with respect to time.
In discrete models, the system is described by an overlapping sequence of physical
operations, interspersed with inactivity pauses. These operations begin and end in welldefined instances (events). The system undergoes a change of state when each event
occurs, remaining in the same state throughout the interval between the two subsequent
events. This type of operation is easy to treat with the simulation approach.
Important Note
The stochastic or deterministic, or continuous or discrete, nature of a model
is not its absolute property and depends on the observer’s vision of the system
itself. This is determined by the objectives and the method of study, as well as
by the experience of the observer.
Now that we’ve analyzed the different types of models in detail, we will learn how to
develop a numerical simulation model.
Approaching a simulation-based problem
To tackle a numerical simulation process that returns accurate results, it is crucial to
rigorously follow a series of procedures that partly precede and partly follow the actual
modeling of the system. We can separate the simulation process workflow into the
following individual steps:
1. Problem analysis
2. Data collection
3. Setting up the simulation model
4. Simulation software selection
Approaching a simulation-based problem
11
5. Verification of the software solution
6. Validation of the simulation model
7. Simulation and analysis of results
To fully understand the whole simulation process, it is essential to analyze the various
phases that characterize a study based on simulation in depth.
Problem analysis
In this initial step, the goal is to understand the problem by trying to identify the aims
of the study and the essential components, as well as the performance measures that
interest them. Simulation is not simply an optimization technique and therefore there
is no parameter that needs to be maximized or minimized. However, there is a series of
performance indices whose dependence on the input variables must be verified. If an
operational version of the system is already available, the work is simplified as it is enough
to observe this system to deduce its fundamental characteristics.
Data collection
This represents a crucial step in the whole process since the quality of the simulation
model depends on the quality of the input data. This step is closely related to the
previous one. In fact, once the objective of the study has been identified, data is collected
and subsequently processed. Processing the collected data is necessary to transform
it into a format that can be used by the model. The origin of the data can be different:
sometimes, the data is retrieved from company databases, but more often than not, direct
measurements in the field must be made through a series of sensors that, in recent years,
have become increasingly smart. These operations weigh down the entire study process,
thus lengthening their execution times.
Setting up the simulation model
This is the crucial step of the whole simulation process; therefore, it is necessary to pay
close attention to it. To set up a simulation model, it is necessary to know the probability
distributions of the variables of interest. In fact, to generate various representative
scenarios of how a system works, it is essential that a simulation generates random
observations from these distributions.
12
Introducing Simulation Models
For example, when managing stocks, the distribution of the product being requested and
the distribution of time between an order and the receipt of the goods is necessary. On
the other hand, when managing production systems with machines that can occasionally
fail, it will be necessary to know the distribution of time until a machine fails and the
distribution of repair times.
If the system is not already available, it is only possible to estimate these distributions by
deriving them, for example, from the observation of similar, already existing systems.
If, from the analysis of the data, it is seen that this form of distribution approximates a
standard type distribution, the standard theoretical distribution can be used by carrying
out a statistical test to verify whether the data can be well represented by that probability
distribution. If there are no similar systems from which observable data can be obtained,
other sources of information must be used: machine specifications, instruction manuals
for the machines, experimental studies, and so on.
As we’ve already mentioned, constructing a simulation model is a complex procedure.
Referring to simulating discrete events, constructing a model involves the following steps:
1. Defining the state variables
2. Identifying the values that can be taken by the state variables
3. Identifying the possible events that change the state of the system
4. Realizing a simulated time measurement, that is, a simulation clock, that records the
flow of simulated time
5. Implementing a method for randomly generating events
6. Identifying the state transitions generated by events
After following these steps, we will have the simulation model ready for use. At this
point, it will be necessary to implement this model in a dedicated software platform;
let’s see how.
Approaching a simulation-based problem
13
Simulation software selection
The choice of the software platform that you will perform the numerical simulation with
is fundamental for the success of the project. In this regard, we have several solutions that
we can adopt. This choice will be made based on our knowledge of programming. Let’s see
what solutions are available:
• Simulators: These are application-oriented packages for simulation. There
are numerous interactive software packages for simulation, such as MATLAB,
COMSOL Multiphysics, Ansys, SolidWorks, Simulink, Arena, AnyLogic, and
SimScale. These pieces of software represent excellent simulation platforms whose
performance differs based on the application solutions provided. These simulators
allow us to elaborate on a simulation environment using graphic menus without the
need to program. They are easy to use but many of them have excellent modeling
capabilities, even if you just use their standard features. Some of them provide
animations that show the simulation in action, which allows you to easily illustrate
the simulation to non-experts. The limitations presented by this software solution
are the high costs of the licenses, which can only be faced by large companies, and
the difficulty in modeling solutions that have not been foreseen by the standards.
• Simulation languages: A more versatile solution is offered by the different
simulation languages available. There are solutions that facilitate the task of the
programmer who, with these languages, can develop entire models or sub-models
with a few lines of code that would otherwise require much longer drafting times,
with a consequent increase in the probability of error. An example of a simulation
language is the general-purpose simulation system (GPSS). This is a generic
programming language that was developed by IBM in 1965. In it, a simulation clock
advances in discrete steps, modeling a system as transactions enter the system and
are passed from one service to another. It is mainly used as a process flow-oriented
simulation language and is particularly suitable for application problems. Another
example of a simulation language is SimScript, which was developed in 1963 as an
extension of Fortran. SimScript is an event-based scripting language, so different
parts of the script are triggered by different events.
14
Introducing Simulation Models
• GPSS: General-purpose programming languages are designed to be able to create
software in numerous areas of application. They are particularly suitable for
the development of system software such as drivers, kernels, and anything that
communicates directly with the hardware of a computer. Since these languages are
not specifically dedicated to a simulation activity, they require the programmer
to work harder to implement all the mechanisms and data structures necessary
in a simulator. On the other hand, by offering all the potential of a high-level
programming language, they offer the programmer a more versatile programming
environment. In this way, you can develop a numerical simulation model perfectly
suited to the needs of the researcher. In this book, we will use this solution by
devoting ourselves to programming with Python. This software platform offers a
series of tools that have been created by researchers from all over the world that
make the elaboration of a numerical modeling system particularly easy. In addition,
the open source nature of the projects written in Python makes this solution
particularly inexpensive.
Now that we’ve made the choice of the software platform we’re going to use and have
elaborated on the numerical model, we need to verify the software solution.
Verification of the software solution
In this phase, a check is carried out on the numerical code. This is known as debugging,
which consists of ensuring that the code correctly follows the desired logical flow, without
unexpected blocks or interruptions. The verification must be provided in real time during
the creation phase because correcting any concept or syntax errors becomes more difficult
as the complexity of the model increases.
Although verification is simple in theory, debugging large-scale simulation code is a
difficult task due to virtual competition. The correctness or otherwise of executions
depends on time, as well as on the large number of potential logical paths. When
developing a simulation model, you should divide the code into modules or subroutines
in order to facilitate debugging. It is also advisable to have more than one person review
the code, as a single programmer may not be a good critic. In addition, it can be helpful to
perform the simulation when considering a large variety of input parameters and checking
that the output is reasonable.
Approaching a simulation-based problem
15
Important Note
One of the best techniques that can be used to verify a discrete-event
simulation program is one based on tracking. The status of the system, the
content of the list of events, the simulated time, the status variables, and the
statistical counters are shown after the occurrence of each event and then
compared with handmade calculations to check the operation of the code.
A track often produces a large volume of output that needs to be checked event by event
for errors. Possible problems may arise, including the following:
• There may be information that hasn’t been requested by the analyst.
• Other useful information may be missing, or a certain type of error may not be
detectable during a limited debugging run.
After the verification process, it is necessary to validate the simulation model.
Validation of the simulation model
In this step, it is necessary to check whether the model that has been created provides
valid results for the system in question. We must check whether the performance
measurements of the real system are well approximated by the measurements generated
by the simulation model. A simulation model of a complex system can only approximate
it. A simulation model is always developed for a set of objectives. A model that’s valid for
one purpose may not be valid for another.
Important Note
Validation is a where the level of accuracy between the model and the system is
respected. It is necessary to establish whether the model adequately represents
the behavior of the system. The value of a model can only be defined in relation
to its use. Therefore, validation is a process that aims to determine whether a
simulation model accurately represents the system for the set objectives.
In this step, the ability of the model to reproduce the real functionality the system is
ascertained; that is, it is ensured that the calibrated parameters, relative to the calibration
scenario, can be used to correctly simulate other system situations. Once the validation
phase is over, the model can be considered transferable and therefore usable for the
simulation of any new control strategies and new intervention alternatives. As widely
discussed in the literature on this subject, it is important to validate the model parameters
that were previously calibrated on the basis of data other than that used to calibrate the
model, always with reference to the phenomenon specific to the scenario being analyzed.
16
Introducing Simulation Models
Simulation and analysis of results
A simulation is a process that evolves during its realization and where the initial results
help lead the simulation toward more complex configurations. Attention should be paid to
some details. For example, it is necessary to determine the transient length of the system
before reaching stationary conditions if you want performance measures of the system
at full capacity. It is also necessary to determine the duration of the simulation after the
system has reached equilibrium. In fact, it must always be kept in mind that a simulation
does not produce the exact values of the performance measures of a system since each
simulation is a statistical experiment that generates statistical observations regarding
the performance of the system. These observations are then used to produce estimates
of performance measures. Increasing the duration of the simulation can increase the
accuracy of these estimates.
The simulation results return statistical estimates of a system’s performance measures.
A fundamental point is that each measurement is accompanied by the confidence
interval, within which it can vary. These results could immediately highlight a better
system configuration than the others, but more often, more than one candidate
configuration will be identified. In this case, further investigations may be needed to
compare these configurations.
Dynamical systems modeling
In this section, we will analyze a real case of modeling a production process. In this
way, we will learn how to deal with the elements of the system and how to translate the
production instances into the elements of the model. A model is created to study the
behavior of a system over time. It consists of a set of assumptions about the behavior of
the system being expressed using mathematical logical-symbolic relationships. These
relationships are between the entities that make up the system. Recall that a model is
used to simulate changes in the system and predict the effects of these changes on the
real system. Simple models are resolved analytically, using mathematical methods, while
complex models are numerically simulated on the computer, where the data is treated as
the data of a real system.
Dynamical systems modeling
17
Managing workshop machinery
In this section, we will look at a simple example of how a discrete event simulation of
a dynamic system is created. A discrete event system is a dynamic system whose states
can assume logical or symbolic values, rather than numerical ones, and whose behavior
is characterized by the occurrence of instantaneous events within an irregular timing
sequence not necessarily known a priori. The behavior of these systems is described in
terms of states and events.
In a workshop, there are two machines, which we will call A1 and A2. At the beginning
of the day, five jobs need to be carried out: W1, W2, W3, W4, and W5. The following table
shows how long we need to work on the machines in minutes:
Figure 1.2 – Table showing work time on the machines
A zero indicates that a job does not require that machine. Jobs that require two machines
must pass through A1 and then through A2. Suppose that we decide to carry out the
jobs by assigning them to each machine so that when they become available, the first
executable job is started first, in the order from 1 to 5. If, at the same time, more jobs can
be executed on the same machine, we will execute the one with a minor index first.
The purpose of modeling is to determine the minimum time needed to complete all the
works. The events in which state changes can occur in the system are as follows:
1. A job becomes available for a machine.
2. A machine starts a job.
3. A machine ends a job.
18
Introducing Simulation Models
Based on these rules and the evaluation times indicated in the previous table, we can
insert the sequence of the jobs, along with the events scheduled according to the execution
times, into a table:
Figure 1.3 – Table of job sequences
This table shows the times of the events in sequence, indicating the start and end of the
work associated with the two machines available in the workshop. At the end of each job,
a new job is sent to each machine according to the rules set previously. In this way, the
deadlines for the work and the subsequent start of another job are clearly indicated. This
is just as easy as it is to identify the time interval in which each machine is used and when
it becomes available again. The table solution we have proposed represents a simple and
immediate way of simulating a simple dynamic discrete system.
The example we just discussed is a typical case of a dynamic system in which time
proceeds in steps, in a discrete way. However, many dynamic systems are best described
by assuming that time passes continuously. In the next section, we will analyze the case of
a continuous dynamic system.
Simple harmonic oscillator
Consider a mass m resting on a horizontal plane, without friction, and attached to a wall
by an ideal spring, of elastic constant k. Suppose that, when the horizontal coordinate x
is zero, the spring is at rest. The following diagram shows the scheme of a simple
harmonic oscillator:
Dynamical systems modeling
19
Figure 1.4 – The scheme of a harmonic oscillator
If the block of mass m is moved to the right with respect to its equilibrium position (x> 0),
the spring, being elongated, calls it to the left. Conversely, if the block is placed to the left
of its equilibrium position (x < 0), then the spring is compressed and pushes the block to
the right. In both cases, we can express the component along the x-axis of the force due to
the spring according to the following formula:
Here, we have the following:
•
is the force.
•
is the elastic constant.
•
𝐹𝐹𝑥𝑥 = −𝑘𝑘 ∗ 𝑥𝑥
is the horizontal coordinate that indicate the position of the mass m.
From the second law of dynamics, we can derive the component of acceleration along x
as follows:
Here, we have the following:
•
is the acceleration.
•
is the elastic constant.
•
•
𝑎𝑎𝑥𝑥 = −
𝑘𝑘
∗ 𝑥𝑥
𝑚𝑚
is the mass of the block.
is the horizontal coordinate that indicate the position of the mass m.
20
Introducing Simulation Models
=
=
If we indicate with
the rate of change of the speed and with
we can obtain the evolution equations of the dynamic system, as follows:
Here, we have the following:
•
𝜔𝜔2 = −
the speed,
𝑑𝑑𝑑𝑑
= 𝑣𝑣
{ 𝑑𝑑𝑑𝑑
𝑑𝑑𝑑𝑑
= − 𝜔𝜔2 ∗ 𝑥𝑥
𝑑𝑑𝑑𝑑
𝑘𝑘
𝑚𝑚
For these equations, we must associate the initial conditions of the system, which we can
write in the following way:
𝑥𝑥(0) = 𝑥𝑥0
{
𝑣𝑣(0) = 𝑣𝑣0
The solutions to the previous differential equations are as follows:
𝑣𝑣0
𝑥𝑥(𝑡𝑡) = 𝑥𝑥0 cos(𝜔𝜔𝜔𝜔) +
sin (𝜔𝜔𝜔𝜔)
{
𝜔𝜔
𝑣𝑣(𝑡𝑡) = 𝑣𝑣0 cos(𝜔𝜔𝜔𝜔) − 𝑥𝑥0 𝜔𝜔 sin (𝜔𝜔𝜔𝜔)
In this way, we obtained the mathematical model of the analyzed system. In order to study
the evolution of the oscillation phenomenon of the mass block m over time, it is enough
to vary the time and calculate the position of the mass at that instant and its speed.
In decision-making processes characterized by high levels of complexity, the use of
analytical models is not possible. In these cases, it is necessary to resort to models that
differ from those of an analytical type for the use of the calculator as a tool not only for
calculation, such as in mathematical programming models, but also for representing the
elements that make up reality why studying the relationships between them.
Predator-prey model
In the field of simulations, simulating the functionality of production and logistic
processes is considerably important. These systems are, in fact, characterized by high
complexity, numerous interrelationships between the different processes that pass through
them, segment failures, unavailability, and the stochasticity of the system parameters.
Dynamical systems modeling
21
To understand how complex the analytical modeling of some phenomena is, let's analyze
a universally widespread biological model. This is the predator-prey model, which was
developed independently by the Italian researcher Vito Volterra and the American
biophysicist Alfred Lotka.
On an island, there are two populations of animals: prey and predators. The vegetation
of the island provides the prey with nourishment in quantities that we can consider as
unlimited, while the prey is the only food available for the predators. We can consider
the birth rate of the prey constant over time; this means that in the absence of predators,
the prey would grow by exponential law. Their mortality rate, on the other hand, depends
on the probability they have of falling prey to a predator and therefore on the number of
predators present per unit area.
As for the predators, the mortality rate is constant, while their growth rate depends on
the availability of food and therefore on the number of prey per unit area present on the
island. We want to study the trend of the size of the two populations over time, starting
from a known initial situation (number of prey and predators).
To carry out a simulation of this biological system, we can model it by means of the
following system of finite difference equations, where x(t) and y(t) are the number of prey
and predators at time t, respectively:
𝑑𝑑𝑑𝑑
= 𝛼𝛼 ∗ 𝑥𝑥(𝑡𝑡) − 𝛽𝛽 ∗ 𝑥𝑥(𝑡𝑡) ∗ 𝑦𝑦(𝑡𝑡)
{ 𝑑𝑑𝑑𝑑
𝑑𝑑𝑑𝑑
= 𝛾𝛾 ∗ 𝑦𝑦(𝑡𝑡) − 𝛿𝛿 ∗ 𝑥𝑥(𝑡𝑡) ∗ 𝑦𝑦(𝑡𝑡)
𝑑𝑑𝑑𝑑
Here, we have the following:
• α, β, γ, δ are positive real parameters related to the interaction of the two species
•
is the instantaneous growth rates of the prey.
•
is the instantaneous growth rates of the predators.
The following hypotheses underlie in this model:
• In the absence of predators, the number of prey increases according to an
exponential law, that is, with a constant rate.
• Similarly, in the absence of prey, the number of predators decreases at a
constant rate.
This is a deterministic and continuous simulation model. In fact, the state of the system,
represented by the size of the two populations in each instance of time, is univocally
determined by the initial state and the parameters of the model. Furthermore, at least in
principle, the variables – that is, the size of the populations – vary continuously over time.
The differential system is in normal form and can be solved with respect to the derivatives
of maximum order but cannot be separated into variables. It is not possible to solve this
in an analytical form, but with numerical methods, it is solved immediately (the RungeKutta method). The solution obviously depends on the values of the four constants and
the initial values.
Summary
In this chapter, we learned what is meant by simulation modeling. We understood
the difference between modeling and simulation, and we discovered the strengths of
simulation models, such as defects. To understand these concepts, we clarified the
meaning of the terms that appear most frequently when dealing with these topics.
We then analyzed the different types of models: static versus dynamic, deterministic
versus stochastic, and continuous versus discrete. We then explored the workflow
connected to a numerical simulation process and highlighted the crucial steps. Finally, we
studied some practical modeling cases to understand how to elaborate on a model starting
from the initial considerations.
In the next chapter, we will learn how to approach a stochastic process and understand
the random number simulation concepts. Then, we will explore the differences between
pseudo and non-uniform random numbers, as well as the methods we can use for random
distribution evaluation.
2
Understanding
Randomness and
Random Numbers
In many real-life situations, it is useful to flip a coin in order to decide what to do. Many
computers also use this procedure as part of their decision-making process. In fact, many
problems can be solved in a very effective and relatively simple way by using probabilistic
algorithms. In an algorithm of this type, decisions are made based on random
contributions that remember the dice roll with the help of a randomly chosen value.
The generation of random numbers has ancient roots, but only recently has the process
been sped up, allowing it to be used on a large scale in scientific research as well. These
generators are mainly used for computer simulations, statistical sampling techniques, or
in the field of cryptography.
In this chapter, we're going to cover the following topics:
• Stochastic processes
• Random number simulation
• The pseudorandom number generator
24
Understanding Randomness and Random Numbers
• Testing uniform distribution
• Exploring generic methods for random distributions
• Random number generation using Python
Technical requirements
In this chapter, we will introduce random number generation techniques. In order to
understand these topics, a basic knowledge of algebra and mathematical modeling is
needed.
To work with the Python code in this chapter, you need the following files (available on
GitHub at https://github.com/PacktPublishing/Hands-On-SimulationModeling-with-Python):
• LinearCongruentialGenerator.py
• LearmouthLewisGenerator.py
• LaggedFibonacciAlgorithm.py
• UniformityTest.py
• Random.Generation.py
Stochastic processes
A stochastic process is a family of random variables that depends on a parameter, t. A
stochastic process is specified using the following notation:
{𝑋𝑋𝑡𝑡 , 𝑡𝑡 ∈ 𝑇𝑇}
Here, t is a parameter, and T is the set of possible values of t.
Usually, time is indicated by t, so a stochastic process is a family of time-dependent
random variables. The variability range of t, that is, the set, T, can be a set of real numbers,
possibly coinciding with the entire time axis. But it can also be a discrete set of values.
The random variables, Xt, are defined on the set, X, called the space of states. This can be a
continuous set, in which case it is defined as a continuous stochastic process, or a discrete
set, in which case it is defined as a discrete stochastic process.
Consider the following elements:
𝑥𝑥0 , 𝑥𝑥1 , 𝑥𝑥2 , … , 𝑥𝑥𝑛𝑛 ∈ 𝑋𝑋
Stochastic processes
25
This means the values that the random variables, Xt, can take are called system states and
represent the possible results of an experiment. The Xt variables are linked together by
dependency relationships. We can know a random variable if we know both the values
it can assume and the probability distribution. So, to understand a stochastic process,
it is necessary not only to know the values that Xt can take but also the probability
distributions of the variables and the joint distributions between the values. Simpler
stochastic processes, in which the variability range of t is a discrete set of time values,
can also be considered.
Important note
In practice, there are numerous phenomena that are studied through the
theory of stochastic processes. A classic application in physics is the study of
the motion of a particle in each medium, the so-called Brownian motion.
This study is carried out statistically using a stochastic process. There are
processes where even by knowing the past and the present, the future cannot
be determined; whereas, in other processes, the future is determined by the
present without considering the past.
Types of stochastic process
Stochastic processes can be classified according to the following characteristics:
• Space of states
• Time index
• Type of stochastic dependence between random variables
The state space can be discrete or continuous. In the first case, the stochastic process
with discrete space is also called a chain, and space is often referred to as the set of
non-negative integers. In the second case, the set of values assumed by the random
variables is not finite or countable, and the stochastic process is in continuous space.
The time index can also be discrete or continuous. A discrete-time stochastic process is
also called a stochastic sequence and is denoted as follows:
Here, the set, T, is finite or countable.
{𝑋𝑋𝑛𝑛 | 𝑛𝑛 ∈ 𝑇𝑇}
26
Understanding Randomness and Random Numbers
In this case, the changes of state are observed only in certain instances: finite or countable.
If state changes occur at any instant in a finite or infinite set of real intervals, then there is
a continuous-time process, which is denoted as follows:
{𝑋𝑋(𝑡𝑡) | 𝑡𝑡 ∈ 𝑇𝑇}
The stochastic dependence between random variables, X(t), for different values of t
characterizes a stochastic process and sometimes simplifies its description. A stochastic
process is stationary in the strict sense that the distribution function is invariant with
respect to a shift on the time axis, T. A stochastic process is stationary in the broad sense
that the first two moments of the distribution are independent of the position on the
T axis.
Examples of stochastic processes
The mathematical treatment of stochastic processes seems complex, yet we find cases of
stochastic processes every day. For example, the number of patients admitted to a hospital
as a function of time, observed at noon each day, is a stochastic process in which the space
of states is discrete, being a finite subset of natural numbers, and time is discrete. Another
example of a stochastic process is the temperature measured in a room as a function of
time, observed at every instant, with continuous state space and continuous time. Let's
now look at a number of structured examples that are based on stochastic processes.
The Bernoulli process
The concept of a random variable allows us to formulate models that are useful for
the study of many random phenomena. An important early example of a probabilistic
model is the Bernoulli distribution, named in honor of the Swiss mathematician, James
Bernoulli (1654-1705), who made important contributions to the field of probability.
Some of these experiments consist of repeatedly performing a given test. For example,
we want to know the probability of getting a head when throwing a coin 1,000 times.
In each of these examples, we look for the probability of obtaining x successes in n trials.
If x indicates the successes, then n - x will be the failures.
A sequence of Bernoulli trials consists of a Bernoulli trial under the following hypotheses:
• There are only two possible mutually exclusive results for each trial, arbitrarily
called success and failure.
• The probability of success, p, is the same for each trial.
• All tests are independent.
Stochastic processes
27
Independence means that the result of a test is not influenced by the result of any other
test. For example, the event, the third test was successful, is independent of the event, the
first test was successful.
The toss of a coin is a Bernoulli trial: the heads event can be considered successful, and
the tails event can be considered unsuccessful. In this case, the probability of success is p
= 1/2. In rolling two dice, the event, the sum of the points is seven, and the complementary
event are both unsuccessful. In this case, it is a Bernoulli trial and the probability of
success is p = 1/6.
Important note
Two events are said to be complementary when the occurrence of the first
excludes the occurrence of the second but one of the two will certainly occur.
Let p denote the probability of success in a Bernoulli trial. The random variable, X, which
counts the number of successes in n trials is called the binomial random variable of the n
and p parameters. X can take integer values between 0 and n.
Random walk
The random walk is a discrete parameter stochastic process in which Xt, where X
represents a random variable, describes the position taken at time t by a moving point.
The term, random walk, refers to the mathematical formalization of statistics that describe
the displacement of an object that moves randomly. This kind of simulation is extremely
important for a physicist and has applications in statistical mechanics, fluid dynamics, and
quantum mechanics.
Random walks represent a mathematical model that is used universally to simulate a path
formalized by a succession of random steps. This model can assume a variable number
of degrees of freedom, depending on the system we want to describe. From a physical
point of view, the path traced over time will not necessarily simulate a real motion, but it
will represent the trend of the characteristics of the system over time. Random walks find
applications in chemistry, biology, and physics, but also in other fields such as economics,
sociology, and information technology.
28
Understanding Randomness and Random Numbers
Random one-dimensional walking is a model that is used to simulate the movement of
a particle moving along a straight line. There are only two potential movements on the
allowed path: either to the right (with a probability that is equal to p) or to the left (with
a probability that is equal to q) of the current position. Each step has a constant length
and is independent of the others, as shown in the following diagram:
Figure 2.1 – One-dimensional walking
The position of the point in each instant is identified by its abscissa, X(n). This position,
after n steps, will be characterized by a random term. Our aim is to calculate the
probability of passing from the starting point after n movements. Obviously, nothing
assures us that the point will return to the starting position. The variable, X(n), returns
the abscissa of the particle after n steps. It is a discrete random variable with a binomial
distribution.
At each instant, the particle steps right or left based on the value returned by a random
variable, Z(n). This variable can take only two values: +1 and -1. It assumes a + 1 value
with a probability of p > 0 and a value of -1 with a probability that is equal to q. The sum
of the two probabilities is p + q = 1. The position of the particle at instant n is given by the
following equation:
𝑋𝑋𝑛𝑛 = 𝑋𝑋𝑛𝑛−1 + 𝑍𝑍𝑛𝑛 ; 𝑛𝑛 = 1, 2, …
This shows the average number of returns to the origin of the particle, named p. The
probability of a single return is given by the following geometric series:
∞
∞
𝑛𝑛=0
𝑛𝑛=0
𝜇𝜇 = ∑ 𝑛𝑛 𝑝𝑝𝑛𝑛 (1 − 𝑝𝑝) = ∑ 𝑛𝑛 𝑝𝑝𝑛𝑛
1
√𝑛𝑛 ∗ 𝜋𝜋
→∞
We assume that the probability of the particle returning to the origin tends to 1. This
means that despite the frequency of the returns decreasing with the increase in the
number of steps taken, they will always be in an infinite value of steps taken. So, we can
conclude that a particle with equal probability of left and right movement, left free to
walk casually to infinity with great probability, returns infinite times to the point from
which it started.
Stochastic processes
29
The Poisson process
There are phenomena in which certain events, with reference to a certain interval of time
or space, rarely happen. The number of events that occur in that interval varies from 0 to
n, and n cannot be determined a priori. For example, the number of cars passing through
an uncrowded street in a randomly chosen 5-minute time frame can be considered a rare
event. Similarly, the number of accidents at work that happen at a company in a week, or
the number of printing errors on a page of a book, is rare.
In the study of rare events, a reference to a specific interval of time or space is
fundamental. For the study of rare events, the Poisson probability distribution is used,
named in honor of the French mathematician, Simeon Denis Poisson (1781-1840), who
first obtained the distribution. The Poisson distribution is used as a model in cases where
the events or realizations of a process, distributed randomly in space or time, are counts,
that is, discrete variables.
The binomial distribution is based on a set of hypotheses that define the Bernoulli trials,
and the same happens for the Poisson distribution. The following conditions describe
the so-called Poisson process:
• The realizations of the events are independent, meaning that the occurrence of
an event in a time or space interval has no effect on the probability of the event
occurring a second time in the same, or another, interval.
• The probability of a single realization of the event in each interval is proportional to
the length of the interval.
• In any arbitrarily small part of the interval, the probability of the event occurring
more than once is negligible.
An important difference between the Poisson distribution and the binomial distribution
is the number of trials and successes. In a binomial distribution, the number, n, of trials
is finite and the number, x, of successes cannot exceed n; in a Poisson distribution, the
number of tests is essentially infinite and the number of successes can be infinitely large,
even if the probability of having x successes becomes very small as x increases.
30
Understanding Randomness and Random Numbers
Random number simulation
The availability of random numbers is a necessary requirement in many applications.
In some cases, the quality of the final application strictly depends on the possibility
of generating good quality random numbers. Think, for example, of applications such
as video games, cryptography, generating visuals or sound effects, telecommunications,
signal processing, optimizations, and simulations. In an algorithm of this type, decisions
are made based on the pull of a virtual currency, which is based on a randomly
chosen value.
There is no single or general definition of a random number since it often depends on the
context. The concept of random number itself is not absolute, as any number or sequence
of numbers can appear to be random to an observer, but not to another who knows the
law with which they are generated. Put simply, a random number is defined as a number
selected in a random process from a finite set of numbers. With this definition, we focus
on the concept of randomness in the process of selecting a sequence of numbers.
In many cases, the problem of generating random numbers concerns the random
generation of a sequence of 0 and 1, from which numbers in any format can be obtained:
integers, fixed points, floating points, or strings of arbitrary length. With the right
functions, it is possible to obtain good quality sequences that can also be used in scientific
applications, such as the Monte Carlo simulation. These techniques should be easy to
implement and be usable by any computer. In addition, like all software solutions, they
should be very versatile and quickly improved.
Important note
These techniques have a big problem that is inherent to the algorithmic nature
of the process: the final string can be predicted from the starting seed. This is
why we call this process pseudorandom.
Despite this, many problems of an algorithmic nature are solved very effectively and
relatively simply using probabilistic algorithms. The simplest example of a probabilistic
algorithm is perhaps the randomized quicksort. This is a probabilistic variant of the
homonymous sorting algorithm, where, by choosing the pivot element, the algorithm
manages to randomly guarantee optimal complexity in the average case, no matter
the distribution of the input. Cryptography is a field in which randomness plays
a fundamental role and deserves specific mention. In this context, randomness does
not lead to computational advantages, but it is essential to guarantee the security of
authentication protocols and encryption algorithms.
Random number simulation
Probability distribution
It is possible to characterize a random process from different points of view. One of
the most important characteristics is the probability distribution. The probability
distribution is a model that associates a probability with each observable modality
of a random variable.
The probability distribution can be either discrete or continuous, depending on whether
the variable is random, discrete, or continuous. It is discrete if the phenomenon is
observable with an integer number of modes. The throw of the dice is a discrete
statistical phenomenon because the number of observable modalities is equal to 6. The
random variable can take only six values (1, 2, 3, 4, 5, and 6). Therefore, the probability
distribution of the phenomenon is discrete. The probability distribution is continuous
when the random variable assumes a continuous set of values; in this case, the statistical
phenomenon can be observed with an infinite or very high number of modalities. The
probability distribution of body temperature is continuous because it is a continuous
statistical phenomenon, that is, the values of the random variable vary continuously.
Let’s now look at different kinds of probability distributions.
Uniform distribution
In many cases, processes characterized by a uniform distribution are considered and
used. This means that each element is as likely as any of the others to be selected if an
infinite number of extractions is performed. If you represent the elements and their
respective probabilities of being extracted on a graph, you get a rectangular graph
as follows:
Figure 2.2 – The probabilities of the elements
31
32
Understanding Randomness and Random Numbers
Since the probability is expressed as a real number between 0 and 1, where 0 represents
the impossible event and 1 the certain event, in a uniform distribution, each element will
have a 1/n probability of being selected, where n is the number of items. In this case, the
sum of all the probabilities must give a uniform result, since, in an extraction, at least one
of the elements is chosen for sure. A uniform distribution is typical of artificial random
processes such as dice rolling, lotteries, and roulette and is also the most used in several
applications.
Gaussian distribution
Another very common probability distribution is the Gaussian or normal distribution,
which has a typical bell shape. In this case, the smaller values, or those that are closer to
the center of the curve, are more likely to be extracted than the larger ones, which are far
away from the center. The following diagram shows a typical Gaussian distribution:
Figure 2.3 – Gaussian distribution
Gaussian distribution is important because it is typical of natural processes. For example,
it can represent the distribution of the noise in many electronic components, or it can
represent the distribution of errors in measurements. It is, therefore, used to simulate
statistical distributions in the fields of telecommunications or signal processing.
Properties of random numbers
By random number, we refer to a random variable distributed in a uniform way between 0
and 1. The statistical properties that a sequence of random numbers must possess are
as follows:
• Uniformity
• Independence
The pseudorandom number generator
Suppose you divide an interval, [0.1], into n subintervals of equal amplitude. The
consequence of the uniformity property is that if N observations of a random number
are made, then the number of observations in each subinterval is equal to N/n. The
consequence of the independence property is that the probability of obtaining a value in
a particular range is independent of the values that were previously obtained.
The pseudorandom number generator
The generation of real random sequences using deterministic algorithms is impossible:
at most, pseudorandom sequences can be generated. These are, apparently, random
sequences that are actually perfectly predictable and can be repeated after a certain
number of extractions. A PRNG is an algorithm designed to output a sequence of
values that appear to be generated randomly.
The pros and cons of a random number generator
A random number generation routine must be the following:
• Replicable
• Fast
• Not have large gaps between two generated numbers
• Have a sufficiently long running period
• Generate numbers with statistical properties that are as close as possible to
ideal ones
The most common cons of random number generators are as follows:
• Numbers not uniformly distributed
• Discretization of the generated numbers
• Incorrect mean or variance
• Presence of cyclical variations
33
34
Understanding Randomness and Random Numbers
Random number generation algorithms
The first to deal with the random number generation problem was John von Neumann, in
1949. He proposed a method called middle-square. This method allows us to understand
some important characteristics of a random number generation process. To start, we need
to provide an input as the seed or a value that starts the sequence. This is necessary to be
able to generate different sequences each time. However, it is important to ensure that the
good behavior of the generator does not depend on which seed is used. From here, the
first flaw of the middle-square method appears, that is, when using the zero value as
a seed, only a sequence of zeros will be obtained.
Another shortcoming of this method is the repetitiveness of the sequences. As in all of the
other PRNGs that will be discussed, each value depends on the previous one, and, at most,
on the internal state variables of the generator. Since this is a limited number of digits, the
sequence can only be repeated from a certain point onward. The length of the sequence,
before it begins to repeat itself, is called the period. A long period is important because
many practical applications require a large amount of random data, and a repetitive
sequence might be less effective. In such cases, it is important that the choice of the seed
has no influence on the potential outcomes.
Another important aspect is the efficiency of the algorithm. The size of the output data
values and internal state, and, therefore, the generator input (seed), are often intrinsic
features of the algorithm and remain constant. For this reason, the efficiency of a PRNG
should be assessed not so much in terms of computational complexity, but in terms of the
possibility of a fast and efficient implementation of the calculation architectures available.
In fact, depending on the architecture you are working on, the choice of different PRNGs,
or different design parameters of a certain PRNG, can result in a faster implementation by
many orders of magnitude.
Linear congruential generator
One of the most common methods for generating random numbers is the Linear
Congruence Generator (LCG). The theory on which it rests is simple to understand
and implement. It also has the advantage of being computationally light. The recursive
relationship underlying this technique is provided by the following equation:
𝑥𝑥𝑘𝑘+1 = (𝑎𝑎 ∗ 𝑥𝑥𝑘𝑘 + 𝑐𝑐) 𝑚𝑚𝑚𝑚𝑚𝑚 𝑚𝑚
Here, we can observe the following:
• a is the multiplier (non-negative integers)
• c is the increment (non-negative integers)
The pseudorandom number generator
35
• m is the mode (non-negative integers)
• x0 is the initial value (seed or non-negative integers)
The modulo function, denoted by mod, results in the remainder of the Euclidean division
of the first number by the second. For example, 18 mod 4 gives 2 as it is the remainder of
the Euclidean division between the two numbers.
The linear congruence technique has the following characteristics:
• It is cyclical with a period that is approximately equal to m
• The generated numbers are discretized
To use this technique effectively, it is necessary to choose very large m values. As an
example, set the parameters of the method and generate the first 16 pseudorandom values.
Here is the Python code that allowed us to generate that sequence of numbers:
import numpy as np
a = 2
c = 4
m = 5
x = 3
for i in range (1,17):
x= np.mod((a*x+c), m)
print(x)
The following results are returned:
0
4
2
3
0
4
2
3
0
4
2
3
36
Understanding Randomness and Random Numbers
0
4
2
3
In this case, the period is equal to 4. It is easy to verify that, at most, m distinct
integers, Xn, can be generated in the interval, [0, m – 1]. If c = 0, the generator is called
multiplicative. Let’s analyze the Python code line by line. The first line was used to import
the library:
import numpy as np
numpy is a library of additional scientific functions of the Python language, designed
to perform operations on vectors and dimensional matrices. numpy allows you to work
with vectors and matrices more efficiently and faster than you can do with lists and lists
of lists (matrices). In addition, it contains an extensive library of high-level mathematical
functions that can operate on these arrays.
After importing the numpy library, we set the parameters that will allow us to generate
random numbers using LCG:
a
c
m
x
=
=
=
=
2
4
5
3
At this point, we can use the LCG formula to generate random numbers. We only generate
the first 16 numbers, but we will see from the results that these are enough to understand
the algorithm. To do this, we use a for loop:
for i in range (1,17):
x= np.mod((a*x+c), m)
print(x)
To generate random numbers according to the LCG formula, we have used the
np.mod() function. This function returns the remainder of a division when given
a dividend and divisor.
The pseudorandom number generator
37
Random numbers with uniform distribution
A sequence of numbers uniformly distributed between [0, 1] can be obtained using the
following formula:
𝑈𝑈𝑛𝑛 =
𝑋𝑋𝑛𝑛
𝑚𝑚
The obtained sequence is periodic, with a period less than or equal to m. If the period is
m, then it has a full period. This occurs when the following conditions are true:
• If m and c have prime numbers
• If m is divisible by a prime number, b, for which it must also be divisible
• If m is divisible by 4, then a – 1 must also be divisible by 4
Important note
By choosing a large value of m, you can reduce both the phenomenon of
periodicity and the problem of generating rational numbers.
Furthermore, it is not necessary for simulation purposes that all numbers between [0, 1]
are generated, because these are infinite. However, it is necessary that as many numbers as
possible within the range have the same probability of being generated.
Generally, a value of m is m ≥ 109 so that the generated numbers constitute a dense subset
of the interval, [0, 1].
An example of a multiplicative generator that is widely used in 32-bit computers is the
Learmonth-Lewis generator. This is a generator in which the parameters assume the
following values:
• a = 75
• c = 0
• m = 231 – 1
Let’s analyze the code that generates the first 100 random numbers according to this
method:
import numpy as np
a = 75
c = 0
m = 2**(31) -1
x = 0.1
38
Understanding Randomness and Random Numbers
for i in range(1,100):
x= np.mod((a*x+c),m)
u = x/m
print(u)
The code we have just seen was analyzed line by line in the Linear congruential generator
section of this chapter. The difference, in addition to the values of the parameters, lies
in the generation of a uniform distribution in the range of [0, 1] through the following
command:
u = x/m
The following results are returned:
Figure 2.4 – LCG output
Since we are dealing with random numbers, the output will be different from the
previous one.
A comparison between the different generators must be made based on the analysis
of the periodicity, the goodness of the uniformity of the numbers generated, and the
computational simplicity. This is because the generation of very large numbers can lead
to the use of expensive computer resources. Also, if the Xn numbers get too big, they are
truncated, which can cause a loss of the desired uniformity statistics.
The pseudorandom number generator
39
Lagged Fibonacci generator
The lagged Fibonacci algorithm for generating pseudorandom numbers arises from the
attempt to generalize the method of linear congruences. One of the reasons that led to the
search for new generators was the need, useful for many applications especially in parallel
computing, to lengthen the generator period. The period of a linear generator when m is
approximately 109 is enough for many applications, but not all of them.
One of the techniques developed is to make Xn + 1 dependent on the two previous values,
Xn and Xn − 1, instead of only on Xn, as is the case in the LCG method. In this case, the
period may arrive close to the value, m2, because the sequence will not repeat itself until
the following equality is obtained:
(𝑋𝑋𝑛𝑛+𝜆𝜆 , 𝑋𝑋𝑛𝑛+𝜆𝜆+1 ) = (𝑋𝑋𝑛𝑛 , 𝑋𝑋𝑛𝑛+1 )
The simplest generator of this type is the Fibonacci sequence represented by the following
equation:
𝑋𝑋𝑛𝑛+1 = (𝑋𝑋𝑛𝑛 + 𝑋𝑋𝑛𝑛−1 ) 𝑚𝑚𝑚𝑚𝑚𝑚 𝑚𝑚
This generator was first analyzed in the 1950s and provides a period, m, but the sequence
does not pass the simplest statistical tests. We then tried to improve the sequence using
the following equation:
𝑋𝑋𝑛𝑛+1 = (𝑋𝑋𝑛𝑛 + 𝑋𝑋𝑛𝑛−𝑘𝑘 ) 𝑚𝑚𝑚𝑚𝑚𝑚 𝑚𝑚
This sequence, although better than the Fibonacci sequence, does not return satisfactory
results. We had to wait until 1958, when Mitchell and Moore proposed the following
sequence:
𝑋𝑋𝑛𝑛 = (𝑋𝑋𝑛𝑛−24 + 𝑋𝑋𝑛𝑛−55 ) 𝑚𝑚𝑚𝑚𝑚𝑚 𝑚𝑚,
𝑛𝑛 ≥ 55
Here, m is even and X0, … X54 are arbitrary integers that are not all even. Constants 24
and 55 are not chosen at random but are numbers that define a sequence whose least
significant bits (Xn mod 2) have a period of length 255-1. Therefore, the sequence (Xn) must
have a period of length of at least 255-1. The succession has a period of 2M-1 (255-1) where
m = 2M.
Numbers 24 and 55 are commonly called lags and the sequence (Xn) is called a Lagged
Fibonacci Generator (LFG). The LFG sequence can be generalized with the following
equation:
𝑋𝑋𝑛𝑛 = (𝑋𝑋𝑛𝑛−𝑙𝑙 ⊗ 𝑋𝑋𝑛𝑛−𝑘𝑘 ) 𝑚𝑚𝑚𝑚𝑚𝑚 2𝑀𝑀 , 𝑙𝑙 > 𝑘𝑘 > 0
40
Understanding Randomness and Random Numbers
Here, ⊗ refers to any of the following operations: +, −, ×, or ⊗ (exclusive or).
Only some pairs (k, l) give sufficiently long periods. In these cases, the period is 2M-1
(2l-1). The pairs (k, l) must be chosen appropriately. The only condition on the first l
values is that at least one of them must be odd; otherwise, the sequence will be composed
of even numbers.
Let’s look at how to implement a simple example of additive LFG in Python using the
following parameters: x0 = x1 = 1 and m = 232. Here is the code to generate the first 100
random numbers:
import numpy as np
x0=1
x1=1
m=2**32
for i in range (1,101):
x= np.mod((x0+x1), m)
x0=x1
x1=x
print(x)
Let’s analyze the Python code line by line. The first line was used to import the library:
import numpy as np
After importing the numpy library, we set the parameters that will allow us to generate
random numbers using LFG:
x0=1
x1=1
m=2**32
At this point, we can use the LFG formula to generate random numbers. We only generate
the first 100 numbers. To do this, we use a for loop:
for i in range (1,101):
x= np.mod((x0+x1), m)
x0=x1
x1=x
print(x)
The pseudorandom number generator
41
To generate random numbers according to the LFG formula, we use the np.mod()
function. This function returns the remainder of a division when given a dividend and
divisor. After generating the random number, the two previous values are updated as
follows:
x0=x1
x1=x
The following random numbers are printed:
Figure 2.5 – Table of random numbers using LFG
The initialization of an LFG is particularly complex, and the results of this method are
very sensitive to the initial conditions. If extreme care is not taken when choosing the
initial values, statistical defects may occur in the output sequence. These defects could
harden the initial values along with subsequent values that have a careful periodicity.
Another potential problem with LFG is that the mathematical theory behind the method
is incomplete, making it necessary to rely on statistical tests rather than theoretical
performance.
42
Understanding Randomness and Random Numbers
Testing uniform distribution
Test adaptation (that is, the goodness of fit) in general, has the purpose of verifying
whether a variable under examination does or does not have a certain hypothesized
distribution on the basis, as usual, of experimental data. It is used to compare a set of
frequencies observed in a sample, with similar theoretical quantities assumed for the
population. By means of the test, it is possible to quantitatively measure the degree of
deviation between the two sets of values.
The results obtained in the samples do not always exactly agree with the theoretical results
that are expected according to the rules of probability. Indeed, it is very rare for this to
occur. For example, although theoretical considerations lead us to expect 100 heads and
100 tails from 200 flips of a coin, it is rare that these results are obtained exactly. However,
despite this, we must not unnecessarily deduce that the coin is rigged.
The chi-squared test
The chi-squared test is a test of hypotheses that gives us back the significance of the
relationship between two variables. It is a statistical inference technique that is based on
the chi-squared statistic and its probability distribution. It can be used with nominal and/
or ordinal variables, generally arranged in the form of contingency tables.
The main purpose of this statistic is to verify the differences between observed and
theoretical values, called expected values, and to make an inference on the degree of
deviation between the two. The technique is used with three different objectives that are
all based on the same fundamental principle:
• The randomness of the distribution of a categorical variable
• The independence of two qualitative variables (nominal or ordinal)
• The differences with a theoretical model
For now, we will just consider the first aspect. The method consists of a comparison
procedure between the observed empirical frequencies and the theoretical frequencies.
Let’s consider the following definitions:
• H0: Null hypothesis or the absence of a statistical relationship between two variables
• H1: Research hypothesis that supports the existence of the relationship, for instance,
H1 is true if H0 is false
• Fo: Observed frequencies, that is, the number of data of a cell detected
• Fe: Expected frequencies, that is, the frequency that should be obtained based on the
marginal totals if there was no association between the two variables considered
Testing uniform distribution
43
The chi-squared test is based on the difference between observed and expected
frequencies. If the observed frequency is very different from the expected frequency, then
there is an association between the two variables.
As the difference between the observed frequency and the expected frequency increases,
so does the value of chi-squared. The chi-squared value is calculated using the following
equation:
χ2 = ∑
(𝐹𝐹𝑜𝑜 − 𝐹𝐹𝑒𝑒 )2
𝐹𝐹𝑒𝑒
Let’s look at an example to understand how to calculate this value. We build a contingency
table that shows student choices for specific courses divided by genres. These are the
observed values:
Figure 2.6 – Table of student choices divided by genres
In addition, we calculate the representation of each value as a percentage of column totals
(observed frequencies):
Figure 2.7 – Table of choices in percentages of column totals
Now we calculate the expected value, as follows:
Expected value =
(𝑇𝑇𝑇𝑇𝑇𝑇 𝑟𝑟𝑟𝑟𝑟𝑟) ∗ (𝑇𝑇𝑇𝑇𝑇𝑇 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐)
𝑇𝑇𝑇𝑇𝑇𝑇
Let’s calculate it for the first cell (Biotechnology – Male):
Expected value =
(985) ∗ (1166)
= 512.9567
2239
44
Understanding Randomness and Random Numbers
The contingency matrix of expected values is as follows:
Figure 2.8 – Table of contingency matrix
Let’s calculate the contingency differences (Fo – Fe), as follows:
Figure 2.9 – Table of contigency differences
Finally, we can calculate the chi-squared value, as follows:
χ2 = ∑
(𝐹𝐹𝑜𝑜 − 𝐹𝐹𝑒𝑒 )2
(−101.957)2 (−101.957)2
=
+
+ ⋯ = 84.35
𝐹𝐹𝑒𝑒
512.9567
472.0433
If the two characters were independent, we would expect a chi-squared value of zero. On
the other hand, random fluctuations are always possible. So, even in the case of perfect
independence, we will never have zero. Therefore, even chi-squared values that
…
Top-quality papers guaranteed
100% original papers
We sell only unique pieces of writing completed according to your demands.
Confidential service
We use security encryption to keep your personal data protected.
Money-back guarantee
We can give your money back if something goes wrong with your order.
Enjoy the free features we offer to everyone
-
Title page
Get a free title page formatted according to the specifics of your particular style.
-
Custom formatting
Request us to use APA, MLA, Harvard, Chicago, or any other style for your essay.
-
Bibliography page
Don’t pay extra for a list of references that perfectly fits your academic needs.
-
24/7 support assistance
Ask us a question anytime you need to—we don’t charge extra for supporting you!
Calculate how much your essay costs
What we are popular for
- English 101
- History
- Business Studies
- Management
- Literature
- Composition
- Psychology
- Philosophy
- Marketing
- Economics