Developing and testing algorithms

In the previous lectures we focused on implementing small pieces of code to perform specific well-defined tasks for you. However, in general, computer programs can perform highly complicated data processing and can comprise of millions line of code (not in this course though). Usually, we say that a computer program is an *algorithm*, i.e., it is "a finite sequence of well-defined, computer-implementable instructions, typically to solve a class of problems or to perform a computation". In order to develop these codes computer scientists use what is called algorithmic thinking. In essence, algorithmic thinking is the process of solving a problem in a systematic way. In this lecture notebook we develop some algorithmic thinking and go through the basic steps that are required to make a computer program from scratch.

Learning objectives for this notebook:

  • Student can design an algorithm in a systematic way
  • Student can implement a algorithm in Python
  • Student can test algorithm outcome
  • Student can benchmark algorithm computational speed

Designing algorithms

Programming problems generally start with a very general description in written language of what the computer needs to do. As a computer scientist (and physicist) you usually have to better specify the problem. Here we will take the problem stated below as an example.

Determine the start date and length of the longest period of summer days in a KNMI dataset ranging from 1920 to 2020.

Although this may seem a trivial task, algorithmic thinking can improve the rapid and correct implementation of this problem in computer code.

Step 1: Understanding the problem and abstract

The first thing we do is to define more precisely what is required. The data file can be obtained from KNMI but is given to you as KNMI_20200129.txt. First open the file in a text editor.

  • The file has a header and the data consists of three columns
  • The temperature column consists of daily temperature readings in units of 0.1 degrees
  • Summer period is defined as a temperature in excess of 25 degrees Celcius
  • There are probably many summer periods, so we have to compare their lengths
  • We output the length of the summer period and the start date
  • We are not sure whether there are any summer periods in the data set

Now that we have defined the problem a bit better we can describe in words what we have to do:

Loop over all days in the data set. If the temperature of the day is higher than or equal to 25 degrees, add 1 day to the summer day length, and go to the next day. When encountering the first day with a temperature lower than 25 degrees, compare the calculated number of consecutive summer days, if the period is longer than the previous longest summer period, replace both the start date and the summer period length by the new start date and length. Set the length back to zero.

Step 2: Make a rough sketch of your algorithm

Below is a sketch of the algorithm that we need to write.

Note the clearly visible loop and conditional statements in the diagram.

Step 3: Define an outline of your program

We make a step by step implementation of the algorithm in Python.

We add a general description of the program. Imagine yourself looking at the code in a few years from now, you would most probably have forgotten everything about the code and be very happy to see a short description of what the code does.

We put all steps in comments and make sure that the program keeps running free of errors in all the steps we make (push run regularly). In this way we know that if programming errors occur we know at which step they occur, which makes it more easy to solve them. Below is a first outline of our program. Note that we defined a test data set to validate the correct operation of our program.

In [ ]:
"""This program calculates the longest period of summer days in a KNMI data set."""

# Load libraries

# Define parameters

# Read in data set

# Plot data set

# Generate test data set

# Loop over all days

# Print output

Step 4: Step by step implementation

The first step is reading in the relevant libraries. Also we already define a few of the important parameters, see the diagram. Next, we open the file with a text editor and look at the data format. As you can see the data is in units of tenths of degrees, hence we need to convert this. The date is in a general form. For now this is fine. Also you can see that the delimiter is a ,. The first real programming step is getting access to the data via reading in the text file.

A plot of the data versus the line number gives us a quick view of what the data set looks like and that the temperature is what we expect. From the data we can already see that there are a few days that have a temperature larger than the threshold of 25 degrees. If we would have to write a program for any KNMI data set, we would have to check with our algorithm whether there are any summer days in the data set.

In [ ]:
# Load libraries
import numpy as np
import matplotlib.pyplot as plt
import sys
sys.path.append('resource/asnlib/public/')
sys.path.append('../answer_checking/') 
from validate_answers import *

# Define parameters
Tsummer=25
index=0
length=0
lengthmax=0

# Read in data set
debilt=np.loadtxt('resource/asnlib/public/KNMI_20200129.txt', dtype=int, delimiter=',')
temps=debilt[:,2]/10

# Plot data set
plt.plot(temps)
plt.show()

# Generate test data set

# Loop over all days

# Print output

Next we generate a test data set to test our method on. We make a loop over all values of this data set and implement a simple loop that runs over all the numbers.

In [ ]:
# Load libraries
import numpy as np
import matplotlib.pyplot as plt

# Define parameters
Tsummer=25
index=0
length=0
lengthmax=0

# Read in data set
debilt=np.loadtxt('resource/asnlib/public/KNMI_20200129.txt', dtype=int, delimiter=',')
temps=debilt[:,2]/10

# Plot data set
# plt.plot(temps)
# plt.show()

# Generate test data set
test=np.array([3, 5, 20, 7, 12, 25, 23, 14, 10, 30, 30, 27, 5])

# Loop over all days
for cnt in range(len(test)):
    print(test[cnt])

# Print output

Now comes the hardest part, we have to implement the core of our code. We will see if we can implement the method in our sketch into working code. From our test data set we know that we should expect a summer day period length 3 starting at index 9 (Python indexing). Implement it and test it, this step may take some time.

Note that there is one small catch to the correct functioning of the program that you easily spot with the test data set (that's why we use it!). If the last day is a summer day adjacent to the longest summer day period (e.g. when we replace the 5 with 29) your code would give the incorrect answer. You could solve this by testing whether the last day is a summer day and then adding a dummy entry to the end of the array. For now, we manually check in the file that the last day is not a summer day, hence our code works correctly.

In [ ]:
# Load libraries
import numpy as np
import matplotlib.pyplot as plt

# Define parameters
Tsummer=25
index=0
length=0
lengthmax=0

# Read in data set
debilt=np.loadtxt('resource/asnlib/public/KNMI_20200129.txt', dtype=int, delimiter=',')
temps=debilt[:,2]/10

# Plot data set
# plt.plot(temps)
# plt.show()

# Generate test data set
test=np.array([3, 5, 20, 7, 12, 25, 23, 14, 10, 30, 30, 27, 5])

# Loop over all days
for cnt in range(len(test)):
    if test[cnt] >= Tsummer:
        length+=1
    else:
        if length>lengthmax:
            lengthmax=length
            index=cnt-length
        else:
            length=0
    
# Print output
print('The maximum length of summer days is ', lengthmax)
print('The longest summer starts at ', index) # Python index

Step 5: Implementing it on your final data set

Okay, the code works with the test data set. Now we can replace, the test data set with the real data set, which is a simple adjustment to your program. The only thing we have to do is extract the date at the correct location in the data set. We print everything to the command line. And we're done.

In [ ]:
# Load libraries
import numpy as np
import matplotlib.pyplot as plt

# Define parameters
Tsummer=25
index=0
length=0
lengthmax=0

# Read in data set
debilt=np.loadtxt('resource/asnlib/public/KNMI_20200129.txt', dtype=int, delimiter=',')
temps=debilt[:,2]/10

# Plot data set
# plt.plot(temps)
# plt.show()

# Generate test data set
test=np.array([3, 5, 20, 7, 12, 25, 23, 14, 10, 30, 30, 27, 5])

# Loop over all days
for cnt in range(len(temps)):
    if temps[cnt] >= Tsummer:
        length+=1
    else:
        if length>lengthmax:
            lengthmax=length
            index=cnt-length
        else:
            length=0
    
# Print output
print('The maximum length of summer days is ', lengthmax)
print('The longest summer starts at ', debilt[index,1]) # Python index

Benchmarking computational speed

An important consideration in programming is how long it takes your code to perform a certain task (how "fast" it is).

In this exercise, you will learn how to measure the speed of your code. For this, we will take an example of the calculation of the average (arithmatic mean) and standard deviation of a dataset.

Once we have coded this ourselves and understand how it works, we will also perform the calculation with the built-in numpy functions.

Finally, we will compare speed of our own functions with the built-in ones from numpy.

(a) Write a function that uses a for loop to calculate the average of numpy array. Test it using the following data: d = np.array([1,2,3,4,5,6,7,8,10])

In [ ]:
d = np.array([1,2,3,4,5,6,7,8,10])

###
### Your code here
###

(b) Write a function to calculate the standard deviation $\sigma$ of an array using a for loop, testing it again using d.

Hint: It is handy to remember that $\sigma_x^2 = \langle x^2 \rangle - \langle x \rangle^2$, where $\langle x^2 \rangle$ is the average value of $x^2$ and $\langle x \rangle$ is the average value of $x$.

In [ ]:
d = np.array([1,2,3,4,5,6,7,8,10])

###
### Your code here
###

(c) Now let's try finding the average and standard deviation for a very large (but boring) dataset! As we want to keep track of how fast our functions are, we will use the time library from Python.

The time library contains a function time() (see documentation here) that returns a floating point number corresponding to the number of in the epoch (typically the number of seconds since January 1, 1970, 00:00:00 (UTC)) with typically millisecond or better resolution.

To calculate how long each of our functions takes, we will record the time since the beginning of hte epoch right before and right after calling each function. Assuming your computer is not doing too many other things (like watching youtube videos or netflix in the background) this will give a pretty good estimate of how fast your code runs.

(On my computer, std_dev took 3.04 seconds typically)

In [ ]:
###
### Your code here
###

(d) Now let's try the same thing using the built-in functions np.average() and np.std(). How much faster do they run?

In [ ]:
###
### Your code here
###

What do you notice? Which one was faster?

On my computer, the averaging code took 1.58 seconds, whereas numpy took only 0.01 seconds!!! What is going on? Am I such a bad programmer?

The answer to this lies in the way that Python works. In particular, python is an interpreted language. This means that when you type code into python, it is not translated directly into "machine code" that can run on your processor, but intead into instructions that are fed one-by-one into the python interpreter program, which makes the execution of programs much slower.

This can get very bad for repetitive things like long for loops, as essentially the same instruction has to be loaded over and over into your processor one at a time, instead of just telling it directly to repeat the same thing over and over again.

The reason why np.average is so much faster is that this is a pre-compiled function: it already knows that it will be repeating the same thing over and over again and can load this all in one go into your process. For that reason, it is hundreds of times faster than using a for loop.

For that reason, in general, if you have python code that is using a for loop and you care about how fast it runs, it's always a good idea to check if there is a built-in function from python that can do it for you. Very often, there is, and for this reason well-written python code can often be just as fast as FORTRAN or C. The flip side of the coin is that when you are learning to program, it is always important to write out these things first with your own (slow) for loops so that you understand how the algorithm works.

Implement your own algorithm

A common task one encounters in experimental physics is picking out the positions of peaks in a dataset: these peaks could be the absorption resonances of an atomic gas, they could be the mechanical resonances of a object like a guitar string, or they could be peaks in a recorded sound signal. In this example, we will work with a “spectroscopy” dataset that represents the amplitude of a voltage measured as a function of a frequency applied to an object.

In general, if you look at a dataset, you can easily see with your eyes where the peaks are. However, it is a bit more tricky to have a computer do that for you. Obviously, for a few peaks you can do it yourself, but if you want to plogh through large datasets you have to rely on a computer. A naive approach, for example simply looking at the highest point will not work: it will just give you one peak, the tallest, even if there are multiple local “peaks” in your data.

One trick for this is to use the fact that between two “peaks” (maxima), there is always a “valley” (a minimum). The way the peak finding algorithm works is that you start at one side of the dataset and iterate through it left to right, as illustrated above. You can start by looking for a peak: for each step you take, if the next point is “higher” then the previous one, it becomes your candidate for where the peak is. As long as the next point keeps going up, then the position of the peak keeps moving with you. However, as soon as you see a step that goes down instead of up, then you know that the last point was actually a real peak, and you can make a note of its location and value.

Once you’ve found a peak, you then switch from looking for a peak to looking for a valley, and continue iterating through the dataset. Now you keep track of your best “candidate” for the position of the valley, updating it if the next point you look at is lower than your current “minimum” value. When you eventually take a step back up again, you know that the previous point was a valley, and so switch back to looking for a peak, continuing like this until the end of the dataset.

Exercise 1

Perform the design and implementation steps in this notebook to make a peak finding algorithm on the data set given in the file peaks.dat. Output the peak position frequency in variable fPeaks and the peak height in variable vPeaks. Yes, you have to start from scratch!

In [ ]:
###
### Your code here
###

answer_2c_1_1 = np.copy(fPeaks)
answer_2c_1_2 = np.copy(vPeaks)
In [ ]:
question = "answer_2c_1"
num = 2

to_check = [question + "_%d" % (n+1) for n in range(num)]
feedback = ""
passed = True
for var in to_check:
    res, msg = check_answer(eval(var), var)
    passed = passed and res
    print(msg); feedback += msg + "n"

assert passed == True, feedback
In [ ]: