using DataFrames # We need to load the DataFrames package to create a DataFrame
= ("John", 25, 1.8)
john_tuple = (name = "John", age = 25, height = 1.8)
john_ntuple = Dict("name" => "John", "age" => 25, "height" => 1.8)
john_dict = DataFrame(name = "John", age = 25, height = 1.8); john_df
Julia Basics
Key Terms
- REPL: The Julia REPL is the Julia Read-Eval-Print-Loop. This is the interactive command line interface for Julia. When you start Julia in the command line (terminal in Mac/Linux, command prompt in Windows), you are in the REPL, and it is a common way to interact with Julia.
- Package: A package is a collection of code that can be used to extend the functionality of Julia and complete specific tasks. Packages are installed using the
Pkg
package manager. - Variable: A variable is a value or object that you have assigned a name. This may be as simple as a number or a sentence (a string variable), or as complex as a model or a plot.
- Function: A function is a block of code that performs a specific task. Functions are called by name and can take arguments, before completing some computation and returning a value or object. Sometimes functions are written and called for their side effects, i.e. they do not directly return an object, but instead perform some action.
- Method: A method is a specific implementation of a function.
- Multiple Dispatch: Multiple dispatch is a really exciting feature of Julia, but also one that is more difficult to understand for newer programmers. The basic premise is that in Julia, how functions behave depends on the types of the arguments that are passed to them. For example, the
*
operator (function) will behave differently if you try to multiply two integers (whole numbers), two floats (numbers with decimals), two matrices, any combination of these etc. Each of these different behaviors is a different method of the*
function.
What to Expect
As mentioned previously, this book (and page) is not meant to provide a ground-up description of everything you need to know about Julia. Instead, we’ll give an overview of some of the key concepts and features that should provide enough of an understanding that you can start using Julia with reasonable confidence. It’ll likely take a couple of passes through this page to really get a good understanding of the concepts, and that’s okay! It’s meant to act as a reference so you can come back to it later if you don’t understand something in the later, more-applied, sections. At the bottom of this page are some additional resources that you can use to gain a deeper understanding of Julia.
Data Types and Structures
There are a number of different data types and structures in Julia. Here are the key ones for your to be aware of.
Data Types:
- Integer
- Whole numbers
- Float
- Numbers with decimals
- Boolean
true
orfalse
(written in lowercase)true
has equal value to1
e.g.1 == true
false
has equal value to0
e.g.0 == false
In Julia, the ==
operator is used to check if two values are equal. It returns a boolean value, true
or false
, depending on whether the values are equal or not i.e. 1 == true
returns true
because 1
and true
are equal! It is different to the =
operator, which is used to assign a value to a variable (see this section below for more details on variables).
In Julia there is also the ===
operator, which is used to check if two values are identical. This is different to the ==
operator, which checks if two values are equal. For example, 1 == true
returns true
because 1
and true
are equal, but 1 === true
returns false
because 1
and true
are not identical (they are not stored in the same location in memory in the computer).
- Char
- A single character e.g.
"H"
- A single character e.g.
- String
- A sequence of characters e.g.
"Hello World!"
- A sequence of characters e.g.
Data Structures:
- Array
- An array is a collection of values that are all the same type. Arrays can be one-dimensional (vectors), two-dimensional (matrices), or multi-dimensional. Arrays are mutable, meaning that they can be changed after they are created. An example of an array is
[1, 2, 3]
- An array is a collection of values that are all the same type. Arrays can be one-dimensional (vectors), two-dimensional (matrices), or multi-dimensional. Arrays are mutable, meaning that they can be changed after they are created. An example of an array is
- DataFrame
- A DataFrame is a special type of array created by the
{DataFrames}
package that is used to store tabular data. It is a collection of columns, where each column is an array of the same type.DataFrames are mutable, meaning that they can be changed after they are created.
- A DataFrame is a special type of array created by the
- Tuple
- A tuple is a collection of values that do not all have to be the same type. Tuples are very useful because they require very little memory, so are fast to create and access. They are also immutable, meaning that they cannot be changed after they are created, but because they are so fast to create, you can just create a new tuple with the values you want. An example of a tuple is
("John", 25, 1.8)
- A tuple is a collection of values that do not all have to be the same type. Tuples are very useful because they require very little memory, so are fast to create and access. They are also immutable, meaning that they cannot be changed after they are created, but because they are so fast to create, you can just create a new tuple with the values you want. An example of a tuple is
- Dictionary
- A dictionary is a collection of key-value pairs that do not need to be of the same type. Dictionaries are mutable, and are very useful for storing data that you want to access by a key (i.e. name), rather than an index. For example, you might want to store a person’s name, age, and height e.g.
Dict("name" => "John", "age" => 25, "height" => 1.8)
- A dictionary is a collection of key-value pairs that do not need to be of the same type. Dictionaries are mutable, and are very useful for storing data that you want to access by a key (i.e. name), rather than an index. For example, you might want to store a person’s name, age, and height e.g.
- Named Tuple
- A variant of the tuple is the named tuple. It is a cross between a tuple and a dictionary, and therefore has the benefits of being able to access values with keys instead of indices (though you use indices), but it is immutable and much smaller and faster than a dictionary. For our person example, a named tuple would look like
(; name = "John", age = 25, height = 1.8)
. Note the;
at the beginning of the tuple - the use of semicolons is common in Julia to separate named arguments from unnamed arguments in functions, and while it is not essential to create a named tuple with length > 1, it must be used for a named tuple with only one element ((name = "John", )
with a,
after the pair could similar be used for 1-element named tuples).
- A variant of the tuple is the named tuple. It is a cross between a tuple and a dictionary, and therefore has the benefits of being able to access values with keys instead of indices (though you use indices), but it is immutable and much smaller and faster than a dictionary. For our person example, a named tuple would look like
- Structs
- A struct is a custom data type that you can create to store data. It is similar to a named tuple in that it is immutable and you can access values with keys instead of indices. One reason you may prefer to use a struct over a named tuple is that you can define methods for a struct (see the multiple dispatch section for more details). Creating structs are out of the scope of this book, but it is important to know that they exist and are a useful tool for organizing your data. If you want to learn more about structs, check out the documentation and this tutorial.
If you have an object and want to tell what type it is, you can use the typeof()
function. If you have an array and want to tell what type the elements of the array are, you can use the eltype()
function.
Variables
Variables really just stored pieces of information that you’ve given a name to. This is useful because it allows you to run a calculation, for example, and then save it for use later on. That way you don’t need to run the calculation again, you can just pull the value out of storage! A slightly different example is if you have a constant value that you use multiple times in your code, e.g. the size of a population. Rather than typing out the value every time you need it, you can just store it in a variable and use the variable name instead. This not only saves you time and makes your code more readable, but also can reduce the chance of making a mistake (e.g. if you accidentally type the wrong value when copying it to a new calculation).
Assignment
Now we know what variables are, let’s look at how to create them. As mentioned earlier, we use the =
operator to assign a value to a variable. For example, if we wanted to create a variable called x
and assign it the value 1
, we would write x = 1
. But we aren’t just restricted to numbers, we can assign any type of value to a variable. This includes strings, arrays, tuples, dictionaries, and structs.
Earlier, when talking about data structures, we used the example of a person’s name, age, and height. Let’s see how we can create tuples, dictionaries, and dataframes to store this information.
When creating a dictionary, you can use the =>
operator to assign a value to a key. The key is always on the left, and the value is always on the right.
At the end of the test array assignment, we have a semicolon (;
). This has nothing to do with the array, but is used to suppress the output of the assignment, so when we run the code, we don’t see the array printed to the screen.
Because a person’s name is a string, their age is an integer, and their height is a float, we cannot create an array to store this information, because arrays can only store values of the same type. To show how we can create and access arrays, let’s create a vector (1-D array) of multiple people’s names, as well as a random matrix (2-D array).
= ["John", "Jane", "Joe"]
people_vec = [1 2 3; 4 5 6; 7 8 9] test_arr
3×3 Matrix{Int64}:
1 2 3
4 5 6
7 8 9
When creating a matrix, you can use a semi-colon to separate rows in the matrix. One alternative is to specify the exact positions of each value e.g.
= [
test_arr 1 2 3
4 5 6
7 8 9
]
Accessing Values
To access the value stored in a variable, we can often use indices. Julia, like R, is a 1-indexed language, meaning that the first element in an array has an index of 1, not 0 (like Python). In our examples, the first element of the objects we created is the person’s name, so we can access it with an index of 1.
1] # "John"
john_tuple[1] # "John"
john_ntuple[1] people_vec[
"John"
For dataframes and multi-dimensional arrays, we have to make a slight modification to use a comma that separates the indices for each dimension. In an array/dataframe, the first index is the row number, and the second index is the column number. To access the element in the first row and the first column of the array, we would use the following code.
1, 1]
test_arr[1, 1] john_df[
"John"
If we want to access an entire row or column, we can use the :
operator. For example, if we want to access the first column of the array, we can use the following code.
:, 1] test_arr[
3-element Vector{Int64}:
1
4
7
:, 1] john_df[
1-element Vector{String}:
"John"
If we want to access the first row of the array, we can use the following code.
1, :] test_arr[
3-element Vector{Int64}:
1
2
3
In all cases where we used the :
operator, we get a column vector as the output, not a single value, regardless of whether we are extracting a row or a column from the original array!
However, none of these methods work for dictionaries. For dictionaries, you need to specify the key of the value you want to access.
"name"] john_dict[
"John"
You can also use the key (or column name) to access the value in dataframes and named tuples.
1, :name] # The : operator before the column name turns it into a symbol that can be used to index the dataframe
john_df[1, "name"]
john_df[
john_ntuple.name
"John"
Functions
Overview
Functions are a core part of programming in Julia, and programming in general. A function is a block of code that performs a specific task. As has been said before, a function is like a recipe you might use to bake a cake. The recipe tells you what ingredients you need, how to combine them, and how long to bake them for. And like a recipe, a function can be used over and over again to produce the same result (assuming you have identical inputs). This is a really powerful concept, and helps make your work and research reproducible by breaking up your code into small, reusable, and understandable chunks. And because it is meant to be reused, it will save you time in cases when you need to do the same thing multiple times (you don’t want to have to write the same code over and over again)!
So let’s look at a simple example of a function in Julia, and use it to explore some of the key concepts of functions.
Say we want to take a number, multiply it by 2, and then divide the result by 3. You could just write this out explicitly, but what if you want to do this for a bunch of different numbers? This is where a function comes in handy.
function multiply_by_two_divide_by_three(x)
= x * 2
y = y / 3
z return z # it's good practice to explicitly return a value (or nothing in special cases)
end
multiply_by_two_divide_by_three (generic function with 1 method)
This function takes a single argument, x
, and then multiplies it by 2 and divides it by 3. The return
keyword tells Julia what value to return from the function. It also tells Julia that the function is finished, and it will not execute any code after the return
statement.
Let’s try using this function.
multiply_by_two_divide_by_three(3)
2.0
multiply_by_two_divide_by_three(10)
6.666666666666667
Note that in both of these examples, a floating point number is returned i.e., a number with decimals.
Without going into too much detail, it is good practice to give functions short, descriptive names. A good example would be cumsum()
that is provided in Julia and calculates the cumulative sum of a vector.
If a function name is too long to write without separating the words, use snake case (words separated by underscores) e.g. multiply_by_two_divide_by_three()
rather than leaving as a single block of text (multiplybytwodividebythree()
), or using camelCase (multiplyByTwoDivideByThree()
).
It is also good practice to add a docstring to your function. This is a short description of what the function does, and can be accessed by typing ?
followed by the function name in the REPL. This means that you can quickly understand exactly what a function does without having to work your way through the code, really helping others who may read your code, but also future you if you revisit a project.
An example of adding a docstring to a function may be as simple as adapting our original code to look like the following.
"""
multiply_by_two_divide_by_three(x)
Multiply `x` by 2 and divide by 3.
"""
function multiply_by_two_divide_by_three(x)
= x * 2
y = y / 3
z return z
end
Read more about docstrings here.
Arguments & Keyword Arguments
Unlike R, Julia makes a distinction between arguments and keyword arguments. Arguments are the values that are passed to a function. In the example function above, x
is an argument. In Julia, arguments are positional, meaning that the order in which you pass them to a function matters. To see this in practice, let’s write a new function that takes two arguments, x
and y
, and multiplies them together after minusing one from argument x
and adding one to argument y
.
function multiply_together_offsets(x, y)
= (x - 1) * (y + 1)
z return z
end
multiply_together_offsets (generic function with 1 method)
Because Julia uses positional arguments, the following two function calls will return different values, even though the numbers 5
and 10
are used in both.
multiply_together_offsets(5, 10)
44
multiply_together_offsets(10, 5)
54
Keyword arguments are arguments that are passed to a function by name. Generally speaking, keyword arguments are used to set default values for arguments that can be changed by the user. Let’s modify our multiply_together_offsets
function to use keyword arguments.
function multiply_together_offsets(x, y; offset_x = 1, offset_y = 1)
= (x - offset_x) * (y + offset_y)
z return z
end
multiply_together_offsets (generic function with 1 method)
We have added two new arguments to the function, offset_x
and offset_y
, and given them default values of 1
.
It is not necessary, but it is generally good style to place keyword arguments after all positional arguments, as well as separating them from positional arguments using a semi-colon (;
), rather than a comma.
Now, when we call the function, we can specify the values of these arguments by name.
multiply_together_offsets(5, 10)
44
multiply_together_offsets(5, 10; offset_x = 2, offset_y = 3)
39
Scope
Scope is a relatively complicated concept, but it is important to understand it in order to write functions that are easy to understand and debug. Scope refers to the visibility of variables within a function. In Julia, variables that are defined within a function are not visible outside of the function. The reverse is not true, however. Variables that are defined outside of a function are visible within the function, but cannot be modified.
Let’s look at some examples.
function add_one(x)
= x + 1
y return y
end
add_one(5)
6
y
UndefVarError: UndefVarError(:y)
UndefVarError: `y` not defined
In this case, y
is defined within the function add_one()
i.e. is a local variable, and is therefore not visible outside of the function, but it can be used within the function!
= 5
global_x
function print_global_x()
return println(global_x)
end
print_global_x()
5
In this example, global_x
is defined outside of the function print_global_x()
, and is therefore visible within the function, but it cannot be modified.
It’s not good practice to access global variables in your functions. Instead, if you want to use a variable in your function, pass it as an argument.
function modify_global_x()
= 10
global_x return global_x
end
modify_global_x()
global_x
5
Here, we have tried to modify global_x
within the function modify_global_x()
, but this has not worked. It looks like it worked when we called the function, but when we check the value of global_x
outside of the function, it is still 5
.
Multiple Dispatch
Multiple dispatch is the idea that the behavior of a function depends on the types of the arguments that are passed to it (as well as the number of arguments). To illustrate this, let’s go back to our original example function multiply_by_two_divide_by_three()
.
In the example above, we passed a single argument to the function, and it returned a floating point number. But what if we wanted to pass multiple numbers to the function, and have it return a vector of the results? We could do this by specifying another method of the function that accepts a tuple of numbers as an argument.
function multiply_by_two_divide_by_three(x::Tuple)
= zeros(Float64, length(x))
y = similar(y)
z
for i in eachindex(x)
= x[i] * 2
y[i] = y[i] / 3
z[i] end
return z
end
multiply_by_two_divide_by_three (generic function with 2 methods)
We have defined a new method of the function (i.e., a new way of using the function) by specifying the type of the argument x
as a tuple (::Tuple
), and this is illustrated in the printout multiply_by_two_divide_by_three (generic function with 2 methods)
.
You don’t have to understand exactly what the code is doing here (but have a look at the for loop section if you’re interested). Neither is the code particularly efficient, but it’s a relatively readable way to illustrate the concept of multiple dispatch.
Now let’s test out our new function method
multiply_by_two_divide_by_three((1, 2, 3))
3-element Vector{Float64}:
0.6666666666666666
1.3333333333333333
2.0
And we can see that the original method that just takes a single number as an argument still works.
multiply_by_two_divide_by_three(3)
2.0
It is important to note that keyword arguments are not considered in multiple dispatch i.e., trying to define a new method of a function that differs only by keyword arguments will not create a new method, but just overwrite the old one. So if you want/need a new method, use positional arguments.
Packages
Packages are an essential part of the Julia ecosystem. You’ve already seen an example of a package in action: {DataFrames}
. At their core, a package is a way for someone to share code, data, and documentation with other people. By design, Julia can’t do everything for everyone straight out of the box. Not only would it be an impossible task for the Julia developers to create a language that can do everything, but it would also be incredibly slow to load and run. Instead, packages extend the abilities of Julia by providing additional features (through functions) that are not included in the base language.
The {DataFrames}
package, for example, creates a special data structure that is very easy to read, as well as providing a number of functions that make it easy to manipulate and analyze data.
To add a package to your Julia environment (project), you can use the add
command in the package manager (accessed by pressing ]
in the REPL). Then, you can use the using
command to load the package into your current Julia session. See this page for more information in the context of setting up a new project.
Control Flow
Control flow refers to the order in which the statements in a program are executed. There are many different ways to control how a program is executed, but we will focus on the most common ones here. See the Julia documentation for more information.
If Statements
If statements are a way to control whether or not a block of code is executed, and fall under the general category of conditional evaluation (but I think “if statements” gives you a more intuitive sense of what we’re talking about in this section). There are many uses for conditional evaluation, so we’ll just show you some examples of how to use it, and you can explore further if you need to. The following is an example from the Julia documentation.
function number_relations(x, y)
if x < y
= "less than"
relation elseif x == y
= "equal to"
relation else
= "greater than"
relation end
return println("x is ", relation, " y.")
end
number_relations(2, 1)
x is greater than y.
In this example we are using an if
statement to determine the relationship between two numbers. It’s important to note that the conditional statements are evaluated in sequence, and the first one that evaluates to true
is executed i.e. if
then elseif
then else
in this example.
elseif
and else
statements are both optional (i.e. just an if
statement is valid), and you can have as many elseif
statements as you like (including 0 i.e. just if
and else
statements).
Short-Circuit Evaluation
If you want to check multiple conditions, you can use the &&
(and) and ||
(or) operators. These is known as short-circuit evaluation. For example, if we wanted to check if a number is between 0 and 10, we could do the following.
function number_between(x)
if x > 0 && x < 10
println("x is between 0 and 10")
else
println("x is not between 0 and 10")
end
end
number_between(3)
x is between 0 and 10
number_between(11)
x is not between 0 and 10
In number_between(3)
, x is greater than 0 and less than 10, so both conditions evaluate to true
, and the code in the if
statement is executed. In number_between(11)
, x is greater than 0, but not less than 10, so while the first condition evaluates to true
, the second condition evaluates to false
, so the code in the else
statement is executed. This is important to understand - all conditions must evaluate to true
for the code in the if
statement to be executed! Based on this, try to think about why the following code also works.
function number_between2(x)
if x > 0 && ((x > 10) == false)
println("x is between 0 and 10")
else
println("x is not between 0 and 10")
end
end
number_between2(3)
x is between 0 and 10
Iteration
Iteration is a useful concept in programming, is a pretty intuitive way to think about many problems we come across in epidemiology (once you get used to it), and is very fast in Julia, so it’s worth spending some time to understand it. Do not expect to understand everything about iteration after reading this section, and you will likely need to come back to refer to it as you go through the book, but hopefully it will provide a good starting point for you to explore further.
For Loop
The most common way to iterate in Julia is using a for
loop. We have already seen a for
loop in the multiple dispatch section, but let’s look at a simpler example.
Let’s say we want to calculate the sum of the numbers from 1 to 10 (cumulative sum) i.e. 1 + 2 + 3 + … + 10. Julia has an in-built function to do this (cumsum()
), but let’s write our own function to do it using a for
loop.
There are multiple ways we could write this function, but the most intuitive way is to go through each of the numbers in 1 to 10, and add them to a running total.
function mycumsum(x)
= 0 # Initialize our running total to 0
y
# For each number in x, add it to our running total
for i in x
+= x[i] # This is equivalent to y = y + x[i]
y end
return y
end
mycumsum(1:10)
55
While Loop
Another way to iterate is using a while
loop. The difference between a for
loop and a while
loop is that a for
loop iterates over a sequence of values, whereas a while
loop iterates until a condition is met. For example, let’s say we want to keep adding numbers to our running total until the total is greater than 100 (and stop counting). We might not know how many numbers we need to add to get to 100, so we can’t use a for
loop, but we can use a while
loop.
function mycumsum2(x)
= 0 # Initialize our running total to 0
y = 1 # Initialize our counter to 1
i
# While our running total is less than 100, add the next number to our running total
while y < 100
+= x[i] # This is equivalent to y = y + x[i]
y += 1 # Update our counter so we can add the next number
i end
return println("We added ", i, " numbers to get to ", y)
end
mycumsum2(1:100)
We added 15 numbers to get to 105
while
loops are very useful in many situations, but are more dangerous than for
loops, because it’s easy to get stuck in an infinite loop. For example, if we accidentally started our cumulative sum between 0:100 and forgot to update our counter, we would never reach our condition of y < 100, and the loop would never end. To avoid this, people often add break
statements to their while
loops, which will break out of the loop if a certain condition is met i.e. if we added 100 numbers and still haven’t reached 100, we can early exit out of the loop. Generally speaking, use a for
loop if you can, and be careful when using while
loops.
Map
An alternative to loops is the map()
function. If you are familiar with functional programming (or the {purrr}
package and functions in R), the map()
function will be easy to grasp. If not, not to worry as it’s just a different method of applying a set of functions to each element in a sequence. The main difference to be aware of is that each application of the function(s) happen independently from each other, so you can’t increment a counter and then update a value from the prior iteration of a loop.
Say we want to calculate take an array of integers and return an array of their squares. We can use a for loop (shown in the folded code below) to do this. Or we could use the map()
function, which accepts an array and a function as arguments.
function squares(x)
= zeros(eltype(x), length(x))
y
for i in eachindex(x)
= x[i]^2
y[i] end
return y
end
squares(1:10)
10-element Vector{Int64}:
1
4
9
16
25
36
49
64
81
100
map(x -> x^2, 1:10)
10-element Vector{Int64}:
1
4
9
16
25
36
49
64
81
100
In the above example, we are using an anonymous function that takes each element of the array 1:10
, assigns it to the variable x
, and then squares it before returning it in a Vector. The map()
function returns a vector of the same length as the input array, which is nice when we know want the output vector to be the same length as the input vector, as it removes the need for us to manually perform bounds checking.
In slightly more complicated scenarios, the anonymous function may become unwieldy. Here, we can either write a named function that we use in the map()
function, or use a do
block.
function square_element(x)
return x^2
end
map(square_element, 1:10)
10-element Vector{Int64}:
1
4
9
16
25
36
49
64
81
100
Much like the anonymous function, when we use a do
block, we assign the elements being iterated over to a variable name that we can manipulate.
map(1:10) do x
^2
xend
10-element Vector{Int64}:
1
4
9
16
25
36
49
64
81
100
The do
block is useful when we only want to perform some operations once, so it’s not necessary to create a named function. It is also very helpful when we want to pass multiple arguments to a function, as we will see.
Here, we have two vectors of the same length and we want to perform element-wise addition i.e., add the first index of each vector together, the second elements, and so on. Doing this without the do
block is possible, but much more cumbersome. Here, we can use the zip()
function to combine the two vectors into a vector of tuples that can be iterated over by map()
.
= 1:10
vec_a = 11:20
vec_b
map(zip(vec_a, vec_b)) do (a, b)
+ b
a end
10-element Vector{Int64}:
12
14
16
18
20
22
24
26
28
30
collect(zip(vec_a, vec_b))
10-element Vector{Tuple{Int64, Int64}}:
(1, 11)
(2, 12)
(3, 13)
(4, 14)
(5, 15)
(6, 16)
(7, 17)
(8, 18)
(9, 19)
(10, 20)
Without the do
block, we could write this. Note the ,
between the two closing parentheses in the anonymous function i.e., ...b),)
map(
-> a + b,
((a, b),) zip(vec_a, vec_b)
)
10-element Vector{Int64}:
12
14
16
18
20
22
24
26
28
30
Additional Resources
I’d recommend checking out the following resources to learn more about Julia (roughly in descending order of preference due to complexity and target audience)