## Backend Engineering: distributed file system

Nowadays, we often have to process large files and collect various types of data. For example, server log files from a website can be several gigabytes per day. When working with gigabyte or

## Dynamic Programming 2 (Shortest Path, DAG)

In the first part, we learned about the Fibonacci Number. In this article, we will talk about the shortest path problem and will learn some properties of Dynamic Programming along the way. The shortest path is a graph problem but you don’t really need to know about graph algorithms for this article.

Let’s say we need to find the shortest between one city to another. There are $n$ cities in total, $0$ is the first city and $n-1$ is the last. The path and distance between pairs of cities are marked with arrows like the picture below:

Now we need to find the length of the shortest path between $0$ to $n-1$.

#### Finding Subproblems/States:

First, we need to find that are the parameters or states can define the problem. Remember the word ‘state’, we will be using it a lot. For this problem, the state is defined by the city you are currently in. Let’s say you are in city $u$, we will express our problem as a function $F(u)$. Just like Fibonacci, we will now try to find a recursive solution.

#### State Transition and Recursion

We don’t know which is next city to from $u$ to reach the destination in the shortest path. In this step, we will guess the next city and every guess will be a new subproblem. In the graph, we can go to the city $1$ and $3$ from $0$. So we will get the shortest path from $f(0)$ using this formula:

$f(0) = min(f(1) + 2, f(3) + 1)$

The idea is to go to each adjacent city and find the shortest path from there. The answer will bee the minimum of all the guesses. In general:

$f(n – 1) = 0 \\f(u) = min(f(u, v) + w(u, v)) \ where \ (u,v) \in E \ for\ u \ne n – 1$

Here $w(u, v)$ is the distance between $u$ and $v$.

#### Sub-problem Ordering

Now let’s think about the ordering of the function calls after we run the recursion.

Now we are in big trouble, the subproblems created a cycle. For example, to find the value of $f(0), we need to. know the value of$f(0)$but$f(2) again depends of $f(0)$. If we try to run this as a code, it will stick in an infinite loop. This brings us to the concept of DAG.

#### Directed Acyclic Graph (DAG)

A DAG or directed acyclic graph is a directed graph with no cycles. A dynamic programming solution will work only if the subproblems create a DAG.

Even though we can’t solve the shortest path problem in a general graph using the formula above, we still can solve it for a DAG. In the graph above, we can remove the $2->0$ arrow and it will become DAG.

#### The C++ Code

Now we can solve it using DP. The C++ code will be like this:

One interesting fact is, we can still find the shortest path a general graph using DP but we need to add an extra parameter $k$ which indicates the maximum number of edges we are allowed to use. Rings a bell? Yes, that is very similar to the bellman ford algorithm.

#### Complexity

We have $n$ subproblems in this problem $f(0), f(1)…f(n-1)$. Moreover, for every city $u$, we are going to a maximum of $n$ adjacent cities. Total complexity will be the multiplication of the number of subproblems and the complexity of internal operations which is $O(n^2)$.

#### Iterative Version

For this problem, writing the iterative version is a bit difficult but possible. For the iterative version, we need to arrange the subproblems in topological order because, in each subproblem, you must make sure you solved the dependant subproblems. For Fibonacci, it was very easy but for this graph, it’s a bit tricky. You can google about topological sorting in a graph, I also discussed in brief here.

For the example above, we will get ${0, 3, 1, 2, 4}$ after sorting. Now you can loop this array in reversed order for building the table. It’s ok if you can’t implement that for this problem, it’s quite complicated and won’t give any advantage. We will learn more about the Iterative version when we learn LIS and knapsack.

#### Conclusion

So far you have learned:

• How to define the subproblems
• How to transit from one subproblem to another to find the optimal solution
• What is DAG, why DP can’t work if there is a cyclic dependency?

Still not getting confidence? No problem, this just the start, you will be able to solve some problems after you read the next part.

Happy Coding

## Dynamic Programming 1 (Fibonacci)

Dynamic Programming may sound intimidating but the concept is quite easy. In dynamic programming, we divide a problem into multiple subproblems and build-up the solutions from that, but we must be careful not to solve one sub-problem more than once. That is dynamic programming in a nutshell. But to understand if a problem can be solved by dynamic programming or not, you need some experience. In this series, we will know what is Dynamic programming and solve some classic dynamic programming problems. I originally wrote the series in Bengali, this is the translated version.

Dynamic programming is a confusing term, especially the ‘dynamic’ part because it’s not really co-relate to how dynamic programming works. Dr. Richard Bellman used this term to save his research grant, strange but true! So don’t waste time trying to find the meaning behind the naming.

One of the pre-requisite to learn dynamic programming is understanding recursion. Also, I hope you know the basics of time and space complexity (Big O notation).

I am using the lecture series of MIT by Dr. Eric Demaine as the reference for this series. You can watch it here.

### Dynamic Programming

Dynamic programming is not an algorithm to solve a particular problem. Rather, it’s a problem-solving technique which can be used to solve many kinds of optimization or counting problem. For example Fibonacci, Coin Change, Matrix Chain Multiplication. Dynamic programming is a kind of brute force but it’s a clever brute force. There are many kinds of problems for which dynamic programming is the only known polynomial solution.

### Fibonacci Sequence

Fibonacci is not the best example of dynamic programming but it’s the easiest to start with. An Italian Mathmetician created this sequence while observing the production rate of rabbits. The sequence is like this::

0, 1, 1, 2, 3, 5, 8, 13, 21, 34 ….

Except for the first two numbers, every number is the summation of the previous two numbers. Let’s imagine a function $F(n)$ which returns the $n^{th}$ Fibonacci number. If we define it recursively, it will look like this:

The C++ code for this looks like this:

Now we divided $f(n)$ into two smaller subproblems. Do you know the time complexity for this code? If we call $F(5)$ and represent the function calls as a tree, it will look like this:

Except for the base case, every function is calling two other functions. The time complexity is $O(n^2)$ which is not really great, it’s exponential.

You might have noticed that I highlighted $f(3)$ in two places.  This is to show that for calculating $f(5)$ we are calculating $f(3)$ twice which is clearly a waste!

One of the principles of dynamic programming is we won’t solve one subproblem more than once. So what can we do? The answer is caching the results in a table. If we see that we already have the results for a subproblem in the table, we don’t need to call the function again.

#### The C++ Code

We used an array called $memo$ to save the results. We initialized it with $-1$ to indicate that the table is empty. You need to choose a value that can’t be part of the answer. During every function call, we are first checking if the result exists in the array.

#### Complexity

Now the complexity of the code becomes $O(n)$ as we have $n$ subproblems and we are solving every subproblem once. Inside the function, all the operations have constant time complexity.

#### Memoization

The process of saving the result for later use is called ‘Memoization’, another weird terminology. Memoization Came from the Latin word memorandum, there is actually not much difference with memorization. The term was coined by researcher Donald Mitchie.

#### Iterative Version

The implementation can be iteratively as well. All we need to do is arranging the sub-problems in topological order. That means we need to solve the subproblems which don’t depend on any other subproblems first (base case) and gradually build the results from smaller subproblems.

For Fibonacci, we already know the results of $f(0)$ and $f(1)$. We can save these values in the table first and gradually solve $f(2), f(3) …. f(n)$ in this order.

For solving the Dynamic Programming problem, it’s easier to find recursive formula at first. After that, we can write the code either recursively or iteratively. In fact, it’s easier to write the recursive code, it works almost like magic. The iterative version can be slightly difficult if there is more than one parameter.

But it’s important to understand the iterative version as well because often it provides some chance to save space.  For example, in the code above, you only need the value of the last two entries of the table, so you don’t really need to maintain the whole table.

There are some standard ways to attack dynamic programming problems. Next, we will see some optimization problem which will give us a better idea of the pattern.

(Next Part)

## ACM ICPC 2019 Dhaka Regional Problem H (Droplets)

Finally, the contest is over. And it’s time for the editorial of my problem. This year, I have submitted two questions for the regionals. Between these two problems, one got selected for the Dhaka regional. And that problem appeared as number H (Droplets) in the contest. Hence this problem itself was quite simple, it was supposed to be the 3rd most manageable problem of the contest. But, somehow, it turned out to be the 2nd easiest one.

[Few days before the contest, we realized that a similar problem appeared in a national-level contest in RUET. But ICPC committee decided to keep the problem because this is a much much easier version, the RUET version was one of the hardest problem in that contest.]

##### The Problem

The problem is about filling up a matrix with water. There is a matrix given and you have to imagine it as a 2-d box. Some parts of the matrix are enclosed. Now, you need to find out the amount of water the matrix can hold. Please read the full statement here carefully before proceeding.

These are the pictures I used in the statement to describe the problem. What the problem asked you to do is quite clear from these pictures.

##### The Solution

Let’s consider the smallest unit of the matrix first to solve this problem. Here, the smallest unit is the little squares. Now, for each square in the matrix, how to identify whether there is water in the square or not?
For that to happen, the water must have a way to reach the square from the top. If you understand this part, you already know this is a graph problem. Therefore, treat every square as a node of the graph. And, a node is connected to its neighbor if no lines are blocking them.

As you have the graph now, scan the top squares from left to right. Every time you encounter a square on the first row whose top is not blocked, you found a way for the water to enter into the matrix. Afterward, run a depth-first-search from that node and mark all the nodes you can reach. Keep continuing this process, and in the end, the number of nodes you reached is the answer.

The time complexity of this solution is O(n * m) for each test case.

Here is the judge solution:

That’s it, I hope you enjoyed the problem!

##### Last but not least

Not sure whether you have noticed it or not, I have used climate-change as the theme of this problem. Recently 11000 scientists from all over the world declared a climate change emergency, which is quite alarming for all of us. Certainly, it is high time we need to be conscious about the rapidly changing climate. I am personally anxious about it, and you should be too. Let’s do whatever we can to reduce resource wastage, plastic usage, and whatever we need to save our mother planet.

Thank you.

## Floyd’s Cycle Finding Algorithm

Suppose you have a linked list, now you have to identify whether there is any cycle in that linked list or not. This type of question is quite common for the interview. Today we will try to solve this problem using Floyd’s cycle finding algorithm.

In the picture above, a cycle with length $7$ is present in the linked list.

### Detect Cycle In A Linked List

The easiest way to detect a cycle is to use a dictionary or hashmap. For that, we need to traverse the list from the very first node and save each node in the dictionary. After reaching to any node, if that node is found to be already present in the dictionary, then the linked list is cyclic. The complexity and memory complexity of this algorithm is $O(n)$.

But we can reduce the memory complexity of this algorithm to $O(1)$ using Floyd’s algorithm. It is also known as the ‘Hare and turtle’ algorithm.

### Hare And Turtle Algorithm

Let’s assume we have two pointers; one is the turtle pointer, which we will denote by $’T’$, and the other pointer is the hare pointer, which we will denote by $’H’$. In the beginning, both of the pointers will be at the root node of the linked list.

In each second, the $H$ pointer will move two steps, but the $T$ pointer will move only one step. Then after one second, the position of the two pointers will be like this :

After one more additional second, $T$ will be at node $12$ and $H$ at the node $90$. After simulating these steps for some time, we will find that the two pointers will meet at the node $28$.

If the two pointers meet at the same node, it indicates that there is a cycle present in the linked list. If there were no cycles in that list, then the $H$ pointer would reach the endpoint of the list by approaching ahead.

Now the question is how we can detect the first node of the cycle?

### How To Detect The First Node of The Cycle

Let’s assume,

$m$= distance between the root node and the first node of the cycle,

$k$= distance between the first node of the cycle and the meeting point of the two pointers.

$l$= length of the cycle.

Now, if $T$ meets $H$ after circling the cycle for $C_{T}$ times, then the distance traveled by $T$ will be:

$D_{T}=m+(C_{T}∗l)+k$

And, if $H$ meets $T$ after circling the cycle for $C_{H}$ times, then the distance traveled by $H$ will be:

$D_{H}=m+(C_{H}∗l)+k$

As the $H$’s speed is two times of $T$’s speed, so we can say that:

$2∗(m+(C_{T}∗l)+k)=m+(C_{H}∗l)+k$

After some rearranging, it can be written as:

$m+k=(C_{H}–2∗C_{T})∗l$

Here, $l$ is the length of the cycle, so $m+k$ is a multiple of the length of the cycle. It means that if we travel $m+k$ length from the first node of the cycle, then we will return to the first node again. This is the most important part of the algorithm to understand.

Now, if you move $m$ steps ahead from the meeting point, then you will return to the first node of the cycle again. The cause is, the distance from the first node of the cycle to the meeting point is $k$. But, the value of both $m$ and $k$ is unknown. Then, how can you move $m$ steps?

### The Value of $m$

An interesting and quite easy way is there to determine the value of $m$. I hope you still remember that the distance between the root node and the first node of the cycle is also $m$.

Let’s assume; no $H$ pointer is there anymore, but another turtle pointer $’T_{2}’$ is now present at the root node, and $T$ is still in that meeting point. Now, if you move both the $T$ and $T_{2}$ pointer one step at a time, then the point where these two pointers will meet is the first node of the cycle!

This is the cycle detection algorithm of Floyd. The time complexity of this algorithm is still $O(n)$ (Why?), but the memory complexity has reduced to $O(1)$.

### C++ Code for Cycle Detection Algorithm

Let’s see a $C++$ code here:

Here is a golang implmentation if you are interested.

### Other Uses of Floyd’s Cycle Finding Algorithm

Besides detecting cycles in a linked list, this algorithm can also be used in some other cases. For example, it can be used to identify cycles in any mathematical functions or pseudo-random number generator.

That’s it, now you know how cycle finding algorithm works. If you want to test yourself whether you are clear about this algorithm or not, you can try to solve this: uva 350 pseudo-random numbers.

## Square Root Decomposition

Using the square-root decomposition technique, we can reduce the time complexity of several types of problems from $O(n)$ to $O(Sqrt(n))$. I will start with an example. Let’s say we have an integer array and we have to perform several operations on them. The operations can be of two types. The first type is, we have the find the sum of all the integers between index $l$ to $r$. The second type is, we have to update the value of index $i$. Some of you may already know how to solve this using a segment tree or binary search tree, today we will solve it with a new technique.

What is the naive way to solve this? Every time we are asked to find the sum, we will loop from $l$ to $r$, and for an update operation, we will update the value in index $i$. In this case, the complexity to find the sum is $O(n)$.

Now let’s try to optimize it. We will divide the array into multiple segments. Every segment will contain the sum of all integers of that segment. See the picture below:

The size of the array is $13$. When we create the segments, we want to ensure that the size of every part is almost the same. Also, the number of segments in the array should be nearly the same as the size of the segments. That’s where square root helps us. The square root of $13$ is approximately $3.61$ or $3$ if we ignore the decimal points. If we divide the array into segments of size $3$, we will get $13/3 = 4$ segments. The size of each segment will be $3$ except for the last part.

If the size of the array is not a square number, we will get one extra segment which is a bit smaller, but that’s not a problem. We will see soon how dividing in this way will help us immensely.

Now when we are asked to find the sum in the range $[l, r]$, we will check which segments fall into the range, and we will get the sum for those segments.

At first, we need to pre-process the segments. See the code below:

We are traversing the input array and every time $i % segment_size == 0$ becomes true, we go to next segment.

For $[l, r] = [1, 10]$, we need to find sum for the segments shown in the picture below:

The problem is, some of the segments are only partially in the range. The first and last segments can have this issue. In that case, we need to loop through each index of those segments. For other segments, we can get the sum. We have $sqrt(n)$ segments, and max size of each segment is also $sqrt(n)$. So even though we may need to loop through the first and the last segment, the total complexity remains $O(sqrt(n))$. Now you know why we divided the array this way.

Now one small question. For a particular index $i$, how do we know which segment $i$ is situated it? Very easy, just check the value of $i/segment_size$.

Let’s see the code for sum operation:

We divided the the code in 3 parts as discussed above.

Now lets come to update operation. It’s very easy, just find the segment $i$ index is located it and update the sum for that segment.

The complexity for update is $O(1)$.

#### Whats next?

You can even find the Lowest Comment ancestor of a tree using a similar technique. If you want the dig deeper, check out this PDF.

## Graph Algorithm 101 for Busy Engineers (Part 1)

This post is for the experienced software engineers who want to refresh their memory on graph algorithms quickly. I will not go into the details of any algorithms, rather than I will list down the most basic algorithms and will briefly tell you how they work. If you are working in the industry for a few years, chances are you have already forgotten most of the algorithms. However, I hope, this post will help you to bring back some of the memories. And certainly, you can google yourself later and learn the details.

(If you are a Bengali reader, you can check out the graph algorithm book which i wrote in 2016. Also, I have lots of articles about graph algorithm in my Bengali blog)

I assume the readers of this post are competent engineers, and they roughly know what a graph is. So, I will start with graph representation and will describe some algorithms very briefly. When I mention their complexity, I will use the variables $V$ and $E$, where $V$ is the number of nodes in the graph and $E$ is the number of edges in the graph.

### Graph Representation

There are two significant ways to represent a graph; the first one is an Adjacency matrix, and the second one is an Adjacency list. Adjacency matrix is a $V \times V$ matrix where $matrix[u][v]$ represents if there is an edge between $u$ and $v$.

For a weighted graph, the matrix contains the weight of the edges. It takes $O(V^2)$ memory. While the matrix makes it easy to check the cost or existence of the edge between $u$ and $v$, the operation of finding all the edges connected to a node becomes expensive as you need to traverse the whole row. On the contrary, the Adjacency list can solve this problem. For the above graph, the Adjacency list will look like this:

In C++, you can use a vector or, in Java, you can use an ArrayList to make this kind of list. It takes only $O(E)$ memory. But the downside is, you need to traverse the list to check if there is an edge between $u$ to $v$.

Now, the question is which representation is to use? Well, it depends on the type of problem you are solving and the type of operations you need to perform.

#### What does it do?

BFS is the most straightforward algorithm to find the shortest path in an unweighted graph. The problem we are trying to solve here is, “Given an undirected graph and a source node, find the shortest path from the source to every other node of the graph.”. It is a “Single source shortest path” (SSSP) algorithms.

#### How does it work?

Have a look at the graph below. Assume $1$ is the source node.

BFS will traverse the graph level-by-level. Here, the level 1 nodes have distance $1$ from the source, then, the level $2$ has distance $2$ and so on. It uses a queue to achieve this. At first, put the source into the queue. Now while the queue is not empty, pop out the first element of the queue and push every connected node in the queue (don’t push the same node twice!!). Doing this, it will ensure you that you will push lower level nodes in the queue before higher level nodes. As a result, every time you push a node, you can calculate the level by adding $1$ to the level of the parent node.

And, the complexity of this algorithm is $O(V + E)$.

### Dijkstra

#### What does it do?

Dijkstra is quite similar to BFS. It also finds the shortest path, but in a weighted graph. Like BFS, it is also an SSSP algorithm.

#### How does it work?

As I said, it’s similar to BFS. But instead of a queue, it will use a priority queue. This priority queue will contain node numbers and distance of the nodes from the source. The members of the queue will be sorted by the distance.

For example, let’s assume $d[u]$ is the distance from the source to $u$. Every time you go from node $u$ to $v$, not only you push v to the queue, but also you need to push $d[v]$ to the queue. That means every time you pop a node from the queue, you will get the closest node from the source.

For this algorithm, the complexity is $O(V \times logV + E)$

### Floyd-Warshall

#### What does it do?

Floyd Warshall is an all pair shortest path (APSP) algorithm. It can find the shortest path from all node to all other nodes in a directed path.

#### How does it work?

Even though the Floyd-Warshall algorithm is internally pretty complex, however, the code of this algorithm is surprisingly simple. First of all, you need to represent your graph using adjacency matrix, other representations won’t work. If two nodes don’t have a connection, you need to put infinite in those cells. Now you need to run $3$ nested loops. The first loop will select a node $k$ as the intermediate node. Then, the other two nested loop will select two nodes $u$ and $v$. Now you will check if it’s a good idea to visit $k$ while you are going from u to v. More specifically, you need to check if $distance[u][v] > distance[u][k] + distance[k][v]$. If the condition is true, you need to update $distance[u][v]$ and move on. In the end, the matrix will contain the shortest path between every node.

Here, the complexity of Floyd Warshall is $O(V^3)$.

### Bellman-Ford

#### What does it do?

Bellman form is another SSSP algorithm. But the speciality of this algorithm is, apart from finding the shortest path from the source, it can also detect the existence of a negative cycle in a graph. (A negative cycle is a sequence of edges which start and ends at the same node and the sum of the weight of the edges of the sequence is negative.)

#### How does it work?

Let’s assume $d[u]$ is the distance from the source to $u$. Initially, $d[source]$ is $0$, and for all other nodes, $d[u]$ is infinite. Then, imagine you are going from $u$ to $v$ and weight is cost(u, v). Now you can update the value of $d[v]$ if $d[u] + cost[u][v] < d[v]$. That means you found a path which is shorter. This is called “edge relaxation”. Now perform this relaxation for every node and keep updating the values in $d$. Then, you will get the shortest path from source to all nodes which uses at most 1 edge. Repeat these operations $V-1$ times and after that, you will have the correct shortest path. Remember, the shortest path can have at most $V-1$ edges.

Now, the concern is, how to detect a negative cycle? For this, first of all, you need to relax all the nodes one more time. Then, if the negative cycle is absent (no negative cycle at all), the value of any index of $d$ won’t update, because, there is no shortest path that has $V$ edges. And, if there is a negative cycle, at least one index will update. In the case, where a negative cycle is reachable from the source, the shortest path is not defined anymore as you can make the path as short as you want by going around the cycle many times.

The complexity of the algorithm is $O(V*E)$. If there is no specific reason, such as, the possibility of a negative cycle, don’t use this algorithm.

### Topological Sorting

#### What does it do?

Imagine you have a list of tasks, each task is dependent on some other tasks. You can represent it with a graph like below:

Now you need to find a correct order to complete the tasks. There can be more than one possible solutions. Topological sort or topsort can help you to do that.

#### How does it work?

Let us assume that there is no cycle in the graph, otherwise, there will be no valid order. Because a cycle will create a circular dependency, hence no valid ordering will be possible. At first, you need to find all the tasks which don’t depend on any other tasks. To do this, we count the in-degree of each node (the number of nodes has an edge towards the current node). The tasks with indegree $0$ can be performed first. Next, push those tasks in the list of answers and remove the outward nodes from those nodes. Now we will get some new tasks with in-degree zero, and we can perform them. Keep repeating this until you complete all the tasks.

Complexity of this algorithm is $O(V^2)$. In order to find all the possible orders of the tasks, you need to do backtracking.

That’s all for part 1, and I will talk about minimum spanning three, depth-first-search and some other algorithms in the next part.

## Promote empathy and be a better colleague

Recently I have read lot’s of articles on the corporate working environment. They advice on what do to climb the ladders, how to become the star employee, the ugly truth of human resource policies, etc. The thing I found dispiriting is almost advised to not to treat the co-workers as friends. They always suggest to keep a professional relationship with colleagues and hide anything personal, including any own problems. The logic they pose is simple and seems convincing at first. Your co-workers are your competitors for the next promotion and pay-rise. If you grow a personal relationship with them and share too much information, they can abuse the knowledge and back-stab you. No matter how close someone is, wait until the next promotion, and you will see they are not a friend.

I find this as a dangerous idea, and it promotes a trust-less environment. It asks you to forget the fact that your co-worker is another human just like you. Remember how dehumanizing it is to call an employee a resource? Promoting non-caring heartless relationship similarly dehumanize people.

Let’s face it, most of us spend at least 40 hours with colleagues every week which is 24% of total hours in a week and a whopping 35% if the time you are awake (assuming you sleep 8 hours a day). Is this healthy to spend so much time in an environment where everyone is worried about back-stabbing? Can you guarantee this habit won’t trickle down in your personal life too?

Now, why should you care about your co-workers? The simple answer is because they are humans and the only way to make the world a better place is caring and empathy. Your work life will be enjoyable when you have great people around you who care for each other and when you know they will defend you when you mess up. Show empathy to the co-worker who is having a bad week due to some personal issue and help them share the workload, most of the time they will return the favor when you have a bad day. Don’t be afraid to share your weakness, your fear with them, be open and help each other to grow as a better person.

I can assure you will be less stressed when you learn to treat the co-workers as friends, as human and stop worrying about back-stabbing. Are their downsides? Of course, there is, there will always be some person who is so-called “professional” and will take advantage of you. Probably you will get sabotaged a few times. But you will still hold up the moral high ground, and in your heart, you will know you did the right thing. So you want to get stressed and have a cold relationship with the people you spend so much time with? Or you want to be open and friendly but take the risk of potential back-stabbing? The choice is yours.

I had made some fantastic friends in my short professional career so far, and they helped a lot in difficult times. I can’t say how much grateful I am to them. Maybe some of them will someday something to hurt my career, but I would rather get hurt by trusting them rather than work in a stressful environment where everyone is ‘just a co-worker’.

It’s time we stop wearing the mask and promote empathy.

## Problem Solving: Consecutive Letters (MIST Inter University Contest 2019)

This problem appeared in 2019 inter-university contest hosted by MIST university, Dhaka, Bangladesh. Initially, our assumption was it will be medium difficulty problem but seems like the contestants found it easier. Around 53 team out of 120 was able to solve it. Let’s see the problem statement:

The time limit for this problem is $3$ seconds.

If you find the 1st query hard to understand, let me try to put it another way, you have to find the size of the largest substring which includes index $i$ and all the characters in the substring are same as $S[i]$.

Let’s see the first sample input:

I give you the query “1 0”. Now you have to find the size of the largest substring which includes index $i = 0$ and contains only the character S[0] = ‘A’. In this case the substring is S[0:1] = AA” and has size $2$ .

Now the next query is “2 1”, so you need to replace $S[1]$ with ‘#’ character, so the string becomes:

After then we again query “1 0”. Now the substring is S[0:0] = A” and has size 1.

Naive Solution

The first solution comes to mind is brute force. For every query of type 1, just loop from index $i$ towards left and right and see how many $S[i]$ you can find.

The time complexity for this code is O(Q*N) where N is the size of the string. As Q and N are very big, they will fail. As a problem setter, I had to be careful to make sure this code doesn’t pass. So I made some test cases where the loops for searching left and right will be very long. Those test cases have very few queries of type 2 and almost all the characters in the string are same.

Expected solution

There are several ways to solve the problem. I have solved it using a set and binary search. There is also a nice solution using a disjoint set. I will describe the solution using binary search first.

In brute force solution, you looped towards the left to find the starting point of the segment. Now I will show you how to find the $left$ point faster. Let’s forget about type 2 query for a moment.

For any given string, we can divide it by segments containing the same characters. See the following string for example:

I have highlighted the starting point of each segment.

After you take the original string as input, you can easily find these points and save them, right? Let’s say we have saved them in a sorted array which looks like $[0, 3, 6, 7]$.

Now let’s say I give you a query “1 4”.

Can you see that the $left$ point is the largest number in the set which is smaller than $i = 4$? In this case, it’s $3$. And how to find the $right$ point? It’s just the next number if the array, in this case, it’s $6$. So the size of the segment will be $right – left = 6 – 3 = 2$

As you have a sorted array already, you don’t need to search one by one, use the power of binary search. You can either write your own binary search or use lower_bound function available in your language and modify it.

Now how about query 2? Let’s say we put a “#” in $i = 1$. Now you can imagine this is creating a new segment from index $i = 1$.

Notice that not only it’s creating a segment from index $1$, it’s creating another segment from index $1 + 1 = 2$. So we insert $1$ and $2$ in our array $[0, 1, 2, 3, 6, 7]$.

Note that we never do type 1 query on a segment consisting “#”. There is no harm if you consider two consecutive “#” as two different segments. We can assume for type $1$ query, index $i$ will always create a new segment, index $i + 1$ may or may not create a new segment but we don’t need to handle it specially if you use a set.

If we use normal array to maintain the list of segment starts it will be very difficult to insert new segments as we need to keep them sorted. That’s why we will use a set. All the programming languages have some built-in set where you can insert a number in sorted way in O(LogN) complexity and also ensures uniqueness of the number.

Let’s see the code. To make our life easier, we will imagine that there is a segment starting at the very end of the string. That will save us pain of handling some special cases.

The complexity for both query of type 1 and type 2 is logN which makes the total time complexity O(Q*LogN) which is good enough to pass under 1 seconds if you use fast I/O.

Alternate solution using disjoint set

There is a nice offline solution using disjoint sets. It “offline” because you have to read all the queries first before answering them. Let’s see another example XXBBFFF, but now we will see the string as a graph:

All the characters converterd into nodes and there is edge between them if the characters are same. Now let’s say we have 3 queries:

Now we read all the type 2 queries first and we will erase the links with those index. In this case they are index $3$ and $5$.

Now treat the connected nodes as a connected components. You can use disjoint sets to maintain the member and the size of the components. Now we have these sets $\{0, 1\}, \{2\}, \{4\}, \{5\}, \{7\}$.

Now we answer the queries from last to first! The last query is “1 5”. What is the size of the component which contains the node 6? The answer is 1.

The next query is “2 6”, now we have to unblock the 5-th node and add back the edges.

Using union find algorithm, we will update the sets, now we have $\{0, 1\}, \{2\}, \{4\}, \{5, 6, 7\}$. Now we query “1 5” again, this time the answer is 3.

In this way we will climb from the end, keep unblocking the nodes and use union find to update the sets. So the type 1 query will be just finding the size of the set. In the end we have to print the queries in original order.

Union find operations have logN complexity, so in the end we have O(QlogN) time complexity. You can see a sample code here.

There is a third solution using segment tree. Code for that is significantly more complex compared to the solution described above. Also it takes much more extra memory and runs slower. During the contest we have let those solutions pass too. I am not going to describe that solution here.

That’s it for now, good wishes for your next contest.

## Tech Retro: Pull Request Ethics

Pull requests are part and parcel of a soft engineer’s life unless you work in a horrible place where people just merge the codes to master directly. How a team handles the pull request and code reviews tells a lot about the team’s culture.

Last week we were discussing pull request ethics in our bi-weekly tech retros in Traveloka.com. Some of the points we discussed are, they should have meaning commit message and description and criticism about code should never offend anyone. If your team is relatively new, you can think about setting some ethical standard about pull requests to avoid conflicts and maintain healthy relationships. Here are my two cents about pull request ethics: