A Basic Data Analysis Example In Core Java
In this post, I discuss a basic data analysis example using core Java. Consider a bank which has presence in 50 districts having 10,000 accounts in each district. The balance in each of those 500,000 account is available say, in a text file. The bank manager is interested in finding, "In every district what are the three accounts with the largest balance?"
If the data were in a database, it is easy to write a query to solve the problem. You would probably use some grouping and / or ranking keyword to get the result you want. But let's say our application is not going to the database and run the query, and due to some special circumstances we are required to do text processing.
Let's also assume our input data is in a text file with the following format:
Sample records would be
For ease of processing, we have the account_id (first column) and district_id (second column) as integers where as balance (third column) is a 2-precision float. Basically we are dealing with numbers here.
Any time you face the question what are the k-largest or k-smallest things in a data set, you would use a priority queue. Priority Queue, as explained by Goodrich et al, "is a collection of prioritized elements that allows aribitrary element insertion and allows the removal of element that has first priority. When an element is added to a priority queue, the user designates its priority by providing an associated key. The element with the minimal key will be the next to be removed from the queue."
The trick is to have a priority queue of capacity k+1 and every time it is filled up, that is its size is k+1, delete the minimum value.
But for this problem, what is required are the three largest accounts in every district, so the data analysis goes into an extra dimension. Therfore we use a data structure within a data structure. We want a key, value pair in which the key is district_id and the value is a priority queue.
The priority queue itself would have elements each having two members: account_id and balance. We solve our problem by maintaining only the three highest balance elements in the priority queue.
The program logic would be:
If you want to run the program, I have a 10,000 record input file on mediafire -> here.
Validation : You can load the input file into MS Excel. Insert a header (or blank) row at the top. Select the three columns as data. Filter second column. Select any district, say 50. Sort data largest to lowest. You can see the top three account_ids match the program output.
The run time of the program is high. On my Windows 8.1 desktop having i5-3470S 2.9GHz processor and 4GB RAM, the Java program (with JDK 1.7) processed a 10000 record file in 0.3 seconds. In the world of computing this is slow. What if it were a million record file. Dead slow it would be.
So how would you make it faster? Of course, parallel computing by Map Reduce algorithm using Hadoop. I will write about it and compare results in a future post.
Reference
Data Structures & Algorithms in Java, by Michael T. Goodrich, Roberto Tamassia & Michael H. Goldwasser. Sixth Edition, 2015 printed by Wiley India Pvt Ltd.
If the data were in a database, it is easy to write a query to solve the problem. You would probably use some grouping and / or ranking keyword to get the result you want. But let's say our application is not going to the database and run the query, and due to some special circumstances we are required to do text processing.
Let's also assume our input data is in a text file with the following format:
account_id district_id balance
Sample records would be
7323 19 2766.94
8906 7 7705.09
9352 47 2874.82
4857 8 9779.27
2067 39 1067.64
8893 20 6592.69
4944 37 8939.08
2565 23 3621.32
8948 19 5807.67
8189 44 7576.92
8906 7 7705.09
9352 47 2874.82
4857 8 9779.27
2067 39 1067.64
8893 20 6592.69
4944 37 8939.08
2565 23 3621.32
8948 19 5807.67
8189 44 7576.92
For ease of processing, we have the account_id (first column) and district_id (second column) as integers where as balance (third column) is a 2-precision float. Basically we are dealing with numbers here.
Any time you face the question what are the k-largest or k-smallest things in a data set, you would use a priority queue. Priority Queue, as explained by Goodrich et al, "is a collection of prioritized elements that allows aribitrary element insertion and allows the removal of element that has first priority. When an element is added to a priority queue, the user designates its priority by providing an associated key. The element with the minimal key will be the next to be removed from the queue."
The trick is to have a priority queue of capacity k+1 and every time it is filled up, that is its size is k+1, delete the minimum value.
But for this problem, what is required are the three largest accounts in every district, so the data analysis goes into an extra dimension. Therfore we use a data structure within a data structure. We want a key, value pair in which the key is district_id and the value is a priority queue.
The priority queue itself would have elements each having two members: account_id and balance. We solve our problem by maintaining only the three highest balance elements in the priority queue.
The program logic would be:
- Write a class Pair that implements the interface Comparable
. - This class has two members a key (int) and value (float).
- Override the compareTo method; this returns 1, 0, or 1 depending on value being less than, equal to, or greater than the value of the object being compared.
- Define a map (of class Map) object with Integer key and PriorityQueue
value. - Scan the data file and read line by line.
- Split each line into account_id, district_id and balance.
- Create a new pair of size 4 with account_id and balance.
- Check if the district_id has a value in the map. If not, add the pair to a new PriorityQueue and put district_id, Priority Queue as key, value in the map.
- Else, add the pair to the priority queue.
- If size of the priority queue, is greater than 3, remove the head element.
- After all lines are processed, the map will have the 50 districts as the keys and each district_id will have a priority queue of the three largest accounts.
If you want to run the program, I have a 10,000 record input file on mediafire -> here.
Validation : You can load the input file into MS Excel. Insert a header (or blank) row at the top. Select the three columns as data. Filter second column. Select any district, say 50. Sort data largest to lowest. You can see the top three account_ids match the program output.
The run time of the program is high. On my Windows 8.1 desktop having i5-3470S 2.9GHz processor and 4GB RAM, the Java program (with JDK 1.7) processed a 10000 record file in 0.3 seconds. In the world of computing this is slow. What if it were a million record file. Dead slow it would be.
So how would you make it faster? Of course, parallel computing by Map Reduce algorithm using Hadoop. I will write about it and compare results in a future post.
Reference
Data Structures & Algorithms in Java, by Michael T. Goodrich, Roberto Tamassia & Michael H. Goldwasser. Sixth Edition, 2015 printed by Wiley India Pvt Ltd.
Comments
Post a Comment