Can you remember your first digital moment in life? For myself, I can remember, 40 years ago, lying in bed totally fascinated with my new digital watch. I could push a button on the side and the black screen would then display in bright red led digits the time! WOW! To keep this short, I will skip over all the other digital experiences in my life, e.g. my first 8086 computer with one 5 ¼ inch floppy drive, no hard drive, no mouse, and a monochrome monitor with a DOS prompt, and that was only 30 years ago! The tools of software development have had many changes over the years. Below I will share with you some of the ones I am currently using.
It is amazing – the changes over the years, but what is even more amazing is the speed of these changes and they keep coming faster and faster. To keep up with the changes, I am currently enrolled in a Big Data Analytics class at the Lucerne University of Applied Sciences and Arts. This blog will give you some insights into the tools I will use for the class project.
The project needs an easy way to retrieve news articles from the internet. An easy-to-use site I found is NewsAPI. It has a free developer version that will handle the needs of the project. The paid service starts at $495 (USD) a month.
Where would a 24/7 process to retrieve the data and a web-site to display the data run? Time to shift my sights up into the clouds. From a very quick and informal review, I found:
- Microsoft’s Azure had a very easy-to-use interface, but looked like it might be too expensive for a 24/7 service.
- Amazon Web Services (AWS) was a bit harder to navigate through the user interface than Azure, but the prices seemed to be cheaper.
- Oracle Cloud Services had the hardest to navigate user interface and looked like it would be the most expensive for a 24/7 service.
AWS has a free trial service period of one year! That sounded good, so I decided I would use AWS.
Looking closer at that offer, I found it is for their t2.micro platform (1 Virtual CPU and 1 GB of memory). I like my computer power to be BIG and manly, like 12 CPUs and at least 256 GB of memory! To put any database project of mine on a “micro” platform would be so demeaning and depressing…. but “free” sounded like a good price, and I was surprised to see the t2.micro has been more than enough power for me. The great thing about using the cloud is that you can very quickly change the resource you are using. If I start to see any slow response times with the t2.micro, I can very quickly and easily kick it up to something more powerful and then later kick it back down again, that is what they call elastic computing.
Now that I have a data source and a place to run the project, I need a place to store the data that I collect.
My first choice for a database has always been Oracle. I could have done just about all the programming for this project needed within Oracle stored procedures. It would have been easy to quickly develop an end user dashboard web site using Oracle Application Express (APEX). You can find on a lot of „NON“-Oracle website comments about free development licenses for Oracle. After reading the Oracle developer license agreement, I saw that it would not cover my project, so Oracle was eliminated for this project and I had to find another place to store the collected news articles.
The data source News-API returns the data as a JSON object. NoSQL ( “Not” “Only” “SQL”) has been a hot buzz word for over a decade now. To totally leave my warm and cozy Relational Database world at once would be too big of a jump for me. Why not use both worlds, together? Relational and NoSQL? I want to have good old traditional tables with columns and rows, and have the option to store my news article JSON objects as NoSQL.
MongoDB has a very cool sounding name, but I quickly eliminated it because it is known as a NoSQL database and not as a relational database.
I wanted a database that is as similar to Oracle as possible, except without having to pay for a license. One open-source database that has been gaining a lot of recognition lately is PostgreSQL. It supports columns containing JSON documents.
Here is a comparison of PostgreSQL with Oracle https://www.educba.com/oracle-vs-postgresql/
A lot of people I know have used MySQL as their database. One of the downsides of MySQL, even though it is open-source, is that it is owned by Oracle. In 2019, Oracle will start charging for Java. Why would a company that makes such a fortune from their licensed database make much of an effort to support an open-source database?
Here is a technical comparison of MySQL vs PostgreSQL http://blog.dumper.io/showdown-mysql-8-vs-postgresql-10/
The winner of the competition is PostgreSQL.
It tries to compete with Oracle, it is open-source with no license fees, and it has NoSQL support for JSON documents. The one thing it does not have is an SQL function for making API requests to a web-site, which means my code will have to be outside of the database.
For Oracle, I use their SQL Developer for an IDE. There is a PostgreSQL JDBC driver that is supposed to make it possible to use the SQL Developer for PostgreSQL.
For this project, I will be using Toad Edge for PostgeSQL. It takes a while to load, but rewards you with a very nice toad „ribbit“ sound when it is finished loading.
Which programming language could I use to retrieve data from a web site’s API?
Just about any programming language that you can think of could do it. So which one to use? For data analytics, I have been hearing a lot about Python. For an experienced programmer it is very fast and easy to start using Python. One great thing about PostgreSQL is that it supports many different languages, including Python. That means you can write PostgreSQL stored functions using Python! Sadly, AWS does not support Python within the RDS, so the code will have to be written and executed somewhere else.
For a Python IDE, I will be using Spyder , which I can strongly recommend.
It is very easy to use with no learning curve.
Everything is now coming together. Oh, that big blue elephant in the room, where is he going to live?
AWS already has a nice home for him called Relational Database Service (RDS).
“Amazon Relational Database Service (Amazon RDS) is a web service that makes it easier to set up, operate, and scale a relational database in the cloud. It provides cost-efficient, resizable capacity for an industry-standard relational database and manages common database administration tasks.”
The RDS is basically a Platform as a Service (PaaS). Just a few clicks and a wham-bam-thank-you-ma’am, you now have a PostgreSQL db up and running.
What about that Python code that needs to be executed a few times every day?
Do I really have to have a 24/7 solution for it, when it only needs a few seconds of execution time every day at most? No way, Jose! I can use serverless computing! I will only get charged for the time it takes to execute my code, and when it is not executing, there are no charges! In AWS, the serverless computing is called Lamda. To have my Python code execute on a regular schedule, I will use the AWS cloud watch.
Now my head, my data, and my processing will all be up in the clouds, how do I work with it and look at it?
I need a PC to make my connections to the cloud. Computers have come a long way in the past 30 years. My Microsoft Surface Pro Tablet with an i7 processor and 16GB RAM is all I need to make my connections to the cloud and do my work. When working at home, I use a wireless mouse, a wireless keyboard, and a wireless display adapter to connect the monitor.
That is it. Those are all the tools I need to get my project done.