63

Arlon's Journal 63

I haven't posted in the blog journal for a while because they keep us busy in Data Science class, and that is a great thing! We got tons of practice wrangling data, using various graphing tools such as matplotlib, pandas and seaborn to visualize and graph our data, and regression tools like KNearestNeigbor and LinearRegression learning algorithms from SciKit Learn, and creating amazing, huge, easy to read, very explanatory data science reports! We did some really exciting projects like predict used car prices from car features! You feed in a humungous array of used car data like # cylinders, whether it has AC, cruise control, leather seats, the car's mileage, all that, and then do various experiments using some and more of those features to first learn and then do predictions of price using the data set. Taking a stab at methodology toward estimating used car prices from an extensive data set is awesome!

There were other amazing exercises too, like figure out the odds a plant has some characteristic based on some other characteristic, and there turns out to be a super simple quick easy code way to chart it, and literally see the answer visually! The animal and funny names go over well with my mom and my friends, so that's good too. I try to explain what I'm doing in class for them, and they just want to hear all the various names of all the stuff, Pandas and Anaconda and Seaborn and MatPlotLib, which is sort of fun to even just say out loud, try it, you'll feel like a robot "MatPlotLib" it sounds like robot talk, but what it is, is it's a really powerful programming tool for graphing and visualizing data you've grappled with NumPy and Pandas, all free plug-ins for the Python programming language. The tools were all completely foreign to me at first, but practice goes a long way with understanding exactly how this stuff all works, the instructor, lectures, class materials and most notably the brilliant and very effective help from the teachers aid assistant, all do a spectacular teaching job so it was a lot easier to get a lot of this stuff than it would have been had I gone through it on my own. Other data sets we experimented during the course were election campaign data, college tuition data, weather data, iris petal data and even home values!!

Yes, we were predicting home values at one point in this computer science class. And here I am trying to get away from real estate appraisal. It was completely different though, but similar! It was training machine learning algorithms to predict home values based on huge data sets, and essentially, oddly, it was coming up with adjustments, adjustments like a real estate appraiser would make! It did that for the cars too! But it did it in a natural way, way that seems a lot more methodical than the real estate appraisal methodology. That methodology is exact. It's still fuzzy but it's based on enormous data sets and there is no approximation other than what exactly you tell it. Very, very interesting, from both a computer science perspective an engineering knowledg/programming perspective, and, much to my dismay, I know it's also very interesting from the perspective of a real estate appraiser...

So, after the data is gotten from the source, it is looked at, to see exactly what is there, using various tools that show sections and subsections of the data. Then it is polished, either fixing the NaN values that may be present with median or mean values of the column, or even just dropping those rows if there's enough data present. Then the data is scaled, so all the features go to and from the same span - so they all make the same difference - in other words, maybe a tiny fraction of a milimeter makes a difference to tuning, so that would be scaled up to the same scale as everything else so that everything is tuned at the same scale. That way you could compare that minute tuning measurement to some larger measurement such as engine displacement or something, whatever, the point is all the characteristics after scaling either all go 0-1 or get meaned at 0 or some method to bring them all into the same realm of existence.

After the data is scaled, it is split into training and test sets, so you can utilize part of the data to see if the training is working. The test set isn't seen by the training algorithm at all, even though it's part of the same pack of first data, it's not learned off of, so the learned part can test what it learns against what it's never seen before, to see if it worked well or not. Then you switch algorithms around and get to know your data, and finally find a regression or some algorithm that fits your data naturally well, without overfitting of course. Overfitting would give too specific of judgments, trying to aim too specifically close to known data points, instead the idea is to form a cloud, like the nearest neighbor algorithm, or some other generalization like a line, in linear regression, and go off the larger generalization rather than aim at the specific points you know of.

So you scale the data so it's all the same size, split it so you have training and tests sets with various characteristics, then what you do is you train it against one of these algorithms in SciKit Learn that actually learns the data to make predictions about it or other similar data. Once you find the right algorithm, hopefully you can make some good predictions with the tools! Our group project involved predicting the weather based on past weather and humidity, in a certain place, at CSUMB specifically and it seemed like it kind of worked!

And what's super neat is that one of the things you can do with these tools is form a huge report of charts and code which you could use for countless applications. For starters, you could use it for documenting the programming language itself, like the book says, the book itself is written in Jupyter notebook. So it's sort of a self-documenting system we're producing. Our group project and several of our homeworks are on this Jupyter notebook program that looks like a report with code and charts on it, but also with report-looking text. The code you put generates the charts, and then you can also just write text explaining what you're doing. So not only could you document programming, you can document a data set, data analysis, data prediction, or some huge science system you or a huge team invents. The possibilities are endless, and I'm eternally gratefully to have been introduced to these invaluable tools which I didn't even know about until about eight weeks ago! I think I will take these tools and run with them, for sure. I have my eye on some very specific problems, I'm not sure exactly how, but I think NumPy with MatPlotLib and Jupyter notebook would at least take a big stab at these problems, at least giving them a good run for their money. NumPy is a very, very powerful tool! In the very beginning we saw examples that showed NumPy algorithms running 50 times faster and more on huge data sets than if we were to write the algorithms ourselves. The reason is that NumPy uses vectorized operations, which are all very fast operations, so it is speciallized and works well for enormous data sets.

So the idea is, you write the report up, with all these code tools, which are a lot simpler than they sound, get all your work onto one document. I have been doing that myself for years, my own way, but this is mean for it. So your entire project, documentation, code, charts, data examples, learning algorithms, conclusions, final charts and English explanations are all present on the same page. So you can take a step back and say that is great, and show it to everyone. Then if there are any problems with any or all of it you can update parts or the entire document to improve the system. As time passes you can do many iterations of revisions and wind up with a polished scientific professional science report to show people insights into real data for real information analysis and conveyance on a big scale.

I personally installed my data science environment which included Python 3, iPython, Anaconda, Conda, Jupyter Notebook, Spyder, Python, NumPy, Pandas, SeaBorn, MatPlotLib, SciKit Learn (among many other tools), on Ubuntu Budgie 20.04 LTS and it worked beautifully. The first thing I always do when I install any system is first mount the network server for file backup and second install one of my favorite pieces of software, FreeFileSync. That way if I wind up wanting to save anything I can mirror a folder or the whole system to the network drive really flexibly. And of course the third thing is sudo apt install wine64 and then fix the Notepad++ icon so it works, very important. I fall back to Notepad++ for everything, all kinds of programming, every kind of programming, plus just text notes, data files, anything you'd want to edit that's text based at all I use Notepad++ for primarily. For this class I ran all my python files in Spyder, which launches out of Anaconda Navigator, running all tests there. Then when it completes I copy the work into Notepad++ as a specific document, where I can power edit things like I'm used to and make big changes, comment entire blocks out, etc, then I sync from there. Same with Jupyter notebook, all the stuff in my Jupyter notebook files essentially comes out of a file I also have in Notepad++. That's just because I can use Notepad++ so effectively, I've been using it for decades, so I always add it to my tool set rather than replacing it.

Logic class looks pretty neat, with lots of math proofs it looks like, I can't wait! I super like doing math proofs, at least in past classes I have taken anyway so I'm sure it will be fun!

We're starting Logic Class Saturday, and I just started my side classes at Cabrillo, US History to 1865, Political Science/Intro to Government, and Spanish I, all three classes simultaneously gang-stabbing me from all sides, so, hopefully I survive, I'm sure I will (that was a metaphor of course, they are just classes). But the point of that sentence is that I may need to hold off on the blogs for a while to save a few moments for a little bit here... happy programming everyone!

Comments

Popular posts from this blog

Module 2 Learning Journal 1-19-21

Arlon's CSUMB ProSeminar CST300 Module 4 Learning Journal for the week Wed 1/27-Tues 2/2, year 2021

52