Though the current project started as a series of posts charting my grief journey after the death of my mother, I am no longer actively grieving. Now, the blog charts a conversation in living, mainly whatever I want it to be. This is an activity that goes well with the theme of this blog (updated 2018). The Sense of Doubt blog is dedicated to my motto: EMBRACE UNCERTAINTY. I promote questioning everything because just when I think I know something is concrete, I find out that it’s not.
Hey, Mom! The Explanation.
Here's the permanent dedicated link to my first Hey, Mom! post and the explanation of the feature it contains.
Posted by EditorDavid from the life-after-Python-3 dept.
The Python programming language "is a big hit for machine learning," read a headline this week at ZDNet, adding "But now it needs to change."
Python is the top language according to IEEE Spectrum's electrical engineering audience, yet you can't run Python in a browser and you can't easily run it on a smartphone. Plus no one builds games in Python these days. To build browser applications, developers tend to go for JavaScript, Microsoft's type-safety take on it, TypeScript, Google-made Go, or even old but trusty PHP. On mobile, why would application developers use Python when there's Java, Java-compatible Kotlin, Apple's Swift, or Google's Dart? Python doesn't even support compilation to the WebAssembly runtime, a web application standard supported by Mozilla, Microsoft, Google, Apple, Intel, Fastly, RedHat and others.
These are just some of the limitations raised by Armin Ronacher, a developer with a long history in Python who 10 years ago created the popular Flask Python microframework to solve problems he had when writing web applications in Python. Austria-based Ronacher is the director of engineering at US startup Sentry — an open-source project and tech company used by engineering and product teams at GitHub, Atlassian, Reddit and others to monitor user app crashes due to glitches on the frontend, backend or in the mobile app itself... Despite Python's success as a language, Ronacher reckons it's at risk of losing its appeal as a general-purpose programming language and being relegated to a specific domain, such as Wolfram's Mathematica, which has also found a niche in data science and machine learning...
Peter Wang, co-founder and CEO of Anaconda, maker of the popular Anaconda Python distribution for data science, cringes at Python's limitations for building desktop and mobile applications. "It's an embarrassing admission, but it's incredibly awkward to use Python to build and distribute any applications that have actual graphical user interfaces," he tells ZDNet. "On desktops, Python is never the first-class language of the operating system, and it must resort to third-party frameworks like Qt or wxPython." Packaging and redistribution of Python desktop applications are also really difficult, he says.
"Now budding Python developers can read up on the National Security Agency's own Python training materials," reports ZDNet:Software engineer Chris Swenson filed a Freedom of Information Act request with the NSA for access to its Python training materials and received a lightly redacted 400-page printout of the agency's COMP 3321 Python training course. Swenson has since scanned the documents, ran OCR on the text to make it searchable, and hosted it on Digital Oceans Spaces. The material has also been uploaded to the Internet Archive...
"If you don't know any programming languages yet, Python is a good place to start. If you already know a different language, it's easy to pick Python on the side. Python isn't entirely free of frustration and confusion, but hopefully you can avoid those parts until long after you get some good use out of Python," writes the NSA...
Swenson told ZDNet that it was "mostly just curiosity" that motivated him to ask the NSA about its Python training material. He also said the NSA had excluded some course material, but that he'll keep trying to get more from the agency... Python developer Kushal Das has pulled out some interesting details from the material. He found that the NSA has an internal Python package index, that its GitLab instance is gitlab.coi.nsa.ic.gov, and that it has a Jupyter gallery that runs over HTTPS. NSA also offers git installation instructions for CentOS, Red Hat Enterprise Linux, Ubuntu, and Windows, but not Debian.
In the early days of personal computing, every machine came with a BASIC interpreter. Everyone knew how to program in BASIC. There was lots of freely-shared code written in BASIC.
Fast forward to a couple of weeks ago. I needed to write a small program (to assist with generating content for my Quora Space, actually). I didn’t want to spend a lot of time on it, and I didn’t want to spend any money on it.
I installed Python on my PC (for free), googled a few fine points of syntax since I hadn’t written any Python in a year or two, and banged out the utility program I needed in about three hours.
My program needed a library that could read in a .CSV file. I knew before I even looked that Python would have a library that does what I need, and that it would be simple to use. Python has a library for everything. For free.
Don’t get me wrong. Python is a much better language than BASIC, but it is popular for a lot of the same reasons BASIC was popular: free to use, ubiquitous, simple, interpreted, lots of shared code.
Python has become the gold standard for applied machine learning. Currently, there are more job openings for data scientists and machine learning engineers that know Python than there are for all the other languages combined. A logical question at this point might be, why is Python used so often in applied machine learning? While there are many reasons for its ubiquity in this space three often rise to the top.
One of the top reasons for Python’s widespread adoption is its simplicity. While it’s not a hard and fast rule, the lower the barrier to entry a programming language has, often the more it will be used. Python is simple. Python might be the highest-level language out there. That means just about anyone can learn it. The less the developer must worry about the code itself, the more focus and emphasis can be put on finding solutions.
The second and possibly the number one reason for Python’s popularity are the libraries. A library in Python is a group of pre-bundled code you can import into your environment to extend the language’s functionality.
There are libraries for just about every aspect of applied machine learning. For example, Pandas is a library for massaging data. SciKit-Learn is a general-purpose library for building traditional models. SciKit-learn also has many tools you use throughout the machine learning pipeline. There’s matplotlib for visualization and Keras for building deep learning models. There are also many libraries for niche needs like NTLK for Natural Language processing and a library called BeautifulSoup for web scraping.
The third reason Python remains popular is the Jupyter Notebook. Jupyter Notebooks are a powerful way to author your code in Python. A Jupyter Notebook is a web-based interface that allows for rapid prototyping and sharing of data-related projects. Rather than writing and re-writing an entire program, you can write lines of code and run them one at a time or in small batches. This makes coding easier to debug and understand.
The success of the Jupyter Notebook hinges on a form of programming called literate programming. Literate programming is a software development style created by Stanford computer scientist, Donald Knuth. This type of programming emphasizes a prose first approach where human-friendly text is punctuated with code blocks. It excels at demonstration, research, and teaching objectives especially for science.
The simplicity, readability, libraries and integrated development environment make Python one of the most used languages in the machine learning space.
At a some point Python shifted from “easiest to learn while good enough for practical work” to become the de rigeur target for scripting language bindings to various APIs and native (compiled) libraries.
For example tools GNS3, a network simulation engine, are complex systems written in C/C++ but primarily used through the Python bindings which are exposed from these native libraries. The same is true of TensorFlow and OpenCV and many others. Not just in machine learning, or in data analytics, nor in network modeling, nor penetration testing, nor systems administration (orchestration like Ansible or configuration management with SaltStack) but across many domains and fields.
It’s the language that’s good enough for people in many specialities to learn while allowing them to focus primarily on their specialty while empowering them to build automation around tools in their field.
It turns out that having a language with minimal ceremony and syntax, but a roughly pseudo code (outline) æsthetic has considerable appeal both to those developing specialized tools, libraries and frameworks, and to their various target users.
Richard Kenneth Eng, Used Fortran, Tandem TAL, C/C++, C#, Obj-C, Java, Smalltalk, Python, Go
Many languages are used in “almost everything.” This includes Java, JavaScript, C, C++, C#, Ruby, Perl, Smalltalk, Lisp, etc. However, the real question is, in what areas do these languages enjoy substantial usage?
Python enjoys substantial usage in only a few areas like data science, machine learning, sysadmin scripting, backend web, and embedded programming.
In areas like mobile programming, systems programming, real-time control systems, cloud computing, supercomputer modelling, high-performance computer graphics, etc., Python is barely used, if at all.
Python’s main advantage is its user-friendliness (readable syntax). This is why it’s so popular.
That said, I’d refute the idea that it’s used in “almost everything”. There’s a lot of stuff where Python would not be the best choice for your programming language. Simulation software, for example. Much more of a C/C++ sort of thing. Python also isn’t particularly popular in things like mobile development, which is a pretty big deal right now. That’s more Java’s territory.
However, Python is just plain easy to use. This has more to do with it being user friendly than anything else. You can, effectively, do more work in less time. The trade off for this is that Python is not always the most “efficient” language, often taking more processing time or memory storage to do the same thing as in other languages. This is because it’s often doing more behind the scenes than is “optimal” for the sake of taking work off your shoulders. For example, dynamic typing.
But computers have gotten like, really good lately. It’s certainly not uncommon for your program to take 0.1 seconds to run even in Python, or for you to only need like, 1% of your RAM’s storage. In these cases, the tradeoff of Python really doesn’t matter that much.
Jon Obermark, Software Engineer at Bank of America
It is easy to write, and extraordinarily flexible. So you can construct straightforward things rapidly, and rescue yourself from difficult situations more quickly than you can in most languages.
It has very few rules that cannot be broken, so you can make things work in all sorts of devious ways with awkward hacks and temporary workarounds.
This allows you to keep something standing while you refactor it. But if you let it go too long, you can end up with a horrific mess. Python that has been too quickly thrown together or 'rescued' too much can be very hard to debug and very slow to refactor, in a way a more rigid language never is. So in the long run, it demands discipline that it does not enforce. But there is often enough rigor to go around, if your work is relatively “normal”.
So it is good for the middle ground: reasonably disciplined programmers who need to be very responsive, but do not need to produce things of extremely high quality, security or performance.
At the same time, it also interfaces fairly naturally with C/C++. So if you find that parts of your project do need these traits, you can move them into a “more real” language piecewise, keeping the Python as “glue”.
Garry Taylor, Been programming since 8 bit computers
It’s not, there are many areas where Python isn’t used in great number, or at all.
Embedded systems, AAA games (or really any games of note), desktop applications, smartphone apps, systems level development, or any large scale application development are areas where Python is not at all popular.
Python is big in certain areas, like automating systems, scripting, and it has a decent share of web site development, and scientific uses, but there are many areas where use of Python is rare, or almost unheard of.
If you want to find a language which is used in almost everything, C++ is probably the no. 1.
There are some really cool Python frameworks that are mainly used for web application development and there are a lot of programmers who use python for security purpose.
Python is widely used to create automated testing frameworks, automated tests, and it is also used to develop desktop applications used by software testers.
You can use Python to develop internal tools, build scripts, build system monitoring tools and login tools.
I should also mention that Python is highly recommended when dealing with strings and text file manipulation because it provides you with tons of built-in functions that will make your life easier.
30 years ago my bachelor degree’s thesis was about the difficult to change the way of solving problems among laguages like COBOL, FORTRAN or PASCAL
Simple conclusión in human terms was that those languages were like foreign idioms, too differents to express the same idea in the same way, and i wasn’t talking about syntax only, but the inner construction of compilers that had to be very very close to the hardware of the machines running them.
Today is a very different scenario, PC’s Intel architecture simplifies computing layer and let programmers to focus on solving problems in the field of software layer, soon enough there were common librares, to write/paint on the screen for example, and magically (or almost) they can be used from different languages (compilers), so people started inventing new ways of computing (new languages) .
And there we are with Python.
It has a very simple syntax, loosly enough with data types, just type what you need (not countless parenthesis, colons, semi-colons, etc), can represent algorithms in PASCAL’s way but simpler, can represent OOP’s concepts but simple than java for example.
All this mix made it a good language for beginners, and the rest is history….
Firstly because python has a pretty small learning curve. Also its pretty good for prototyping.
Let me give you an example. As a project last year I had to build a hand written digit recognizer. I wanted the system to be fast so my obvious choice for final implementation was c++.
However I had to test out various approaches and it was easy to prototype in python to test it out. You are free from the burden of compiling and recompiling every time you make changes
Another great advantage is the idle environment where you can call and test out the various functions in your program. In a typical c or c++ program you would have to put an infinite loop and accept some input and then use a switch case to call these functions.
They aren’t. While popular and even in the #3 spot for modern programming language usage… only about 10% of developers are using it as a primary development tool. Java is #1 at 16%, C at 15%, C++ 6%, and C# at just under 4%.
It can be highly useful, lots of people might use it as a secondary tool, it rising in usages, etc. but it is hardly “everyone” and isn’t even close to “most” or even “many”. It probably never will.
Relative to other programming languages Python is:
easy and fast to write
easy to maintain
adequately fast
available almost everywhere
Python is also excellent glue. You can call other programs from it, including pipelines. You can fairly easily integrate C functions into Python. It is good at dealing with operating systems calls and structures.
It is used for personal programming, scientific programming, data munging, and data marshalling for super computers and render farms, even AI among other things.
So Python is flexible, adaptable, versatile, fast enough, and easy to use. That’s why I think it’s everywhere…
A long-term Data Science roadmap which WON’T help you become an expert in only several months
Some thoughts on becoming a data scientist. It isn’t easy or fast and requires a lot of efforts, but if you are interested in data science, it is worth it.
From time to time I am asked: how does one become a data scientist? What courses are necessary? How long will it take? How did you become a DS? I have answered this question several times, so it seems to me that writing a post could be a good idea to help the aspiring data scientists.
About me
I got a masters degree at MSU Faculty of Economics (Russia, Moscow) and worked for ~4 years as an analyst/consultant in ERP-system implementation sphere. It involved talking with clients, discussing their needs and formalizing them, writing documentation, explaining tasks to programmers, testing the results, organizing projects and many other things.
But it was a stressful job, with lots of problems. What is more important, I did not really like it. Most of the things were not inspiring, though I liked working with data. So, in the Spring-Summer 2016 I have started looking for something else. I got a Green Belt in Lean Six Sigma, but there were no opportunities nearby. One day I have found out about BigData. After a couple weeks of googling and reading numerous articles I realized that this could be my dream career.
I left my job and 8 months later got my first position as a data scientist in a bank. Since then I have worked in a couple of companies but my passion for data science is still strong. I have completed several courses in ML and DL, made several projects (such as a chat-bot or a digit recognizer app), took part in many ML competitions and activities, got three silver medals on Kaggle and so on. Thus, I have some experience with studying data science and working as data scientist. Of course I have a lot of things to learn and a lot of skills to acquire still.
Disclaimer
This article contains my opinions. Some people may disagree with them and I want to point out that I do not want to offend anyone. I think that anyone wishing to become a data scientist must invest a lot of time and effort in it or they will fail. Courses or MOOCs claiming that you can become an expert in ML/DL/DS in several weeks or months are not entirely truthful. You can get some knowledge and skills within weeks/months. But without extended practice (which is not a part of most courses) you will not prevail.
You do need internal motivation, but, more importantly, you need discipline, so that you will continue working after the motivation went away.
Let me repeat again — you need to do things by yourself. If you ask the most basic questions without even trying to use Google/StackOverflow or thinking for a couple of minutes, you will never be able to catch up with the professionals.
In most of the courses which I took, only around 10–20% of people completed them. Most of those who dropped out did not have dedication or patience.
Who is a Data Scientist?
There are many pictures showing data scientist’s core skills. For the purposes of this post any of them is good, so let us look at this one. It shows that you need Math & Stats, Programming & Devops, Domain knowledge and Soft skills.
That’s a lot! How is it possible to know all of this? Well, it really takes a lot of time. But here are good news: it is not necessary to know everything.
There was an interesting talk on 21 October 2018 at Yandex. It was said that there are many types of specialists, who have different combinations of aforementioned skills.
Data Scientists are supposed to be in the middle, but in fact they can be in any part of triangle, having different levels in any of the three spheres.
In this article I will talk about data scientists as they are usually assumed — those who can talk with customers, perform analysis, build models and deliver them.
Switching careers? This means you already have something!
Some people say that switchings career is quite difficult. While it is true, have a career to switch from, means you already know something. Maybe you have experience with programming & devops, maybe you have worked in a math/stats heavy sphere or you honed your soft skills everyday. At a bare minimum you have an expertise in your domain. So always try to use your strong sides.
First, read fucking Hastie, Tibshirani, and whoever. Chapters 1–4 and 7–8. If you don’t understand it, keep reading it until you do.
You can read the rest of the book if you want. You probably should, but I’ll assume you know all of it.
Take Andrew Ng’s Coursera. Do all the exercises in python and R. Make sure you get the same answers with all of them.
Now forget all of that and read the deep learning book. Put tensorflow and pytorch on a Linux box and run examples until you get it. Do stuff with CNNs and RNNs and just feed forward NNs.
Once you do all of that, go on arXiv and read the most recent useful papers. The literature changes every few months, so keep up.
There. Now you can probably be hired most places. If you need resume filler, so some Kaggle competitions. If you have debugging questions, use StackOverflow. If you have math questions, read more. If you have life questions, I have no idea.
Still not enough. Come up with a novel problem where there’s no training data and figure out how to collect some. Learn to write a scraper, then do some labeling and feature extraction. Install everything on EC2 and automate it. Write code to continuously retrain and redeploy your models in production as new data becomes available.
While being short, harsh and very difficult, this guide is quite great and it will get you to a hireable level.
Of course there are many other ways to data science, so I will offer mine. It is not perfect, but it is based on my experience.
My Roadmap
There is one skill which will get you very far. If you do not have it yet, I urge you to develop it. This skill is… formulating thoughts, searching for information, finding it and understanding it. Seriously! Some people cannot formulate thoughts, some are unable to find solutions to the most basic questions, some do not know how to properly create google queries. This is a basic and necessary skill, and you must perfect it!
Choose a programming language and study it. Usually it would be Python or R. I highly recommend choosing Python. I won’t list the reasons, because there are a lot of arguments about R/Python out there already, I personally think Python is more versatile and useful. Spend 2–4 weeks on learning the language, so that you can do basic things. Get a general understanding of the libraries used, such as pandas/matplotlib or tydiverse/ggplot2.
Go through the ML course by Andrew NG. It is old, but it gives a great foundation. It could be useful to complete the tasks in Python/R, but it is not necessary.
Now take one more good course in ML (or a couple of them). For R users I recommend Analytics Edge, for Python users — mlcourse.ai. If you know Russian language, this course on Coursera is also great. In my opinion mlcourse.ai is the best among these three. Why? It provides good theory and some tough assignments, which could already be enough. However, it also teaches people to take part in Kaggle competitions and make standalone projects. This makes it great for practice.
Study SQL. In most companies data is kept in relational databases so that you will need to be able to get it. Make yourself comfortable with using select, group by, CTE, joins and other things.
Try to work with raw data to get the experience of working with dirty datasets.
While the previous point may not be necessary, this one is mandatory: complete at least 1 or 2 complete projects. Perform a detailed analysis and modelling of some dataset, or create an app, for example. The main thing to learn is how to create an idea, plan its implementation, get data, work with it and bring the project to completion.
Go to Kaggle, study kernels and take part in competitions.
Join a good community. I have joined ods.ai — a community of 15k+ active Russian data scientists (by the way, this community is open to data scientista from any countries) and it helped me a lot.
Studying Deep Learning is a completely different topic.
This is only the beginning. Following this roadmap (or doing something similar) will help you start your journey to becoming a data scientist. The rest is up to you!
- New note - On 1807.06, I ceased daily transmission of my Hey Mom feature after three years of daily conversations. I plan to continue Hey Mom posts at least twice per week but will continue to post the days since ("Days Ago") count on my blog each day. The blog entry numbering in the title has changed to reflect total Sense of Doubt posts since I began the blog on 0705.04, which include Hey Mom posts, Daily Bowie posts, and Sense of Doubt posts. Hey Mom posts will still be numbered sequentially. New Hey Mom posts will use the same format as all the other Hey Mom posts; all other posts will feature this format seen here.
No comments:
Post a Comment