Hello, welcome to The Aggregate, the newsletter on the in depth analysis on topical yet unusual datasets and technical topics. If you want to sign up, a button to do that is below, or just read on!
The Weekly Technical:
If you are data journalist, scientist, or anyone who works with data on a daily basis, sometimes there are tasks that need to be done on a repeated basis.
1. A Data journalist is web scraping from a website that updates a table on a daily basis. 2. A Data Engineer is running a ETL to transform a input of unstructured data to a output of structured CSV files. 3. A Data Scientist is running a regression on a dataset updated daily, and wants to see if the significance of the given variables changes over time.
While you could run python ETL.py
by hand everyday at a specified time, what if there was a way to make the computer run code automatically, as if it was scheduled
. Well, through a a piece of software known as cron
found on Unix operating systems plus WSL if one has a windows computer, letting the computer run your code at a exact time, versus hoping you have access to the command line when you usually run your script, becomes a possibility.
What is a Cron Job?
Cron is a schedule for unix-like operating systems such as Linux or macOS. The job represents that a user to use cron to schedule jobs (shell commands/scripts or scripts in different languages) to run at specific times.
It’s called a job because one specifies different cron’s for each script you want to run, with each individual cron being known as a job.
How to Write a Cron Job
The syntax of a cron schedule is below, while it can look obtuse, let us start with a example and explain it from there.
0 12 * * *
There are five
inputs when writing a cron schedule expression. The first is the minute, the second hour, third day & fifth being day, and the forth being month.
Parts 1,2, and 5 are pretty simple input a range from [0-59]
, [0-23]
, and [1-12]
to select the minute, hour, and month when a given script associated with a cron job will run. The first day (3) represents the day of the month, so has a range of [0-31]
. The second day (5), represents the day of the week, and can be inputted with [0-6]
, zero representing Sunday and six representing Saturday.
For a cron job to do something, there needs to be a associated script or bash command that the cron job will execute at the given time. For example, imagine if you had a shell script named print.sh
that just echoed the number five. If you wanted to repeat that and log the output, you would write the following.
0 12 * * * print.sh
At 12:00 PM every day, a cron job would execute the given the script.
A word of warning, for cron jobs to work reliably, it is best to have a server that is running constantly, or at least running at the time when the script is about to be triggered. Be careful if you are running a cron job locally on your computer to not shutdown your computer at a time when a script is supposed to run.
To sace a given cron job, you write one in a crontab
file, which can be created & edited with crontab -e
. Crontab uses Vi by default as a word of warning so it’s :w
to save but not exit, and :q
to quit, but you had to have saved the file before hand, and :wq
to save the file and exit vi.
An Actual Use Case
For a personal data journalism project, I wanted to scrape data from entgroup’s table on Chinese daily box office data. While I could automatically scrape the data by hand on a daily basis, I knew that the data was updated on a daily basis, and the tables had the same format. Using Rvest
and DBI
, I wrote a script that scraped the box office data, and appened it to a MySQL
database.
Instead of running this script by hand on a daily basis, I wrote the following cron job to automate it within my crontab file.
0 6 * * * Rscript entgroupdata.R > entgroup.log
Much easier then running the command by hand everyday.
Now, some links…
Dana Kopel (Ssense): THE MUSEUM DOES NOT EXIST Pay What You Can, See What You Can’t
But the museum is also a place of work: art handlers, educators, curators, and countless others make the contemporary museum function. Since the COVID-19 crisis began, the worker activist initiative Art + Museum Transparency have been tracking museum layoffs on their Twitter account. The former Tenement Museum chief program officer, Michelle Moon, also keeps count in a public spreadsheet: as of mid-April, over ten thousand US museum workers have been laid off or furloughed as a result of decisions made by museum executives.
This likely excludes people who are rarely considered “employees” to begin with: temporary, contract, and gig workers. The Guggenheim, whose endowment was valued at $92 million in 2017, chose to pay regular staff but not those known as “on-call” when the museum first closed. Online, the Guggenheim Union shared a letter from a member pleading with executives for fair compensation during the crisis. “I’m asking you, mother to mother,” She wrote. “I have 3 small children. Your actions to not pay us for the same duration that you pay yourselves is unfair and cruel.” Meanwhile, at The Shed, an arts institution that cost $475 million to build—and directly benefited from $1.2 billion in public funds redirected to Hudson Yards from low-income Manhattan neighborhoods—nearly eighty unionized visitor experience workers have been furloughed. Art handlers at the Shed, who are not unionized, were abruptly forced to forgo anticipated pay.
Sara Jerde (AdWeek): Condé Nast Pins Revenue Growth on New Data Offering
Amid a pandemic that has left advertising spend sputtering and could permanently change consumer behavior, Condé Nast is releasing a new first-party data offering it hopes will draw interest from advertisers.
The offering, Now|New|Next Segments, uses Condé Nast’s first-party data across brands to share insights that speak to how consumers are spending now, how those behaviors might change and who will be the next customer who does spend as stay-at-home orders continue to lift.
It’s the first time the company (now a combined entity of Condé Nast and Condé Nast International) has released a data segment that can speak to its combined, international audience. “To be honest, really living in this moment in time engineered the idea,” said Pam Drucker Mann, global chief revenue officer and president of U.S. revenue.
It’s a way for the media company to distinguish itself from competitors looking to boost their data capabilities amid the pandemic. Data was a topic already on the minds of publishers before Covid-19 hit, with media businesses putting a bigger emphasis on acquiring first-party data as they faced new privacy regulations, starting with CCPA and now the looming cookie-less future.
Samuel Moyn (NYRB): The Trouble with Comparisons
One of the deepest American critics of such apologetic comparisons at the time was the Harvard University historian Charles Maier. Comparative exercises were crucial, Maier observed, but they were potentially misleading, too—especially when analogies were made without the balance provided by its obverse, disanalogy. “Any genuine comparative exercise emphasizes uniqueness as much as similarity; it establishes what is common in contrast to what is distinctive,” Maier, as master of comparative analysis himself, concluded. “Comparison must be a two-edged sword.” Indeed, as one of the greatest modern historians, the Frenchman Marc Bloch, had argued fifty years earlier, the whole point of comparison, when responsible, is to isolate what is singular and thus in need of new attention. A comparison cannot be about ignoring distinctions, but must isolate them, or it is negligent or reckless.
L’Atelier: The Virtual Economy
Elsewhere, a 17-year-old just earned $500 for designing a gun that can’t shoot anyone. A 21-year-old earned $125 for controlling someone’s identity for an hour, and a 16-year-old just won $3 million in prize money for eliminating everyone on a virtual island in front of millions of people.
This is the Virtual Economy. An agglomeration of sophisticated platforms, fledgling and often dubious marketplaces, skilled nixers, volatile assets, and ambitious pioneers that exist or operate uniquely in virtual environments. It sits just out of reach, behind a digital curtain, invisible to most of us. Within it, there is a galaxy of activity and opportunity. A new economic frontier that may just be the answer to the generational wealth gap.
It is a place frequented, with varying degrees of immersion, by some 2.5 billion people through phones, consoles, laptops, desktops, and headsets. It is an environment native to the tech savvy, a dual citizenship of sorts for the technically fluent. It is a place where people go to socialise, to play, to create, to work, to fantasise, to deceive, and to prosper.
Alexandra Scaggs (Barrons): Cities and States Need Funding Help. It Won’t Come Cheap.
A broad range of U.S. companies are now getting some form of support from the Federal Reserve’s purchases of corporate bond funds. But state and local governments have to wait longer—and clear arguably higher hurdles—to access central-bank financing.
The disparity has already had consequences for bond markets . And it could continue to weigh down municipal bonds’ performance relative to corporate bonds.
In Tuesday testimony to Congress, Fed Chair Jerome Powell said that the municipal liquidity facility should be fully operational by the end of this month. Before they can apply to borrow from the Fed, municipalities must first file a “Notice of Interest” with the central bank, and the New York Fed posted the materials necessary to send such a notice last week.
Miscellany:
For a refresher on regulatory policy, I been rereading Jerry Brito’s Regulation: A Primer. It is a very brief book, being only 120 pages long. It is somewhat libertarian flavored but that does not detract from the useful thing the primer does in actually explains what goes on in regulatory agencies like FERC (Federal Energy Regulatory Commission).
American Affairs Summer 2020 edition came out pretty recently.
Thanks!
Thanks for taking the time to read this, I will be back next soon! In the meantime, you can follow me on Twitter or reach out via email.