xfe.li

Listening to the Twitter Stream API in real time with PHP.

2022-08-31 • Félix Dorn
This article is over 1 years old. I may think differently or it may be outdated.

Monitoring tweets about your company, an event, building any kind of real-time analysis of a variety of subjects, etc. Twitter’s Streaming API has an enormous amount of use cases for individuals, companies and governments. Let us dig into how it works and how to integrate it with PHP.

Recently, Twitter release a new version of their API, soberly named “Twitter V2”, including more features and filtering options (that is good). While it is not at feature-parity yet, it is pretty close, notably, all the stream-related endpoints surpass the previous version.

This new version comes with a number of changes that break older packages01, most notably fennb/phirehose. There are alternatives, like spatie/laravel-twitter-streaming-api and spatie/twitter-streaming-api that kept the same public API for the V2 and the V1, drastically limiting the use cases but allowing for an easy upgrade.

Anyway, the last two both rely on my package, so let us use that. I will go over everything you need to know to work with the API.

Following along

You will need an approved developer account. If you do not have one, apply here. The process usually takes a few days for the "essential access". If you want more features and more tweets pulled per month, once you got the “essential access”, you may apply for the “elevated access”, it takes a week in my experience.

You will then need to create an “application” and get your bearer token.

Make sure you have PHP and Composer installed and require the package:

1composer require redwebcreation/twitter-stream-api

I also recommend downloading other packages to make development easier:

1composer require --dev symfony/var-dumper vlucas/phpdotenv nunomaduro/collision

symfony/var-dumper is for debugging (includes the dd and dump function), vlucas/phpdotenv is for parsing and loading env files. nunomaduro/collision is a nice error handler for the terminal.

You should also create an .env file with the following contents:

1TWITTER_BEARER_TOKEN=...

Then, create a tinker.php file and paste the following:

1<?php
2 
3require __DIR__ . '/vendor/autoload.php';
4 
5use Felix\TwitterStream\Streams\VolumeStream;
6use Felix\TwitterStream\Streams\FilteredStream;
7use Felix\TwitterStream\TwitterConnection;
8use NunoMaduro\Collision\Provider;
9use Dotenv\Dotenv;
10 
11(new Provider)->register();
12 
13$dotenv = Dotenv::createImmutable(__DIR__);
14$dotenv->load();
15 
16$bearerToken = $_ENV['TWITTER_BEARER_TOKEN'];
17 
18$stream = new VolumeStream();
19$connection = new TwitterConnection($bearerToken);
20 
21$stream->listen($connection, function (object $tweet) {
22 echo $tweet->data->text . PHP_EOL;
23});

You can already listen to a stream!

1php tinker.php

You should see a bunch of text after a few seconds. Congrats, you are listening to Twitter’s Streaming API in real-time! Yay.

Types of stream

You have already used one of the two types available in the example above: the volume stream, which returns roughly 1% of all the new tweets. There is another one called the filtered stream, which returns all tweets matching a set of rules (more on that later).

There are no technicalities surrounding the volume stream, for this reason, I will use it to demonstrate how this package works before diving into the specifics of the filtered stream.

1use Felix\TwitterStream\Streams\VolumeStream;
2use Felix\TwitterStream\TwitterConnection;
3 
4$stream = new VolumeStream();
5$connection = new TwitterConnection(bearerToken: '...');
6 
7$stream
8 ->withTweetLimit(100)
9 ->listen($connection, function (object $tweet) {
10 echo $tweet->data->text;
11 });

Let's break down this piece of code, the TwitterConnection object uses your bearer token to authenticate you. Then — and it is true for any stream that implements the TwitterStream interface —, we call the listen(Connection, callable) method to start listening to the stream (careful, as Twitter heavily limits the number of calls for both streams: "50 requests per 15-minute window"02, that is a request every 18 seconds).

Each incoming tweet is passed in the callable. In this case, the $tweet contains an object tweet that follows the same structure as the default Tweet object.

You may also access inside the callable:

  • The number of tweets received, via $stream->tweetsReceived().

  • The UNIX timestamp at which the stream started, via $stream->createdAt()

  • The number of milliseconds since the stream started, via $stream->timeElapsedInSeconds()

  • A way to stop further processing, via $this->stopListening()

However, due to PHP's limitations, you can not stop the stream after a given amount of time, but you may stop processing as soon as you get a tweet after an arbitrary deadline. It's usually a technicality more than a problem, however, if it's a deal-breaker for you, check out ReactPHP, an event driven, non-blocking I/O toolset that could solve this problem03.

Fields & Expansions

By default, Twitter sends little data about the tweet. To get more information, you will need to explicitly request it using fields and (fields) expansions.

Fields

Fields allow for more customization regarding the payload returned per tweet. Let's see that in an example below:

1$stream
2 ->fields([
3 // alternatively, you can also pass in an array
4 'tweet' => 'author_id'
5 ])
6 ->listen(...);

Which could return:

1{
2 "data": {
3 "id": "1234321234321234321",
4 "text": "Hello world!",
5 "author_id": "5678765678765678765"
6 }
7}

Here's the list of all the available field types and their respective object model (last updated: Aug. 2022):

You can also check out Twitter’s documentation for more details.

Expansions

Expansions let you expand IDs to their complete object, for example, if you request an extra author_id field, you may expand it using the author_id expansion:

1$stream
2 ->fields(['tweet' => 'author_id'])
3 ->expansions('author_id')
4 ->listen(...);

Which could return:

1{
2 "data": {
3 "id": "1234321234321234321",
4 "text": "Hello world!",
5 "author_id": "5678765678765678765"
6 },
7 "includes": {
8 "users": [
9 {
10 "id": "5678765678765678765",
11 "name": "John Doe",
12 "username": "johndoe"
13 }
14 ]
15 }
16}

The list of expansions is quite extensive and not all expansions work the same, please check out Twitter's documentation on the subject.

Filtering the stream

This part only applies if you're interested in the filtered stream.

Twitter built its own query language that enables fine-grained control over which tweet you may receive, let's dig into it.

Building a rule

Rules are a list of filters to narrow down the results from the 6000 tweets per seconds that you could theoretically get to "only" a few hundred per second, depending on the specificity of your filter, of course. They contain a query and a label ("tag") for this query and are stored on Twitter's side. Rules are persistent between connections. However, they do expire if unused for more than 180 days; you'll get a 30-day notice. A filtered stream can receive tweets from more than one rule: five for the "essential access", twenty-five for the "elevated access" and a thousand for the "academic research access04. Each rule must be unique to your stream.

Note, If you change your rules while connected to the stream, Twitter will use the new rules immediately05.

Before jumping into rule building, let's learn how to save and delete rules using this package.

Save, read and delete rules

You can not update rules.

1use Felix\TwitterStream\Rule\RuleManager;
2 
3$rule = new RuleManager($connection);

Let's create a rule:

1$rule->save(
2 # tweets must contain the word cat and have at least one image
3 "cat has:images",
4 "images of cats"
5);

You may now retrieve your newly saved rule:

1$rule->all();

Which returns an array of Felix\TwitterStream\Rule\Rule:

1[
2 0 => Felix\TwitterStream\Rule\Rule{
3 +value: "cat has:images",
4 +tag: "images of cats",
5 +id: "4567654567654567654"
6 }
7]

Note, the Felix\TwitterStream\Rule\Rule is merely a Data Object, it does not contain any method.

To delete the rule pass its ID to the delete method:

1$rule->delete('4567654567654567654');

Batch Processing

To save many rules at once:

1use Felix\TwitterStream\Rule\Rule;
2 
3$rule->saveMany([
4 new Rule("cats has:images", "cat pictures"),
5 new Rule("dogs has:images", "dog pictures"),
6 new Rule("horses has:images", "horse picture"),
7]);

To delete these new rules,

1$rule->deleteMany([
2 '1484148414841484148',
3 '2585258525852585258',
4 '5101510151015101510'
5]);

Validating your rules

Twitter has a dry-run mode, meaning you'll hit the endpoint but no rules will be created.

You can either use the validate method:

1$rule->validate('cats ha:images');

Or, the save and saveMany method both have a dryRun parameter:

1$rule->save('...', '...', dryRun: true);
2 
3$rule->saveMany([...], dryRun: true);

Changing named parameters is considered a breaking-change by this package, you may use them safely.

Both ways would throw the following exception:

[UnprocessableEntity] cats ha:images : Reference to invalid operator 'ha'. Operator is not available in current product or product packaging. Please refer to complete available operator list at http://t.co/filteredstreamoperators. ( at position 6); Reference to invalid field 'ha' (at position 6) [https://api.twitter.com/2/problems/invalid-rules]

Operators

Finally, how to build rules. We're well past 10,000 characters and the only rule you've seen was about cats. Images of cats. Let's do better.

Types of operators

To prevent you from retrieving all of Twitter in real-time, you have to have at least one "standalone" operator. standalone operators may be a hashtag, a word, an emoji, etc. These standalone operators can not be a stopword – a word like "the", "is", "an", "you", etc.07 –, here are a few examples:

  • cats, tweets containing the word "cats"

  • cool dogs, tweets containing the words “cool” and “dogs”, in any position

    Writing operators with a space between them is equivalent to writing cool AND dogs (more on boolean operators later).

  • "no way", tweets that contains the words “no way”, next to each other.

  • #future, tweets containing the hashtag future.

  • @afelixdorn, tweets that mentions the given username

Standalone operators are case-insensitive, meaning that the rule cool dogs would match “COOL DOGS”, “cOol dOGs”… Accents and diacritics on the other hand are respected, pequeño and pequeno are two different rules.

Note, a rule like no way, may return a tweet without "no way" in it because the "no way" is in the quoted tweet. You may also encounter this behavior for replies (the parent tweet matches the rule but the reply will be returned).

Quick tip: while debugging your rules, you can look up a tweet without knowing the author using the following URL template: https://twitter.com/_/status/ID_HERE08

On the other hand, "conjunction-required" operators are not needed for a rule to be valid but allow you to filter out tweets to only the ones relevant for your use-case.

Here are a few examples (I will omit the standalone operator that would be required):

  • -cats means “tweets without the word ‘cats’”. It is not a standalone operator because querying all tweets without a word may be too unspecific.

  • is:retweet, tweets that are “true” retweets. It does not include quoted tweets, there is an is:quoted operator for that.

  • -is:retweet, all tweets except “true” retweets.

  • lang:fr, tweets identified as written in French by Twitter.

  • point_radius:[-41 174 20km], tweets posted in a circle whose center is the longitude (-41) and latitude 174 defined by the first two parameters. The radius of the circle being the third one 20km.

I will quickly list all the available operators as of now (August 2022), just to give you a peek into how much you can do with Twitter’s Stream API: from:, to:, url:, retweets_of:, context:, entity:, conversation_id:, bio:, bio_name:, bio_location:, place:, place_country:, point_radius:, bounding_box:, is:retweet, is:reply, is:quote, is:verified, -is: nullcast, has:hashtags, has:cashtags, has:links, has:mentions, has:media:, has:images, has:videos, has:geo, sample:, lang:, followers_count:, tweets_count:, following_count:, listed_count:, url_title:, url_description:, url_contains:, source:, in_reply_to_tweet_id:, retweets_of_tweet_id:.

Wow. Took some time.

let us build a rule that retrieves tweets about songs that people are listening to.

1$rule->new('listening to music')
2 ->raw('#nowplaying')
3 ->isNotRetweet()
4 ->lang('en')
5 ->save();

Okay, this is cool, you can try and run it. Here is a complete example:

1<?php
2 
3use Felix\TwitterStream\Rule\RuleManager;
4use Felix\TwitterStream\Streams\FilteredStream;
5use Felix\TwitterStream\TwitterConnection;
6 
7require __DIR__ . '/vendor/autoload.php';
8 
9$stream = new FilteredStream();
10$connection = new TwitterConnection(bearerToken: '');
11$rule = new RuleManager($connection);
12 
13$rule->new('listening to music')
14 ->raw('#nowplaying')
15 ->isNotRetweet()
16 ->lang('en')
17 ->sample(10) // only returns 10% of the available tweets
18 ->save();
19 
20$stream->listen($connection, dump(...));

If you are unfamiliar with the first-class callable syntax dump(...), here is the RFC. This example also assumes that you have symfony/var-dumper installed.

Compiling this would produce the following:

1#nowplaying -is:retweet lang:en sample:10

Note, while the query builder makes heavy use of magic methods to let you use nice method names like isNotRetweet, exceptFromLang, andNotFrom… You still get full autocompletion (as long as your editor understands PHPDoc).

To quickly debug a rule, you may call dd() at any time on the query builder: $rule->new()->...()->dd(). If the function dd does not exist, it defaults to var_dump and die.

Boolean Operators

We are talking about ANDs and ORs here.

ANDs

We have seen previously that separating operators with a space was equivalent to writing “AND”, that means there is no use for the AND keyword, you may use it if facilitates the comprehension of your query but be careful: rules have a max length. You are losing 4 characters per AND you add.

Here are a few examples :

1$rule->raw('dog')->andRaw('doggy'); // (1) dog AND doggy
2$rule->raw("I'm famous")->andNotVerified(); // (2) "I'm famous" AND -is:verified
3$rule->raw('big')->and->raw('house'); // (3) big AND ho

These would return exactly the same tweets as the examples below (without ANDs):

1$rule->raw('dog')->raw('doggy'); // (1) dog doggy
2$rule->raw("I'm famous")->exceptVerified(); // (2) "I'm famous" -is:verified
3$rule->raw('big')->raw('house';) // (3) big house

You can use exceptSomething or notSomething interchangeably for is and has operators. Often, one sounds better than the other apart from that, there is no rule, no difference.

ORs

ors follow the same syntax as ands and behave as one would expect: “Successive operators with OR between them will result in OR logic, meaning that Tweets will match if either condition is met.”09.

1$rule->raw('study')->orRaw('paper'); // study OR paper
2$rule->raw('apple')->raw('iphone')->or->raw('ipad') // apple OR iphone ipad

About the order of operations, tomato potato OR carrot would be evaluated as tomato (potato OR carrot) which corresponds to “tweets containing 'tomato' and either ‘potato’ or ‘carrot’”. Inversely, tomato OR potato carrot would be evaluated as (tomato OR potato) carrot.

Do not forget to check out Twitter's documentation.

Conclusion

Twitter’s Stream API is a great way to listen to what is happening, now but rules are very hard to get right, iterate on them.

Trends change, if you are planning on running your script for a long time, check regularly your data to make sure you are getting what you think you are getting.

Anyway, thanks for reading.


03

I am only assuming that using ReactPHP would be the most straightforward way to implement a timeout-based disconnection, it may not be the case. Please reach out if you know better. In the meantime, here's a link to two, probably relevant, packages: reactphp/promise and reactphp/promise-timer

06

The list of stopwords isn't public and common list of stopwords match very poorly with Twitter's (undisclosed) list of stopwords, most likely because those list are destined to filter out insignificant words in natural language data and not to prevent developers from abusing Twitter's API. The list above was built through trial-and-error by calling the API to check for each stopword individually. I checked for ~700 stopwords and only found four ("the", " is", "an", "you").

08

The is:retweet operator, “matches on Retweets that match the rest of the specified rule. This operator looks only for true Retweets (for example, those generated using the Retweet button). Quote Tweets will not be matched by this operator.” – source