-
Notifications
You must be signed in to change notification settings - Fork 721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A Simpler Fix to the Streaming Code due to Changes from Twitter on Jan. 13, 2014. #196
Conversation
…ports to function from within the module.
@@ -34,6 +36,9 @@ def __iter__(self): | |||
while True: | |||
try: | |||
utf8_buf = self.buf.decode('utf8').lstrip() | |||
if utf8_buf and utf8_buf[0] != '{': # Remove the hex delimiter length and extra whitespace. | |||
utf8_buf = utf8_buf.lstrip('0123456789abcdefABCDEF') | |||
utf8_buf = utf8_buf.lstrip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to try it out but I'm worried this should break the delimited=true argument from twitter API by doing this this way
All right, so I did a few tryouts and I confirm there are still a few issues there:
But the gzip simplification is indeed better, I'll add it to my branch, and the lstrip way is certainly a better path than my re way. I'll try and port it to my branch later on. Btw, I removed completely in my version the except TwitterHTTPErrror which you commented here as pointless, I agree with you. |
RouxRC, People embrace small changes. I reverted to the master branch and just focussed upon the problem facing me: the broken public streams of January 13th. Everyone can judge whether I solved the problem of the moment. For my purposes, I did. You believe that it doesn't handle user streams. While I don't use this library for user streams, as a public service, I'll take a look at those later today. People also don't want to lose capabilities, such as Python3 support. This pull request supports both Python dialects. I hope you migrate your pull request to support both dialects. Anon, |
I just did yes (cf 82c7c43). My branch now works with both version of python. All I was saying is that the bug we try to fix here corresponds to the same source than the bug that broke userstream when it migrated 2 months earlier than the public one. Fixing only one of the two here seems to me like as bad as not supporting python 3 imho, it's just as much incomplete and "losing capabilities" :) |
The latest fix addresses RouxRC's concerns about the user stream. I have reworked my fix to improve the logic in two ways. First, the chunk header processing is removed from buffer management. That was the wrong place for it. Second, the This code has run for 3 days and processed over 12M tweets. While it could still have problems, I am quite happy with it and hope Mr. Verdone chooses to base the canonical fix for his important library on it. |
Thanks Andrew, this looks a lot better indeed! And it looks like the delimited argument should work with this version now as well. It makes total sense this would be more efficient this way, I was hesitating doing something like this with the chunks size as discussed earlier. Glad to see it's working well when implemented! |
Hi folks - Just wanted to express my support for this fix, and ask if there's anything I can specifically do to help? I'm tied up until the evening, but would be glad to jump in and help with whatever may be worthwhile at this stage. |
Matthew, The key thing needed right now is that the patch needs to be used and abused. I primarily use it sitting on the sample and filter endpoints. At this point, it has probably downloaded 25M+ tweets to my systems. It has had a modest number of timeouts, but then any network app will have timeouts. I don't particularly use it to track the site stream or user accounts. (I should probably set up a daemon to track my personal account to test the code.) BTW, I did not get a chance to attend your seminar at Data Day Texas. Folks tell me it a good presentation. I saw your talk though. Keep up the good work. Anon, |
…This reduces memory consumption and is faster than the += operator for buffer concatenation and trimming.
Gentlefolk, I am a programmer who appreciates precise code. As Python is new to me, though I am an old and grumpy programmer, I am always looking for how to improve the code. The prior patch, while simplistically Pythonic, was not efficient and wasted memory with many large extra buffers being created and thrown away. I've chosen to handle it by just parsing what is needed into an int() for the chunk size determination and then moving to a byte array to handle the perhaps multi-stage read from the underlying socket. Also, the byte array is used to trim the trailing CRLF off of the buffer, saving us a needless trip through the JSON parser. In other words, this patch has no effect on your data. It has an affect on your memory usage/garbage collection. It reduces both. If you are happy with the prior patch, then you need not upgrade to this patch. I am using it in my production stack for the last 12+ hours. Also, as before, this patch has been tested on Python v3.3.3 and Python v2.7.6 on OS X 10.8.5 (Mountain Lion). Because I use Anon, |
…y use by using a chunk sized bytearray.
…ion. This slightly improves locality of code and data.
…fy buffer management to only operate on strings and not re-encode the string as bytes. This improves readability at the expense of breakage if Twitter starts spanning JSON across HTTP chunks. This is an unlikely change to their infrastructure. That said, this is a totally optional patch.
All HTTP chunks are read in their entirety. Cosmetic code improvements. (The socket's blocking state is set in a more compact form after a DeMorgan's boolean transformation.) Hangups by Twitter, as with timeouts, are signaled via a message to allow gracious recovery.
I stage my fixes to this patch, |
Simplify protocol parsing checks. Move HTTP exception catching to where it might actually be thrown.
A few cosmetic edits of the iterator function.
…n the Twitter stream.
Gentlefolk, This is a candidate release patch. I propose it become the formal branch of this library and have dubbed it version v1.10.3. I once again formally thank RouxRC for his efforts moving this library forward. Any errors in this patch remain mine and do not reflect upon RouxRC or his code. This library is a high performance streaming library. Compared to other Twitter libraries, it is easily an order of magnitude faster at delivering tweets to your application. Why is that? When streaming, this library pierces Python's urllib abstraction and takes control of the socket. It interprets the HTTP stream directly. That makes it fast. It also makes it vulnerable to changes. It needed to be upgraded when Twitter upgraded the protocol version. Twitter's switch to HTTP v1.1 was long overdue. Summary of changes:
Thank you for your patience while I refine this patch and I ask Mr. Verdone to select this patch as the basis for moving this library forward. Enjoy and Anon, |
Great job, thanks for your work on this. So far it is working well. Will let you know if I run into any issues with it. |
Yowza! Whoa, so I travel for most of December and January, and in the interim Twitter utterly breaks my library, and then a task force self-assembles to totally fix the mess. Great work! As I understand it, this is the "best" branch in that it is 2.7 and 3.3+ compatible, and fixes the specific blocking issues. Is that correct? If so I'll merge this and release it as 1.11. @RouxRC if I merge this it will probably break your branch. Are you willing to resubmit your improvements afterward? They can go out as 1.12. |
Yep no worries, like I said, I'll keep using my branch until I get the time to port the other changes sometime in the coming weeks. Let's move forward :) |
A Simpler Fix to the Streaming Code due to Changes from Twitter on Jan. 13, 2014. Gentlefolk, This is a candidate release patch. I propose it become the formal branch of this library and have dubbed it version v1.10.3. I once again formally thank RouxRC for his efforts moving this library forward. Any errors in this patch remain mine and do not reflect upon RouxRC or his code. This library is a high performance streaming library. Compared to other Twitter libraries, it is easily an order of magnitude faster at delivering tweets to your application. Why is that? When streaming, this library pierces Python's urllib abstraction and takes control of the socket. It interprets the HTTP stream directly. That makes it fast. It also makes it vulnerable to changes. It needed to be upgraded when Twitter upgraded the protocol version. Twitter's switch to HTTP v1.1 was long overdue. Summary of changes: - Based upon RouxRC's code, I turned off gzip compression. My version is slightly different than RouxRC's version. - Instead of incrementally reading arbitrary lengths of bytes from the socket and seeing if they parse in the JSON parser, a good technique, the switch to HTTP chunking forced us to process in chunk sized blocks. Based upon inspection, Twitter never sends partial JSON in a chunk. They also send keep-alive delimiters in single 7 byte long chunks. This code depends upon both of these observations. It does not do general purpose HTTP chunk processing. It is a Twitter specific HTTP chunk parser. - Chunk oriented processing allowed me to isolate stream interpretation to the chunk code and migrate the wrapper code to operate exclusively using strings. This makes the wrapper code more readable. - Once I had opened up the wrapper code, I cleaned it up. This involved modest edits in how certain socket parameters were determined and moving data exclusive to the generator into the generator and out of the containing object. - As this is exclusively socket oriented code, the HTTP exception catching was removed from the method. The exception was moved to wrap the opening of the socket by url lib. - Due to reading the data in larger chunks and, hence, running it through the JSON parser less often, this code is about 10% faster than the prior generation. - When Twitter hangs up on us, this code emits a `hangup` message in the stream. - This code has been tested using Python v2.7.6 and v3.3.3 on OS X 10.8.5 (Mountain Lion). I have tested it on the high volume sample stream and on a user stream under both versions of Python. It is believed, but not tested, that it will function under Python v2.6.x. It uses the bytearray type. I believe that has been back ported all the way to Python v2.6.x. As the code is not particularly tricky, I do not foresee that it has introduced any new issues that were not already apparent in this library. - I use this patch in production and have captured 50M+ tweets with it. It is solid and reliable. If you find it to not be so, please contact me. I use it in production and have a vested interest in ensuring that it catches all corner cases. Thank you for your patience while I refine this patch and I ask Mr. Verdone to select this patch as the basis for moving this library forward. Enjoy and Anon, Andrew
Gentlefolk,
First allow me to applaud and thank RouxRC on his fine work pushing this little library forward. This pull request would not have been possible without his efforts. That said, RouxRC has dropped a quite large pull request that had problems in my Python3 system. When I rolled back to Python2, the system started working but would not stay connected for very long.
As a result of the above, I felt I needed to look into RouxRC's pull request in more detail and use just the parts I needed to get my streams running again. Any errors in this pull request are mine and do not reflect negatively on RouxRC or his code.
RouxRC's streaming patch falls into two main areas. First, it modifies the
api.py
to respect a new property,gzip
. I have largely copied his work here. I did simplify how he set the headers from 3 lines to one. The second fix occurs instream.py
. Once the gzip compression was turned off, I was able to see what Twitter is now sending us. Even though length delimiters are turned off, Twitter nonetheless is inserting a hex encoded number and a\r\n
between JSON items. RouxRC and I differ in how we chose to address these changes. My fix is focussed upon handling those extra items if they appear in the stream, removing them and then returning to the existing control flow. The important code block is below and my changes occur with theif
statement:RouxRC's code does this in a different way. I believe my solution is both shorter and more robust. But you, gentle reader, are the ultimate arbiter. RouxRC did identify in his code some opportunities to raise
TwitterHTTPError
s. I've tried to duplicate those.I've run this set of fixes under Python 3.3.3 and Python 2.7.6 both on OS X 10.8.5 (Mountain Lion).
In addition to the changes to the library, I modernized a test script,
stream-example
to take the OAuth parameters needed by modern Twitter. As I also use PyCharm, I've added.idea
to the.gitignore
file.Again, this simpler fix would not have been possible without the fine work of RouxRC. Thank you RouxRC.
Anon,
Andrew