Thursday, May 14

The Log Line That Tells You Nothing

The alert fires at 2am. You ssh in, open the logs, and find this:


  ERROR myapp: connection pool exhausted


  One line. The error itself, and nothing else.


  You know what broke. You have no idea why. Was it a sudden traffic spike? A query that held a connection too long? A retry loop that ran away? The answer was in the DEBUG logs — but you turned those off six months ago because the noise was unbearable.

  

  Then there's the other version of this problem. A teammate pings you: "hey, can you send me the error log for that failure yesterday?" You zip it up and send it over. They come back ten minutes later: "this just has the error line, there's no context here." You know they're right. You don't have anything better to send them. The conversation stalls because the information was never captured in the first place.


  The logging level trap


  Every Python developer eventually ends up in the same place. You start with DEBUG because you want visibility. The files balloon. Grepping through megabytes of chatter to find anything useful makes you want to quit your job. So you raise the level to WARNING, the logs go quiet, and life is good — until the next incident, when you realize you've traded noise for blindness.


  The standard advice is to log more context in your error messages. So you start stuffing state into every  logger.error() call. It helps, a little. But you're essentially rebuilding, by hand, the context that the preceding DEBUG logs already had — and you still can't reconstruct the sequence of events that led there.

  

  The real problem is that log levels are a blunt instrument. You don't want DEBUG logs. You want DEBUG logs when something goes wrong.


  What you actually want


  Silent during normal operation. Full context the moment an error fires.


  I built incident-logging (https://pypi.org/project/incident-logging/) because I kept wanting exactly this and kept hacking around the absence of it. It's a single-file Python logging handler — no dependencies — that buffers your DEBUG and INFO records silently. The moment a WARNING, ERROR, or CRITICAL is emitted, it flushes the recent buffer followed by the triggering message, then clears and starts over.


  The buffer is a ring buffer. It holds the most recent N records. Old ones fall off. So you always get the context  window just before the incident — not hours of irrelevant history, not nothing. And when a teammate asks for the error log, you actually have something useful to give them.

  How to use it


import logging
from logging.handlers import RotatingFileHandler
from incident_logging import IncidentHandler

logger = logging.getLogger("myapp")
logger.setLevel(logging.DEBUG)

# Incident log — only emits when WARNING or above fires, with recent context
incident = RotatingFileHandler("incidents.log", maxBytes=1_000_000, backupCount=3)
incident.setFormatter(logging.Formatter("%(asctime)s %(levelname)-8s %(message)s"))
logger.addHandler(IncidentHandler(target_handler=incident, buffer_size=30))

  Source and install instructions: https://github.com/BenLin0/EmergencyLogging


Labels: ,

Wednesday, June 24

Sudoku programming: Adding logic into the already-logically-perfect program

As I already have a backtracking program that can perfectly solve any solvable sudoku, I am ready to add logic reasoning into it, after the board is loaded, before the backtracking trysudoku() function is called. Yes, as the last resort, the trysudoku() will be in action if the sudoku is not completely solved after the logic reasoning part is deployed.

The new function is called preset(). Apparently when I was writing this part, I only thought of it as a preprocess before running the main course trysudoku(). Anyway, in the second commit of the project, you can see this new present() function has these logic reasoning parts:

Part 1, walk though all empty cells to determine whether only one possible number suitable (not conflict with other elements). If there is only one possible number, then set this number. 

        count = 0
	for i in range(9):
        for j in range(9):
            if board[i][j] != 0:
                continue
            onlyonefit = false
            fitnum = 0
            for trynum in [1, 2, 3, 4, 5, 6, 7, 8, 9]:
                if not conflict(board, i, j, trynum):
                    if onlyonefit == false:
                        onlyonefit = true
                        fitnum = trynum
                    else:
                        onlyonefit = false
                        break   #jump out of trynum
            if onlyonefit == true:
                board[i][j] = fitnum
		count += 1


Yes this simple logic works for some sudoku games.  I should make this part as a function and call it repeatedly until there is no cell being set (count == 0). 

Part 2, for each row, loop thought number [1..9]. If the number already exists in this row, forget it; otherwise, count how many possible cells this number is possible to be set in. If there is only one cell this number can set in (in this row), then the number has to be set in this cell.


    for i in range(9):
        for trynum in [1, 2, 3, 4, 5, 6, 7, 8, 9]:
            possiblesitenum = 0
            possiblesite = 0
            Found = False
            for j in range(9):
                if type(board[i][j]) is int:
                    if board[i][j] == trynum:
                        Found = True
                        break   #found the number.
                else:   #is a set
                    if trynum in board[i][j]:
                        possiblesitenum += 1
                        possiblesite = j
            if Found == False and possiblesitenum == 1:
                logging.info("This num {} can only be in one site {} of this column {}".format(trynum, possiblesite, i))
                board[i][possiblesite] = trynum
                count += 1

After checking each row, do the same checking to each column, and each block.

Again, this is also simple logic: Each [1...9] number must show up in a row once, and only once. If there is only one cell available for a number for one row/column/block, then the number must be in this cell.

Aha, after reading the program again, I found I did call the preset() function (including the Part 1 and Part 2) repeatedly until there is no cell being set (count == 0). 

Done with this logic, I am able to commit this code, and get back to paper sudoku to learn more logic. To be continue.

Labels: ,

Grand National Betting Locations | Mapyro
Find the 포천 출장샵 best Grand 충청남도 출장안마 National Betting locations in the United Kingdom. 김제 출장샵 Great location for betting & 논산 출장안마 gambling in our online guide. Great What is Grand National Betting?How often can I bet on 동해 출장안마 the Grand National?
 

Thursday, October 5

Practice caution when using cache (lru_cache in python)

I have a simple search code in python:

def SearchNode(name):
   for node in Nodes:
         if node.name == name:
              return node
   return None

Pretty simple. Nodes is global variable in this module. Because this function is called frequently, I added cache from functools for it:
@lru_cache(maxsize=5000)
def SearchNode(name):
   for node in Nodes:
         if node.name == name:
              return node
   return None
Sounds too trivial to create a unit test for it, so I apply it directly to my project. Two days later, my program is giving out different kinds of error, and it took me several hours to identify the source of error: the @lru_cache line. The call to this function always returns "None" for me.

 I actually added some code to iterate the legit names and do "SearchNode(name)", and the return values were always "None"!

After another hour of tracing and thinking, I finally understood the problem:

When building Nodes, I call this function for a new name, to make sure the name does not exist in the Nodes, before adding into the structure. So for each new name, the answer is always None, and the @lru_cache remembers this answer. So when I actually need the node, the @lru_cache still gives me the "None".


After locating the source of error, I can think of several ways to solve it (while still having the cache mechanism): Disable the cache first, only enable it after the Nodes is built; Reset cache after the the Nodes is built; Let the cache remember the non-None result. @lru_cache does not have the capability to do any of them, so I have make my own cache, using the third way:
SearchNode_cache = {}
def SearchNode(name):
   if node in SearchNode_cache:         return SearchNode_cache[node]
   for node in Nodes:
         if node.name == name:
              SearchNode_cache[node] = node              return node
   return None

You can easily add 2 lines to set a size limit for the cache, and remove the oldest item in it, to implement Least Recent Used mechanism.


Lesson learnt:
1, Any one simple line, no matter how innocent it looks, can jeopardize your project in a big way.
2, If the value of the Nodes can be modified, my current way is not correct.
3, So don't just copy my code. My application scenario might not be the same as yours.
4, Practice caution when using cache.

Labels: ,

Tuesday, September 5

Learning Python: Step by step


Content:
1, Introduction. Setting up environment: Commandline/Notepad++
2, First project
3, Statement, Branches(if-then), Loop
4, Upgrade the environment: Professional IDE
5, Math, Class
6, Upgrade the environment again: Being professional: SVN/GIT
7, Further learning


Labels:

Thursday, October 22

Show the complete apache config file

In the Apache config file, you can use "Include" or "IncludeOptional" to include other config files. A lot of the Linux variants take advantage of that to organize the config files. For example, the default congif file of Ubuntu is in /etc/apache2/apache.conf, and it includes enabled modules, enabled sites, and configuration files this way:

IncludeOptional mods-enabled/*.conf
IncludeOptional conf-enabled/*.conf
IncludeOptional sites-enabled/*.conf

If you really want to see all the complete config settings, there is no existing tool for that. This Stack Overflow page  answered this question pretty well: You can use apachectl -S to see the settings of Virtual Host, or apachectl -M to see the loaded modules, but to see all settings, there is no such tool, you will have to go through all the files , starting from familiar yourself with the  general structure of the httpd config files. 

So I created this python program to generate the complete apache config file:

#!/usr/bin/python2.7
# CombineApacheConfig.py 
#!/usr/bin/python2.7
# CombineApacheConfig.py 
__author__ = 'ben'
import sys, os, os.path, logging, fnmatch, re


def Help():
    print("Usage: python CombineApacheConfig.py inputfile[default:/etc/apache2/apache2.conf] outputfile[default:/tmp/apache2.combined.conf")


def InputParameter():
    if len(sys.argv) <> 3:
        Help()
        return "/etc/apache2/apache2.conf", "/tmp/apache2.combined.conf"
    return sys.argv[1], sys.argv[2]


def ProcessMultipleFiles(InputFiles):
    if InputFiles.endswith('/'):              #Updated as Pierrick's comment
        InputFiles = InputFiles + "*"
    Content = ''
    LocalFolder = os.path.dirname(InputFiles)
    basenamePattern = os.path.basename(InputFiles)
    for root, dirs, files in os.walk(LocalFolder):
        for filename in fnmatch.filter(files, basenamePattern):
            Content += ProcessInput(os.path.join(root, filename))
    return Content


def RemoveExcessiveLinebreak(s):
    Length = len(s)
    s = s.replace(os.linesep + os.linesep + os.linesep, os.linesep + os.linesep)
    NewLength = len(s)
    if NewLength < Length:
        s = RemoveExcessiveLinebreak(s)
    return s


def ProcessInput(InputFile):
    global ServerRoot

    Content = ''
    if logging.root.isEnabledFor(logging.DEBUG):
        Content = '# Start of ' + InputFile + os.linesep
    with open(InputFile, 'r') as infile:
        for line in infile:
            stripline = line.strip(' \t')
            if stripline.startswith('#'):
                continue
            searchroot = re.search(r'ServerRoot\s+(\S+)', stripline, re.I)      #search for ServerRoot
            if searchroot:
                ServerRoot = searchroot.group(1).strip('"')
                logging.info("ServerRoot: " + ServerRoot)
            if stripline.lower().startswith('include'):
                match = stripline.split()
                if len(match) == 2:
                    IncludeFiles = match[1]
                    IncludeFiles = IncludeFiles.strip('"') #Inserted according to V's comment.
                    if not IncludeFiles.startswith('/'):
                        IncludeFiles = os.path.join(ServerRoot, IncludeFiles)

                    Content += ProcessMultipleFiles(IncludeFiles) + os.linesep
                else:
                    Content += line     # if it is not pattern of 'include(optional) path', then continue.
            else:
                Content += line
    Content = RemoveExcessiveLinebreak(Content)
    if logging.root.isEnabledFor(logging.DEBUG):
        Content += '# End of ' + InputFile + os.linesep + os.linesep
    return Content


if __name__ == "__main__":
    logging.basicConfig(level=logging.DEBUG, format='[%(asctime)s][%(levelname)s]:%(message)s')
    InputFile, OutputFile = InputParameter()
    try:
        ServerRoot = os.path.dirname(InputFile)
        Content = ProcessInput(InputFile)
    except Exception as e:
        logging.error("Failed to process " + InputFile,  exc_info=True)
        exit(1)

    try:
        with open(OutputFile, 'w') as outfile:
            outfile.write(Content)
    except Exception as e:
        logging.error("Failed to write to " + outfile,  exc_info=True)
        exit(1)

    logging.info("Done writing " + OutputFile)

The usage is simple: Run it as "python  CombineApacheConfig.py ". Since there is no additional parameters given, it will retrieve the default Ubuntu apache config file from  /etc/apache2/apache2.conf and generate the result complete config file in /tmp/apache2.combined.conf. If your config file is in different location, then give the input file and output file location. For example, RHEL has the config file in /etc/httpd/conf/httpd.conf, then you can run "python CombineApacheConfig.py /etc/httpd/conf/httpd.conf /tmp/apache2.combined.conf ".

Note: Apache server-info page http://127.0.0.1/server-info also provide similar information, but not in the config file format. It is in human readable format. The page works only when it is open from the same computer.


Labels:

Nice, but is there a way to dump the configs if macro_module is being used.

Thanks
 
It's a nice looking idea, but doesn't seem to be including any included files... is this working for other people?
 
Let me correct/clarify that a bit... It seems to work if the include is a full path and not a relative path to the httpd config root (which works for apache and is allowed as far as I know).

This works
Include /etc/httpd/conf/vhosts.d/*.conf

This does not work (although apache is OK with it)
Include vhosts.d/*.conf

For some reason, this does not work (from a whm/cpanel httpd.conf)
Include "/etc/apache2/conf.d/userdata/std/2_4/username/*.conf"

I'm guessing the issue in the cpanel example might be the quotations around the string

 
Hi V,

Confirmed that path with quotations will not work. But the relative path definitely works well for me.

I will update the program to deal with quotations later. With source code, you should be able to do that as well :)

 
Inserted one line "IncludeFiles = IncludeFiles.strip('"')" to deal with the config file with quotation marks according to V's comment.

Thanks, V.
 
It does not work with folder includes. Like "Include conf.d/"
As workaround, I added the next lines in ProcessMultipleFiles:

if InputFiles.endswith('/'):
InputFiles = InputFiles + "*"

Also, the script may need a supplementary optional argument to specify relative path root. The script does not work on RHEL based systems. There, the base config path is /etc/httpd/conf/httpd.conf and root is /etc/httpd/.

Regards,
 
Thank you, Pierrick. Confirmed that my version did not work with "Include conf.d/" scenario and your update was good, so I updated the post accordingly.

About the RHEL and optional relative path root, I will need that environment to try it out. Thanks.
 
Pierrick, after looking into an RHEL server, I found the reason was that my code ignored "ServerRoot" that Apache used to identify the root of config file ("relative path root" of your comment).

By adding the logic of ServerRoot (making a global variable, make it default as the folder of httpd.conf but available to be modified as reading the config file), the code is working well in both RHEL environment which has ServerRoot specified, and Debian environment.

Thanks for pointing out the issue!

 
After revisiting the previous comment, I think maybe Verdon's frustration come from the same source: the ServerRoot in RHEL environment was not handled properly so the relative path was not correctly processed.

Please use the new version!
 
This Python program is a helpful tool for generating the complete Apache config file.
 

Thursday, October 15

Python logging OnMarkRotatingFileHandler

When there is a need, there is a solution.

As explained in the previous post "The TimedRotatingFileHandler of python logging system", the handler is not doing what I am thinking to do. So I made this new handler OnMarkRotatingFileHandler to fulfill my need. For example, assuming the Interval setting is "Hour":

1, If the program starts at 8:20AM, the TimedRotatingFileHandler will restart a new log file at 9:20AM, but my OnMarkRotatingFileHandler will restart a new log file at 9:00AM.

2, If the previous log file was last modified at 8:58 AM, if you start the program with old handler after 9:58, it will rotate the log; if you start the program before 9:58, say 9:55 the program will just append the log entries into the existing log until 10:55, if your program runs for that long.
With my new handler, now at 9:00AM, when the program starts, the new handler will rotate the log file to generate new log file.


Source code:
#filename: logHandler.py
__author__ = 'ben'

import logging.handlers

class OnMarkRotatingFileHandler(logging.handlers.TimedRotatingFileHandler):

        def __init__(self, filename, when='h', interval=1, backupCount=0, encoding=None, delay=False, utc=False):
            #super().__init__() # in Python 2 use super(D, self).__init__()
            super(OnMarkRotatingFileHandler, self).__init__(filename, when, interval, backupCount, encoding, delay, utc)


        def floor_to(self, num, scale):
            return int(num/scale) * scale


        def computeRollover(self, currentTime):
            temp_result = super(OnMarkRotatingFileHandler, self).computeRollover(currentTime)
            if not self.when.startswith('W'):
                result = self.floor_to(temp_result, self.interval)
            else:
                result = temp_result    # need to find out the first date of time (is it 1970/1/1?), what weekday that is.

            return result


Most methods are inherited from the TimedRotatingFileHandler. The W0/W1.../W6  options are not implemented yet. But you get the ideal.

To use it, place the logHandler.py with your code, then you can either import it and load it as

    import logHandler
    h = logHandler.OnMarkRotatingFileHandler ("filename")
    logger=logging.getLogger('app')
    logger.addHandler(h)


or you can use it in the logging.ini:
    class=logHandler.OnMarkRotatingFileHandler

then load it in logging.config like every other handler does:
    logging.config.fileConfig('logging.ini')


Have fun hacking!


Labels:

The TimedRotatingFileHandler of python logging system

The name of TimedRotatingFileHandler is like the best for my need: rotate every day, keep 30 days of copy. I know the logrotate of Linux system provides the same service, but I prefer to make the program easy to install and run without System Admin (root) involving in the process. Another reason I am not using logrotate is that when my program is writing into log at midnight, logrotate would perform "mv myprogram.log myprogram.log.20151015", and my program is still writing into that file which has a new name now: myprogram.log.20151015, while I still have another program trying to access myprogram.log for display purpose, then that program will fail.


So I want to have a native logging handler to do log rotating, and change to a new file of "myprogram.log" at midnight. The TimedRotatingFileHandler looks perfect to me.


But it is not.


The TimedRotatingFileHandler only rotate the log when the program has been running for more than 24 hours (Assuming the interval setting is 1 Day ), or if the log file has not been modified (before loading) for more than 24 hours. You can check the code, but that is what it is. If your program runs every day from 1AM to 11PM, then the log is never rotate, it keep appending forever. It is using the last modified time of the log file , or current time when the file doesn't exist, to decide when to rotate.


One limitation is that in Linux there is not "file creation time" or "birth time". Most file system don't support this file attribute. I guess ext4 does, under some configuration, but we can't rely on it.





Labels:

Friday, April 24

Using MySQL Cluster

Recently in one project I migrated a database from MySQL using InnoDB to MySQL Cluster with 4 nodes. In theory the process should be simple, but actually there are many problems:

(1), Lack of documentation. The online community of MySQL Cluster is not as strong as the ordinary MySQL. Some blog posts are old, and some of the documents are so ambiguous, it is like, if you know how to do, then you can read it. The concepts of "date node" and "sql node" are very confusing.


(2), The default configuration is not usable. My test database is in normal size, that can fit into a server of 1G memory, using MySQL of InnoDB, with all default configuration. Changing the table creating command from "ENGINE=INNODB" to "ENGINE=NDB", I tried to create the same table structure in MySQL Cluster but failed. I have add these settings to make it work:
MaxNoOfConcurrentOperations = 262144
MaxNoOfLocalOperations = 288360
MaxNoOfOrderedIndexes  = 1024
MaxNoOfAttributes = 4000

(3), It requires more memory during initial setup phase.  As in (2), I am only trying to create the table structure, not loading any data yet, it still complain not enough of memory, so finally I had to increase the memory of each node to 4G from the original 1G, then allocate 2G for data and 200M for index:
DataMemory   = 2000M
IndexMemory  = 200M

(4), The same "create table" command works well in InnoDB, but fails in Cluster. For example, if you have one column that is auto-increment, but this column is not the Primary Key of this table, the InnoDB engine will accept that, and created a hidden auto-increment ID for this table. This table will fail in NDB engine, claiming "you can't have two auto-increment column". Another example, in InnoDB, a varchar column can have the size of 8000. But in NDB, each table, all columns combine together must be less than 2400 characters (including overhead). If you defined a varchar column with size of 3000, you must either make the size smaller, or change it to TEXT or BLOB column when migrating into NDB engine.


(5), When a node is registered in the management configuration file, then the node must be active, for the whole system to run. It doesn't support "hot-plug". If one node is shutdown, the whole system is down, and you need to restart each node to get the whole system running again.


(6), Any view, stored routine or trigger can't be propagate through the cluster automatically. They must be created/modified on each sql node where you will connect to. They are not attached to the ndb engine, so they are not part of the cluster system.


(7), When there are several threads accessing database concurrently, as my program is doing, I see a lot of "Waiting for ndbcluster global schema lock", "System lock", "checking permissions" in the process status. I actually have to increase the TransactionDeadLockDetectionTimeout from default 1.2 second to 1 minute to make it work. It takes longer time to complete the same process than using traditional InnoDB engine.


Labels:

Wednesday, February 25

MySQL: Changing the size of the log file

I was told that the new MySQL (5.6+) is flexible about this thing, that is very good. I know people have been fighting with it for a long time. Since the Ubuntu 14.04 is still having MySQL 5.5 as default, I would like to write this down as a reference.

Situation: The default innodb_log file size is small, as 5M, if it is not specified in the /etc/mysql/my.cnf. As soon as you start mysql server, ib_logfile0 and ib_logfile1 are created in the /var/lib/mysql folder with 5M size. Then you can not simply change it.

With the default small log file size, you can run into
Error code 1206: The number of locks exceeds the lock table size.
when doing some complicate query, or loaddata big file.

In an ideal word (as in the new MySQL(5.6+)), you just need to add innodb_log_file_size=128M in the my.cnf file, restart the mysql server, the software would load the new configuration and resize the log file.

But in reality (as MySQL 5.5-), if you modify the my.cnf and restart, your mysql service can't never start, and the log tells you that setting is not consistent with the existing log file size. Sure, we know that and that is exactly what we are trying to do: resizing the log file. But what should we do now?

So you have to change back to innodb_log_file_size to 5M in order to proceed. Add
innodb_fast_shutdown = 0
in config file for purging any pending transaction into database. So the my.cnf has this 2 lines:
innodb_fast_shutdown = 0innodb_log_file_size   =  5M

At this time, you should able to start mysql service:
service mysql start
Good. Then we stop it properly:
service mysql stop
At this time, because all pending transactions are purged from the log file, we can safely remove the ib_logfile*:
mv /var/lib/mysql/ib_logfile* /tmp
Now you can edit the my.cnf file toset the innodb_log_file_size into the size you want, and remove (comment out) the shutdown parameter:
#innodb_fast_shutdown = 0innodb_log_file_size   =  128M
Then restart the service:
service mysql start

The ib_logfiles will be created with new size now.

Labels:

Great read thanks for sharing this
 

Friday, May 9

One tip about logging.debug() in python

The idea of logging.debug( output string ) is that, if the logging system was set as higher level, the output string will be ignored, not being output to the logging system.

One problem, if the logging.debug() is calling some functions to format the output string, that functions ARE executed, no matter what the logging system is.

For example, I used this sentence to find out the size of the critical data object in the environment:
logging.debug("  size of dataprovider1 is: %s", sys.getsizeof(cPickle.dumps(dataprovider1)))
It works fine in debug environment; then when switching to production environment, even though the logging.level was set to INFO, this sentence was not being output to the log file, this function was still being executed, and an object of 90G size was created, and crashed the production system immediately.

Solution: use the isEnableFor() function of the logging.
if logging.root.isEnabledFor(logging.DEBUG):
    logging.debug("  size of dataprovider1 is: %s", sys.getsizeof(cPickle.dumps(dataprovider1)))

Labels:

Monday, April 14

Add memory info to python logging: several options?

Our program is using a lot of memory. Too much of memory, maybe. So I want to write memory of current process in logging information.

Getting the memory size of current process is the easy part (in Linux):
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
I can think of several ways to add this into logging system: Adding a handler (derive the handler from either StreamHandler or FileHandler), or change the Formatter. Finally I decided to add a filter:
class MemFilter(logging.Filter):
    def filter(self, record):
        record.mem = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
        return True
logging.basicConfig(format='[%(levelname)s][MEM:%(mem)s] [%(asctime)s]:%(message)s')
logging.root.addFilter(MemFilter())

Case closed.

Labels:

Wednesday, April 2

Display current time for each command, in Python Interactive Interface

I like making the prompt sign (ps) into showing current time, so that I know exactly what time I executed a command, and what time it is completed. I made that for DOS command, Linux terminal, and MySQL terminal. Here I will show you how to do that in Python Interactive Interface (Python Interpreter).

Here is what it is after setting:



The first thing, setup a $PYTHONSTARTUP environment variable to point to a python file location. Exactly how to do that depends on your operating system. Now matter how it is done, the file it points to will be executed when "python" (Python Interactive Interface) is called.
In my Ubuntu, execute
export PYTHONSTARTUP=~/.pythonstartup.py
Next step, in my home folder "~", I created the file .pythonstartup.py as:
# filename: .pythonstartup.py
# after setting $PYTHONSTARTUP=~/.pythonstartup.py, this file will be executed before python interactive interface is open
import sys, time
sys.ps1 = time.strftime("%A %H:%M:%S") + ">>>"
Pretty simple, then the prompt sign (I guess that is the full name of "ps) will be current time, as "Wednesday 16:49:51" showing in the above screenshot.

PS: the way for showing the same information in linux command line (Ubuntu) is to modify the PS1:
PS1='${debian_chroot:+($debian_chroot)}\[\033[01;32m\]\u@\h\[\033[00m\]:\[\033[01;34m\]\w\[\033[00m\] [\D{%a} \t] \n\$ '
(add this in .bashrc)

For mysql client tool, add this line under [mysql] of the ~/.my.cnf:
prompt=\\u@\\d [\\w \\R:\\m:\\s]>

For MacBook, the operation is similar with Linux. Add these 2 lines into ~/.bash_profile (create this file if not existed):

export CLICOLOR=1
export PS1='\[\033[01;32m\]\u@\h\[\033[00m\]:\[\033[01;34m\]\w\[\033[00m\] [\D{%a} \t] \n\$ '

For "DOS" command line in Windows system, go to the Control Panel->System Properties->Advanced->Environment Variables, where you can modify path information. Check for "PROMPT". Modify it or add it if you don't have it yet. Set the value as "$P [$T] $_$G ", that will let your command line have path, time, and start from the second line.


If you are using Git Bash in Windows, it already have most of what I want, except the computer time. Open C:\Program Files\Git\etc\profile.d\git-prompt.sh to add one line 
PS1="$PS1"'\[\033[00m\] [\D{%a} \t]'                 # current time
after the other $PS1 lines. That will do the trick.


Labels:

Friday, March 14

Using python to execute a .sql file for mysql

The task is to execute one import.sql file to import several data files into mysql. The trick is that it is known some files will be missing, so the execution of the import.sql will have error, but we want to ignore the error, import whatever exists in the harddrive.

There are two ways to do that:

1, Execute mysql command line, feed in the file as pipe input:
    f = open("import.sql")
    process = Popen(['mysql', '--force'], stdout=PIPE, stdin=f, stderr=PIPE)
    out, err = process.communicate()
The trick is to use '--force' argument of mysql to ignore the error. This is equivalent to:
mysql --force < import.sql
The err will have the error output for missing data file, then you can write that into log file.
2, Execute mysql  command line client, then execute "source import.sql":
    from subprocess import Popen, PIPE
    process = Popen('mysql', stdout=PIPE, stdin=PIPE, stderr=PIPE)
    out, err = process.communicate(input="source import.sql")
Last time I checked, this way was not working very well. It always failed at the missing data file step.
Reference: http://bugs.mysql.com/bug.php?id=533 : "--force is intended to be used in Batch Mode only."

Labels:

Thursday, March 13

python: replacing strings in file

In perl, this one-liner is easy:
perl -p -i -e 's/replace this/using that/g' filename
"-i" means to write the modified text back to the same file.

When it comes to python, to make the files in line, you will need fileinput to specify inplace=1 .
import fileinput
for line in fileinput.FileInput(filename, inplace=1):
    line=line.replace("replace this","using that")
    print line
Somehow the program generate 2 line breaks in my Windows machine, for each line: \r\n becomes \r\n\r\n.

The other way is to open the same file twice, one for reading, and one for writing:
s = open(filename).read()
s = s.replace('replace this', 'using that')
f = open(filename, 'w')
f.write(s)
f.close()
This is "python way" because s is operated as a set, making the program easy to read, with high run-time efficiency.


Labels:

Wednesday, July 3

head_gedit, to check head of text files in gedit really quick (Ubuntu)

Recently I got some bit text files that I need to check the header of them and verify the format really quick. The traditional way is to go to command line, cd to this folder, then "head xxxx.txt".

Because I made "Mass Sequential Rename" script before, I know it doesn't take long to integrate "head" functionality into right-click of nautilus. Here is the result:


#!/bin/bash
# Nautilus script for head and gedit.
# Modified by Ben(AT)Fadshop.net. http://benincampus.blogspot.com .July 2, 2013
# Based on http://benincampus.blogspot.com/2009/03/mass-sequential-rename.html
# save this file as ~/.gnome2/nautilus-scripts/head_gedit

set -e

IFS=$'\n'   #for space in the file name.
for FILE in $NAUTILUS_SCRIPT_SELECTED_FILE_PATHS
do
  filename=""`basename "$FILE"`""

  head "$FILE" > "/tmp/head_gedit_$filename"
  gedit "/tmp/head_gedit_$filename"
  rm  "/tmp/head_gedit_$filename"
done


So the program creates a head file in /tmp/ folder, use gedit to show it, then remove it immediately.

Save the script as ~/.gnome2/nautilus-scripts/head_gedit , make it executable (chmod +x ~/.gnome2/nautilus-scripts/head_gedit) then in your nautilus, right click the text files you want to show (you can click multiple files!) and select "script - > head_gedit", you will see the heads of them showing in gedit instantly.

You can fine tune how many lines you want to show, or you can add a prompt to let user input a number by referring the "Mass Sequential Rename" script.

Have fun!

Labels: ,

Tuesday, June 11

Two programmer's jokes. Wrong jokes.


A mother said to her programmer son, "While you're out, buy some bread."
Her son never came home.
The mother created an infinite loop because no exit condition is given:
while (you are out){buy bread;}
So the programmer keeps buying bread, with no way to exit the loop.
Actually, as I said in the title, the joke is wrong. When the store is out of bread, or the programmer is out of money, or when the programmer keep bugging the store attendant for buying bread, the store attendant is not able to assist others (CPU intensive task), or at the time the store is closed, an exception is thrown. I should be more clear: the programmer is thrown out of the store. When he is out of money, the only way for him to do is to return home.

The next day, the mother said:  "Could you please go to the store for me and buy one carton of milk, and if they have eggs, get 6!" 
A short time later the programmer son comes back with 6 cartons of milk.
This programmer is not a good programmer. After he buys one carton of milk, he checks the condition of (they have eggs) and finds the condition is fulfilled, he should buy 6 cartons again. So he should go home with 7 cartons of milk!
buy one carton of milk
if (they have eggs){
    buy 6
}


Labels:

Sunday, June 2

Java: Remove an element from array

In Java, the common practice to remove an array is to call an Apache common library
newarray = org.apache.commons.lang3.ArrayUtils.remove(oldarray, i)

The source code shows that to remove the ith element, this method forms a new array "newarray" of the size of oldarray.size-1,  copy [0, i-1] of the oldarray into newarray, then copy [i+1, size] into new array.

I have an application that requires to call this function a lot, and I feel the limitation of it.

My scenario is: In a [8000][12000] sparse array (double value), I want to remove rows and columns that are all zero. The 6000 rows and 7000 columns will be removed.
To remove the 6000 rows, this remove() function is called 6000 times.
To remove 7000 columns, I need a loop of the 2000 remaining rows from the previous step to remove each one of them, so this remove() function is called 7000*2000=14,000,000 times.

It takes 4 minutes to complete the task.

Also, because in each remove() operation, a new array is created, and the old array is left for Garbage Collection, the program is using almost 4G of memory during these 4 minutes.

So I decided to create my own ArrayRemove(array, i) function that is using the same array to store the modified array:

    private void ArrayRemove(Object array, int index) {
        int length = java.lang.reflect.Array.getLength(array);
        if ((index < length - 1)  && (index > -1)){
            System.arraycopy(array, index + 1, array, index, length - index - 1);         

        }
        return;
    }


Using this new function, it takes 1.5 minutes to perform the same task, and the peak memory is 500M.

The downside of this function is that: The array is NOT resized. Before process it contains 7000 elements; After process, it still contains 7000 elements. The last element is obsolete, not useful at all, but the array.length is still 7000. Even though the sparse array is still [8000][12000], only the first [2000][5000] are meaningful.

So, you need to remember how many rows and columns are removed, and form a new array :
       removerowcount = RemoveRow(array);
       double[][] temparray = new double[array.length-removerowcount][array[0].length];
       temparray = Arrays.copyOf(array, array.length-removerowcount);
       array = temparray;
     
       removecolumncount = RemoveColumn(array)
       double[][] temparray = new double[array.length][array[0].length-removecolumncount];
       for(int i=0; i < array.length; i++){
              temparray[i] = Arrays.copyOf(array[i], array[0].length-removecolumncount);
       }
       array = temparray;


The best thing is, you only need to form new array twice (the first time is to remove the trailing obsolete rows before starting RemoveColumn), not 14,006,000 times.

Have fun! Make sure to add your own error handling code.

PS: To remove an element from array, the above code works. But to achieve what I intend to (to remove null rows and columns), there is a better way (a Matlab way): Have another 2 arrays (nullrows[] and nullcolumns[]) to keep a record of which rows and columns are null, then remove these rows and columns in a batch.

Labels: ,

Thursday, May 2

mysql note: IsNumeric() functionality

Situation: There is one text field "old" in the table a, and it can be:
?
NON
2+
0
4.0
-0.3
=======
The task is to identify the numbers (the last 3 items) and put the numbers into a double filed "new". This field is currently set as default (null) at this moment. So the new field would be:
(null)
(null)
(null)
0
4.0
-0.3
=========
I can't use "update a set new = old". The result is:
0
0
0
0
4
0.3
========
So the idea is to use a "IsNumeric()" function to "update a set new = old where IsNumeric(old)", so only the last 3 items that are numeric are being updated.

There is such a function in MS Sql Server, but not in mysql. The first Googled page gives one way:
update a set new = old where old = concat( '', 0 + old )
The result is:
(null)
(null)
(null)
0
(null)
-0.3
========
The reason is that when they are converted to string, 4.0 does not equal to 4.

Another Googled page claims to have "MySQL Equivalent of ISNUMERIC()", and the solution is using regex:
update a set new = old where old  REGEXP ('[0-9]');
The result is:

(null)
(null)
2
0
4.0
-0.3
========

Because this regex is looking for any string that contains number, the "2+" is selected, and that is wrong.

A popular regex to verify number is '^[-+]?[0-9]*\.?[0-9]*$', but when it is being used, the result is not right:

update a set new = old where old  REGEXP ('^[-+]?[0-9]*\.?[0-9]*$');
The result is:

0
(null)
2
0
4.0
-0.3
========


I can't understand why '?' and '2+' can pass the regex verification. Maybe the regex implementation of mysql is not quite standard.

Finally, this regex gives out what I need: '^(([0-9+-.$]{1})|([+-]?[$]?[0-9]*(([.]{1}[0-9]*)|([.]?[0-9]+))))$' (From this page)

I am sorry, this regex is too long to understand, and I am exhausted already. Please check that page to understand what's the meaning if interested. For now, I can just use it as:
update a set new = old where old  REGEXP ('^(([0-9+-.$]{1})|([+-]?[$]?[0-9]*(([.]{1}[0-9]*)|([.]?[0-9]+))))$');

(null)
(null)
(null)
0
4.0
-0.3
=========

Problem solved.

BTW, the regex in mysql can only be used for validations like this case, can not be used for string replacement, such as retrieving number from a string. That limits the moves we can have. I hate that.

Labels: ,

Wednesday, March 20

DateDiff function in SQL Server

This DateDiff function in SQL Server might not be working as you expected.

For example, now we are looking for records that are older than 4 hours, you would think to do it like that:

select * from table where DateDiff(hour, CreateDate, getdate()) >=4
but that is not correct. Assume the CreateDate is 05:59 and the current time is
09:00 , you will find that the this record is selected, as if this 3 hours and 1 minute record is "older than 4 hours."

We can easily confirm that using these 2 queries:



 select datediff(hour, '2013-03-19 05:59', '2013-03-19 09:00')

The time difference is 3 hours and 1 minute, so I would expect this query to return “3”


select datediff(hour, '2013-03-19 05:00', '2013-03-19 09:59')

The time difference is 4 hours and 59 minute, so I would expect this query to return “4”

Actually, these two queries both return “4”.

So this DateDiff function is simply using the "hour" part of the 2 datetime to do the calculation. It use the "9" of the second datetime to substract the "5" of the first datetime to achieve "4".

Back to our initial question: How to look for records that are older than 4 hours? Use DateAdd:

Select * from table where DateAdd(hour, 4, CreateDate) >= getDate()

This will give a precise calculation of 4 hours.


Add on Sept 16, 2013:
The last query can do the work, but it wasted index, if any. The DB has to visit every item to calculate the DateAdd(hour, 4, CreateDate) in order to find out the candidates. 
It's important not to calculate each column, if you want to DB to use existing index of this column.
So the query can be changed to:

Select * from table where CreateDate >= DateAdd(hour, -4, getDate())



Labels: ,

Tuesday, October 25

Registry Redirection, for 32-bit application in 64-bit Windows OS

Some people might have noticed, and some might not:

In Microsoft's 64-bit operation system, there is one "C:\Program Files\" folder and one "C:\Program Files (x86)" folder. 32-bit applications are placed in the second one. The tricky thing is, when the 32-bit applications internally are trying to visit "C:\Program Files\" folder, they are being redirected to the "C:\Program Files (x86)" folder. This is call "WoW64 File System Redirection". When the 32-bit applications are trying to access Windows\System32\ folder, they are actually accessing Windows\SysWOW64\ folder.

Microsoft use this "WoW64" approach to keep 32-bit applications in the same boxes with 64-bit application, and let the 32-bit applications access 32-bit environment (DLLs), not messing around the 64-bit environment.

Take a wild guess, what is the meaning of "WoW64"?

For registry, we have the same story. For example, there are registry path KEY_LOCAL_MACHINE\SOFTWARE\Microsoft, and HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\Microsoft. When you run "regedit" in your OS, of course it is 64-bit application, so you can see both paths. When requesting "HKLM\SOFTWARE\Microsoft", a 64-bit application will get the content of first path, but a 32-bit application will get the content of second path.

If a 32-bit application wants to get the content of the first path, in the source code, it will need to open the registry key with extra KEY_WOW64_64KEY option:
RegOpenKeyEx(HKEY_LOCAL_MACHINE, path, 0, KEY_SET_VALUE|KEY_WOW64_64KEY, &hKey);

Note: Visual Studio, by default, is creating 32-bit applications, even when it is running in 64-bit operating system.

Yesterday I was too lazy to use "RegOpenKeyEx". I just want to call the external regedit.exe to import an existing Ben.reg file, as part of the configuration stage. The registry keys should be in KEY_LOCAL_MACHINE\SOFTWARE\Ben folder. I ran "regedit /s Ben.reg" a thousand times and the content is imported successfully. But when the same command line is called from my application, it reports "imported successfully" but the keys are not in the KEY_LOCAL_MACHINE\SOFTWARE\Ben folder. Of course now you know the keys are in KEY_LOCAL_MACHINE\SOFTWARE\WoW64\Ben folder, since the Visual Studio generated application is a 32-bit application.

After googling thousands of webpages, I found one working solution from Greg Domjan:

Add a class:
class Wow64RedirectOff {
typedef BOOL (WINAPI *FN_Wow64DisableWow64FsRedirection) ( __out PVOID *OldValue );
typedef BOOL (WINAPI *FN_Wow64RevertWow64FsRedirection) ( __in PVOID OldValue );

public:
Wow64RedirectOff() {
LPFN_Disable = (FN_Wow64DisableWow64FsRedirection)GetProcAddress(
GetModuleHandle(TEXT("kernel32")),"Wow64DisableWow64FsRedirection");
if( LPFN_Disable ) {
LPFN_Disable(&OldValue);
}
}

~Wow64RedirectOff() {
if( LPFN_Disable ) {
FN_Wow64RevertWow64FsRedirection LPFN_Revert = (FN_Wow64RevertWow64FsRedirection)GetProcAddress(
GetModuleHandle(TEXT("kernel32")),"Wow64RevertWow64FsRedirection");
if( LPFN_Revert ) {
LPFN_Revert(OldValue);
}
}
}

private:
FN_Wow64DisableWow64FsRedirection LPFN_Disable;
PVOID OldValue;
};

Then in the program, you define Wow64RedirectOff scopedRedirect; before calling the external program "regedit". You can wrap these 2 actions into one bracket so that scopedRedirect is being destructed right after the external program is terminated, and the WoW Redirection is being restored to normal.

Labels: